AI Agent Tool Calling

Building an AI Agent with Real Function Calling

This project demonstrates how to build an AI agent that makes actual structured function calls rather than just describing tool usage in text. Many tutorials show agents that fake tool calling or simply generate text like "I would call the weather tool..." - this implementation executes real function calls with validated parameters using LangChain, LangGraph, and a locally-hosted LLM.

The complete implementation is available at GitHub Repository

The Problem: Most "Tool Calling" Isn't Real

Many AI agent tutorials create the illusion of tool calling without actual function execution. Common issues include:

  • Using models that lack function calling capabilities
  • Generating text descriptions instead of structured tool calls
  • Skipping critical tool binding steps
  • Missing proper input validation with Pydantic schemas
  • Inadequate docstrings that fail to guide LLM behavior
  • Using outdated LangGraph APIs that AI coding assistants generate

Technical Implementation

Hybrid Architecture: Single Model, Dual Roles

The system uses a single LLM instance (Hermes-3-Llama-3.1-8B) running locally via vLLM, serving two distinct purposes:

  • Agent Mode (llm_with_tools): LLM with tools bound, generates structured tool calls
  • Evaluator Mode (llm): Raw LLM without tools, performs semantic assessment

This hybrid approach runs on a 24GB GPU by avoiding the memory overhead of loading two separate models. The same underlying model serves both functions - one configuration with tool schemas, one without.

Why Model Selection is Critical

Not all LLMs support structured function calling. General instruction models will generate text descriptions like "I would call the weather tool for Boston" instead of making actual function calls. This implementation uses Hermes-3-Llama-3.1-8B, which is specifically trained for function calling.

Function-calling capable models: Hermes-3-Llama-3.1-8B, GPT-4, Claude 3+, Mistral-Large

Lacks function calling: Mistral-7B-Instruct, Base Llama models, most general chat models

Pydantic Schemas for Input Validation

Each tool uses Pydantic models to define expected input schemas, ensuring the LLM generates valid, structured tool calls:


class WeatherInput(BaseModel):
    location: str = Field(
        description="The city name and optionally state/country (e.g., 'San Francisco, CA')",
        min_length=2,
        max_length=100
    )

@tool(args_schema=WeatherInput)
def get_weather(location: str) -> str:
    """
    Retrieves current weather information for a specified location.
    
    Args:
        location: The city name and optionally state/country
    
    Returns:
        A string containing the current temperature and weather conditions.
    """
    return f"Current weather in {location}: Temperature is 72°F, conditions are sunny"
								

The Critical Tool Binding Step

Simply defining tools is insufficient - they must be explicitly bound to the LLM. This step is commonly skipped by AI coding assistants:


# Base LLM without tools (for evaluation)
llm = ChatOpenAI(
    base_url="http://localhost:8082/v1",
    model="NousResearch/Hermes-3-Llama-3.1-8B"
)

# Bind tools to create agent LLM
tools = [get_weather, calculator]
llm_with_tools = llm.bind_tools(tools)
								

Without binding, the LLM has no knowledge of available tools and cannot generate structured function calls.

LangGraph State Management

The agent uses LangGraph to manage conversation flow with conditional routing:


graph_builder = StateGraph(MessagesState)
graph_builder.add_node("agent", call_model)
graph_builder.add_node("tools", ToolNode(tools))
graph_builder.add_conditional_edges(
    "agent",
    lambda x: "tools" if x["messages"][-1].tool_calls else END
)
graph_builder.add_edge("tools", "agent")
graph_builder.add_edge(START, "agent")
graph = graph_builder.compile()
								

Important: This uses the current LangGraph API with START and END keywords. Many AI coding assistants generate outdated code using set_entry_point() and "__end__".

Semantic Evaluation Without Keyword Matching

Rather than brittle keyword matching, the system uses LLM-based evaluation to assess:

  • Tool Selection: Did the agent choose appropriate tools for the query?
  • Response Quality: Is the final answer clear, complete, and well-formatted?
  • Overall Success: Does the response address the user's question?

This approach handles semantic variations ("multiply" vs "multiplied" vs "times") and provides reasoned assessment with explanations.

Implemented Tools

Weather Tool

Returns mock weather data for any location, demonstrating single-parameter tool with string validation and proper Pydantic schema with length constraints.

Calculator Tool

Performs basic arithmetic operations (add, subtract, multiply, divide) with multi-parameter validation and error handling for invalid operations and division by zero.

Both tools follow production patterns:

  • Complete docstrings with parameter descriptions
  • Pydantic validation schemas
  • Natural language return values
  • Proper error messages

Key Implementation Details

The Role of Docstrings

Docstrings are not optional - they are the primary mechanism by which the LLM learns what each tool does. The docstring content is sent to the LLM as part of the tool schema. Many coding agents skip or minimize docstrings, resulting in unreliable tool selection.

Effective tool docstrings include:

  • Clear purpose statement
  • Detailed parameter descriptions with examples
  • Return value documentation
  • Format and constraint specifications

vLLM Configuration

The implementation requires specific vLLM flags to enable function calling:


vllm serve NousResearch/Hermes-3-Llama-3.1-8B \
  --port 8082 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192
							
  • --enable-auto-tool-choice: Enables automatic tool calling capability
  • --tool-call-parser hermes: Uses Hermes-specific parser for structured calls
  • --max-model-len 8192: Limits context window to fit in 24GB GPU memory

Test Results

The implementation includes three test cases demonstrating different agent behaviors:

  1. Weather query - Agent correctly calls get_weather tool
  2. Math query - Agent correctly calls calculator tool with proper operation parameter
  3. General knowledge - Agent appropriately declines when no tool is available

Each test displays the complete message flow (HumanMessage → AIMessage with tool_calls → ToolMessage → final AIMessage) and LLM-based evaluation results.

Educational Value and Extensions

This implementation provides patterns for building agents that interact with:

  • REST APIs and web services
  • Databases with SQL queries
  • Analytics and data processing tools
  • System operations and file management
  • Business systems (CRM, ERP, ticketing)

Extension points:

  • Add more tools following the Pydantic + docstring pattern
  • Implement authentication for external APIs
  • Add retry logic and rate limiting
  • Extend evaluation criteria for domain-specific requirements
  • Scale to multi-agent systems with specialized tool sets

Common Pitfalls Addressed

AI Coding Assistants Generate Outdated Code

Many LLM-based coding assistants (Claude, ChatGPT, etc.) generate LangGraph code using the legacy API. This implementation uses the current API and highlights the differences.

Missing Tool Binding

AI assistants frequently skip the critical bind_tools() step, resulting in agents that cannot actually call functions.

Inadequate Docstrings

Many tutorials use minimal docstrings that don't provide enough information for the LLM to reliably select and use tools.

Wrong Model Selection

Using models that lack function calling capabilities results in text descriptions instead of structured calls.

Project Attribution

This is an educational project demonstrating production-ready patterns for AI agent development with real function calling capabilities.

Technologies Used

  • Python 3.11+
  • LangChain
  • LangGraph
  • vLLM
  • Pydantic
  • Hermes-3-Llama-3.1-8B