Testing LLM Tool Calls: How to Validate Function Calling in AI Apps

Testing LLM Tool Calls: How to Validate Function Calling in AI Apps

LLM tool calls (function calling) are the integration layer between your AI app and the real world — database queries, API calls, external services. They're also where most AI application bugs hide. This guide covers how to test tool call selection, argument generation, error handling, and multi-step tool chains across OpenAI, Anthropic, and Google AI.

Key Takeaways

Tool call selection is the most common failure point. Models pick the wrong tool, call no tool when they should, or call a tool with arguments that pass schema validation but fail at runtime. None of these are caught by unit tests on your tool implementations.

You need three test layers: schema validation, behavioral testing, and integration testing. Schema tests check argument structure. Behavioral tests check that the model picks the right tool for a given prompt. Integration tests check that the full round-trip works end-to-end.

Mock the LLM for deterministic tool argument tests. For testing your tool implementations, inject canned tool call responses. Only test real model behavior when you're specifically testing tool selection decisions.

Test error recovery explicitly. What happens when a tool returns an error? Does the model retry? Give up? Hallucinate a response? Most apps have untested error recovery paths.

Why Tool Call Testing Is Hard

Tool calls (function calling) let LLMs interact with external systems: search the web, query a database, call an API. They're the difference between a chatbot that answers from training data and an AI application that does real work.

The problem: tool calls introduce three distinct failure modes that traditional testing misses:

  1. Wrong tool selected — model calls search_products when it should call get_order_status
  2. Wrong arguments generated — model calls the right tool but generates invalid or nonsensical arguments
  3. Failed error recovery — model receives a tool error and either crashes, loops, or fabricates a response

Each failure mode requires a different testing approach.

The Three Layers of Tool Call Testing

Layer 1: Tool Implementation Tests (No LLM)

Test your tool functions independently. These are standard unit tests — no AI involved.

# test_tools.py
import pytest
from tools import get_order_status, search_products

def test_get_order_status_returns_correct_fields():
    result = get_order_status(order_id="ORD-123")
    assert "status" in result
    assert "items" in result
    assert "estimated_delivery" in result

def test_get_order_status_raises_on_invalid_id():
    with pytest.raises(ValueError, match="Order not found"):
        get_order_status(order_id="INVALID")

def test_search_products_returns_list():
    results = search_products(query="blue jeans", limit=5)
    assert isinstance(results, list)
    assert len(results) <= 5

This layer has 100% determinism. Run it in CI without any LLM calls.

Layer 2: Schema and Argument Validation Tests

Verify that your tool schemas are correctly defined and that argument parsing handles edge cases.

import json
from openai import OpenAI

client = OpenAI()

# Your tool definitions
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Get the status of an order by order ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The order ID, format ORD-XXXXX"
                    }
                },
                "required": ["order_id"]
            }
        }
    }
]

def test_schema_is_valid_openai_format():
    """Validate tool schema before sending to API"""
    for tool in tools:
        assert tool["type"] == "function"
        assert "name" in tool["function"]
        assert "parameters" in tool["function"]
        params = tool["function"]["parameters"]
        assert params["type"] == "object"
        assert "properties" in params

Layer 3: Behavioral Tests (With Real LLM)

Test that the model selects the right tool for given inputs. These tests are inherently probabilistic — run them multiple times and treat failures as signals, not hard stops.

import pytest
from openai import OpenAI

client = OpenAI()

def get_tool_call(prompt: str) -> dict | None:
    """Get the tool call decision for a prompt"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        tools=tools,
        tool_choice="auto"
    )
    msg = response.choices[0].message
    if msg.tool_calls:
        return {
            "name": msg.tool_calls[0].function.name,
            "args": json.loads(msg.tool_calls[0].function.arguments)
        }
    return None

@pytest.mark.llm  # mark for separate test runs
def test_order_status_query_calls_correct_tool():
    result = get_tool_call("What's the status of order ORD-99412?")
    assert result is not None, "Model should call a tool"
    assert result["name"] == "get_order_status"
    assert result["args"].get("order_id") == "ORD-99412"

@pytest.mark.llm
def test_product_search_query_calls_correct_tool():
    result = get_tool_call("Find me blue running shoes under $100")
    assert result is not None
    assert result["name"] == "search_products"
    assert "blue" in result["args"].get("query", "").lower()

@pytest.mark.llm
def test_ambiguous_query_does_not_hallucinate_tool():
    """Model should ask for clarification, not guess"""
    result = get_tool_call("Hello, how are you?")
    assert result is None, "Greeting should not trigger tool call"

Testing Tool Calls with Anthropic Claude

Claude's tool use API follows the same pattern but with different syntax:

import anthropic
import json

client = anthropic.Anthropic()

tools_anthropic = [
    {
        "name": "get_order_status",
        "description": "Get the status of an order by order ID",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID"
                }
            },
            "required": ["order_id"]
        }
    }
]

def test_claude_calls_tool_for_order_query():
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        tools=tools_anthropic,
        messages=[{
            "role": "user",
            "content": "What's the status of my order ORD-12345?"
        }]
    )
    
    # Find tool use block
    tool_use = next(
        (block for block in response.content if block.type == "tool_use"),
        None
    )
    
    assert tool_use is not None, "Claude should call get_order_status"
    assert tool_use.name == "get_order_status"
    assert tool_use.input.get("order_id") == "ORD-12345"

Testing Multi-Step Tool Chains

Real AI apps often involve chains: the model calls tool A, gets results, then decides to call tool B. Testing these requires simulating the full conversation loop.

def run_agent_loop(user_message: str, max_turns: int = 5) -> str:
    """Run the full agent loop, returning the final text response"""
    messages = [{"role": "user", "content": user_message}]
    
    for _ in range(max_turns):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )
        
        assistant_msg = response.choices[0].message
        messages.append(assistant_msg.to_dict())
        
        if not assistant_msg.tool_calls:
            # Model gave final response
            return assistant_msg.content
        
        # Execute tool calls and add results
        for tool_call in assistant_msg.tool_calls:
            tool_name = tool_call.function.name
            tool_args = json.loads(tool_call.function.arguments)
            
            result = execute_tool(tool_name, tool_args)
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })
    
    raise RuntimeError("Agent loop exceeded max turns")

def test_order_lookup_chain():
    """Test that order lookup + product recommendation chain works"""
    result = run_agent_loop(
        "I ordered shoes last week (ORD-88821). Are they shipped? "
        "If not, can you show me similar in-stock alternatives?"
    )
    
    # Verify final response is coherent
    assert len(result) > 50, "Response should be substantive"
    assert any(word in result.lower() for word in ["order", "ship", "status", "shoe"])

Testing Error Recovery

Error recovery is where most production bugs live. Test what happens when tools fail:

from unittest.mock import patch, MagicMock

def test_agent_handles_tool_error_gracefully():
    """Model should acknowledge failure, not fabricate a response"""
    
    # Mock the tool to return an error
    with patch("tools.get_order_status") as mock_tool:
        mock_tool.side_effect = Exception("Database timeout")
        
        # In production, your agent catches tool errors and passes them back to the model
        # as tool result messages with error content
        result = run_agent_loop_with_error_handling(
            "What's the status of order ORD-12345?"
        )
    
    # Model should indicate it couldn't retrieve the info
    assert any(word in result.lower() for word in [
        "unable", "error", "couldn't", "try again", "problem"
    ])
    
    # Model should NOT fabricate order status
    assert "delivered" not in result.lower()
    assert "shipped" not in result.lower()

def test_agent_retries_on_transient_error():
    """Model should retry transient failures"""
    call_count = 0
    
    def flaky_tool(**kwargs):
        nonlocal call_count
        call_count += 1
        if call_count == 1:
            raise Exception("Temporary network error")
        return {"status": "shipped", "order_id": "ORD-12345"}
    
    with patch("tools.get_order_status", side_effect=flaky_tool):
        result = run_agent_loop("Status of ORD-12345?")
    
    assert call_count >= 2, "Should retry after transient failure"
    assert "shipped" in result.lower()

Mocking LLM Responses for Deterministic Tests

For most tool argument tests, you don't need a real LLM. Mock the API response:

from unittest.mock import MagicMock, patch

def make_mock_tool_call_response(tool_name: str, arguments: dict):
    """Create a mock OpenAI response that calls a specific tool"""
    mock_response = MagicMock()
    mock_response.choices[0].message.tool_calls = [
        MagicMock(
            id="call_abc123",
            function=MagicMock(
                name=tool_name,
                arguments=json.dumps(arguments)
            )
        )
    ]
    mock_response.choices[0].message.content = None
    return mock_response

@patch("openai.OpenAI")
def test_tool_executor_handles_response_correctly(mock_openai):
    """Test your tool execution logic without real LLM calls"""
    mock_client = mock_openai.return_value
    mock_client.chat.completions.create.return_value = make_mock_tool_call_response(
        "get_order_status",
        {"order_id": "ORD-99999"}
    )
    
    # Test that your agent correctly executes the tool call
    agent = MyAgent(client=mock_client)
    result = agent.run("What's my order status?")
    
    # Verify tool was called with correct args
    # (this tests YOUR code, not the LLM)
    assert result["tool_called"] == "get_order_status"
    assert result["order_id"] == "ORD-99999"

Parallel Tool Call Testing

GPT-4o and Claude 3.5+ can call multiple tools in parallel. Test this explicitly:

def test_parallel_tool_calls_are_handled():
    """Test that your agent handles parallel tool calls correctly"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": "Compare the status of orders ORD-111 and ORD-222"
        }],
        tools=tools,
        tool_choice="auto"
    )
    
    tool_calls = response.choices[0].message.tool_calls
    
    # Should make 2 parallel calls, not sequential
    assert len(tool_calls) == 2
    
    called_order_ids = {
        json.loads(tc.function.arguments)["order_id"]
        for tc in tool_calls
    }
    assert called_order_ids == {"ORD-111", "ORD-222"}

Running Tool Call Tests in CI

Separate your test suite so LLM tests don't block builds:

# .github/workflows/test.yml
jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - run: pytest -m "not llm" --fast  # No API calls, always green
  
  llm-integration-tests:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'  # Only on merge
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - run: pytest -m "llm" --timeout=30

End-to-End Tool Call Monitoring

Once deployed, you need to monitor tool call behavior in production. Track:

  • Tool selection accuracy: What % of tool calls are the right tool?
  • Argument validity rate: What % of generated arguments pass validation?
  • Tool error rate: How often do tools return errors?
  • Recovery success rate: When tools fail, how often does the model recover gracefully?

Log every tool call with its inputs and outputs. This data is invaluable for catching regressions when you update your prompts or switch model versions.

For automated end-to-end behavioral monitoring — verifying that your AI app's full workflow still produces correct tool calls after a deployment — you can use HelpMeTest to write tests in plain English that exercise your full agent loop and assert on the final behavior, without instrumenting your production code.

Key Takeaways

Testing LLM tool calls requires three layers: unit tests for your tool implementations (deterministic, fast), behavioral tests for tool selection (probabilistic, run against real models), and integration tests for full agent loops (catch multi-step failures). Mock the LLM for argument validation tests, use real models only when you're specifically testing selection behavior. Always test error recovery — it's the most undertested path in production AI apps.

Read more