AI Testing

How to Test PydanticAI Agents Before They Hit Production

HelpMeTest

13 May 2026 — 5 min read

You built a PydanticAI agent. The structured outputs look right. The tool calls return what you expect in development. You run it manually a few times and ship it.

Then a user hits a case where the agent calls the wrong tool, your retry logic silently swallows the error, and your structured output schema mismatch gets serialized into garbage that breaks the downstream system.

This is the PydanticAI testing problem: the type annotations give you confidence at the schema level, but they don't tell you whether your agent behaves correctly across the inputs that matter.

Why PydanticAI Agents Need More Than Unit Tests

PydanticAI is designed around type safety and structured outputs. That's a genuine improvement over untyped LLM frameworks — if the model returns a field the wrong type, Pydantic validation catches it immediately.

But type validation is not behavior validation. Your agent can return a perfectly typed OrderConfirmation object that contains the wrong order ID, escalated a case it should have handled, or called a tool twice when once was correct.

The testing challenges that remain:

Tool selection — does the agent pick the right tool for the right input?
Dependency injection — does the agent use your injected context correctly?
Retry behavior — does the retry logic recover correctly from transient failures?
Structured output contents — are the field values correct, not just the field types?
Multi-step reasoning — does the agent reach the right conclusion from a multi-turn input?

Layer 1: Unit Testing Agent Nodes

PydanticAI agents are composed from nodes (functions) and tools (decorated functions). Start by testing each in isolation.

from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
from your_agent import process_order, OrderContext

async def test_order_classification():
    model = TestModel()
    agent = Agent(model, deps_type=OrderContext)
    
    ctx = OrderContext(customer_id="cust_123", region="EU")
    result = await agent.run("I need to return my order", deps=ctx)
    
    # Assert the response reflects the correct intent
    assert "return" in result.data.lower() or result.data.action == "initiate_return"

TestModel from pydantic_ai.models.test is PydanticAI's built-in mock. It lets you control model responses without making real API calls, keeping unit tests fast and deterministic.

Layer 2: Testing Tool Calls

Tools are where most PydanticAI bugs hide. The model decides which tool to call — and it doesn't always decide correctly.

Test tool invocation explicitly:

from pydantic_ai import Agent, RunContext
from pydantic_ai.models.test import TestModel

async def test_agent_calls_lookup_tool_for_order_status():
    tool_calls = []
    
    model = TestModel(tool_name="lookup_order_status")
    agent = Agent(model, tools=[lookup_order_status, cancel_order])
    
    result = await agent.run("What is the status of order #4521?")
    
    # Verify the right tool was called
    assert model.last_tool_name == "lookup_order_status"
    assert model.last_tool_args["order_id"] == "4521"

Don't assume the model will always call the right tool. Explicitly test the cases where two tools could plausibly apply — this is where production agents fail most often.

Layer 3: Testing Dependency Injection

PydanticAI's deps_type system lets you inject context into every tool call. Test that your agent uses that context correctly — and fails gracefully when it's misconfigured.

from dataclasses import dataclass
from pydantic_ai import Agent, RunContext

@dataclass
class SupportContext:
    user_id: str
    tier: str  # "standard" | "premium"
    
async def test_premium_routing():
    model = TestModel()
    agent = Agent(model, deps_type=SupportContext)
    
    premium_ctx = SupportContext(user_id="u_999", tier="premium")
    result = await agent.run("I need urgent help", deps=premium_ctx)
    
    # Premium users should be routed to priority queue
    assert result.data.priority == "high"

async def test_standard_routing():
    standard_ctx = SupportContext(user_id="u_001", tier="standard")
    result = await agent.run("I need urgent help", deps=standard_ctx)
    
    assert result.data.priority == "normal"

Layer 4: Testing Structured Output Correctness

PydanticAI validates that your output matches the schema. You still need to test that the values are correct.

from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel

class RefundDecision(BaseModel):
    approved: bool
    amount: float
    reason: str

async def test_refund_approval_for_late_delivery():
    model = TestModel(custom_result_args={"approved": True, "amount": 49.99, "reason": "delivery exceeded 14 days"})
    agent = Agent(model, result_type=RefundDecision)
    
    result = await agent.run("My package arrived 3 weeks late, I want a refund")
    
    assert result.data.approved is True
    assert result.data.amount > 0
    assert "delivery" in result.data.reason.lower()

Schema conformance is table stakes. Value correctness is what your users care about.

Layer 5: Testing Retry and Error Handling

PydanticAI supports automatic retries when validation fails. Test that your retry logic recovers correctly rather than silently degrading.

from pydantic_ai.models.test import TestModel

async def test_agent_retries_on_validation_failure():
    # Simulate a model that fails once, then succeeds
    responses = [
        {"invalid_field": "wrong"},   # First call: validation fails
        {"approved": True, "amount": 25.0, "reason": "goodwill gesture"}  # Retry: succeeds
    ]
    model = TestModel(responses=iter(responses))
    agent = Agent(model, result_type=RefundDecision, retries=2)
    
    result = await agent.run("I had a bad experience")
    
    assert result.data.approved is True
    assert model.call_count == 2  # Confirms retry happened

If you're not testing retry behavior, you don't know how your agent handles the real-world case where models return malformed outputs under load.

What Code-Level Tests Miss

Your PydanticAI tests verify the agent's logic in isolation. They don't verify how the agent behaves once it's deployed against a real model, handling real user inputs.

Production agents fail for reasons outside your test suite:

Model version changes — OpenAI or Anthropic updates a model. Your structured output prompts that worked with the old model start returning unexpected formats with the new one.
Input distribution shift — users in production phrase things differently than your test fixtures. The agent works correctly for your cases, fails on the cases you didn't think of.
Dependency drift — the external service your tool calls changes its response schema. Your tool still executes successfully but parses garbage.
Concurrency failures — your agent works fine in single-threaded tests, silently corrupts state when multiple requests arrive simultaneously.

Monitoring PydanticAI Agents in Production

Once your agent is live, you need ongoing behavioral monitoring — not just a CI suite that runs once on deploy.

HelpMeTest lets you write natural language behavioral tests against your deployed PydanticAI endpoint and run them on a schedule:

Test: order status lookup — correct order returned
When user asks: "What is the status of my most recent order?"
And user is authenticated as customer with one completed order
Then: response includes order number
And: response includes current delivery status
And: agent does not ask for clarification
And: response time under 5 seconds

Tests run continuously. If your agent's behavior shifts after a model update, a tool API change, or a dependency version bump, you find out before your users do — not from a support ticket three days later.

Free tier: 10 tests, unlimited health checks. Try HelpMeTest →

PydanticAI Testing Checklist

Before shipping any PydanticAI agent:

Unit tests for every tool function with TestModel mocking
Explicit tool selection tests — verify the right tool is called for each input type
Dependency injection tests — verify context is used correctly across all paths
Structured output value tests — not just schema conformance, but correct field values
Retry behavior tests — verify recovery from validation failures
Edge case inputs — empty inputs, malformed requests, boundary values
Error propagation — does the agent fail gracefully when a tool raises an exception?
Production monitoring for behavioral drift after model or tool updates

The type annotations tell you the shape is right. The tests tell you the behavior is right. You need both.

How to Test PydanticAI Agents Before They Hit Production

HelpMeTest

Why PydanticAI Agents Need More Than Unit Tests

Layer 1: Unit Testing Agent Nodes

Layer 2: Testing Tool Calls

Layer 3: Testing Dependency Injection

Layer 4: Testing Structured Output Correctness

Layer 5: Testing Retry and Error Handling

What Code-Level Tests Miss

Monitoring PydanticAI Agents in Production

PydanticAI Testing Checklist

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest