How to Test PydanticAI Agents Before They Hit Production
You built a PydanticAI agent. The structured outputs look right. The tool calls return what you expect in development. You run it manually a few times and ship it.
Then a user hits a case where the agent calls the wrong tool, your retry logic silently swallows the error, and your structured output schema mismatch gets serialized into garbage that breaks the downstream system.
This is the PydanticAI testing problem: the type annotations give you confidence at the schema level, but they don't tell you whether your agent behaves correctly across the inputs that matter.
Why PydanticAI Agents Need More Than Unit Tests
PydanticAI is designed around type safety and structured outputs. That's a genuine improvement over untyped LLM frameworks — if the model returns a field the wrong type, Pydantic validation catches it immediately.
But type validation is not behavior validation. Your agent can return a perfectly typed OrderConfirmation object that contains the wrong order ID, escalated a case it should have handled, or called a tool twice when once was correct.
The testing challenges that remain:
- Tool selection — does the agent pick the right tool for the right input?
- Dependency injection — does the agent use your injected context correctly?
- Retry behavior — does the retry logic recover correctly from transient failures?
- Structured output contents — are the field values correct, not just the field types?
- Multi-step reasoning — does the agent reach the right conclusion from a multi-turn input?
Layer 1: Unit Testing Agent Nodes
PydanticAI agents are composed from nodes (functions) and tools (decorated functions). Start by testing each in isolation.
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
from your_agent import process_order, OrderContext
async def test_order_classification():
model = TestModel()
agent = Agent(model, deps_type=OrderContext)
ctx = OrderContext(customer_id="cust_123", region="EU")
result = await agent.run("I need to return my order", deps=ctx)
# Assert the response reflects the correct intent
assert "return" in result.data.lower() or result.data.action == "initiate_return"TestModel from pydantic_ai.models.test is PydanticAI's built-in mock. It lets you control model responses without making real API calls, keeping unit tests fast and deterministic.
Layer 2: Testing Tool Calls
Tools are where most PydanticAI bugs hide. The model decides which tool to call — and it doesn't always decide correctly.
Test tool invocation explicitly:
from pydantic_ai import Agent, RunContext
from pydantic_ai.models.test import TestModel
async def test_agent_calls_lookup_tool_for_order_status():
tool_calls = []
model = TestModel(tool_name="lookup_order_status")
agent = Agent(model, tools=[lookup_order_status, cancel_order])
result = await agent.run("What is the status of order #4521?")
# Verify the right tool was called
assert model.last_tool_name == "lookup_order_status"
assert model.last_tool_args["order_id"] == "4521"Don't assume the model will always call the right tool. Explicitly test the cases where two tools could plausibly apply — this is where production agents fail most often.
Layer 3: Testing Dependency Injection
PydanticAI's deps_type system lets you inject context into every tool call. Test that your agent uses that context correctly — and fails gracefully when it's misconfigured.
from dataclasses import dataclass
from pydantic_ai import Agent, RunContext
@dataclass
class SupportContext:
user_id: str
tier: str # "standard" | "premium"
async def test_premium_routing():
model = TestModel()
agent = Agent(model, deps_type=SupportContext)
premium_ctx = SupportContext(user_id="u_999", tier="premium")
result = await agent.run("I need urgent help", deps=premium_ctx)
# Premium users should be routed to priority queue
assert result.data.priority == "high"
async def test_standard_routing():
standard_ctx = SupportContext(user_id="u_001", tier="standard")
result = await agent.run("I need urgent help", deps=standard_ctx)
assert result.data.priority == "normal"Layer 4: Testing Structured Output Correctness
PydanticAI validates that your output matches the schema. You still need to test that the values are correct.
from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
class RefundDecision(BaseModel):
approved: bool
amount: float
reason: str
async def test_refund_approval_for_late_delivery():
model = TestModel(custom_result_args={"approved": True, "amount": 49.99, "reason": "delivery exceeded 14 days"})
agent = Agent(model, result_type=RefundDecision)
result = await agent.run("My package arrived 3 weeks late, I want a refund")
assert result.data.approved is True
assert result.data.amount > 0
assert "delivery" in result.data.reason.lower()Schema conformance is table stakes. Value correctness is what your users care about.
Layer 5: Testing Retry and Error Handling
PydanticAI supports automatic retries when validation fails. Test that your retry logic recovers correctly rather than silently degrading.
from pydantic_ai.models.test import TestModel
async def test_agent_retries_on_validation_failure():
# Simulate a model that fails once, then succeeds
responses = [
{"invalid_field": "wrong"}, # First call: validation fails
{"approved": True, "amount": 25.0, "reason": "goodwill gesture"} # Retry: succeeds
]
model = TestModel(responses=iter(responses))
agent = Agent(model, result_type=RefundDecision, retries=2)
result = await agent.run("I had a bad experience")
assert result.data.approved is True
assert model.call_count == 2 # Confirms retry happenedIf you're not testing retry behavior, you don't know how your agent handles the real-world case where models return malformed outputs under load.
What Code-Level Tests Miss
Your PydanticAI tests verify the agent's logic in isolation. They don't verify how the agent behaves once it's deployed against a real model, handling real user inputs.
Production agents fail for reasons outside your test suite:
- Model version changes — OpenAI or Anthropic updates a model. Your structured output prompts that worked with the old model start returning unexpected formats with the new one.
- Input distribution shift — users in production phrase things differently than your test fixtures. The agent works correctly for your cases, fails on the cases you didn't think of.
- Dependency drift — the external service your tool calls changes its response schema. Your tool still executes successfully but parses garbage.
- Concurrency failures — your agent works fine in single-threaded tests, silently corrupts state when multiple requests arrive simultaneously.
Monitoring PydanticAI Agents in Production
Once your agent is live, you need ongoing behavioral monitoring — not just a CI suite that runs once on deploy.
HelpMeTest lets you write natural language behavioral tests against your deployed PydanticAI endpoint and run them on a schedule:
Test: order status lookup — correct order returned
When user asks: "What is the status of my most recent order?"
And user is authenticated as customer with one completed order
Then: response includes order number
And: response includes current delivery status
And: agent does not ask for clarification
And: response time under 5 secondsTests run continuously. If your agent's behavior shifts after a model update, a tool API change, or a dependency version bump, you find out before your users do — not from a support ticket three days later.
Free tier: 10 tests, unlimited health checks. Try HelpMeTest →
PydanticAI Testing Checklist
Before shipping any PydanticAI agent:
- Unit tests for every tool function with
TestModelmocking - Explicit tool selection tests — verify the right tool is called for each input type
- Dependency injection tests — verify context is used correctly across all paths
- Structured output value tests — not just schema conformance, but correct field values
- Retry behavior tests — verify recovery from validation failures
- Edge case inputs — empty inputs, malformed requests, boundary values
- Error propagation — does the agent fail gracefully when a tool raises an exception?
- Production monitoring for behavioral drift after model or tool updates
The type annotations tell you the shape is right. The tests tell you the behavior is right. You need both.