AI Testing

Testing AI Agents Built with Google ADK

HelpMeTest

13 May 2026 — 7 min read

Google's Agent Development Kit (ADK) makes it fast to build production-grade AI agents. Within a few hundred lines of Python, you have a multi-agent system with tool use, session management, and streaming responses.

What ADK doesn't give you is a testing framework. That's your problem to solve.

This guide covers how to test ADK agents systematically — from individual tools to full multi-agent pipelines — and how to monitor them once they're live.

What Makes ADK Agents Different to Test

ADK agents share the testing challenges of all LLM-powered systems, but they have some specific characteristics that shape how you test them.

Agents have state across turns. ADK's session management means an agent can remember what happened in previous turns. A test that only covers a single turn will miss bugs that only appear after the agent has built up context.

Tool results feed back into the agent. When an ADK agent calls a tool, the result goes back into the context window. If the tool returns unexpected data, the agent's next response is unpredictable. You're testing the interaction between agent and tool, not just the tool in isolation.

Multi-agent architectures amplify failures. ADK supports orchestrator and sub-agent patterns. An error in a sub-agent doesn't just affect that sub-agent — it propagates to the orchestrator, which may retry, misinterpret the error, or call additional sub-agents trying to compensate.

Streaming changes the test model. ADK agents often stream responses. Your tests need to handle partial responses and verify the final assembled output, not just the first chunk.

Testing Individual Tools

Every ADK agent has tools. Test them first, in isolation, before testing the full agent.

A tool test doesn't involve the LLM at all. You call the tool function directly with controlled inputs and assert the outputs.

# tools/search.py
def search_knowledge_base(query: str, max_results: int = 5) -> list[dict]:
    """Search the internal knowledge base and return relevant documents."""
    # implementation
    ...

# tests/test_tools.py
from tools.search import search_knowledge_base

def test_search_returns_results_for_valid_query():
    results = search_knowledge_base("product pricing")
    assert len(results) > 0
    assert all("title" in r and "content" in r for r in results)

def test_search_respects_max_results():
    results = search_knowledge_base("pricing", max_results=2)
    assert len(results) <= 2

def test_search_returns_empty_for_gibberish():
    results = search_knowledge_base("xkzqmwbvcf")
    assert results == []

def test_search_handles_empty_query():
    # Should not raise, should return empty or raise ValueError
    try:
        results = search_knowledge_base("")
        assert results == []
    except ValueError:
        pass  # Also acceptable

Cover happy paths, edge cases, and error conditions. This layer of testing runs fast, is deterministic, and catches the majority of tool bugs.

Testing Agent Responses

Once tools are tested, test the agent's behavior when it calls those tools.

The key is to mock the tool layer when testing agent behavior. This gives you deterministic tests that verify how the agent uses tools, not whether the tools themselves work.

from unittest.mock import patch
from google.adk.testing import AgentTestCase

class TestSupportAgent(AgentTestCase):
    
    @patch("tools.search.search_knowledge_base")
    def test_agent_uses_search_for_product_questions(self, mock_search):
        mock_search.return_value = [
            {"title": "Pricing FAQ", "content": "Pro plan is $100/month."}
        ]
        
        response = self.run_agent("What does the pro plan cost?")
        
        # Verify the tool was called
        mock_search.assert_called_once()
        assert "pricing" in mock_search.call_args[0][0].lower()
        
        # Verify the response uses the tool output
        assert "$100" in response.text or "100" in response.text

    @patch("tools.search.search_knowledge_base")
    def test_agent_handles_empty_search_results(self, mock_search):
        mock_search.return_value = []
        
        response = self.run_agent("What is the refund policy for the platinum tier?")
        
        # Agent should acknowledge it doesn't have the information
        # rather than hallucinating an answer
        assert any(phrase in response.text.lower() for phrase in [
            "don't have", "not found", "unable to find", "no information"
        ])

Testing that the agent handles empty tool results correctly is just as important as testing the happy path. Agents that hallucinate when tools return nothing are a significant source of production incidents.

Testing Multi-Turn Conversations

ADK's session management means you need to test sequences of turns, not just individual messages.

def test_agent_remembers_user_context_across_turns():
    session = self.create_session()
    
    # First turn: establish context
    response1 = self.run_agent(
        "I'm looking for a testing tool for my startup with 5 engineers.",
        session=session
    )
    
    # Second turn: verify context was retained
    response2 = self.run_agent(
        "What would be the monthly cost?",
        session=session
    )
    
    # Agent should remember "startup with 5 engineers" context
    # and give pricing relevant to that profile
    assert response2.text is not None
    # Don't assert exact wording — LLM output varies
    # Assert the semantic meaning instead
    assert any(keyword in response2.text.lower() for keyword in [
        "$100", "pro", "free", "plan"
    ])

def test_agent_does_not_leak_context_between_sessions():
    session_a = self.create_session()
    session_b = self.create_session()
    
    # Session A: user mentions confidential information
    self.run_agent("Our budget is $50,000 per year.", session=session_a)
    
    # Session B: different user should not know about session A
    response = self.run_agent("What budget should I plan for?", session=session_b)
    
    # The $50,000 from session A should not appear in session B's response
    assert "50,000" not in response.text
    assert "50000" not in response.text

Session isolation tests are critical for multi-tenant applications. A bug here means user A can see user B's data.

Testing Multi-Agent Pipelines

ADK's orchestrator/sub-agent pattern creates a new testing challenge: you need to verify that the orchestrator correctly routes tasks to sub-agents and assembles their results.

def test_orchestrator_routes_technical_questions_to_tech_agent():
    with patch.object(tech_support_agent, "run") as mock_tech:
        with patch.object(billing_agent, "run") as mock_billing:
            mock_tech.return_value = AgentResponse(text="Technical answer here.")
            
            response = self.run_orchestrator("How do I configure SSO?")
            
            # Should route to tech agent, not billing
            mock_tech.assert_called_once()
            mock_billing.assert_not_called()

def test_orchestrator_handles_sub_agent_failure():
    with patch.object(tech_support_agent, "run") as mock_tech:
        mock_tech.side_effect = Exception("Sub-agent timeout")
        
        response = self.run_orchestrator("How do I configure SSO?")
        
        # Orchestrator should degrade gracefully, not crash
        assert response is not None
        assert response.text is not None
        # Should communicate the failure, not expose the internal error
        assert "timeout" not in response.text.lower() or "sorry" in response.text.lower()

The second test — sub-agent failure — is often untested and the most common source of production outages. When a sub-agent fails, what does the orchestrator do?

Behavioral Testing vs. Exact Match

A mistake many teams make when testing LLM agents: asserting exact strings in responses.

This creates brittle tests that fail whenever the model is updated, even when the behavior is correct. A test that asserts response.text == "The Pro plan costs $100 per month." will fail if the model says "The Pro plan is $100/month." — which is functionally identical.

Test behavior, not wording:

# Bad: brittle
assert response.text == "The Pro plan costs $100 per month."

# Good: behavioral
assert "$100" in response.text
assert "pro" in response.text.lower()

# Better: semantic
assert pricing_mentioned(response.text)  # custom validator

# Best for complex assertions: use an LLM as judge
assert llm_judge.evaluates_as_correct(
    question="What does the Pro plan cost?",
    expected_fact="$100/month",
    response=response.text
)

For complex behavioral assertions, using a second LLM to evaluate the first LLM's response is a legitimate testing pattern. Define what "correct" means in plain language, and let a judge model verify it.

End-to-End Testing Through the UI

If your ADK agent powers a chat interface, customer support widget, or any user-facing product, you need end-to-end tests that exercise the full stack — from user input through the agent to the rendered response.

Protocol-level tests (Python unit tests of the agent) verify the agent's logic. End-to-end tests verify that the agent is correctly wired into the application, that responses render correctly, that error states are handled gracefully, and that the experience is actually usable.

HelpMeTest lets you write these tests in plain English, without code:

Open the support chat widget.
Type: "I need help setting up SSO for my team."
Wait for the agent to respond.
Verify the response mentions configuration steps or asks a clarifying question.
Verify no error message appears.
Verify the response renders completely (no cut-off text, no spinner stuck).

These tests run in a real browser, catch rendering issues the unit tests can't see, and verify the full integration from the user's perspective. Run them in CI and you'll catch integration breakage before users do.

Testing in Production: Tracing and Monitoring

Testing before deployment catches known failures. Production monitoring catches unknown ones.

For ADK agents in production:

Enable tracing. ADK integrates with OpenTelemetry and Google Cloud Trace. Every agent execution should produce a trace showing which tools were called, in what order, with what inputs and outputs, and how long each step took. Without traces, debugging production failures is guesswork.

Track key metrics per agent run:

Latency (time to first token, time to complete response)
Tool call count (agents that call more tools than expected may be confused)
Session length (very long sessions may indicate the agent is stuck in a loop)
Error rate (tool failures, model API errors, timeouts)

Set up alerts:

Error rate > 5% on any agent in a rolling 10-minute window
P95 latency > 2× baseline
Tool call count > 3× the typical average (loop detection)
Session length > 20 turns (loop detection)

Log all inputs and outputs (with PII scrubbing). When a user reports the agent did something wrong, you need to be able to reproduce it exactly.

Regression Testing After Model Updates

ADK lets you swap the underlying model. When Google releases a new Gemini version, you want to adopt it — but you need to verify it doesn't break your agent's behavior.

Before updating your model:

Collect 100+ real conversations from production that represent diverse use cases
Run all of them through both the old model and the new model
Compare outputs using your behavioral test assertions
Flag cases where the new model's output violates any assertion
Manually review the flagged cases

This gives you a regression test suite derived from real usage, which is far more valuable than synthetic test data.

For continuous evaluation, run your test suite against every new model version automatically. If the pass rate drops, don't upgrade until you understand why.

The Testing Maturity Ladder

If you're starting from zero, here's a practical progression:

Level 1 (Week 1): Unit test every tool function. This is fast and catches the most bugs per hour of effort.

Level 2 (Week 2): Add agent-level tests with mocked tools. Verify routing logic, empty result handling, and error degradation.

Level 3 (Week 3): Add multi-turn tests. Cover context retention and session isolation.

Level 4 (Month 2): Add end-to-end tests through the UI. Verify the full stack.

Level 5 (Month 3): Add production monitoring, tracing, and alerts. Build a regression test suite from real conversations.

Most teams stop at Level 1 or 2. Level 5 is where you stop having production incidents and start having informed post-mortems instead.

ADK makes it easy to build agents that are impressive in demos. Making them reliable in production is a different skill — one that's built through systematic testing, not by hoping the model gets it right.

Test the tools. Test the behavior. Test the integration. Monitor what you can't test in advance.