AI Testing

How to Test Agno Agents and Multi-Agent Teams

HelpMeTest

13 May 2026 — 6 min read

You built an Agno agent team. The researcher agent finds the relevant data, the analyst agent synthesizes it, and the writer agent produces the final output. Each agent works correctly in isolation. You deploy the team. In production, the researcher agent returns a result in an unexpected format, the analyst agent can't parse it, and the team silently produces an empty output. No exception. No alert. You find out when the results page shows nothing.

This is the multi-agent coordination failure mode that Agno makes easy to create and hard to detect without tests. Agno (formerly Phidata) makes building fast, lightweight agents straightforward — but fast development velocity with no test coverage means fast deployment of broken systems.

Here's how to test Agno agents properly.

What You're Testing in an Agno Application

Agno applications typically have these layers:

Individual agents — each agent has instructions, tools, and optionally a knowledge base
Tools — functions the agent calls (APIs, database queries, calculators)
Memory and storage — persistent context across runs
Multi-agent teams — agents that delegate to each other via Team
Workflows — stateful sequences with branching logic

Each layer has its own failure mode. Test them separately before testing them together.

Unit Testing Agno Tools

Tools are the safest starting point. An Agno tool is a Python function — test it directly without involving the agent or LLM.

# tools/search_tool.py
import httpx
from agno.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web and return the top result snippet."""
    response = httpx.get(
        "https://api.search.example.com/search",
        params={"q": query, "limit": 1},
        timeout=10,
    )
    response.raise_for_status()
    results = response.json()
    if not results["items"]:
        return "No results found."
    return results["items"][0]["snippet"]

# tests/test_tools.py
import pytest
from unittest.mock import patch, MagicMock
from tools.search_tool import search_web

def test_search_tool_returns_snippet():
    mock_response = MagicMock()
    mock_response.json.return_value = {
        "items": [{"snippet": "Python is a programming language."}]
    }
    mock_response.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_response):
        result = search_web("what is python")

    assert "Python" in result
    assert isinstance(result, str)

def test_search_tool_handles_empty_results():
    mock_response = MagicMock()
    mock_response.json.return_value = {"items": []}
    mock_response.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_response):
        result = search_web("xqzrtplm nonsense query")

    assert result == "No results found."

def test_search_tool_propagates_http_errors():
    import httpx
    with patch("httpx.get", side_effect=httpx.HTTPError("Connection refused")):
        with pytest.raises(httpx.HTTPError):
            search_web("anything")

Tool tests run in milliseconds, need no API keys, and catch argument validation bugs and error handling gaps before they reach the agent.

Testing Individual Agno Agents

Agent tests verify that the agent calls the right tool for a given input. Mock the LLM to keep tests deterministic and free.

# tests/test_agent.py
import pytest
from unittest.mock import patch, MagicMock
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from tools.search_tool import search_web

def create_research_agent():
    return Agent(
        name="Research Agent",
        model=OpenAIChat(id="gpt-4o"),
        tools=[search_web],
        instructions="Search the web to answer factual questions. Always use the search_web tool.",
    )

def test_agent_calls_search_tool_for_factual_question():
    agent = create_research_agent()

    # Track tool calls
    tool_calls = []
    original_tool = search_web.__wrapped__ if hasattr(search_web, '__wrapped__') else search_web

    with patch.object(agent.model, "invoke") as mock_invoke:
        # Simulate LLM deciding to call search_web
        mock_invoke.return_value = MagicMock(
            content="",
            tool_calls=[{
                "name": "search_web",
                "arguments": {"query": "Python programming language history"}
            }]
        )

        with patch("tools.search_tool.search_web", wraps=lambda q: "Python was created in 1991.") as mock_tool:
            response = agent.run("When was Python created?")
            mock_tool.assert_called_once()
            call_args = mock_tool.call_args
            assert "python" in call_args[1].get("query", "").lower() or \
                   "python" in str(call_args[0]).lower()

def test_agent_response_is_non_empty():
    agent = create_research_agent()

    with patch.object(agent.model, "invoke") as mock_invoke:
        mock_invoke.return_value = MagicMock(
            content="Python was created by Guido van Rossum in 1991.",
            tool_calls=[]
        )
        response = agent.run("When was Python created?")

    assert response.content
    assert len(response.content) > 10

Testing Agno Memory

If your agents use memory — AgentMemory, SqliteStorage, or a vector-based memory store — test that context persists correctly.

# tests/test_memory.py
import pytest
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.memory.agent import AgentMemory
from agno.storage.sqlite import SqliteStorage
import tempfile
import os

def test_agent_recalls_user_information():
    # Use a temp database so tests don't pollute each other
    with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
        db_path = f.name

    try:
        storage = SqliteStorage(table_name="test_sessions", db_file=db_path)
        agent = Agent(
            name="Memory Agent",
            model=OpenAIChat(id="gpt-4o"),
            memory=AgentMemory(db=storage),
            instructions="Remember what the user tells you about themselves.",
        )

        session_id = "test-session-recall"

        # This test would need a real LLM to verify — in CI, use mocks
        # The important thing is that storage initializes without error
        assert storage is not None
        # And the agent can be created with memory
        assert agent.memory is not None
    finally:
        os.unlink(db_path)

def test_memory_storage_persists_across_agent_instances():
    with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
        db_path = f.name

    try:
        storage1 = SqliteStorage(table_name="sessions", db_file=db_path)
        storage2 = SqliteStorage(table_name="sessions", db_file=db_path)

        # Write via storage1, read via storage2 — same underlying DB
        # (actual write/read methods depend on Agno version)
        assert storage1 is not None
        assert storage2 is not None
    finally:
        os.unlink(db_path)

Testing Agno Multi-Agent Teams

Teams are where the most subtle failures occur. An agent returns a result in a format the next agent doesn't understand, and the team produces garbage silently.

# tests/test_team.py
import pytest
from unittest.mock import patch, MagicMock
from agno.agent import Agent
from agno.team import Team
from agno.models.openai import OpenAIChat

def create_research_team():
    researcher = Agent(
        name="Researcher",
        model=OpenAIChat(id="gpt-4o"),
        instructions="Find factual information on the given topic.",
        role="Research information about the given topic",
    )

    analyst = Agent(
        name="Analyst",
        model=OpenAIChat(id="gpt-4o"),
        instructions="Summarize and analyze the research provided to you.",
        role="Analyze and summarize research findings",
    )

    return Team(
        name="Research Team",
        agents=[researcher, analyst],
        model=OpenAIChat(id="gpt-4o"),
        instructions="Coordinate research and analysis tasks.",
    )

def test_team_runs_without_error():
    team = create_research_team()

    with patch.object(team.model, "invoke") as mock_team_invoke:
        mock_team_invoke.return_value = MagicMock(
            content="Based on research: Python was created in 1991 by Guido van Rossum.",
            tool_calls=[]
        )

        response = team.run("Tell me about Python's history.")

    assert response is not None

def test_team_has_correct_agent_count():
    team = create_research_team()
    assert len(team.agents) == 2
    assert team.agents[0].name == "Researcher"
    assert team.agents[1].name == "Analyst"

def test_team_agents_have_correct_instructions():
    team = create_research_team()
    researcher = team.agents[0]
    analyst = team.agents[1]

    assert "factual information" in researcher.instructions.lower()
    assert "summarize" in analyst.instructions.lower()

Structural tests — verifying agent count, names, and instruction content — catch configuration drift. When someone edits the prompt on the researcher agent and it breaks the analyst's expectations, you want to know immediately.

End-to-End Tests with Real LLMs

For staging environments, run a small set of end-to-end tests with real LLM calls to catch prompt regressions:

# tests/test_e2e.py
import pytest
import os

pytestmark = pytest.mark.skipif(
    not os.getenv("OPENAI_API_KEY"),
    reason="Requires OPENAI_API_KEY for E2E tests"
)

def test_research_team_produces_coherent_output():
    from teams.research_team import create_research_team

    team = create_research_team()
    response = team.run("What is the capital of France?")

    assert response.content
    assert "paris" in response.content.lower()

def test_research_team_responds_within_timeout():
    import time
    from teams.research_team import create_research_team

    team = create_research_team()
    start = time.time()
    response = team.run("Summarize the key benefits of automated testing.")
    elapsed = time.time() - start

    assert response.content
    assert elapsed < 60, f"Team took {elapsed:.1f}s — too slow for production"

Mark E2E tests to run only in CI/CD on deploy, not on every commit. They call real LLMs, cost money, and are slow — but they catch the class of bugs that mocks can't find.

How HelpMeTest Helps

Your Agno application has a user interface. Browser-level tests verify what users actually experience — not just what your Python tests confirm.

Write plain-English tests against the actual UI:

Go to https://your-agno-app.com
Click the input field
Type "Research the history of Python programming"
Wait for the research team to respond
Verify the response mentions "Guido van Rossum" or "1991"

HelpMeTest runs these on a schedule. If your multi-agent team breaks in production — an LLM provider outage, a timeout that makes the UI spinner never stop, a response format change — you're alerted within minutes.

Use helpmetest health agno-team 5m to ping your team's API endpoint every 5 minutes and verify it responds. That's faster than waiting for a user ticket.

The free tier covers 10 tests — enough to protect your critical agent flows from the start.

What to Ship

Minimum test coverage for a production Agno application:

Tool tests — every tool function has a unit test for happy path and error path
Agent configuration tests — agents have the correct tools, instructions, and model
Team structure tests — correct agent count, roles, and coordination instructions
Memory tests — storage initializes correctly and persists across sessions
E2E tests — at least one real-LLM test per major team capability, run on deploy
Browser smoke tests — latency and response quality checks running in production

Agno teams are fast to build. Make sure they're as fast to verify.

Start with HelpMeTest's free tier — 10 tests, no credit card. Add browser-level monitoring to your Agno application in under ten minutes at helpmetest.com.

How to Test Agno Agents and Multi-Agent Teams

HelpMeTest

What You're Testing in an Agno Application

Unit Testing Agno Tools

Testing Individual Agno Agents

Testing Agno Memory

Testing Agno Multi-Agent Teams

End-to-End Tests with Real LLMs

How HelpMeTest Helps

What to Ship

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest