How to Test Agno Agents and Multi-Agent Teams
You built an Agno agent team. The researcher agent finds the relevant data, the analyst agent synthesizes it, and the writer agent produces the final output. Each agent works correctly in isolation. You deploy the team. In production, the researcher agent returns a result in an unexpected format, the analyst agent can't parse it, and the team silently produces an empty output. No exception. No alert. You find out when the results page shows nothing.
This is the multi-agent coordination failure mode that Agno makes easy to create and hard to detect without tests. Agno (formerly Phidata) makes building fast, lightweight agents straightforward — but fast development velocity with no test coverage means fast deployment of broken systems.
Here's how to test Agno agents properly.
What You're Testing in an Agno Application
Agno applications typically have these layers:
- Individual agents — each agent has instructions, tools, and optionally a knowledge base
- Tools — functions the agent calls (APIs, database queries, calculators)
- Memory and storage — persistent context across runs
- Multi-agent teams — agents that delegate to each other via
Team - Workflows — stateful sequences with branching logic
Each layer has its own failure mode. Test them separately before testing them together.
Unit Testing Agno Tools
Tools are the safest starting point. An Agno tool is a Python function — test it directly without involving the agent or LLM.
# tools/search_tool.py
import httpx
from agno.tools import tool
@tool
def search_web(query: str) -> str:
"""Search the web and return the top result snippet."""
response = httpx.get(
"https://api.search.example.com/search",
params={"q": query, "limit": 1},
timeout=10,
)
response.raise_for_status()
results = response.json()
if not results["items"]:
return "No results found."
return results["items"][0]["snippet"]# tests/test_tools.py
import pytest
from unittest.mock import patch, MagicMock
from tools.search_tool import search_web
def test_search_tool_returns_snippet():
mock_response = MagicMock()
mock_response.json.return_value = {
"items": [{"snippet": "Python is a programming language."}]
}
mock_response.raise_for_status = MagicMock()
with patch("httpx.get", return_value=mock_response):
result = search_web("what is python")
assert "Python" in result
assert isinstance(result, str)
def test_search_tool_handles_empty_results():
mock_response = MagicMock()
mock_response.json.return_value = {"items": []}
mock_response.raise_for_status = MagicMock()
with patch("httpx.get", return_value=mock_response):
result = search_web("xqzrtplm nonsense query")
assert result == "No results found."
def test_search_tool_propagates_http_errors():
import httpx
with patch("httpx.get", side_effect=httpx.HTTPError("Connection refused")):
with pytest.raises(httpx.HTTPError):
search_web("anything")Tool tests run in milliseconds, need no API keys, and catch argument validation bugs and error handling gaps before they reach the agent.
Testing Individual Agno Agents
Agent tests verify that the agent calls the right tool for a given input. Mock the LLM to keep tests deterministic and free.
# tests/test_agent.py
import pytest
from unittest.mock import patch, MagicMock
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from tools.search_tool import search_web
def create_research_agent():
return Agent(
name="Research Agent",
model=OpenAIChat(id="gpt-4o"),
tools=[search_web],
instructions="Search the web to answer factual questions. Always use the search_web tool.",
)
def test_agent_calls_search_tool_for_factual_question():
agent = create_research_agent()
# Track tool calls
tool_calls = []
original_tool = search_web.__wrapped__ if hasattr(search_web, '__wrapped__') else search_web
with patch.object(agent.model, "invoke") as mock_invoke:
# Simulate LLM deciding to call search_web
mock_invoke.return_value = MagicMock(
content="",
tool_calls=[{
"name": "search_web",
"arguments": {"query": "Python programming language history"}
}]
)
with patch("tools.search_tool.search_web", wraps=lambda q: "Python was created in 1991.") as mock_tool:
response = agent.run("When was Python created?")
mock_tool.assert_called_once()
call_args = mock_tool.call_args
assert "python" in call_args[1].get("query", "").lower() or \
"python" in str(call_args[0]).lower()
def test_agent_response_is_non_empty():
agent = create_research_agent()
with patch.object(agent.model, "invoke") as mock_invoke:
mock_invoke.return_value = MagicMock(
content="Python was created by Guido van Rossum in 1991.",
tool_calls=[]
)
response = agent.run("When was Python created?")
assert response.content
assert len(response.content) > 10Testing Agno Memory
If your agents use memory — AgentMemory, SqliteStorage, or a vector-based memory store — test that context persists correctly.
# tests/test_memory.py
import pytest
from agno.agent import Agent
from agno.models.openai import OpenAIChat
from agno.memory.agent import AgentMemory
from agno.storage.sqlite import SqliteStorage
import tempfile
import os
def test_agent_recalls_user_information():
# Use a temp database so tests don't pollute each other
with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
db_path = f.name
try:
storage = SqliteStorage(table_name="test_sessions", db_file=db_path)
agent = Agent(
name="Memory Agent",
model=OpenAIChat(id="gpt-4o"),
memory=AgentMemory(db=storage),
instructions="Remember what the user tells you about themselves.",
)
session_id = "test-session-recall"
# This test would need a real LLM to verify — in CI, use mocks
# The important thing is that storage initializes without error
assert storage is not None
# And the agent can be created with memory
assert agent.memory is not None
finally:
os.unlink(db_path)
def test_memory_storage_persists_across_agent_instances():
with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
db_path = f.name
try:
storage1 = SqliteStorage(table_name="sessions", db_file=db_path)
storage2 = SqliteStorage(table_name="sessions", db_file=db_path)
# Write via storage1, read via storage2 — same underlying DB
# (actual write/read methods depend on Agno version)
assert storage1 is not None
assert storage2 is not None
finally:
os.unlink(db_path)Testing Agno Multi-Agent Teams
Teams are where the most subtle failures occur. An agent returns a result in a format the next agent doesn't understand, and the team produces garbage silently.
# tests/test_team.py
import pytest
from unittest.mock import patch, MagicMock
from agno.agent import Agent
from agno.team import Team
from agno.models.openai import OpenAIChat
def create_research_team():
researcher = Agent(
name="Researcher",
model=OpenAIChat(id="gpt-4o"),
instructions="Find factual information on the given topic.",
role="Research information about the given topic",
)
analyst = Agent(
name="Analyst",
model=OpenAIChat(id="gpt-4o"),
instructions="Summarize and analyze the research provided to you.",
role="Analyze and summarize research findings",
)
return Team(
name="Research Team",
agents=[researcher, analyst],
model=OpenAIChat(id="gpt-4o"),
instructions="Coordinate research and analysis tasks.",
)
def test_team_runs_without_error():
team = create_research_team()
with patch.object(team.model, "invoke") as mock_team_invoke:
mock_team_invoke.return_value = MagicMock(
content="Based on research: Python was created in 1991 by Guido van Rossum.",
tool_calls=[]
)
response = team.run("Tell me about Python's history.")
assert response is not None
def test_team_has_correct_agent_count():
team = create_research_team()
assert len(team.agents) == 2
assert team.agents[0].name == "Researcher"
assert team.agents[1].name == "Analyst"
def test_team_agents_have_correct_instructions():
team = create_research_team()
researcher = team.agents[0]
analyst = team.agents[1]
assert "factual information" in researcher.instructions.lower()
assert "summarize" in analyst.instructions.lower()Structural tests — verifying agent count, names, and instruction content — catch configuration drift. When someone edits the prompt on the researcher agent and it breaks the analyst's expectations, you want to know immediately.
End-to-End Tests with Real LLMs
For staging environments, run a small set of end-to-end tests with real LLM calls to catch prompt regressions:
# tests/test_e2e.py
import pytest
import os
pytestmark = pytest.mark.skipif(
not os.getenv("OPENAI_API_KEY"),
reason="Requires OPENAI_API_KEY for E2E tests"
)
def test_research_team_produces_coherent_output():
from teams.research_team import create_research_team
team = create_research_team()
response = team.run("What is the capital of France?")
assert response.content
assert "paris" in response.content.lower()
def test_research_team_responds_within_timeout():
import time
from teams.research_team import create_research_team
team = create_research_team()
start = time.time()
response = team.run("Summarize the key benefits of automated testing.")
elapsed = time.time() - start
assert response.content
assert elapsed < 60, f"Team took {elapsed:.1f}s — too slow for production"Mark E2E tests to run only in CI/CD on deploy, not on every commit. They call real LLMs, cost money, and are slow — but they catch the class of bugs that mocks can't find.
How HelpMeTest Helps
Your Agno application has a user interface. Browser-level tests verify what users actually experience — not just what your Python tests confirm.
Write plain-English tests against the actual UI:
Go to https://your-agno-app.com
Click the input field
Type "Research the history of Python programming"
Wait for the research team to respond
Verify the response mentions "Guido van Rossum" or "1991"HelpMeTest runs these on a schedule. If your multi-agent team breaks in production — an LLM provider outage, a timeout that makes the UI spinner never stop, a response format change — you're alerted within minutes.
Use helpmetest health agno-team 5m to ping your team's API endpoint every 5 minutes and verify it responds. That's faster than waiting for a user ticket.
The free tier covers 10 tests — enough to protect your critical agent flows from the start.
What to Ship
Minimum test coverage for a production Agno application:
- Tool tests — every tool function has a unit test for happy path and error path
- Agent configuration tests — agents have the correct tools, instructions, and model
- Team structure tests — correct agent count, roles, and coordination instructions
- Memory tests — storage initializes correctly and persists across sessions
- E2E tests — at least one real-LLM test per major team capability, run on deploy
- Browser smoke tests — latency and response quality checks running in production
Agno teams are fast to build. Make sure they're as fast to verify.
Start with HelpMeTest's free tier — 10 tests, no credit card. Add browser-level monitoring to your Agno application in under ten minutes at helpmetest.com.