How to Test Microsoft AutoGen Multi-Agent Systems
Your AutoGen pipeline worked fine in dev. The Planner agent handed off to the Executor, the Executor called the right tool, the Critic reviewed the output, and everything terminated cleanly. Then you deployed it. Now the Executor occasionally ignores the Planner's instructions, the conversation sometimes loops 40 times before terminating, and the Critic intermittently approves outputs it should reject. Welcome to multi-agent systems in production.
AutoGen makes it easy to build complex agent pipelines. It does not make it easy to test them. This post breaks down a practical testing strategy across four layers — from individual agent behavior to the end-user applications that surface agent output.
Why AutoGen Systems Are Hard to Test
Single-agent systems are hard enough. Multi-agent systems multiply the failure surface:
Non-determinism. LLM responses vary. Two runs of the same conversation can produce structurally different message sequences. A test that passes once may fail on the next run — not because your code changed, but because the model responded differently.
Agent handoffs. When a GroupChat hands off between agents, you're trusting the model to select the right next speaker. If the selection logic is implicit (e.g., "the manager decides"), a subtle prompt change can break routing without any obvious error.
Termination conditions. AutoGen conversations need to know when to stop. A poorly defined is_termination_msg function — or one that works in testing but fails on edge-case phrasings — can cause infinite loops or premature exits in production.
Tool call reliability. Agents that call functions can hallucinate function names, pass wrong argument types, or silently fail when tools return errors. These failures often don't raise exceptions — they just produce wrong output.
Layer 1: Testing Individual Agents in Isolation
Before testing how agents talk to each other, verify each agent behaves correctly on its own. Mock out the other agents and test the logic of a single agent's responses.
import pytest
from unittest.mock import MagicMock, patch
from autogen import AssistantAgent
def test_planner_agent_produces_structured_plan():
"""Verify the Planner agent returns a plan with required sections."""
planner = AssistantAgent(
name="Planner",
system_message="""You are a task planner. When given a goal, respond with:
PLAN:
1. [step one]
2. [step two]
DONE""",
llm_config={"config_list": [{"model": "gpt-4", "api_key": "..."}]},
)
# Inject a canned reply to test parsing logic, not the LLM
with patch.object(planner, "generate_reply") as mock_reply:
mock_reply.return_value = "PLAN:\n1. Research topic\n2. Write summary\nDONE"
reply = planner.generate_reply(
messages=[{"role": "user", "content": "Summarize quantum computing"}],
sender=MagicMock(),
)
assert "PLAN:" in reply
assert "DONE" in reply
lines = reply.split("\n")
plan_lines = [l for l in lines if l.strip().startswith(("1.", "2.", "3."))]
assert len(plan_lines) >= 1This pattern — mock the LLM, test the parsing and routing logic — lets you run thousands of unit tests cheaply without hitting API rate limits or paying for inference.
For agents with custom is_termination_msg logic, test that function directly:
def is_done(message):
return "TASK_COMPLETE" in message.get("content", "")
def test_termination_detects_exact_token():
assert is_done({"content": "TASK_COMPLETE"}) is True
def test_termination_ignores_partial_match():
assert is_done({"content": "task_complete"}) is False # case-sensitive
def test_termination_handles_missing_content():
assert is_done({}) is FalseLayer 2: Testing Agent Conversations
Once individual agents are verified, test how they interact. The goal here is to validate message flow, role adherence, and conversation termination — not to test LLM intelligence.
import pytest
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager
def make_mock_agent(name, reply):
"""Create an AssistantAgent that always returns a fixed reply."""
agent = AssistantAgent(
name=name,
system_message=f"You are {name}.",
llm_config=False, # disable real LLM calls
)
agent.register_reply(
trigger=lambda sender: True,
reply_func=lambda *args, **kwargs: (True, reply),
position=0,
)
return agent
def test_groupchat_terminates_after_completion():
"""Verify the conversation ends when the termination condition is met."""
planner = make_mock_agent("Planner", "Step 1: do the thing. HANDOFF_TO_EXECUTOR")
executor = make_mock_agent("Executor", "Done. TASK_COMPLETE")
user_proxy = UserProxyAgent(
name="User",
human_input_mode="NEVER",
is_termination_msg=lambda m: "TASK_COMPLETE" in m.get("content", ""),
max_consecutive_auto_reply=10,
code_execution_config=False,
)
groupchat = GroupChat(
agents=[user_proxy, planner, executor],
messages=[],
max_round=10,
)
manager = GroupChatManager(groupchat=groupchat, llm_config=False)
user_proxy.initiate_chat(manager, message="Run the pipeline")
messages = groupchat.messages
contents = [m["content"] for m in messages]
# Conversation must have ended — not run all 10 rounds
assert len(messages) < 10
# TASK_COMPLETE must appear somewhere
assert any("TASK_COMPLETE" in c for c in contents)For role adherence — making sure the Critic doesn't do what the Executor is supposed to do — collect all messages and check which agent sent what:
def test_executor_is_the_only_agent_that_writes_code():
# ... set up agents with mocked replies ...
executor_messages = [
m for m in groupchat.messages
if m.get("name") == "Executor"
]
non_executor_messages = [
m for m in groupchat.messages
if m.get("name") != "Executor" and m.get("name") != "User"
]
# Only executor messages should contain code blocks
for msg in non_executor_messages:
assert "```python" not in msg.get("content", ""), \
f"Agent {msg.get('name')} produced code — only Executor should"Layer 3: Testing Tool and Function Calls
AutoGen agents can call registered functions. These function calls are where a lot of production bugs hide — wrong arguments, missing error handling, silent failures.
import json
from autogen import AssistantAgent, UserProxyAgent
def search_database(query: str, limit: int = 10) -> str:
"""Tool function registered with an AutoGen agent."""
# Real implementation would hit a DB
return json.dumps({"results": [], "query": query, "limit": limit})
def test_search_tool_called_with_correct_arguments():
"""Verify the agent passes valid arguments to the search tool."""
call_log = []
def tracked_search(query: str, limit: int = 10) -> str:
call_log.append({"query": query, "limit": limit})
return json.dumps({"results": [{"id": 1, "title": "Result"}]})
assistant = AssistantAgent(
name="Researcher",
system_message="Use the search_database tool to answer questions.",
llm_config={
"config_list": [{"model": "gpt-4", "api_key": "..."}],
"functions": [{
"name": "search_database",
"description": "Search the database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"limit": {"type": "integer"},
},
"required": ["query"],
},
}],
},
)
user_proxy = UserProxyAgent(
name="User",
human_input_mode="NEVER",
function_map={"search_database": tracked_search},
max_consecutive_auto_reply=3,
code_execution_config=False,
)
# Inject a canned function call response to avoid real LLM
with patch.object(assistant, "generate_reply") as mock:
mock.return_value = {
"function_call": {"name": "search_database", "arguments": '{"query": "AI testing tools", "limit": 5}'}
}
user_proxy.initiate_chat(assistant, message="Find articles about AI testing")
assert len(call_log) > 0
assert call_log[0]["query"] == "AI testing tools"
assert isinstance(call_log[0]["limit"], int)Test error handling explicitly. What happens when your tool throws?
def test_agent_handles_tool_failure_gracefully():
def failing_search(query: str, limit: int = 10) -> str:
raise ConnectionError("Database unreachable")
# Set up agent with failing tool
# ... (same setup as above, but with failing_search in function_map)
# The conversation should not crash — it should produce an error message
# and either retry or report the failure to the user
messages = groupchat.messages
error_acknowledged = any(
"error" in m.get("content", "").lower() or
"failed" in m.get("content", "").lower()
for m in messages
)
assert error_acknowledged, "Agent silently swallowed a tool failure"Layer 4: End-to-End Testing the Applications AutoGen Powers
Your AutoGen pipeline is probably not the product itself — it powers something: a web app, an API, a Slack bot. End-to-end testing means testing that thing, from the user's perspective.
This is where unit and integration tests run out. You need to verify that when a user submits a request in your UI, the right agent pipeline runs, and the output surfaces correctly.
For web applications powered by AutoGen agents, HelpMeTest runs Robot Framework + Playwright tests that interact with your actual frontend:
*** Test Cases ***
Research Pipeline Produces Output Within Timeout
[Documentation] Verify AutoGen pipeline returns results to UI within 30s
Navigate To ${BASE_URL}/research
Fill In id=query-input text=Summarize quantum computing advances in 2024
Click Button id=submit-research
Wait Until Element Contains id=results-panel text=Summary timeout=30s
Element Should Contain id=results-panel text=Quantum
Element Should Not Contain id=results-panel text=Error
Agent Pipeline Shows Intermediate Steps
[Documentation] Multi-agent pipeline should surface step visibility to users
Navigate To ${BASE_URL}/research
Fill In id=query-input text=Market analysis for electric vehicles
Click Button id=submit-research
Wait Until Element Is Visible css=.agent-step-indicator timeout=5s
# Steps should appear as agents hand off
Wait Until Element Contains id=pipeline-status text=Planning timeout=10s
Wait Until Element Contains id=pipeline-status text=Executing timeout=20s
Wait Until Element Contains id=pipeline-status text=Complete timeout=30sFor health monitoring of your AutoGen-powered service, HelpMeTest's CLI lets you set up continuous checks:
helpmetest health autogen-research-api 60sThis runs your test suite on a 60-second cadence and alerts when the pipeline starts failing — before users report it.
How HelpMeTest Fits Into an AutoGen Testing Workflow
Layers 1-3 — unit tests, conversation tests, tool call tests — live in your repo and run in CI. They're fast, cheap, and should run on every commit.
Layer 4 is where HelpMeTest operates. Once your AutoGen pipeline is deployed, the question shifts from "does the code work?" to "does the product work, right now, for real users?"
HelpMeTest monitors the UI your AutoGen agents power:
- Test the inputs users actually submit — not just the happy path you thought of
- Verify agent output surfaces correctly — formatted, complete, not truncated
- Catch regressions when you update models or prompts — a new model version can break UI assumptions
- Health checks with
helpmetest health— continuous monitoring with configurable grace periods
You can create tests in plain English without writing Playwright selectors by hand, or use the MCP server integration (helpmetest mcp) to write tests directly from Claude Code or Cursor.
Start With the Failure Points That Matter
Multi-agent systems have more places to break than single-agent ones. The testing strategy has to match that complexity:
- Unit test each agent's logic with mocked replies — fast, cheap, runs in CI
- Integration test conversations with deterministic mock agents — validates routing and termination
- Test tool calls explicitly — verify arguments, types, and error handling
- End-to-end test the application layer — what users actually see
The LLM is the one component you cannot fully control. Everything around it — the routing logic, the termination conditions, the tool integrations, the UI — is deterministic code that you can and should test like deterministic code.
HelpMeTest's free tier includes 10 tests — enough to cover your critical AutoGen-powered user flows. Start testing at helpmetest.com.