AI Testing

How to Test Microsoft AutoGen Multi-Agent Systems

HelpMeTest

13 May 2026 — 6 min read

Your AutoGen pipeline worked fine in dev. The Planner agent handed off to the Executor, the Executor called the right tool, the Critic reviewed the output, and everything terminated cleanly. Then you deployed it. Now the Executor occasionally ignores the Planner's instructions, the conversation sometimes loops 40 times before terminating, and the Critic intermittently approves outputs it should reject. Welcome to multi-agent systems in production.

AutoGen makes it easy to build complex agent pipelines. It does not make it easy to test them. This post breaks down a practical testing strategy across four layers — from individual agent behavior to the end-user applications that surface agent output.

Why AutoGen Systems Are Hard to Test

Single-agent systems are hard enough. Multi-agent systems multiply the failure surface:

Non-determinism. LLM responses vary. Two runs of the same conversation can produce structurally different message sequences. A test that passes once may fail on the next run — not because your code changed, but because the model responded differently.

Agent handoffs. When a GroupChat hands off between agents, you're trusting the model to select the right next speaker. If the selection logic is implicit (e.g., "the manager decides"), a subtle prompt change can break routing without any obvious error.

Termination conditions. AutoGen conversations need to know when to stop. A poorly defined is_termination_msg function — or one that works in testing but fails on edge-case phrasings — can cause infinite loops or premature exits in production.

Tool call reliability. Agents that call functions can hallucinate function names, pass wrong argument types, or silently fail when tools return errors. These failures often don't raise exceptions — they just produce wrong output.

Layer 1: Testing Individual Agents in Isolation

Before testing how agents talk to each other, verify each agent behaves correctly on its own. Mock out the other agents and test the logic of a single agent's responses.

import pytest
from unittest.mock import MagicMock, patch
from autogen import AssistantAgent

def test_planner_agent_produces_structured_plan():
    """Verify the Planner agent returns a plan with required sections."""
    planner = AssistantAgent(
        name="Planner",
        system_message="""You are a task planner. When given a goal, respond with:
        PLAN:
        1. [step one]
        2. [step two]
        DONE""",
        llm_config={"config_list": [{"model": "gpt-4", "api_key": "..."}]},
    )

    # Inject a canned reply to test parsing logic, not the LLM
    with patch.object(planner, "generate_reply") as mock_reply:
        mock_reply.return_value = "PLAN:\n1. Research topic\n2. Write summary\nDONE"
        reply = planner.generate_reply(
            messages=[{"role": "user", "content": "Summarize quantum computing"}],
            sender=MagicMock(),
        )

    assert "PLAN:" in reply
    assert "DONE" in reply
    lines = reply.split("\n")
    plan_lines = [l for l in lines if l.strip().startswith(("1.", "2.", "3."))]
    assert len(plan_lines) >= 1

This pattern — mock the LLM, test the parsing and routing logic — lets you run thousands of unit tests cheaply without hitting API rate limits or paying for inference.

For agents with custom is_termination_msg logic, test that function directly:

def is_done(message):
    return "TASK_COMPLETE" in message.get("content", "")

def test_termination_detects_exact_token():
    assert is_done({"content": "TASK_COMPLETE"}) is True

def test_termination_ignores_partial_match():
    assert is_done({"content": "task_complete"}) is False  # case-sensitive

def test_termination_handles_missing_content():
    assert is_done({}) is False

Layer 2: Testing Agent Conversations

Once individual agents are verified, test how they interact. The goal here is to validate message flow, role adherence, and conversation termination — not to test LLM intelligence.

import pytest
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager

def make_mock_agent(name, reply):
    """Create an AssistantAgent that always returns a fixed reply."""
    agent = AssistantAgent(
        name=name,
        system_message=f"You are {name}.",
        llm_config=False,  # disable real LLM calls
    )
    agent.register_reply(
        trigger=lambda sender: True,
        reply_func=lambda *args, **kwargs: (True, reply),
        position=0,
    )
    return agent

def test_groupchat_terminates_after_completion():
    """Verify the conversation ends when the termination condition is met."""
    planner = make_mock_agent("Planner", "Step 1: do the thing. HANDOFF_TO_EXECUTOR")
    executor = make_mock_agent("Executor", "Done. TASK_COMPLETE")

    user_proxy = UserProxyAgent(
        name="User",
        human_input_mode="NEVER",
        is_termination_msg=lambda m: "TASK_COMPLETE" in m.get("content", ""),
        max_consecutive_auto_reply=10,
        code_execution_config=False,
    )

    groupchat = GroupChat(
        agents=[user_proxy, planner, executor],
        messages=[],
        max_round=10,
    )
    manager = GroupChatManager(groupchat=groupchat, llm_config=False)

    user_proxy.initiate_chat(manager, message="Run the pipeline")

    messages = groupchat.messages
    contents = [m["content"] for m in messages]

    # Conversation must have ended — not run all 10 rounds
    assert len(messages) < 10
    # TASK_COMPLETE must appear somewhere
    assert any("TASK_COMPLETE" in c for c in contents)

For role adherence — making sure the Critic doesn't do what the Executor is supposed to do — collect all messages and check which agent sent what:

def test_executor_is_the_only_agent_that_writes_code():
    # ... set up agents with mocked replies ...
    
    executor_messages = [
        m for m in groupchat.messages
        if m.get("name") == "Executor"
    ]
    non_executor_messages = [
        m for m in groupchat.messages
        if m.get("name") != "Executor" and m.get("name") != "User"
    ]

    # Only executor messages should contain code blocks
    for msg in non_executor_messages:
        assert "```python" not in msg.get("content", ""), \
            f"Agent {msg.get('name')} produced code — only Executor should"

Layer 3: Testing Tool and Function Calls

AutoGen agents can call registered functions. These function calls are where a lot of production bugs hide — wrong arguments, missing error handling, silent failures.

import json
from autogen import AssistantAgent, UserProxyAgent

def search_database(query: str, limit: int = 10) -> str:
    """Tool function registered with an AutoGen agent."""
    # Real implementation would hit a DB
    return json.dumps({"results": [], "query": query, "limit": limit})

def test_search_tool_called_with_correct_arguments():
    """Verify the agent passes valid arguments to the search tool."""
    call_log = []

    def tracked_search(query: str, limit: int = 10) -> str:
        call_log.append({"query": query, "limit": limit})
        return json.dumps({"results": [{"id": 1, "title": "Result"}]})

    assistant = AssistantAgent(
        name="Researcher",
        system_message="Use the search_database tool to answer questions.",
        llm_config={
            "config_list": [{"model": "gpt-4", "api_key": "..."}],
            "functions": [{
                "name": "search_database",
                "description": "Search the database",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "limit": {"type": "integer"},
                    },
                    "required": ["query"],
                },
            }],
        },
    )

    user_proxy = UserProxyAgent(
        name="User",
        human_input_mode="NEVER",
        function_map={"search_database": tracked_search},
        max_consecutive_auto_reply=3,
        code_execution_config=False,
    )

    # Inject a canned function call response to avoid real LLM
    with patch.object(assistant, "generate_reply") as mock:
        mock.return_value = {
            "function_call": {"name": "search_database", "arguments": '{"query": "AI testing tools", "limit": 5}'}
        }
        user_proxy.initiate_chat(assistant, message="Find articles about AI testing")

    assert len(call_log) > 0
    assert call_log[0]["query"] == "AI testing tools"
    assert isinstance(call_log[0]["limit"], int)

Test error handling explicitly. What happens when your tool throws?

def test_agent_handles_tool_failure_gracefully():
    def failing_search(query: str, limit: int = 10) -> str:
        raise ConnectionError("Database unreachable")

    # Set up agent with failing tool
    # ... (same setup as above, but with failing_search in function_map)

    # The conversation should not crash — it should produce an error message
    # and either retry or report the failure to the user
    messages = groupchat.messages
    error_acknowledged = any(
        "error" in m.get("content", "").lower() or
        "failed" in m.get("content", "").lower()
        for m in messages
    )
    assert error_acknowledged, "Agent silently swallowed a tool failure"

Layer 4: End-to-End Testing the Applications AutoGen Powers

Your AutoGen pipeline is probably not the product itself — it powers something: a web app, an API, a Slack bot. End-to-end testing means testing that thing, from the user's perspective.

This is where unit and integration tests run out. You need to verify that when a user submits a request in your UI, the right agent pipeline runs, and the output surfaces correctly.

For web applications powered by AutoGen agents, HelpMeTest runs Robot Framework + Playwright tests that interact with your actual frontend:

*** Test Cases ***
Research Pipeline Produces Output Within Timeout
    [Documentation]    Verify AutoGen pipeline returns results to UI within 30s
    Navigate To    ${BASE_URL}/research
    Fill In    id=query-input    text=Summarize quantum computing advances in 2024
    Click Button    id=submit-research
    Wait Until Element Contains    id=results-panel    text=Summary    timeout=30s
    Element Should Contain    id=results-panel    text=Quantum
    Element Should Not Contain    id=results-panel    text=Error

Agent Pipeline Shows Intermediate Steps
    [Documentation]    Multi-agent pipeline should surface step visibility to users
    Navigate To    ${BASE_URL}/research
    Fill In    id=query-input    text=Market analysis for electric vehicles
    Click Button    id=submit-research
    Wait Until Element Is Visible    css=.agent-step-indicator    timeout=5s
    # Steps should appear as agents hand off
    Wait Until Element Contains    id=pipeline-status    text=Planning    timeout=10s
    Wait Until Element Contains    id=pipeline-status    text=Executing    timeout=20s
    Wait Until Element Contains    id=pipeline-status    text=Complete    timeout=30s

For health monitoring of your AutoGen-powered service, HelpMeTest's CLI lets you set up continuous checks:

helpmetest health autogen-research-api 60s

This runs your test suite on a 60-second cadence and alerts when the pipeline starts failing — before users report it.

How HelpMeTest Fits Into an AutoGen Testing Workflow

Layers 1-3 — unit tests, conversation tests, tool call tests — live in your repo and run in CI. They're fast, cheap, and should run on every commit.

Layer 4 is where HelpMeTest operates. Once your AutoGen pipeline is deployed, the question shifts from "does the code work?" to "does the product work, right now, for real users?"

HelpMeTest monitors the UI your AutoGen agents power:

Test the inputs users actually submit — not just the happy path you thought of
Verify agent output surfaces correctly — formatted, complete, not truncated
Catch regressions when you update models or prompts — a new model version can break UI assumptions
Health checks with helpmetest health — continuous monitoring with configurable grace periods

You can create tests in plain English without writing Playwright selectors by hand, or use the MCP server integration (helpmetest mcp) to write tests directly from Claude Code or Cursor.

Start With the Failure Points That Matter

Multi-agent systems have more places to break than single-agent ones. The testing strategy has to match that complexity:

Unit test each agent's logic with mocked replies — fast, cheap, runs in CI
Integration test conversations with deterministic mock agents — validates routing and termination
Test tool calls explicitly — verify arguments, types, and error handling
End-to-end test the application layer — what users actually see

The LLM is the one component you cannot fully control. Everything around it — the routing logic, the termination conditions, the tool integrations, the UI — is deterministic code that you can and should test like deterministic code.

HelpMeTest's free tier includes 10 tests — enough to cover your critical AutoGen-powered user flows. Start testing at helpmetest.com.

How to Test Microsoft AutoGen Multi-Agent Systems

HelpMeTest

Why AutoGen Systems Are Hard to Test

Layer 1: Testing Individual Agents in Isolation

Layer 2: Testing Agent Conversations

Layer 3: Testing Tool and Function Calls

Layer 4: End-to-End Testing the Applications AutoGen Powers

How HelpMeTest Fits Into an AutoGen Testing Workflow

Start With the Failure Points That Matter

Read more

Testing Supabase Row Level Security Policies with pgTAP

Testing Supabase Realtime: Subscriptions, Broadcast, and Presence

Testing CrewAI Tools, Task Context, and Crew Output: A Practical Guide

Testing Supabase Edge Functions with Deno