Testing AutoGen GroupChat: Speaker Selection, Termination, and Deterministic Conversations
GroupChat is where AutoGen gets genuinely powerful — and genuinely unpredictable. You have a manager agent directing a conversation between specialists, a speaker selection policy deciding who talks next, and a termination condition that needs to fire at exactly the right moment. Any one of those can go wrong in ways that are hard to reproduce and harder to catch in CI.
This post focuses specifically on testing GroupChat mechanics: how you validate speaker selection logic, write reliable is_termination_msg functions with proper test coverage, create deterministic conversation recordings for regression testing, and enforce max_round limits. If you're looking for a general AutoGen testing overview, the broader AutoGen testing guide covers agent isolation and LLM mocking from first principles. This post assumes you're already there and want to go deeper on GroupChat.
The GroupChat Testing Problem
GroupChat introduces three failure modes that don't exist in two-agent conversations:
Speaker selection is probabilistic. The GroupChatManager uses an LLM call to pick the next speaker unless you override speaker_selection_method. In production, you might have auto mode where the manager interprets the conversation context and decides who speaks. In test, that same call costs money, takes time, and can return a different agent name depending on how the model feels that day.
Termination conditions must handle message variety. Your is_termination_msg function sees every message in the conversation. It needs to correctly identify termination signals, ignore false positives from mid-conversation messages that happen to contain your trigger phrase, and handle edge cases like empty content or tool-call messages without crashing.
max_round is a safety net, not a feature. A group chat that consistently hits max_round is a group chat with a broken termination condition. But testing that max_round actually stops the conversation — and doesn't silently pass through — is something most teams skip.
Setting Up: AutoGen v0.4+ GroupChat API
AutoGen v0.4 introduced autogen_agentchat with a revised API. The examples here use that version.
# conftest.py
import pytest
from unittest.mock import AsyncMock, MagicMock, patch
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import RoundRobinGroupChat, SelectorGroupChat
from autogen_agentchat.messages import TextMessage
from autogen_core.models import ChatCompletionClient
@pytest.fixture
def mock_model_client():
"""Returns a mock model client that returns controlled responses."""
client = AsyncMock(spec=ChatCompletionClient)
return client
@pytest.fixture
def make_agent(mock_model_client):
"""Factory for creating test agents with the mock client."""
def _make(name: str, system_message: str = "You are a helpful assistant."):
return AssistantAgent(
name=name,
model_client=mock_model_client,
system_message=system_message,
)
return _makeTesting Speaker Selection Logic
The safest approach is to avoid testing auto speaker selection in unit tests — you're testing AutoGen's internals, not your code. What you should test is any custom speaker selection function you write, and the behavior of your group chat when specific agents speak in specific orders.
Testing a Custom Speaker Selection Function
If you've written a speaker_selection_func to enforce routing rules, that function is pure Python and should have pure Python tests:
# your_app/groupchat.py
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import BaseMessage
from typing import Sequence
def route_by_topic(
last_message: str,
agents: list[AssistantAgent],
messages: Sequence[BaseMessage],
) -> AssistantAgent | str:
"""Route to specialist based on keywords in the last message."""
agent_map = {agent.name: agent for agent in agents}
if any(kw in last_message.lower() for kw in ["data", "query", "sql", "database"]):
return agent_map.get("data_analyst", "auto")
if any(kw in last_message.lower() for kw in ["write", "draft", "summarize", "report"]):
return agent_map.get("writer", "auto")
if "DONE" in last_message or "TERMINATE" in last_message:
return "auto" # let the manager decide to end
return agent_map.get("coordinator", "auto")# tests/test_speaker_selection.py
import pytest
from unittest.mock import MagicMock
from autogen_agentchat.agents import AssistantAgent
from your_app.groupchat import route_by_topic
@pytest.fixture
def agents():
coordinator = MagicMock(spec=AssistantAgent)
coordinator.name = "coordinator"
data_analyst = MagicMock(spec=AssistantAgent)
data_analyst.name = "data_analyst"
writer = MagicMock(spec=AssistantAgent)
writer.name = "writer"
return [coordinator, data_analyst, writer]
def test_routes_data_keywords_to_analyst(agents):
result = route_by_topic("Can you run a SQL query on the database?", agents, [])
assert result.name == "data_analyst"
def test_routes_writing_keywords_to_writer(agents):
result = route_by_topic("Please draft a summary of the findings.", agents, [])
assert result.name == "writer"
def test_routes_unknown_to_coordinator(agents):
result = route_by_topic("What should we do next?", agents, [])
assert result.name == "coordinator"
def test_data_keyword_is_case_insensitive(agents):
result = route_by_topic("We need to QUERY the DATABASE.", agents, [])
assert result.name == "data_analyst"
def test_terminate_signal_returns_auto(agents):
result = route_by_topic("TERMINATE the conversation.", agents, [])
assert result == "auto"
def test_missing_agent_falls_back_to_auto(agents):
# Remove the writer from the list
agents_without_writer = [a for a in agents if a.name != "writer"]
result = route_by_topic("Please draft a report.", agents_without_writer, [])
assert result == "auto"These tests run in milliseconds with no LLM calls. They also document the routing contract explicitly — when this function is modified, the tests catch regressions immediately.
Testing is_termination_msg Functions
The termination function is one of the highest-leverage test targets in any AutoGen project. A bad termination function either stops conversations too early (false positives) or allows them to run forever (false negatives). Both are production bugs.
# your_app/termination.py
from autogen_agentchat.messages import BaseMessage
def is_task_complete(message: BaseMessage) -> bool:
"""
Terminate when:
- The message content contains TASK_COMPLETE
- Or the message is from the coordinator and contains DONE
Does NOT terminate on:
- Messages that mention completion in passing
- Tool call result messages (they have no content attribute or content is None)
"""
content = getattr(message, "content", None)
if content is None:
return False
if not isinstance(content, str):
return False
# Explicit signal check — require uppercase to avoid false positives
if "TASK_COMPLETE" in content:
return True
source = getattr(message, "source", "")
if source == "coordinator" and content.strip().upper().endswith("DONE"):
return True
return False# tests/test_termination.py
import pytest
from unittest.mock import MagicMock
from autogen_agentchat.messages import TextMessage
from your_app.termination import is_task_complete
def make_message(content, source="agent"):
msg = MagicMock()
msg.content = content
msg.source = source
return msg
class TestTerminationHappyPath:
def test_terminates_on_task_complete_signal(self):
msg = make_message("All steps are done. TASK_COMPLETE")
assert is_task_complete(msg) is True
def test_terminates_when_coordinator_says_done(self):
msg = make_message("The analysis is complete. DONE", source="coordinator")
assert is_task_complete(msg) is True
def test_task_complete_mid_sentence(self):
msg = make_message("TASK_COMPLETE — moving to next phase")
assert is_task_complete(msg) is True
class TestTerminationFalsePositives:
def test_does_not_terminate_on_lowercase_done(self):
msg = make_message("We are done with step one.", source="coordinator")
assert is_task_complete(msg) is False
def test_does_not_terminate_on_completion_mention(self):
msg = make_message("Once the task is complete, we should review.")
assert is_task_complete(msg) is False
def test_does_not_terminate_when_non_coordinator_says_done(self):
msg = make_message("DONE", source="data_analyst")
assert is_task_complete(msg) is False
class TestTerminationEdgeCases:
def test_handles_none_content_gracefully(self):
msg = make_message(None)
assert is_task_complete(msg) is False
def test_handles_list_content_gracefully(self):
# Tool call messages can have list-type content
msg = make_message([{"type": "tool_call", "name": "search"}])
assert is_task_complete(msg) is False
def test_handles_empty_string(self):
msg = make_message("")
assert is_task_complete(msg) is False
def test_handles_whitespace_only(self):
msg = make_message(" ")
assert is_task_complete(msg) is FalseRun these before every deploy. They are fast, deterministic, and they protect against the most common GroupChat bug in production.
Deterministic Conversation Testing with Recorded Responses
The key to making GroupChat integration tests deterministic is replacing the LLM client with a response recorder that returns pre-scripted answers in sequence. This gives you full conversation replay without real API calls.
# tests/helpers/recorded_client.py
from typing import AsyncIterator
from unittest.mock import AsyncMock
from autogen_core.models import (
ChatCompletionClient,
CreateResult,
RequestUsage,
)
from autogen_core.models._types import LLMMessage
class RecordedModelClient:
"""
Replays a sequence of pre-recorded responses.
Raises IndexError if the conversation exceeds the recorded script.
"""
def __init__(self, responses: list[str]):
self._responses = list(responses)
self._index = 0
self.calls: list[list[LLMMessage]] = []
async def create(
self,
messages: list[LLMMessage],
*,
tools=None,
cancellation_token=None,
**kwargs,
) -> CreateResult:
self.calls.append(messages)
if self._index >= len(self._responses):
raise IndexError(
f"RecordedModelClient exhausted after {len(self._responses)} responses. "
f"Conversation ran longer than expected."
)
content = self._responses[self._index]
self._index += 1
return CreateResult(
content=content,
usage=RequestUsage(prompt_tokens=10, completion_tokens=10),
finish_reason="stop",
cached=False,
)
@property
def capabilities(self):
from autogen_core.models import ModelCapabilities
return ModelCapabilities(vision=False, function_calling=True, json_output=True)# tests/test_groupchat_integration.py
import pytest
import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_agentchat.conditions import TextMentionTermination
from tests.helpers.recorded_client import RecordedModelClient
@pytest.fixture
def recorded_client():
"""A client with a scripted 3-turn conversation."""
return RecordedModelClient([
"I'll start by analyzing the requirements.", # coordinator turn 1
"I found 3 relevant database records. QUERY_DONE", # data_analyst turn 1
"Here is the summary report. TASK_COMPLETE", # writer turn 1
])
@pytest.mark.asyncio
async def test_three_agent_conversation_completes(recorded_client):
coordinator = AssistantAgent(
name="coordinator",
model_client=recorded_client,
system_message="You coordinate the team.",
)
data_analyst = AssistantAgent(
name="data_analyst",
model_client=recorded_client,
system_message="You query databases.",
)
writer = AssistantAgent(
name="writer",
model_client=recorded_client,
system_message="You write reports.",
)
termination = TextMentionTermination("TASK_COMPLETE")
team = RoundRobinGroupChat(
[coordinator, data_analyst, writer],
termination_condition=termination,
max_turns=10,
)
result = await team.run(task="Analyze Q1 sales data and produce a summary report.")
# Verify the conversation ran the expected number of turns
assert len(result.messages) == 4 # task + 3 agent responses
# Verify each agent spoke in the expected order
agent_names = [m.source for m in result.messages if hasattr(m, "source")]
assert "coordinator" in agent_names
assert "data_analyst" in agent_names
assert "writer" in agent_names
# Verify termination happened on TASK_COMPLETE, not on max_turns
last_message = result.messages[-1]
assert "TASK_COMPLETE" in last_message.content
# Verify all scripted responses were consumed (conversation wasn't cut short)
assert recorded_client._index == 3Testing max_round Enforcement
A conversation that hits max_turns without terminating is a bug. But you also need to verify that max_turns actually stops the conversation when the termination condition never fires. Here's how to test both sides:
@pytest.mark.asyncio
async def test_conversation_stops_at_max_turns_when_termination_never_fires():
"""Verify max_turns acts as a hard ceiling."""
# Script more responses than max_turns allows
infinite_client = RecordedModelClient(
["Thinking about the problem..."] * 20
)
agent_a = AssistantAgent(
name="agent_a",
model_client=infinite_client,
system_message="You think carefully.",
)
agent_b = AssistantAgent(
name="agent_b",
model_client=infinite_client,
system_message="You also think carefully.",
)
# Termination condition that never fires
from autogen_agentchat.conditions import TextMentionTermination
termination = TextMentionTermination("THIS_WILL_NEVER_APPEAR")
team = RoundRobinGroupChat(
[agent_a, agent_b],
termination_condition=termination,
max_turns=4,
)
result = await team.run(task="Start discussing.")
# Should stop at max_turns, not run forever
agent_messages = [m for m in result.messages if hasattr(m, "source")]
assert len(agent_messages) == 4
# Confirm we did NOT consume all 20 scripted responses
assert infinite_client._index == 4
@pytest.mark.asyncio
async def test_termination_fires_before_max_turns():
"""Verify early termination doesn't require hitting max_turns."""
early_stop_client = RecordedModelClient([
"I have the answer. TASK_COMPLETE",
])
agent = AssistantAgent(
name="solver",
model_client=early_stop_client,
system_message="You solve problems.",
)
from autogen_agentchat.conditions import TextMentionTermination
termination = TextMentionTermination("TASK_COMPLETE")
team = RoundRobinGroupChat(
[agent],
termination_condition=termination,
max_turns=10,
)
result = await team.run(task="Solve this problem.")
# Only one agent turn should have occurred
agent_messages = [m for m in result.messages if hasattr(m, "source")]
assert len(agent_messages) == 1
assert early_stop_client._index == 1 # only one LLM call madeTesting Agent Role Specialization
In a well-designed GroupChat, each agent should only perform tasks consistent with its role. Testing this means asserting on the content agents produce, not just that they spoke.
@pytest.mark.asyncio
async def test_data_analyst_produces_structured_output():
"""The data analyst should always include numeric findings."""
import re
analyst_client = RecordedModelClient([
"Found 47 records matching criteria. Average value: $1,234. Max: $5,678."
])
analyst = AssistantAgent(
name="data_analyst",
model_client=analyst_client,
system_message="You produce quantitative analysis with specific numbers.",
)
from autogen_agentchat.conditions import MaxMessageTermination
team = RoundRobinGroupChat(
[analyst],
termination_condition=MaxMessageTermination(2),
)
result = await team.run(task="Analyze the Q1 revenue data.")
analyst_messages = [
m for m in result.messages
if getattr(m, "source", None) == "data_analyst"
]
assert len(analyst_messages) == 1
content = analyst_messages[0].content
# Role specialization assertion: analyst output must contain numbers
numbers_found = re.findall(r'\d+', content)
assert len(numbers_found) >= 2, (
f"Data analyst output should contain numeric findings, got: {content}"
)Asserting Message Sequences
Sometimes the order of agent contributions matters as much as their content. A sequential research pipeline — gather, analyze, summarize — should always produce output in that order:
@pytest.mark.asyncio
async def test_pipeline_message_sequence_is_correct():
"""Verify agents speak in the correct pipeline order."""
pipeline_client = RecordedModelClient([
"Research complete: found 5 sources.", # researcher
"Analysis done: 3 key themes identified.", # analyst
"Summary: the main finding is X. TASK_COMPLETE", # summarizer
])
researcher = AssistantAgent(
name="researcher", model_client=pipeline_client,
system_message="You gather information."
)
analyst = AssistantAgent(
name="analyst", model_client=pipeline_client,
system_message="You analyze information."
)
summarizer = AssistantAgent(
name="summarizer", model_client=pipeline_client,
system_message="You summarize findings."
)
from autogen_agentchat.conditions import TextMentionTermination
team = RoundRobinGroupChat(
[researcher, analyst, summarizer],
termination_condition=TextMentionTermination("TASK_COMPLETE"),
max_turns=6,
)
result = await team.run(task="Research and summarize the topic.")
# Extract agent message sequence (exclude the initial task message)
sequence = [
m.source for m in result.messages
if hasattr(m, "source") and m.source != "user"
]
assert sequence == ["researcher", "analyst", "summarizer"], (
f"Expected pipeline order [researcher, analyst, summarizer], got: {sequence}"
)Putting It Together: A GroupChat Test Suite Structure
A complete GroupChat test suite for a production system looks like this:
tests/
unit/
test_speaker_selection.py # pure function tests, no mocking needed
test_termination_conditions.py # pure function tests, fast
integration/
test_groupchat_conversations.py # RecordedModelClient, deterministic
test_max_round_enforcement.py # boundary condition tests
test_role_specialization.py # content assertion tests
fixtures/
recorded_conversations/
research_pipeline_happy_path.json # stored message sequences
research_pipeline_early_termination.jsonStore your RecordedModelClient response scripts as JSON files alongside the tests. When a conversation behavior changes intentionally — because you updated a system prompt, for example — update the recording and commit it. The diff shows exactly what changed about the conversation.
Beyond Unit Tests: Monitoring GroupChat in Production
Unit and integration tests catch most GroupChat bugs before deployment, but they can't catch the long tail of LLM behavior drift — when your model provider updates a model and speaker selection starts routing differently, or when a new class of user inputs triggers your termination function incorrectly.
Production monitoring needs to track: average turns per conversation (a rising average signals termination condition degradation), speaker distribution per conversation (unexpected skews indicate routing problems), and conversations that hit max_turns (these should be near zero in a healthy system).
HelpMeTest provides health monitoring that works alongside your unit tests — you can define assertions on conversation-level metrics and get alerted when they drift outside expected ranges. It's particularly useful for catching the class of GroupChat bugs that only appear with real users at scale. Plans start at $100/month.
The tests in this post give you a foundation that works today. The monitoring gives you confidence that tomorrow's model update doesn't quietly break your GroupChat routing.