Testing Multi-Agent Orchestration with LangGraph and CrewAI

Testing Multi-Agent Orchestration with LangGraph and CrewAI

Multi-agent systems (LangGraph workflows, CrewAI crews) are dramatically harder to test than single agents. State flows between agents, handoffs can fail silently, and emergent behaviors arise from agent interactions. This guide covers testing strategies for multi-agent orchestration: state validation, handoff testing, agent isolation, and failure injection.

Single agents are hard enough to test. Multi-agent systems add another dimension: the interactions between agents. A workflow where a Researcher agent feeds data to an Analyst agent, which produces a report for a Writer agent, has failure modes at every handoff — and most of them are silent.

This guide covers testing LangGraph workflows and CrewAI crews, but the principles apply to any multi-agent orchestration framework.

What to Test in Multi-Agent Systems

State Transitions

In LangGraph, state passes between nodes (agents). The core test: does state transform correctly at each step?

# tests/test_research_graph.py
import pytest
from langgraph.graph import StateGraph
from typing import TypedDict, Annotated
from app.agents import ResearcherAgent, AnalystAgent, WriterAgent

class ResearchState(TypedDict):
    query: str
    search_results: list[dict]
    analysis: str
    final_report: str
    errors: list[str]

def test_researcher_node_adds_results_to_state():
    """Researcher agent must populate search_results."""
    initial_state = ResearchState(
        query="quantum computing trends 2026",
        search_results=[],
        analysis="",
        final_report="",
        errors=[]
    )
    
    researcher = ResearcherAgent(search_client=MockSearchClient())
    result_state = researcher.run(initial_state)
    
    assert len(result_state["search_results"]) > 0, \
        "Researcher must add at least one search result"
    assert result_state["query"] == initial_state["query"], \
        "Researcher must not modify the original query"
    assert result_state["errors"] == [], \
        "Researcher should not add errors on success"

def test_analyst_node_reads_search_results():
    """Analyst agent must process search_results and populate analysis."""
    state_with_results = ResearchState(
        query="quantum computing",
        search_results=[
            {"title": "IBM unveils 1000-qubit chip", "content": "IBM's new chip...", "url": "https://..."},
            {"title": "Google quantum milestone", "content": "Google researchers...", "url": "https://..."}
        ],
        analysis="",
        final_report="",
        errors=[]
    )
    
    analyst = AnalystAgent()
    result_state = analyst.run(state_with_results)
    
    assert len(result_state["analysis"]) > 100, \
        "Analyst must produce a substantive analysis (>100 chars)"
    # Analysis should reference the source material
    assert "IBM" in result_state["analysis"] or "Google" in result_state["analysis"]

Testing individual nodes in isolation catches bugs before they compound across the graph.

Handoff Correctness

Test that the output of agent A is exactly what agent B expects as input:

def test_researcher_to_analyst_handoff():
    """Verify researcher output is compatible with analyst input schema."""
    researcher_output_fields = set(ResearcherAgent.OUTPUT_FIELDS)
    analyst_required_fields = set(AnalystAgent.REQUIRED_INPUT_FIELDS)
    
    missing = analyst_required_fields - researcher_output_fields
    assert not missing, \
        f"Researcher doesn't produce all fields Analyst requires: {missing}"

def test_analyst_produces_valid_report_input():
    """Analyst output must be parseable by the Writer agent."""
    state = create_state_with_mock_analysis()
    analyst = AnalystAgent()
    result = analyst.run(state)
    
    # Writer expects structured analysis — validate it
    try:
        WriterAgent.validate_input(result["analysis"])
    except ValueError as e:
        pytest.fail(f"Analyst output failed Writer validation: {e}")

Handoff tests act as integration tests between pairs of agents. They're cheaper than full end-to-end tests but catch interface mismatches.

Testing LangGraph Graphs

Graph Structure Tests

Before testing behavior, verify the graph is wired correctly:

from app.graphs import create_research_graph

def test_graph_has_correct_nodes():
    graph = create_research_graph()
    compiled = graph.compile()
    
    # All expected nodes are present
    node_names = set(compiled.nodes.keys())
    expected_nodes = {"researcher", "analyst", "writer", "__start__", "__end__"}
    assert expected_nodes.issubset(node_names), \
        f"Missing nodes: {expected_nodes - node_names}"

def test_graph_conditional_routing():
    """If researcher finds no results, graph should skip analyst and go to error handler."""
    graph = create_research_graph()
    compiled = graph.compile()
    
    # Run with a query that returns no results
    initial_state = {"query": "xyzzy_nonexistent_topic_12345", "search_results": [], "errors": []}
    
    with patch("app.agents.researcher.search", return_value=[]):
        final_state = compiled.invoke(initial_state)
    
    assert "no results found" in " ".join(final_state["errors"]).lower(), \
        "Graph should add error when researcher finds nothing"
    assert final_state["final_report"] == "", \
        "Graph should not generate a report when there are no results"

Full Graph Execution with Mocked LLMs

Test the complete graph end-to-end without hitting real LLMs or external APIs:

from unittest.mock import patch, MagicMock

def make_mock_llm_responses(node_responses: dict):
    """Map node names to their mock LLM responses."""
    call_count = {"count": 0}
    node_order = list(node_responses.keys())
    
    def mock_invoke(messages, **kwargs):
        # Detect which node is calling based on system prompt content
        system_prompt = next(
            (m["content"] for m in messages if m["role"] == "system"), ""
        )
        for node_name, response in node_responses.items():
            if node_name.lower() in system_prompt.lower():
                return MagicMock(content=[MagicMock(text=response)])
        return MagicMock(content=[MagicMock(text="Default response")])
    
    return mock_invoke

def test_full_research_workflow():
    mock_responses = {
        "researcher": '["Result 1: AI advances", "Result 2: ML breakthroughs"]',
        "analyst": "Key trends: AI adoption is accelerating. Three major developments...",
        "writer": "# Quantum Computing in 2026\n\nThis year marked a turning point..."
    }
    
    with patch("app.llm.client.messages.create", side_effect=make_mock_llm_responses(mock_responses)):
        graph = create_research_graph()
        result = graph.compile().invoke({
            "query": "AI trends 2026",
            "search_results": [],
            "analysis": "",
            "final_report": "",
            "errors": []
        })
    
    assert len(result["final_report"]) > 0
    assert result["errors"] == []
    assert "2026" in result["final_report"]

Testing CrewAI Crews

CrewAI uses a different abstraction (agents, tasks, crews) but the testing principles are the same.

Agent Task Isolation

Test each task in isolation before testing the crew:

from crewai import Agent, Task, Crew
from unittest.mock import MagicMock

def test_research_task_produces_required_output():
    """Research task must return structured findings."""
    mock_llm = MagicMock()
    mock_llm.call.return_value = """
    FINDINGS:
    1. Cloud adoption reached 85% among enterprises in 2026
    2. Cost savings averaging 40% vs on-premise
    3. Top providers: AWS (32%), Azure (28%), GCP (19%)
    """
    
    researcher = Agent(
        role="Senior Researcher",
        goal="Find accurate statistics about cloud adoption",
        backstory="You are an experienced market researcher",
        llm=mock_llm,
        verbose=False
    )
    
    task = Task(
        description="Research cloud adoption rates in 2026",
        expected_output="Numbered list of at least 3 findings with statistics",
        agent=researcher
    )
    
    result = task.execute_sync()
    
    # Validate output structure
    assert "FINDINGS:" in result or "1." in result, \
        "Research task should produce numbered findings"
    assert len(result.split("\n")) >= 3, \
        "Should have at least 3 lines of findings"

def test_analyst_uses_researcher_output():
    """Analyst task should reference researcher findings in its output."""
    mock_llm = MagicMock()
    
    # Researcher produces findings
    mock_llm.call.side_effect = [
        "FINDINGS:\n1. Cloud adoption 85%\n2. Cost savings 40%",  # researcher
        "ANALYSIS: Based on the 85% adoption rate, the trend is clearly..."  # analyst
    ]
    
    researcher = Agent(role="Researcher", goal="Find data", backstory="...", llm=mock_llm)
    analyst = Agent(role="Analyst", goal="Analyze data", backstory="...", llm=mock_llm)
    
    research_task = Task(
        description="Find cloud adoption data",
        expected_output="Statistics list",
        agent=researcher
    )
    analyst_task = Task(
        description="Analyze the research findings",
        expected_output="Strategic analysis",
        agent=analyst,
        context=[research_task]  # Analyst gets researcher output
    )
    
    crew = Crew(agents=[researcher, analyst], tasks=[research_task, analyst_task], verbose=False)
    result = crew.kickoff()
    
    # Analyst output should reference the researcher's key findings
    assert "85%" in result.raw or "adoption" in result.raw.lower()

Testing Crew-Level Behaviors

def test_crew_handles_researcher_failure_gracefully():
    """If researcher fails, crew should not crash — it should report the failure."""
    mock_llm = MagicMock()
    mock_llm.call.side_effect = [
        Exception("Search API timeout"),  # researcher fails
    ]
    
    crew = create_research_crew(llm=mock_llm)
    
    # Crew should not raise an exception
    try:
        result = crew.kickoff(inputs={"query": "test"})
        # If it doesn't raise, it should indicate failure in output
        assert "error" in result.raw.lower() or "failed" in result.raw.lower(), \
            "Crew should communicate researcher failure in output"
    except Exception as e:
        pytest.fail(f"Crew raised unexpected exception on tool failure: {e}")

def test_crew_task_order_is_correct():
    """Tasks must execute in dependency order."""
    execution_order = []
    
    def track_execution(original_func, task_name):
        def wrapper(*args, **kwargs):
            execution_order.append(task_name)
            return original_func(*args, **kwargs)
        return wrapper
    
    crew = create_research_crew()
    for task in crew.tasks:
        task.execute_sync = track_execution(task.execute_sync, task.description[:20])
    
    with patch_all_llms():
        crew.kickoff()
    
    # Research must come before analysis
    research_idx = next(i for i, name in enumerate(execution_order) if "research" in name.lower())
    analysis_idx = next(i for i, name in enumerate(execution_order) if "analys" in name.lower())
    assert research_idx < analysis_idx, "Research must execute before analysis"

Failure Injection Testing

Multi-agent systems must be resilient. Test what happens when agents fail:

import random

class FailingAgent:
    """Wraps a real agent to simulate failures."""
    
    def __init__(self, agent, failure_rate: float = 0.3):
        self._agent = agent
        self._failure_rate = failure_rate
    
    def run(self, state):
        if random.random() < self._failure_rate:
            raise RuntimeError(f"Simulated {type(self._agent).__name__} failure")
        return self._agent.run(state)

def test_graph_retries_on_transient_failure():
    """Graph should retry failed nodes up to max_retries times."""
    call_count = {"count": 0}
    
    def flaky_search(query):
        call_count["count"] += 1
        if call_count["count"] < 3:  # Fail twice, then succeed
            raise RuntimeError("Search API temporarily unavailable")
        return [{"title": "Result", "content": "Content", "url": "https://..."}]
    
    with patch("app.agents.researcher.search", flaky_search):
        graph = create_research_graph(max_retries=3)
        result = graph.compile().invoke({"query": "test", "search_results": [], "errors": []})
    
    assert call_count["count"] == 3, "Should have retried twice before succeeding"
    assert result["errors"] == [], "Should have no errors after successful retry"

def test_graph_enters_error_state_after_max_retries():
    """After max_retries, graph should enter error state, not crash."""
    with patch("app.agents.researcher.search", side_effect=RuntimeError("Permanent failure")):
        graph = create_research_graph(max_retries=3)
        result = graph.compile().invoke({"query": "test", "search_results": [], "errors": []})
    
    assert len(result["errors"]) > 0, "Should have errors after max retries"
    assert "permanent failure" in " ".join(result["errors"]).lower() or \
           "failed" in " ".join(result["errors"]).lower()
    assert result["final_report"] == "", "Should not generate report on failure"

Performance Testing

Multi-agent systems have compounding latency. Profile them:

import time
import statistics

def test_full_crew_completes_within_sla():
    """Full research crew must complete in under 60 seconds."""
    crew = create_research_crew(llm=fast_mock_llm())
    
    start = time.time()
    crew.kickoff(inputs={"query": "test query"})
    duration = time.time() - start
    
    assert duration < 60, f"Crew took {duration:.1f}s — exceeds 60s SLA"

def test_crew_token_usage_within_budget():
    """Full crew must not exceed 10,000 tokens per run."""
    token_counter = TokenCounter()
    
    crew = create_research_crew(llm=counting_mock_llm(token_counter))
    crew.kickoff(inputs={"query": "quantum computing"})
    
    assert token_counter.total < 10_000, \
        f"Crew used {token_counter.total} tokens — exceeds 10K budget"

End-to-End Validation

After testing individual agents and their interactions, validate the complete workflow against a real application:

*** Test Cases ***
Multi-Agent Research Report Published Successfully
    # Run the research crew
    Run Process  python  run_crew.py  --query  "AI testing tools 2026"
    
    # Validate the output appeared in the target system
    Go To  https://app.example.com/reports
    Wait For Elements State  .report-list  visible  timeout=30s
    ${count}=  Get Element Count  .report-item
    Should Be True  ${count} > 0
    Get Text  .report-item:first-child .title  contains  AI

For teams using HelpMeTest, install the CLI and run browser-level E2E tests as part of your CI:

curl -fsSL https://helpmetest.com/install | bash
helpmetest <span class="hljs-built_in">test <span class="hljs-string">"Multi-Agent Report Generation"

Common Mistakes

Testing the whole crew but not individual agents. When a crew test fails, you have no idea which agent is wrong. Test each agent in isolation first.

Not testing handoff schemas. If agent A produces JSON but agent B expects a list of strings, the crew silently degrades — the downstream agent gets useless context. Define and test handoff schemas explicitly.

Ignoring non-determinism. Different LLM calls in the same session can produce different ordering of results. Use temperature=0 in tests and assert behavioral properties, not exact outputs.

No failure injection. Multi-agent systems in production will encounter failures. If you only test the happy path, you don't know how the system behaves when a single agent fails.

Conclusion

Testing multi-agent systems requires testing at three levels: individual agent node behavior, agent-to-agent handoffs, and full workflow execution. LangGraph graphs and CrewAI crews have specific testing affordances — use them. Inject failures intentionally to verify resilience. Track state transitions explicitly rather than treating the system as a black box. The investment in structured multi-agent testing pays off in systems that fail predictably, recover gracefully, and don't produce mysterious outputs when a single component misbehaves.

Read more