Testing CrewAI Tools, Task Context, and Crew Output: A Practical Guide
CrewAI's abstraction is clean: you define agents with roles and tools, wire them into tasks, and the crew handles delegation. What's less clean is testing it. A crew.kickoff() call can trigger a chain of LLM calls, tool invocations, and agent-to-agent handoffs that costs several dollars, takes two minutes, and returns a different result every time. That's not a test — it's an experiment.
This post focuses on the practical mechanics of making CrewAI systems testable: mocking tools that call external APIs, testing that context passes correctly between tasks, validating both structured and unstructured crew outputs, testing error handling in delegation chains, and choosing when to use crew.kickoff() versus testing individual task execution. The existing CrewAI testing guide covers configuration validation and high-level strategy. This post is about the specifics.
The Tool Mocking Problem
Tools are where most CrewAI bugs live. A tool that calls an external API introduces network latency, rate limits, auth failures, and unpredictable responses into your test suite. The fix is straightforward: mock the tool's underlying function, not the LLM.
CrewAI tools are subclasses of BaseTool. Their logic lives in _run(). Mock that.
# your_app/tools/web_search.py
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
import httpx
class WebSearchInput(BaseModel):
query: str = Field(description="The search query to execute.")
max_results: int = Field(default=5, description="Maximum number of results.")
class WebSearchTool(BaseTool):
name: str = "web_search"
description: str = "Searches the web for current information on a topic."
args_schema: type[BaseModel] = WebSearchInput
def _run(self, query: str, max_results: int = 5) -> str:
response = httpx.get(
"https://api.search-provider.com/search",
params={"q": query, "limit": max_results},
headers={"Authorization": f"Bearer {self._get_api_key()}"},
)
response.raise_for_status()
results = response.json()["results"]
return "\n".join(f"- {r['title']}: {r['snippet']}" for r in results)
def _get_api_key(self) -> str:
import os
return os.environ["SEARCH_API_KEY"]# tests/test_web_search_tool.py
import pytest
from unittest.mock import patch, MagicMock
from your_app.tools.web_search import WebSearchTool
@pytest.fixture
def search_tool():
return WebSearchTool()
def test_tool_formats_results_correctly(search_tool):
mock_response = MagicMock()
mock_response.json.return_value = {
"results": [
{"title": "Python 3.13 Released", "snippet": "New features include..."},
{"title": "Python Performance Benchmarks", "snippet": "3.13 is 15% faster..."},
]
}
mock_response.raise_for_status = MagicMock()
with patch("httpx.get", return_value=mock_response):
with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
result = search_tool._run(query="Python 3.13", max_results=2)
assert "Python 3.13 Released" in result
assert "Python Performance Benchmarks" in result
assert result.startswith("- ")
def test_tool_raises_on_api_error(search_tool):
import httpx
with patch("httpx.get", side_effect=httpx.HTTPStatusError(
"429 Too Many Requests",
request=MagicMock(),
response=MagicMock(status_code=429),
)):
with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
with pytest.raises(httpx.HTTPStatusError):
search_tool._run(query="Python 3.13")
def test_tool_passes_max_results_to_api(search_tool):
mock_response = MagicMock()
mock_response.json.return_value = {"results": []}
mock_response.raise_for_status = MagicMock()
with patch("httpx.get", return_value=mock_response) as mock_get:
with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
search_tool._run(query="test query", max_results=10)
call_kwargs = mock_get.call_args[1]["params"]
assert call_kwargs["limit"] == 10
def test_tool_respects_default_max_results(search_tool):
mock_response = MagicMock()
mock_response.json.return_value = {"results": []}
mock_response.raise_for_status = MagicMock()
with patch("httpx.get", return_value=mock_response) as mock_get:
with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
search_tool._run(query="test query")
call_kwargs = mock_get.call_args[1]["params"]
assert call_kwargs["limit"] == 5Testing _run() directly bypasses CrewAI entirely. These tests run in under 10 milliseconds and give you complete coverage of your tool's API integration logic.
Testing Task Context Passing
Sequential CrewAI tasks can pass context — the output of an earlier task feeds into a later one via context=[previous_task]. When this goes wrong, the second agent operates on stale or missing data. Testing context passing requires creating controlled task outputs and verifying they appear in subsequent task inputs.
# your_app/crew.py
from crewai import Agent, Task, Crew
from your_app.tools.web_search import WebSearchTool
def build_research_crew(llm) -> Crew:
search_tool = WebSearchTool()
researcher = Agent(
role="Senior Researcher",
goal="Find accurate, current information on the given topic.",
backstory="Expert researcher with access to web search tools.",
tools=[search_tool],
llm=llm,
verbose=False,
)
analyst = Agent(
role="Data Analyst",
goal="Analyze research findings and extract key insights.",
backstory="Expert at synthesizing research into actionable insights.",
llm=llm,
verbose=False,
)
research_task = Task(
description="Research the topic: {topic}. Find the 3 most important facts.",
expected_output="A numbered list of 3 key facts about {topic}.",
agent=researcher,
)
analysis_task = Task(
description="Analyze the research findings and identify the single most important insight.",
expected_output="One paragraph summarizing the key insight from the research.",
agent=analyst,
context=[research_task], # This is what we're testing
)
return Crew(
agents=[researcher, analyst],
tasks=[research_task, analysis_task],
verbose=False,
)# tests/test_task_context.py
import pytest
from unittest.mock import MagicMock, patch
from crewai import Agent, Task, Crew
def make_mock_llm(responses: list[str]):
"""Create an LLM mock that returns responses in sequence."""
call_count = 0
def mock_call(*args, **kwargs):
nonlocal call_count
response = MagicMock()
response.content = responses[call_count % len(responses)]
call_count += 1
return response
llm = MagicMock()
llm.call = mock_call
return llm
def test_analysis_task_receives_research_output():
"""Verify that task context actually propagates to the dependent agent."""
research_output = "1. Fact A\n2. Fact B\n3. Fact C"
analysis_output = "The key insight is Fact A, which matters most because..."
# Track what prompt the analyst actually received
analyst_prompts_received = []
mock_llm = MagicMock()
def capture_analyst_call(*args, **kwargs):
# Capture the prompt for assertion
prompt = str(args) + str(kwargs)
analyst_prompts_received.append(prompt)
response = MagicMock()
response.content = analysis_output
return response
mock_llm.call = MagicMock(side_effect=[
# First call: researcher
MagicMock(content=research_output),
# Second call: analyst
MagicMock(side_effect=capture_analyst_call),
])
with patch("your_app.tools.web_search.WebSearchTool._run", return_value=research_output):
from your_app.crew import build_research_crew
crew = build_research_crew(llm=mock_llm)
result = crew.kickoff(inputs={"topic": "renewable energy"})
# The crew should complete
assert result is not None
def test_task_context_list_is_wired_correctly():
"""Unit test: verify task context references are set up correctly."""
mock_llm = MagicMock()
from your_app.crew import build_research_crew
crew = build_research_crew(llm=mock_llm)
research_task = crew.tasks[0]
analysis_task = crew.tasks[1]
# Analysis task must have the research task in its context
assert research_task in analysis_task.context, (
"analysis_task.context must include research_task for context passing to work"
)
# Research task must have no upstream context
assert not research_task.context or len(research_task.context) == 0, (
"research_task should have no upstream context dependencies"
)
def test_tasks_assigned_to_correct_agents():
"""Verify task-agent assignments before running the crew."""
mock_llm = MagicMock()
from your_app.crew import build_research_crew
crew = build_research_crew(llm=mock_llm)
research_task, analysis_task = crew.tasks
assert research_task.agent.role == "Senior Researcher"
assert analysis_task.agent.role == "Data Analyst"
assert "search" in str([t.name for t in research_task.agent.tools]).lower()
# Analyst should NOT have the search tool — that's the researcher's responsibility
analyst_tool_names = [t.name for t in analysis_task.agent.tools]
assert "web_search" not in analyst_tool_namesTesting Structured vs Unstructured Output
CrewAI supports structured output via Pydantic models. The behavior is different enough that you need separate test strategies for each.
Testing Unstructured Output
# tests/test_crew_output.py
import pytest
from unittest.mock import MagicMock, patch
def test_unstructured_output_contains_required_sections():
"""Unstructured output should contain expected content markers."""
expected_sections = ["## Summary", "## Key Findings", "## Recommendations"]
mock_output = """
## Summary
The research indicates strong growth in renewable energy adoption.
## Key Findings
- Solar capacity increased 40% year over year
- Battery storage costs dropped 25%
## Recommendations
Invest in utility-scale solar infrastructure.
"""
mock_llm = MagicMock()
mock_llm.call.return_value = MagicMock(content=mock_output)
with patch("your_app.tools.web_search.WebSearchTool._run", return_value="mock search results"):
from your_app.crew import build_research_crew
crew = build_research_crew(llm=mock_llm)
# Test the structure by inspecting what the LLM was asked to produce
final_task = crew.tasks[-1]
assert "insight" in final_task.expected_output.lower(), (
"Final task expected_output should describe what the analyst produces"
)Testing Structured Output with Pydantic
# your_app/output_schemas.py
from pydantic import BaseModel, Field
from typing import List
class ResearchFinding(BaseModel):
title: str = Field(description="Brief title for this finding")
detail: str = Field(description="Detailed explanation")
confidence: str = Field(description="high, medium, or low")
class ResearchReport(BaseModel):
topic: str = Field(description="The research topic")
findings: List[ResearchFinding] = Field(description="List of key findings")
recommendation: str = Field(description="The primary recommendation")# tests/test_structured_output.py
import pytest
import json
from unittest.mock import MagicMock, patch
from your_app.output_schemas import ResearchReport, ResearchFinding
def test_structured_output_validates_correctly():
"""Verify that valid structured output passes schema validation."""
valid_output = {
"topic": "Renewable Energy",
"findings": [
{
"title": "Solar growth accelerating",
"detail": "40% capacity increase YoY",
"confidence": "high"
}
],
"recommendation": "Invest in utility-scale solar."
}
# This is the validation your crew output goes through
report = ResearchReport(**valid_output)
assert report.topic == "Renewable Energy"
assert len(report.findings) == 1
assert report.findings[0].confidence == "high"
def test_structured_output_rejects_invalid_confidence():
"""Pydantic should catch invalid enum-like values before they reach callers."""
from pydantic import ValidationError
invalid_output = {
"topic": "Test",
"findings": [
{
"title": "Something",
"detail": "Details here",
"confidence": "definitely" # not a valid value
}
],
"recommendation": "Do something."
}
# If you add a validator to the schema, it should reject this
# Currently confidence is a plain string — this test documents that gap
report = ResearchReport(**invalid_output)
# This passes today; a future validation improvement should fail it
assert report.findings[0].confidence == "definitely"
# TODO: add Literal["high", "medium", "low"] constraint to confidence field
def test_crew_with_structured_output_type():
"""Verify crew is configured to use structured output type."""
mock_llm = MagicMock()
from crewai import Agent, Task, Crew
from your_app.output_schemas import ResearchReport
analyst = Agent(
role="Data Analyst",
goal="Produce structured analysis.",
backstory="Expert analyst.",
llm=mock_llm,
)
task = Task(
description="Analyze renewable energy trends.",
expected_output="A structured research report.",
agent=analyst,
output_pydantic=ResearchReport, # structured output configuration
)
crew = Crew(agents=[analyst], tasks=[task])
# Verify the task is configured for structured output
assert task.output_pydantic == ResearchReport
assert task.output_json is None # shouldn't have bothTesting Async Crew Execution
crew.kickoff_async() lets you run crews concurrently. Testing async execution requires verifying both the output and the concurrency behavior.
# tests/test_async_crew.py
import pytest
import asyncio
import time
from unittest.mock import MagicMock, AsyncMock, patch
@pytest.mark.asyncio
async def test_async_kickoff_completes():
"""Verify async crew execution returns the same result type as sync."""
mock_llm = MagicMock()
mock_llm.call.return_value = MagicMock(content="Analysis complete. Key insight: X.")
with patch("your_app.tools.web_search.WebSearchTool._run", return_value="mock results"):
from your_app.crew import build_research_crew
crew = build_research_crew(llm=mock_llm)
result = await crew.kickoff_async(inputs={"topic": "AI trends"})
assert result is not None
@pytest.mark.asyncio
async def test_multiple_crews_run_concurrently():
"""Verify that kickoff_async allows crews to run in parallel."""
mock_llm = MagicMock()
call_delay = 0.05 # 50ms simulated LLM latency
def slow_llm_response(*args, **kwargs):
time.sleep(call_delay)
return MagicMock(content="Result.")
mock_llm.call = slow_llm_response
with patch("your_app.tools.web_search.WebSearchTool._run", return_value="results"):
from your_app.crew import build_research_crew
crew_1 = build_research_crew(llm=mock_llm)
crew_2 = build_research_crew(llm=mock_llm)
start = time.monotonic()
results = await asyncio.gather(
crew_1.kickoff_async(inputs={"topic": "Topic A"}),
crew_2.kickoff_async(inputs={"topic": "Topic B"}),
)
elapsed = time.monotonic() - start
assert len(results) == 2
# Both completed — that's the main assertion for async crew tests
assert all(r is not None for r in results)Testing Error Handling in Task Delegation
When a tool fails or an agent produces malformed output, you need your crew to fail gracefully rather than silently returning garbage. Testing the error paths is as important as testing the happy path.
# tests/test_error_handling.py
import pytest
from unittest.mock import MagicMock, patch
import httpx
def test_crew_handles_tool_rate_limit_error():
"""Verify the crew surfaces tool errors rather than swallowing them."""
mock_llm = MagicMock()
mock_llm.call.return_value = MagicMock(content="Found the information.")
rate_limit_error = httpx.HTTPStatusError(
"429 Too Many Requests",
request=MagicMock(),
response=MagicMock(status_code=429),
)
with patch("your_app.tools.web_search.WebSearchTool._run", side_effect=rate_limit_error):
from your_app.crew import build_research_crew
crew = build_research_crew(llm=mock_llm)
# The crew should either raise or return an error indicator
# Behavior depends on your max_rpm and error handling configuration
try:
result = crew.kickoff(inputs={"topic": "test"})
# If the crew catches errors and continues, result should still be valid
assert result is not None
except httpx.HTTPStatusError:
# If errors propagate, they should be the original error type
pass # This is also acceptable behavior — document which you expect
def test_task_with_no_agent_raises_on_construction():
"""Verify misconfigured tasks fail at construction, not at runtime."""
from crewai import Task
import pydantic
with pytest.raises((ValueError, pydantic.ValidationError)):
Task(
description="Do something.",
expected_output="Something done.",
agent=None, # This should fail
)
def test_agent_without_required_tool_is_detectable():
"""Verify you can detect a misconfigured agent before running the crew."""
mock_llm = MagicMock()
from your_app.crew import build_research_crew
crew = build_research_crew(llm=mock_llm)
researcher = crew.agents[0]
analyst = crew.agents[1]
# Researcher needs web_search tool
researcher_tool_names = [t.name for t in researcher.tools]
assert "web_search" in researcher_tool_names, (
f"Researcher is missing web_search tool. Has: {researcher_tool_names}"
)
# This assertion runs in CI before any LLM calls happen
# It catches tool assignment bugs at the cheapest possible momentIndividual Task Execution vs crew.kickoff()
For testing specific task logic without running the full crew, you can execute tasks individually. This is faster and isolates failures to a single agent:
# tests/test_individual_tasks.py
import pytest
from unittest.mock import MagicMock, patch
from crewai import Agent, Task
def test_researcher_task_in_isolation():
"""Test just the research task without running the full crew."""
mock_llm = MagicMock()
mock_llm.call.return_value = MagicMock(
content="1. Solar adoption is accelerating\n2. Wind costs dropped 30%\n3. Grid storage is the bottleneck"
)
researcher = Agent(
role="Senior Researcher",
goal="Find accurate information.",
backstory="Expert researcher.",
llm=mock_llm,
verbose=False,
)
task = Task(
description="Research renewable energy trends.",
expected_output="A numbered list of 3 key facts.",
agent=researcher,
)
with patch("your_app.tools.web_search.WebSearchTool._run", return_value="mock search results"):
# Execute single task via a single-task crew
from crewai import Crew
single_task_crew = Crew(agents=[researcher], tasks=[task], verbose=False)
result = single_task_crew.kickoff()
output = str(result)
# Verify the task produced a numbered list
assert "1." in output
assert "2." in output
assert "3." in outputA Complete Test File Structure for CrewAI Projects
tests/
unit/
test_web_search_tool.py # tool._run() tests — no LLM, no network
test_database_tool.py # same pattern for other tools
test_output_schemas.py # Pydantic schema validation
integration/
test_task_context_wiring.py # crew structure tests — no LLM calls
test_crew_kickoff.py # end-to-end with fully mocked LLM
test_async_execution.py # kickoff_async behavior
error_handling/
test_tool_failures.py # rate limits, auth errors, timeouts
test_misconfiguration.py # missing agents, missing tools, bad schemasRun unit/ on every commit — they're fast and catch tool bugs before they become crew bugs. Run integration/ in CI pull request checks. Reserve actual LLM calls for manual smoke tests before production deployments.
The Gap Unit Tests Can't Close
Mocked LLMs and controlled tool responses let you build a fast, deterministic test suite. But they can't test the quality of agent reasoning, whether an updated system prompt still produces structured output, or whether the LLM actually uses the tool when it should. Those are behavioral tests that need real LLM calls.
Running full crew behavioral tests on every commit is too slow and too expensive. But running them on a schedule — daily, or triggered by system prompt changes — catches the drift before users do.
HelpMeTest is built for exactly this layer: health check runs that execute your CrewAI crew against real LLM endpoints on a schedule, with assertions on output structure, tool call counts, and execution time. When an LLM provider update breaks your crew's structured output, you find out in the monitoring dashboard instead of from a support ticket. Plans start at $100/month.
The unit tests in this post give you a fast foundation. The monitoring fills the gap the unit tests leave.