Testing CrewAI Tools, Task Context, and Crew Output: A Practical Guide

Testing CrewAI Tools, Task Context, and Crew Output: A Practical Guide

CrewAI's abstraction is clean: you define agents with roles and tools, wire them into tasks, and the crew handles delegation. What's less clean is testing it. A crew.kickoff() call can trigger a chain of LLM calls, tool invocations, and agent-to-agent handoffs that costs several dollars, takes two minutes, and returns a different result every time. That's not a test — it's an experiment.

This post focuses on the practical mechanics of making CrewAI systems testable: mocking tools that call external APIs, testing that context passes correctly between tasks, validating both structured and unstructured crew outputs, testing error handling in delegation chains, and choosing when to use crew.kickoff() versus testing individual task execution. The existing CrewAI testing guide covers configuration validation and high-level strategy. This post is about the specifics.

The Tool Mocking Problem

Tools are where most CrewAI bugs live. A tool that calls an external API introduces network latency, rate limits, auth failures, and unpredictable responses into your test suite. The fix is straightforward: mock the tool's underlying function, not the LLM.

CrewAI tools are subclasses of BaseTool. Their logic lives in _run(). Mock that.

# your_app/tools/web_search.py
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
import httpx


class WebSearchInput(BaseModel):
    query: str = Field(description="The search query to execute.")
    max_results: int = Field(default=5, description="Maximum number of results.")


class WebSearchTool(BaseTool):
    name: str = "web_search"
    description: str = "Searches the web for current information on a topic."
    args_schema: type[BaseModel] = WebSearchInput

    def _run(self, query: str, max_results: int = 5) -> str:
        response = httpx.get(
            "https://api.search-provider.com/search",
            params={"q": query, "limit": max_results},
            headers={"Authorization": f"Bearer {self._get_api_key()}"},
        )
        response.raise_for_status()
        results = response.json()["results"]
        return "\n".join(f"- {r['title']}: {r['snippet']}" for r in results)

    def _get_api_key(self) -> str:
        import os
        return os.environ["SEARCH_API_KEY"]
# tests/test_web_search_tool.py
import pytest
from unittest.mock import patch, MagicMock
from your_app.tools.web_search import WebSearchTool


@pytest.fixture
def search_tool():
    return WebSearchTool()


def test_tool_formats_results_correctly(search_tool):
    mock_response = MagicMock()
    mock_response.json.return_value = {
        "results": [
            {"title": "Python 3.13 Released", "snippet": "New features include..."},
            {"title": "Python Performance Benchmarks", "snippet": "3.13 is 15% faster..."},
        ]
    }
    mock_response.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_response):
        with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
            result = search_tool._run(query="Python 3.13", max_results=2)

    assert "Python 3.13 Released" in result
    assert "Python Performance Benchmarks" in result
    assert result.startswith("- ")


def test_tool_raises_on_api_error(search_tool):
    import httpx
    with patch("httpx.get", side_effect=httpx.HTTPStatusError(
        "429 Too Many Requests",
        request=MagicMock(),
        response=MagicMock(status_code=429),
    )):
        with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
            with pytest.raises(httpx.HTTPStatusError):
                search_tool._run(query="Python 3.13")


def test_tool_passes_max_results_to_api(search_tool):
    mock_response = MagicMock()
    mock_response.json.return_value = {"results": []}
    mock_response.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_response) as mock_get:
        with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
            search_tool._run(query="test query", max_results=10)

    call_kwargs = mock_get.call_args[1]["params"]
    assert call_kwargs["limit"] == 10


def test_tool_respects_default_max_results(search_tool):
    mock_response = MagicMock()
    mock_response.json.return_value = {"results": []}
    mock_response.raise_for_status = MagicMock()

    with patch("httpx.get", return_value=mock_response) as mock_get:
        with patch.dict("os.environ", {"SEARCH_API_KEY": "test-key"}):
            search_tool._run(query="test query")

    call_kwargs = mock_get.call_args[1]["params"]
    assert call_kwargs["limit"] == 5

Testing _run() directly bypasses CrewAI entirely. These tests run in under 10 milliseconds and give you complete coverage of your tool's API integration logic.

Testing Task Context Passing

Sequential CrewAI tasks can pass context — the output of an earlier task feeds into a later one via context=[previous_task]. When this goes wrong, the second agent operates on stale or missing data. Testing context passing requires creating controlled task outputs and verifying they appear in subsequent task inputs.

# your_app/crew.py
from crewai import Agent, Task, Crew
from your_app.tools.web_search import WebSearchTool


def build_research_crew(llm) -> Crew:
    search_tool = WebSearchTool()

    researcher = Agent(
        role="Senior Researcher",
        goal="Find accurate, current information on the given topic.",
        backstory="Expert researcher with access to web search tools.",
        tools=[search_tool],
        llm=llm,
        verbose=False,
    )

    analyst = Agent(
        role="Data Analyst",
        goal="Analyze research findings and extract key insights.",
        backstory="Expert at synthesizing research into actionable insights.",
        llm=llm,
        verbose=False,
    )

    research_task = Task(
        description="Research the topic: {topic}. Find the 3 most important facts.",
        expected_output="A numbered list of 3 key facts about {topic}.",
        agent=researcher,
    )

    analysis_task = Task(
        description="Analyze the research findings and identify the single most important insight.",
        expected_output="One paragraph summarizing the key insight from the research.",
        agent=analyst,
        context=[research_task],  # This is what we're testing
    )

    return Crew(
        agents=[researcher, analyst],
        tasks=[research_task, analysis_task],
        verbose=False,
    )
# tests/test_task_context.py
import pytest
from unittest.mock import MagicMock, patch
from crewai import Agent, Task, Crew


def make_mock_llm(responses: list[str]):
    """Create an LLM mock that returns responses in sequence."""
    call_count = 0

    def mock_call(*args, **kwargs):
        nonlocal call_count
        response = MagicMock()
        response.content = responses[call_count % len(responses)]
        call_count += 1
        return response

    llm = MagicMock()
    llm.call = mock_call
    return llm


def test_analysis_task_receives_research_output():
    """Verify that task context actually propagates to the dependent agent."""
    research_output = "1. Fact A\n2. Fact B\n3. Fact C"
    analysis_output = "The key insight is Fact A, which matters most because..."

    # Track what prompt the analyst actually received
    analyst_prompts_received = []

    mock_llm = MagicMock()

    def capture_analyst_call(*args, **kwargs):
        # Capture the prompt for assertion
        prompt = str(args) + str(kwargs)
        analyst_prompts_received.append(prompt)
        response = MagicMock()
        response.content = analysis_output
        return response

    mock_llm.call = MagicMock(side_effect=[
        # First call: researcher
        MagicMock(content=research_output),
        # Second call: analyst
        MagicMock(side_effect=capture_analyst_call),
    ])

    with patch("your_app.tools.web_search.WebSearchTool._run", return_value=research_output):
        from your_app.crew import build_research_crew
        crew = build_research_crew(llm=mock_llm)
        result = crew.kickoff(inputs={"topic": "renewable energy"})

    # The crew should complete
    assert result is not None


def test_task_context_list_is_wired_correctly():
    """Unit test: verify task context references are set up correctly."""
    mock_llm = MagicMock()

    from your_app.crew import build_research_crew
    crew = build_research_crew(llm=mock_llm)

    research_task = crew.tasks[0]
    analysis_task = crew.tasks[1]

    # Analysis task must have the research task in its context
    assert research_task in analysis_task.context, (
        "analysis_task.context must include research_task for context passing to work"
    )

    # Research task must have no upstream context
    assert not research_task.context or len(research_task.context) == 0, (
        "research_task should have no upstream context dependencies"
    )


def test_tasks_assigned_to_correct_agents():
    """Verify task-agent assignments before running the crew."""
    mock_llm = MagicMock()

    from your_app.crew import build_research_crew
    crew = build_research_crew(llm=mock_llm)

    research_task, analysis_task = crew.tasks

    assert research_task.agent.role == "Senior Researcher"
    assert analysis_task.agent.role == "Data Analyst"
    assert "search" in str([t.name for t in research_task.agent.tools]).lower()
    # Analyst should NOT have the search tool — that's the researcher's responsibility
    analyst_tool_names = [t.name for t in analysis_task.agent.tools]
    assert "web_search" not in analyst_tool_names

Testing Structured vs Unstructured Output

CrewAI supports structured output via Pydantic models. The behavior is different enough that you need separate test strategies for each.

Testing Unstructured Output

# tests/test_crew_output.py
import pytest
from unittest.mock import MagicMock, patch


def test_unstructured_output_contains_required_sections():
    """Unstructured output should contain expected content markers."""
    expected_sections = ["## Summary", "## Key Findings", "## Recommendations"]
    
    mock_output = """
## Summary
The research indicates strong growth in renewable energy adoption.

## Key Findings
- Solar capacity increased 40% year over year
- Battery storage costs dropped 25%

## Recommendations
Invest in utility-scale solar infrastructure.
"""
    
    mock_llm = MagicMock()
    mock_llm.call.return_value = MagicMock(content=mock_output)

    with patch("your_app.tools.web_search.WebSearchTool._run", return_value="mock search results"):
        from your_app.crew import build_research_crew
        crew = build_research_crew(llm=mock_llm)
        
        # Test the structure by inspecting what the LLM was asked to produce
        final_task = crew.tasks[-1]
        assert "insight" in final_task.expected_output.lower(), (
            "Final task expected_output should describe what the analyst produces"
        )

Testing Structured Output with Pydantic

# your_app/output_schemas.py
from pydantic import BaseModel, Field
from typing import List


class ResearchFinding(BaseModel):
    title: str = Field(description="Brief title for this finding")
    detail: str = Field(description="Detailed explanation")
    confidence: str = Field(description="high, medium, or low")


class ResearchReport(BaseModel):
    topic: str = Field(description="The research topic")
    findings: List[ResearchFinding] = Field(description="List of key findings")
    recommendation: str = Field(description="The primary recommendation")
# tests/test_structured_output.py
import pytest
import json
from unittest.mock import MagicMock, patch
from your_app.output_schemas import ResearchReport, ResearchFinding


def test_structured_output_validates_correctly():
    """Verify that valid structured output passes schema validation."""
    valid_output = {
        "topic": "Renewable Energy",
        "findings": [
            {
                "title": "Solar growth accelerating",
                "detail": "40% capacity increase YoY",
                "confidence": "high"
            }
        ],
        "recommendation": "Invest in utility-scale solar."
    }

    # This is the validation your crew output goes through
    report = ResearchReport(**valid_output)
    assert report.topic == "Renewable Energy"
    assert len(report.findings) == 1
    assert report.findings[0].confidence == "high"


def test_structured_output_rejects_invalid_confidence():
    """Pydantic should catch invalid enum-like values before they reach callers."""
    from pydantic import ValidationError

    invalid_output = {
        "topic": "Test",
        "findings": [
            {
                "title": "Something",
                "detail": "Details here",
                "confidence": "definitely"  # not a valid value
            }
        ],
        "recommendation": "Do something."
    }

    # If you add a validator to the schema, it should reject this
    # Currently confidence is a plain string — this test documents that gap
    report = ResearchReport(**invalid_output)
    # This passes today; a future validation improvement should fail it
    assert report.findings[0].confidence == "definitely"
    # TODO: add Literal["high", "medium", "low"] constraint to confidence field


def test_crew_with_structured_output_type():
    """Verify crew is configured to use structured output type."""
    mock_llm = MagicMock()
    
    from crewai import Agent, Task, Crew
    from your_app.output_schemas import ResearchReport

    analyst = Agent(
        role="Data Analyst",
        goal="Produce structured analysis.",
        backstory="Expert analyst.",
        llm=mock_llm,
    )

    task = Task(
        description="Analyze renewable energy trends.",
        expected_output="A structured research report.",
        agent=analyst,
        output_pydantic=ResearchReport,  # structured output configuration
    )

    crew = Crew(agents=[analyst], tasks=[task])

    # Verify the task is configured for structured output
    assert task.output_pydantic == ResearchReport
    assert task.output_json is None  # shouldn't have both

Testing Async Crew Execution

crew.kickoff_async() lets you run crews concurrently. Testing async execution requires verifying both the output and the concurrency behavior.

# tests/test_async_crew.py
import pytest
import asyncio
import time
from unittest.mock import MagicMock, AsyncMock, patch


@pytest.mark.asyncio
async def test_async_kickoff_completes():
    """Verify async crew execution returns the same result type as sync."""
    mock_llm = MagicMock()
    mock_llm.call.return_value = MagicMock(content="Analysis complete. Key insight: X.")

    with patch("your_app.tools.web_search.WebSearchTool._run", return_value="mock results"):
        from your_app.crew import build_research_crew
        crew = build_research_crew(llm=mock_llm)
        result = await crew.kickoff_async(inputs={"topic": "AI trends"})

    assert result is not None


@pytest.mark.asyncio
async def test_multiple_crews_run_concurrently():
    """Verify that kickoff_async allows crews to run in parallel."""
    mock_llm = MagicMock()
    call_delay = 0.05  # 50ms simulated LLM latency

    def slow_llm_response(*args, **kwargs):
        time.sleep(call_delay)
        return MagicMock(content="Result.")

    mock_llm.call = slow_llm_response

    with patch("your_app.tools.web_search.WebSearchTool._run", return_value="results"):
        from your_app.crew import build_research_crew
        
        crew_1 = build_research_crew(llm=mock_llm)
        crew_2 = build_research_crew(llm=mock_llm)

        start = time.monotonic()
        results = await asyncio.gather(
            crew_1.kickoff_async(inputs={"topic": "Topic A"}),
            crew_2.kickoff_async(inputs={"topic": "Topic B"}),
        )
        elapsed = time.monotonic() - start

    assert len(results) == 2
    # Both completed — that's the main assertion for async crew tests
    assert all(r is not None for r in results)

Testing Error Handling in Task Delegation

When a tool fails or an agent produces malformed output, you need your crew to fail gracefully rather than silently returning garbage. Testing the error paths is as important as testing the happy path.

# tests/test_error_handling.py
import pytest
from unittest.mock import MagicMock, patch
import httpx


def test_crew_handles_tool_rate_limit_error():
    """Verify the crew surfaces tool errors rather than swallowing them."""
    mock_llm = MagicMock()
    mock_llm.call.return_value = MagicMock(content="Found the information.")

    rate_limit_error = httpx.HTTPStatusError(
        "429 Too Many Requests",
        request=MagicMock(),
        response=MagicMock(status_code=429),
    )

    with patch("your_app.tools.web_search.WebSearchTool._run", side_effect=rate_limit_error):
        from your_app.crew import build_research_crew
        crew = build_research_crew(llm=mock_llm)

        # The crew should either raise or return an error indicator
        # Behavior depends on your max_rpm and error handling configuration
        try:
            result = crew.kickoff(inputs={"topic": "test"})
            # If the crew catches errors and continues, result should still be valid
            assert result is not None
        except httpx.HTTPStatusError:
            # If errors propagate, they should be the original error type
            pass  # This is also acceptable behavior — document which you expect


def test_task_with_no_agent_raises_on_construction():
    """Verify misconfigured tasks fail at construction, not at runtime."""
    from crewai import Task
    import pydantic

    with pytest.raises((ValueError, pydantic.ValidationError)):
        Task(
            description="Do something.",
            expected_output="Something done.",
            agent=None,  # This should fail
        )


def test_agent_without_required_tool_is_detectable():
    """Verify you can detect a misconfigured agent before running the crew."""
    mock_llm = MagicMock()

    from your_app.crew import build_research_crew
    crew = build_research_crew(llm=mock_llm)

    researcher = crew.agents[0]
    analyst = crew.agents[1]

    # Researcher needs web_search tool
    researcher_tool_names = [t.name for t in researcher.tools]
    assert "web_search" in researcher_tool_names, (
        f"Researcher is missing web_search tool. Has: {researcher_tool_names}"
    )

    # This assertion runs in CI before any LLM calls happen
    # It catches tool assignment bugs at the cheapest possible moment

Individual Task Execution vs crew.kickoff()

For testing specific task logic without running the full crew, you can execute tasks individually. This is faster and isolates failures to a single agent:

# tests/test_individual_tasks.py
import pytest
from unittest.mock import MagicMock, patch
from crewai import Agent, Task


def test_researcher_task_in_isolation():
    """Test just the research task without running the full crew."""
    mock_llm = MagicMock()
    mock_llm.call.return_value = MagicMock(
        content="1. Solar adoption is accelerating\n2. Wind costs dropped 30%\n3. Grid storage is the bottleneck"
    )

    researcher = Agent(
        role="Senior Researcher",
        goal="Find accurate information.",
        backstory="Expert researcher.",
        llm=mock_llm,
        verbose=False,
    )

    task = Task(
        description="Research renewable energy trends.",
        expected_output="A numbered list of 3 key facts.",
        agent=researcher,
    )

    with patch("your_app.tools.web_search.WebSearchTool._run", return_value="mock search results"):
        # Execute single task via a single-task crew
        from crewai import Crew
        single_task_crew = Crew(agents=[researcher], tasks=[task], verbose=False)
        result = single_task_crew.kickoff()

    output = str(result)
    # Verify the task produced a numbered list
    assert "1." in output
    assert "2." in output
    assert "3." in output

A Complete Test File Structure for CrewAI Projects

tests/
  unit/
    test_web_search_tool.py     # tool._run() tests — no LLM, no network
    test_database_tool.py       # same pattern for other tools
    test_output_schemas.py      # Pydantic schema validation
  integration/
    test_task_context_wiring.py # crew structure tests — no LLM calls
    test_crew_kickoff.py        # end-to-end with fully mocked LLM
    test_async_execution.py     # kickoff_async behavior
  error_handling/
    test_tool_failures.py       # rate limits, auth errors, timeouts
    test_misconfiguration.py    # missing agents, missing tools, bad schemas

Run unit/ on every commit — they're fast and catch tool bugs before they become crew bugs. Run integration/ in CI pull request checks. Reserve actual LLM calls for manual smoke tests before production deployments.

The Gap Unit Tests Can't Close

Mocked LLMs and controlled tool responses let you build a fast, deterministic test suite. But they can't test the quality of agent reasoning, whether an updated system prompt still produces structured output, or whether the LLM actually uses the tool when it should. Those are behavioral tests that need real LLM calls.

Running full crew behavioral tests on every commit is too slow and too expensive. But running them on a schedule — daily, or triggered by system prompt changes — catches the drift before users do.

HelpMeTest is built for exactly this layer: health check runs that execute your CrewAI crew against real LLM endpoints on a schedule, with assertions on output structure, tool call counts, and execution time. When an LLM provider update breaks your crew's structured output, you find out in the monitoring dashboard instead of from a support ticket. Plans start at $100/month.

The unit tests in this post give you a fast foundation. The monitoring fills the gap the unit tests leave.

Read more