Testing CrewAI Agent Crews: Task Delegation, Role Validation, Tool Assertions, and Output Validation

Testing CrewAI Agent Crews: Task Delegation, Role Validation, Tool Assertions, and Output Validation

CrewAI orchestrates multiple agents working together toward a shared goal. A researcher agent gathers information. A writer agent drafts content. An editor agent refines it. Each has a role, a set of tools, and a defined responsibility.

The complexity that makes CrewAI powerful — multi-agent coordination, role-based delegation, sequential task chains — is also what makes it hard to test. When output is wrong, which agent is responsible? When a task fails, was it the wrong tool, the wrong instructions, or a bad handoff?

Testing CrewAI crews systematically answers these questions before your users encounter them.

The CrewAI Testing Problem

CrewAI agents are LLM-backed by design. Every .kickoff() call makes real LLM calls, costs money, and returns non-deterministic results. That makes end-to-end crew tests impractical for CI.

The solution is testing at multiple levels:

  1. Agent configuration tests — validate that agents are configured correctly (role, goal, backstory, tools assigned)
  2. Task delegation tests — validate that tasks are assigned to the right agents with the right context
  3. Tool call assertion tests — validate that agents call their tools with correct arguments (using mocked LLMs)
  4. Output validation tests — validate that task outputs match expected structure and content (with deterministic LLM mocks)

Each level uses a different mocking strategy.

Testing Agent Configuration

Before testing behavior, test that your agents are configured correctly. Misconfigured agents produce wrong behavior that's hard to trace.

import pytest
from crewai import Agent, Task, Crew
from your_crew import create_research_agent, create_writer_agent, SearchTool, WritingTool

class TestAgentConfiguration:
    def test_research_agent_has_search_tool(self):
        agent = create_research_agent()
        tool_names = [t.name for t in agent.tools]
        assert "search_web" in tool_names
    
    def test_research_agent_does_not_have_writing_tools(self):
        agent = create_research_agent()
        tool_names = [t.name for t in agent.tools]
        assert "create_document" not in tool_names
        assert "format_markdown" not in tool_names
    
    def test_writer_agent_has_writing_tools(self):
        agent = create_writer_agent()
        tool_names = [t.name for t in agent.tools]
        assert "create_document" in tool_names
    
    def test_agent_roles_are_distinct(self):
        researcher = create_research_agent()
        writer = create_writer_agent()
        
        assert researcher.role != writer.role
        assert researcher.goal != writer.goal
    
    def test_agent_memory_settings(self):
        researcher = create_research_agent()
        # Research agents should have memory to avoid re-researching
        assert researcher.memory is True
    
    def test_crew_agent_count(self):
        from your_crew import create_research_crew
        crew = create_research_crew()
        assert len(crew.agents) == 3  # researcher, writer, editor
    
    def test_crew_task_count(self):
        from your_crew import create_research_crew
        crew = create_research_crew()
        assert len(crew.tasks) == 3
    
    def test_task_agent_assignments(self):
        from your_crew import create_research_crew
        crew = create_research_crew()
        
        task_agent_roles = {task.description[:20]: task.agent.role for task in crew.tasks}
        
        # Research task should be assigned to researcher
        research_task = next(t for t in crew.tasks if "research" in t.description.lower())
        assert "researcher" in research_task.agent.role.lower()
        
        # Writing task should be assigned to writer
        writing_task = next(t for t in crew.tasks if "write" in t.description.lower())
        assert "writer" in writing_task.agent.role.lower()

These tests are pure Python — no LLM calls, no API keys, fast execution.

Testing Task Delegation Logic

Task delegation in CrewAI flows through task descriptions and agent assignments. Test that tasks are correctly scoped and handed off.

from crewai import Task

class TestTaskDelegation:
    def test_task_context_passed_to_downstream(self):
        """Test that task output is available as context for the next task."""
        from your_crew import create_tasks
        research_task, writing_task, editing_task = create_tasks()
        
        # Writing task should have research task as context
        assert research_task in (writing_task.context or [])
    
    def test_task_output_format_matches_next_task_input(self):
        """Test that task expected_output aligns with downstream task requirements."""
        from your_crew import create_tasks
        research_task, writing_task, _ = create_tasks()
        
        # Research task should output structured data that the writer can use
        assert research_task.expected_output is not None
        assert "markdown" in research_task.expected_output.lower() or \
               "list" in research_task.expected_output.lower() or \
               "json" in research_task.expected_output.lower()
    
    def test_task_descriptions_include_output_format(self):
        """Tasks without output format guidance produce inconsistent results."""
        from your_crew import create_tasks
        tasks = create_tasks()
        
        for task in tasks:
            # Each task description should specify what format output should be in
            has_format = any(word in task.description.lower() 
                           for word in ["format", "markdown", "json", "list", "structured"])
            assert has_format, f"Task '{task.description[:50]}' lacks output format guidance"

Testing Agent Role Validation via Mock LLM

To test agent behavior without real LLM calls, mock the underlying LLM. CrewAI uses LangChain-compatible LLMs, so you can replace them with mocks.

from unittest.mock import MagicMock, patch
from langchain_core.messages import AIMessage

def make_mock_llm(response: str):
    """Create a deterministic mock LLM that always returns the given response."""
    mock = MagicMock()
    mock.invoke.return_value = AIMessage(content=response)
    mock.predict.return_value = response
    mock.call.return_value = response
    # CrewAI uses various LLM interfaces
    mock.__call__ = MagicMock(return_value=response)
    return mock

class TestAgentRoleValidation:
    @patch("your_crew.agents.ChatOpenAI")
    def test_researcher_uses_search_tool_for_queries(self, mock_openai_class):
        """Verify the researcher agent calls search when given a research task."""
        search_tool = MagicMock()
        search_tool.name = "search_web"
        search_tool._run = MagicMock(return_value="search results about AI testing")
        
        # Mock LLM to instruct using the search tool
        mock_llm = make_mock_llm(
            'Action: search_web\nAction Input: "AI testing frameworks 2026"\nObservation: ...'
        )
        mock_openai_class.return_value = mock_llm
        
        agent = create_research_agent(tools=[search_tool])
        
        # The agent's LLM should have been set up
        assert agent.llm is not None
    
    @patch("your_crew.agents.ChatOpenAI")
    def test_writer_does_not_call_search(self, mock_openai_class):
        """Writer should use provided context, not search for new information."""
        search_tool = MagicMock()
        search_tool.name = "search_web"
        
        writer = create_writer_agent(tools=[])  # Writer should have no search tool
        
        tool_names = [t.name for t in writer.tools]
        assert "search_web" not in tool_names

Testing Tool Call Assertions

Verify that agents call their tools with the expected arguments. This requires intercepting the tool's _run method.

class TestToolCallAssertions:
    def test_search_tool_called_with_specific_query(self):
        """Research agent should search with the exact topic from the task."""
        search_calls = []
        
        class TrackingSearchTool:
            name = "search_web"
            description = "Search the web for information"
            
            def _run(self, query: str) -> str:
                search_calls.append(query)
                return f"Results for: {query}"
            
            async def _arun(self, query: str) -> str:
                return self._run(query)
        
        agent = create_research_agent(tools=[TrackingSearchTool()])
        
        task = Task(
            description="Research the latest trends in AI agent testing frameworks",
            expected_output="A list of key findings",
            agent=agent
        )
        
        with patch.object(agent, "llm") as mock_llm:
            # Simulate agent deciding to use the search tool
            mock_llm.invoke.return_value = AIMessage(
                content='Action: search_web\nAction Input: "AI agent testing frameworks 2026"'
            )
            # In a real test, you'd run the task and verify calls
            # This tests the tool tracking mechanism
            TrackingSearchTool()._run("AI agent testing frameworks 2026")
        
        assert len(search_calls) > 0
        assert any("AI agent testing" in call for call in search_calls)
    
    def test_tool_receives_structured_input(self):
        """Tools that expect JSON input should receive valid JSON."""
        received_inputs = []
        
        class DatabaseTool:
            name = "query_database"
            description = "Query the database"
            
            def _run(self, query: str) -> str:
                received_inputs.append(query)
                return "query results"
            
            async def _arun(self, query: str) -> str:
                return self._run(query)
        
        tool = DatabaseTool()
        tool._run('SELECT * FROM users WHERE active = true')
        
        assert len(received_inputs) == 1
        assert "SELECT" in received_inputs[0]

Testing Output Validation

Crew output validation ensures the final result meets your requirements. Use Pydantic models to enforce output structure.

from pydantic import BaseModel, validator
from typing import List

class ResearchReport(BaseModel):
    title: str
    summary: str
    key_findings: List[str]
    sources: List[str]
    word_count: int
    
    @validator("key_findings")
    def must_have_findings(cls, v):
        if len(v) < 3:
            raise ValueError("Research report must have at least 3 key findings")
        return v
    
    @validator("sources")
    def sources_must_be_urls(cls, v):
        for source in v:
            if not (source.startswith("http://") or source.startswith("https://")):
                raise ValueError(f"Source must be a URL: {source}")
        return v
    
    @validator("word_count")
    def minimum_word_count(cls, v):
        if v < 500:
            raise ValueError("Report must be at least 500 words")
        return v

class TestOutputValidation:
    def test_research_output_schema_validation(self):
        # Simulate crew output
        mock_output = {
            "title": "AI Testing Frameworks in 2026",
            "summary": "An analysis of current AI testing approaches",
            "key_findings": [
                "LangGraph has become the dominant agent framework",
                "Mocking LLMs in tests is now standard practice",
                "Observability tooling has matured significantly"
            ],
            "sources": [
                "https://langchain.com/langgraph",
                "https://docs.crewai.com"
            ],
            "word_count": 1200
        }
        
        # Should not raise
        report = ResearchReport(**mock_output)
        assert report.title == "AI Testing Frameworks in 2026"
    
    def test_output_fails_with_insufficient_findings(self):
        mock_output = {
            "title": "Test",
            "summary": "Test",
            "key_findings": ["only one finding"],  # Too few
            "sources": ["https://example.com"],
            "word_count": 600
        }
        
        with pytest.raises(ValueError, match="at least 3 key findings"):
            ResearchReport(**mock_output)
    
    def test_parse_crew_text_output_to_schema(self):
        """Test that your output parser converts crew text output to validated schema."""
        from your_crew import parse_research_output
        
        crew_text_output = """
        # AI Testing Frameworks in 2026
        
        ## Summary
        Analysis of testing approaches for AI agents.
        
        ## Key Findings
        - LangGraph dominates agent orchestration
        - Mocking is essential for deterministic tests
        - Observability tooling has matured
        
        ## Sources
        - https://langchain.com
        - https://crewai.com
        """
        
        parsed = parse_research_output(crew_text_output)
        report = ResearchReport(**parsed)
        
        assert len(report.key_findings) >= 3
        assert all(s.startswith("https://") for s in report.sources)

Running Crew Tests Without API Keys

Structure your conftest.py to ensure no tests accidentally make real LLM calls:

# tests/conftest.py
import os
import pytest
from unittest.mock import patch, MagicMock

@pytest.fixture(autouse=True)
def no_real_llm_calls():
    """Ensure no test accidentally calls a real LLM API."""
    # Override any real API keys with fake ones
    with patch.dict(os.environ, {
        "OPENAI_API_KEY": "test-key-not-real",
        "ANTHROPIC_API_KEY": "test-key-not-real"
    }):
        yield

@pytest.fixture
def mock_crew_llm():
    """Provide a reusable mock LLM for crew tests."""
    with patch("crewai.Agent.execute_task") as mock_execute:
        mock_execute.return_value = "mocked task output"
        yield mock_execute

This autouse=True fixture ensures every test runs with fake API keys, making accidental real LLM calls fail fast with an authentication error rather than charging your account.

CI Pipeline for CrewAI Tests

- name: Run CrewAI agent tests
  run: pytest tests/crew/ -v -k "not integration"
  env:
    OPENAI_API_KEY: "test-key-not-real"
    PYTHONDONTWRITEBYTECODE: 1

- name: Run crew integration tests
  if: github.ref == 'refs/heads/main'
  run: pytest tests/crew/ -v -k "integration"
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Keep unit and integration tests separated. Run unit tests on every PR (fast, free). Run integration tests only on merge to main (slow, costs money).

Testing CrewAI crews doesn't require running full LLM workflows. Test configuration, delegation, tool routing, and output parsing — all of which are deterministic and fast. Save the expensive end-to-end runs for pre-release validation.

Read more