Testing CrewAI Agent Crews: Task Delegation, Role Validation, Tool Assertions, and Output Validation
CrewAI orchestrates multiple agents working together toward a shared goal. A researcher agent gathers information. A writer agent drafts content. An editor agent refines it. Each has a role, a set of tools, and a defined responsibility.
The complexity that makes CrewAI powerful — multi-agent coordination, role-based delegation, sequential task chains — is also what makes it hard to test. When output is wrong, which agent is responsible? When a task fails, was it the wrong tool, the wrong instructions, or a bad handoff?
Testing CrewAI crews systematically answers these questions before your users encounter them.
The CrewAI Testing Problem
CrewAI agents are LLM-backed by design. Every .kickoff() call makes real LLM calls, costs money, and returns non-deterministic results. That makes end-to-end crew tests impractical for CI.
The solution is testing at multiple levels:
- Agent configuration tests — validate that agents are configured correctly (role, goal, backstory, tools assigned)
- Task delegation tests — validate that tasks are assigned to the right agents with the right context
- Tool call assertion tests — validate that agents call their tools with correct arguments (using mocked LLMs)
- Output validation tests — validate that task outputs match expected structure and content (with deterministic LLM mocks)
Each level uses a different mocking strategy.
Testing Agent Configuration
Before testing behavior, test that your agents are configured correctly. Misconfigured agents produce wrong behavior that's hard to trace.
import pytest
from crewai import Agent, Task, Crew
from your_crew import create_research_agent, create_writer_agent, SearchTool, WritingTool
class TestAgentConfiguration:
def test_research_agent_has_search_tool(self):
agent = create_research_agent()
tool_names = [t.name for t in agent.tools]
assert "search_web" in tool_names
def test_research_agent_does_not_have_writing_tools(self):
agent = create_research_agent()
tool_names = [t.name for t in agent.tools]
assert "create_document" not in tool_names
assert "format_markdown" not in tool_names
def test_writer_agent_has_writing_tools(self):
agent = create_writer_agent()
tool_names = [t.name for t in agent.tools]
assert "create_document" in tool_names
def test_agent_roles_are_distinct(self):
researcher = create_research_agent()
writer = create_writer_agent()
assert researcher.role != writer.role
assert researcher.goal != writer.goal
def test_agent_memory_settings(self):
researcher = create_research_agent()
# Research agents should have memory to avoid re-researching
assert researcher.memory is True
def test_crew_agent_count(self):
from your_crew import create_research_crew
crew = create_research_crew()
assert len(crew.agents) == 3 # researcher, writer, editor
def test_crew_task_count(self):
from your_crew import create_research_crew
crew = create_research_crew()
assert len(crew.tasks) == 3
def test_task_agent_assignments(self):
from your_crew import create_research_crew
crew = create_research_crew()
task_agent_roles = {task.description[:20]: task.agent.role for task in crew.tasks}
# Research task should be assigned to researcher
research_task = next(t for t in crew.tasks if "research" in t.description.lower())
assert "researcher" in research_task.agent.role.lower()
# Writing task should be assigned to writer
writing_task = next(t for t in crew.tasks if "write" in t.description.lower())
assert "writer" in writing_task.agent.role.lower()These tests are pure Python — no LLM calls, no API keys, fast execution.
Testing Task Delegation Logic
Task delegation in CrewAI flows through task descriptions and agent assignments. Test that tasks are correctly scoped and handed off.
from crewai import Task
class TestTaskDelegation:
def test_task_context_passed_to_downstream(self):
"""Test that task output is available as context for the next task."""
from your_crew import create_tasks
research_task, writing_task, editing_task = create_tasks()
# Writing task should have research task as context
assert research_task in (writing_task.context or [])
def test_task_output_format_matches_next_task_input(self):
"""Test that task expected_output aligns with downstream task requirements."""
from your_crew import create_tasks
research_task, writing_task, _ = create_tasks()
# Research task should output structured data that the writer can use
assert research_task.expected_output is not None
assert "markdown" in research_task.expected_output.lower() or \
"list" in research_task.expected_output.lower() or \
"json" in research_task.expected_output.lower()
def test_task_descriptions_include_output_format(self):
"""Tasks without output format guidance produce inconsistent results."""
from your_crew import create_tasks
tasks = create_tasks()
for task in tasks:
# Each task description should specify what format output should be in
has_format = any(word in task.description.lower()
for word in ["format", "markdown", "json", "list", "structured"])
assert has_format, f"Task '{task.description[:50]}' lacks output format guidance"Testing Agent Role Validation via Mock LLM
To test agent behavior without real LLM calls, mock the underlying LLM. CrewAI uses LangChain-compatible LLMs, so you can replace them with mocks.
from unittest.mock import MagicMock, patch
from langchain_core.messages import AIMessage
def make_mock_llm(response: str):
"""Create a deterministic mock LLM that always returns the given response."""
mock = MagicMock()
mock.invoke.return_value = AIMessage(content=response)
mock.predict.return_value = response
mock.call.return_value = response
# CrewAI uses various LLM interfaces
mock.__call__ = MagicMock(return_value=response)
return mock
class TestAgentRoleValidation:
@patch("your_crew.agents.ChatOpenAI")
def test_researcher_uses_search_tool_for_queries(self, mock_openai_class):
"""Verify the researcher agent calls search when given a research task."""
search_tool = MagicMock()
search_tool.name = "search_web"
search_tool._run = MagicMock(return_value="search results about AI testing")
# Mock LLM to instruct using the search tool
mock_llm = make_mock_llm(
'Action: search_web\nAction Input: "AI testing frameworks 2026"\nObservation: ...'
)
mock_openai_class.return_value = mock_llm
agent = create_research_agent(tools=[search_tool])
# The agent's LLM should have been set up
assert agent.llm is not None
@patch("your_crew.agents.ChatOpenAI")
def test_writer_does_not_call_search(self, mock_openai_class):
"""Writer should use provided context, not search for new information."""
search_tool = MagicMock()
search_tool.name = "search_web"
writer = create_writer_agent(tools=[]) # Writer should have no search tool
tool_names = [t.name for t in writer.tools]
assert "search_web" not in tool_namesTesting Tool Call Assertions
Verify that agents call their tools with the expected arguments. This requires intercepting the tool's _run method.
class TestToolCallAssertions:
def test_search_tool_called_with_specific_query(self):
"""Research agent should search with the exact topic from the task."""
search_calls = []
class TrackingSearchTool:
name = "search_web"
description = "Search the web for information"
def _run(self, query: str) -> str:
search_calls.append(query)
return f"Results for: {query}"
async def _arun(self, query: str) -> str:
return self._run(query)
agent = create_research_agent(tools=[TrackingSearchTool()])
task = Task(
description="Research the latest trends in AI agent testing frameworks",
expected_output="A list of key findings",
agent=agent
)
with patch.object(agent, "llm") as mock_llm:
# Simulate agent deciding to use the search tool
mock_llm.invoke.return_value = AIMessage(
content='Action: search_web\nAction Input: "AI agent testing frameworks 2026"'
)
# In a real test, you'd run the task and verify calls
# This tests the tool tracking mechanism
TrackingSearchTool()._run("AI agent testing frameworks 2026")
assert len(search_calls) > 0
assert any("AI agent testing" in call for call in search_calls)
def test_tool_receives_structured_input(self):
"""Tools that expect JSON input should receive valid JSON."""
received_inputs = []
class DatabaseTool:
name = "query_database"
description = "Query the database"
def _run(self, query: str) -> str:
received_inputs.append(query)
return "query results"
async def _arun(self, query: str) -> str:
return self._run(query)
tool = DatabaseTool()
tool._run('SELECT * FROM users WHERE active = true')
assert len(received_inputs) == 1
assert "SELECT" in received_inputs[0]Testing Output Validation
Crew output validation ensures the final result meets your requirements. Use Pydantic models to enforce output structure.
from pydantic import BaseModel, validator
from typing import List
class ResearchReport(BaseModel):
title: str
summary: str
key_findings: List[str]
sources: List[str]
word_count: int
@validator("key_findings")
def must_have_findings(cls, v):
if len(v) < 3:
raise ValueError("Research report must have at least 3 key findings")
return v
@validator("sources")
def sources_must_be_urls(cls, v):
for source in v:
if not (source.startswith("http://") or source.startswith("https://")):
raise ValueError(f"Source must be a URL: {source}")
return v
@validator("word_count")
def minimum_word_count(cls, v):
if v < 500:
raise ValueError("Report must be at least 500 words")
return v
class TestOutputValidation:
def test_research_output_schema_validation(self):
# Simulate crew output
mock_output = {
"title": "AI Testing Frameworks in 2026",
"summary": "An analysis of current AI testing approaches",
"key_findings": [
"LangGraph has become the dominant agent framework",
"Mocking LLMs in tests is now standard practice",
"Observability tooling has matured significantly"
],
"sources": [
"https://langchain.com/langgraph",
"https://docs.crewai.com"
],
"word_count": 1200
}
# Should not raise
report = ResearchReport(**mock_output)
assert report.title == "AI Testing Frameworks in 2026"
def test_output_fails_with_insufficient_findings(self):
mock_output = {
"title": "Test",
"summary": "Test",
"key_findings": ["only one finding"], # Too few
"sources": ["https://example.com"],
"word_count": 600
}
with pytest.raises(ValueError, match="at least 3 key findings"):
ResearchReport(**mock_output)
def test_parse_crew_text_output_to_schema(self):
"""Test that your output parser converts crew text output to validated schema."""
from your_crew import parse_research_output
crew_text_output = """
# AI Testing Frameworks in 2026
## Summary
Analysis of testing approaches for AI agents.
## Key Findings
- LangGraph dominates agent orchestration
- Mocking is essential for deterministic tests
- Observability tooling has matured
## Sources
- https://langchain.com
- https://crewai.com
"""
parsed = parse_research_output(crew_text_output)
report = ResearchReport(**parsed)
assert len(report.key_findings) >= 3
assert all(s.startswith("https://") for s in report.sources)Running Crew Tests Without API Keys
Structure your conftest.py to ensure no tests accidentally make real LLM calls:
# tests/conftest.py
import os
import pytest
from unittest.mock import patch, MagicMock
@pytest.fixture(autouse=True)
def no_real_llm_calls():
"""Ensure no test accidentally calls a real LLM API."""
# Override any real API keys with fake ones
with patch.dict(os.environ, {
"OPENAI_API_KEY": "test-key-not-real",
"ANTHROPIC_API_KEY": "test-key-not-real"
}):
yield
@pytest.fixture
def mock_crew_llm():
"""Provide a reusable mock LLM for crew tests."""
with patch("crewai.Agent.execute_task") as mock_execute:
mock_execute.return_value = "mocked task output"
yield mock_executeThis autouse=True fixture ensures every test runs with fake API keys, making accidental real LLM calls fail fast with an authentication error rather than charging your account.
CI Pipeline for CrewAI Tests
- name: Run CrewAI agent tests
run: pytest tests/crew/ -v -k "not integration"
env:
OPENAI_API_KEY: "test-key-not-real"
PYTHONDONTWRITEBYTECODE: 1
- name: Run crew integration tests
if: github.ref == 'refs/heads/main'
run: pytest tests/crew/ -v -k "integration"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Keep unit and integration tests separated. Run unit tests on every PR (fast, free). Run integration tests only on merge to main (slow, costs money).
Testing CrewAI crews doesn't require running full LLM workflows. Test configuration, delegation, tool routing, and output parsing — all of which are deterministic and fast. Save the expensive end-to-end runs for pre-release validation.