Mocking External APIs in Agentic Systems
AI agents call external APIs — search engines, databases, Slack, GitHub, payment processors. Mocking these in tests is essential: real APIs are slow, rate-limited, non-deterministic, and sometimes irreversible. This guide covers four mocking strategies for agentic systems: HTTP interception, tool-level mocking, record/replay, and in-process fakes.
An AI agent that manages your GitHub issues calls the GitHub API. An agent that schedules meetings calls Google Calendar. A customer support agent hits your CRM, your ticket system, and your billing API — all in a single workflow.
Testing these agents against real APIs has serious problems:
- Rate limits kill CI pipelines
- API calls cost money
- Tests that delete records or send emails have irreversible effects
- Flaky external services make tests non-deterministic
- Real APIs are slow (network latency × number of tool calls × number of tests = hours)
Mocking solves all of this. But agentic systems require mocking at multiple layers, and the choice of which layer matters.
The Four Mocking Layers
Layer 1: HTTP Interception
The lowest-level approach intercepts HTTP requests before they leave the process. Tools like responses (Python), nock (Node.js), or httpretty intercept calls to any HTTP client.
# Python: mock all GitHub API calls
import responses
from agent.tools import create_github_issue
@responses.activate
def test_create_issue_tool():
responses.post(
"https://api.github.com/repos/myorg/myrepo/issues",
json={
"id": 1,
"number": 42,
"title": "Bug: login fails on mobile",
"html_url": "https://github.com/myorg/myrepo/issues/42"
},
status=201
)
result = create_github_issue(
owner="myorg",
repo="myrepo",
title="Bug: login fails on mobile",
body="Steps to reproduce..."
)
assert result["number"] == 42
assert result["html_url"].endswith("/42")
assert len(responses.calls) == 1 # Exactly one API call was madeHTTP interception works for any HTTP client (requests, httpx, fetch) without changing your production code. The downside: you're asserting on HTTP request/response details, which tightly couples tests to implementation.
Layer 2: Tool-Level Mocking
Rather than intercepting HTTP, mock the tool functions themselves. This is the right level of abstraction for agent orchestration tests — you're testing that the agent calls the right tools, not the HTTP details:
# Python: mock tools at the function level
from unittest.mock import patch, MagicMock
from agent.workflow import run_issue_triage_agent
def test_agent_creates_issue_for_bug_report():
mock_search = MagicMock(return_value=[
{"title": "Similar bug #38", "url": "https://github.com/myorg/myrepo/issues/38"}
])
mock_create = MagicMock(return_value={
"number": 42,
"html_url": "https://github.com/myorg/myrepo/issues/42"
})
with patch("agent.tools.search_existing_issues", mock_search), \
patch("agent.tools.create_github_issue", mock_create):
result = run_issue_triage_agent(
"Users are reporting login failures on mobile Safari since yesterday's deploy"
)
# Agent should have searched before creating
mock_search.assert_called_once()
search_call = mock_search.call_args
assert "login" in str(search_call).lower() or "mobile" in str(search_call).lower()
# Agent should have created an issue
mock_create.assert_called_once()
create_call = mock_create.call_args[1]
assert "mobile" in create_call["title"].lower() or "login" in create_call["title"].lower()Tool-level mocking makes the test read like documentation: "the agent should search before creating." It decouples from HTTP details and lets you focus on agent behavior.
Layer 3: Record and Replay
Record real API interactions once, then replay them in tests. This gives you realistic responses without hitting real APIs repeatedly.
Several libraries support this pattern:
VCR.py (Python):
import vcr
@vcr.use_cassette("cassettes/github_search.yaml")
def test_search_issues():
from agent.tools import search_existing_issues
results = search_existing_issues(query="mobile login", repo="myorg/myrepo")
assert len(results) > 0
assert all("url" in r for r in results)The first run with VCR_RECORD_MODE=new_episodes hits the real API and saves the response to cassettes/github_search.yaml. Subsequent runs replay from the file — no network needed.
Polly.js (Node.js):
import { Polly } from "@pollyjs/core";
import NodeHttpAdapter from "@pollyjs/adapter-node-http";
Polly.register(NodeHttpAdapter);
describe("GitHub tool", () => {
let polly;
beforeEach(() => {
polly = new Polly("github-search", {
adapters: ["node-http"],
mode: "replay",
});
});
afterEach(() => polly.stop());
it("searches existing issues", async () => {
const { searchIssues } = await import("./tools/github.js");
const results = await searchIssues("mobile login", "myorg/myrepo");
expect(results.length).toBeGreaterThan(0);
});
});Record/replay is ideal for complex, stateful API interactions where writing manual mocks would be too much work. The cassette files are committed to git, so the team shares the same recorded responses.
Layer 4: In-Process Fakes
Fakes are lightweight implementations of external services that run in the same process. Unlike mocks (which assert call behavior), fakes implement just enough real behavior to be useful:
# A fake GitHub API for testing
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class FakeGitHubAPI:
issues: list = field(default_factory=list)
_next_number: int = 1
def search_issues(self, query: str, repo: str) -> list:
"""Return issues whose title or body contains any query word."""
words = query.lower().split()
return [
issue for issue in self.issues
if any(w in (issue["title"] + issue["body"]).lower() for w in words)
]
def create_issue(self, owner: str, repo: str, title: str, body: str) -> dict:
issue = {
"number": self._next_number,
"title": title,
"body": body,
"html_url": f"https://github.com/{owner}/{repo}/issues/{self._next_number}",
"state": "open"
}
self.issues.append(issue)
self._next_number += 1
return issue
def close_issue(self, number: int) -> dict:
issue = next((i for i in self.issues if i["number"] == number), None)
if not issue:
raise ValueError(f"Issue #{number} not found")
issue["state"] = "closed"
return issueUsing the fake in tests:
def test_agent_deduplicates_issues(fake_github):
# Pre-populate with an existing issue
fake_github.create_issue(
owner="myorg", repo="myrepo",
title="Bug: login fails on mobile",
body="Reported by user A"
)
with patch("agent.tools.github_api", fake_github):
result = run_issue_triage_agent(
"Mobile login is broken, same thing user A reported"
)
# Agent should have found the existing issue and not created a duplicate
assert len(fake_github.issues) == 1, "Agent should not have created a duplicate issue"
assert "existing" in result.output.lower() or "duplicate" in result.output.lower()Fakes are the best choice when:
- Tests involve stateful interactions (create, then read, then update)
- You want to test the agent's behavior across multiple tool calls
- The mocked service's behavior matters (not just whether the call was made)
Mocking Irreversible Actions
Some tool calls are irreversible: sending emails, posting to Slack, deleting records, charging credit cards. Always mock these in tests, but test that the mock was called:
def test_agent_sends_confirmation_email():
sent_emails = []
def mock_send_email(to: str, subject: str, body: str) -> dict:
sent_emails.append({"to": to, "subject": subject, "body": body})
return {"message_id": "msg_fake_123", "status": "sent"}
with patch("agent.tools.send_email", mock_send_email):
run_subscription_cancellation_agent(user_email="user@example.com")
assert len(sent_emails) == 1, "Agent should send exactly one confirmation email"
assert sent_emails[0]["to"] == "user@example.com"
assert "cancel" in sent_emails[0]["subject"].lower()For Slack, GitHub comments, and other social actions — assert the content of what would have been posted, not just that the function was called.
Handling Rate Limits in Tests
When you can't mock (e.g., running selected integration tests against a staging environment), implement rate limit simulation so tests don't accidentally behave differently than production:
class RateLimitAwareClient:
def __init__(self, real_client, max_calls_per_minute: int = 30):
self._client = real_client
self._max_calls = max_calls_per_minute
self._calls = []
def search_issues(self, *args, **kwargs):
self._enforce_rate_limit()
return self._client.search_issues(*args, **kwargs)
def _enforce_rate_limit(self):
import time
now = time.time()
self._calls = [t for t in self._calls if now - t < 60]
if len(self._calls) >= self._max_calls:
sleep_time = 60 - (now - self._calls[0])
time.sleep(sleep_time)
self._calls.append(now)Structuring Mock Fixtures
Organize mock fixtures so they're reusable across tests:
# conftest.py
import pytest
from tests.fakes import FakeGitHubAPI, FakeSlackAPI, FakeDatabaseAPI
@pytest.fixture
def fake_github():
return FakeGitHubAPI()
@pytest.fixture
def fake_slack():
return FakeSlackAPI()
@pytest.fixture
def fake_db():
db = FakeDatabaseAPI()
db.seed_from_fixture("tests/fixtures/sample_data.json")
return db
@pytest.fixture
def agent_environment(fake_github, fake_slack, fake_db):
"""Complete mocked environment for agent tests."""
with patch("agent.tools.github_api", fake_github), \
patch("agent.tools.slack_api", fake_slack), \
patch("agent.tools.database", fake_db):
yield {
"github": fake_github,
"slack": fake_slack,
"db": fake_db
}Then use in tests:
def test_agent_full_triage_workflow(agent_environment):
env = agent_environment
result = run_full_triage_agent("Critical: payment processing is down")
# Assert GitHub issue was created
assert len(env["github"].issues) == 1
assert "critical" in env["github"].issues[0]["title"].lower()
# Assert Slack alert was sent
assert len(env["slack"].messages) >= 1
slack_message = env["slack"].messages[0]
assert "#incidents" in slack_message["channel"]End-to-End Testing After Mocked Development
Once you've verified behavior with mocks, run a smaller suite of end-to-end tests against real (staging) infrastructure. These should be:
- Fewer — maybe 5-10 per agent workflow
- Read-only or reversible — don't delete production data
- Isolated — use dedicated test accounts, namespaces, or sandboxes
- Run less often — nightly or pre-release, not on every commit
For browser-visible effects of agent workflows, HelpMeTest's Robot Framework tests work well:
curl -fsSL https://helpmetest.com/install | bash
helpmetest <span class="hljs-built_in">test <span class="hljs-string">"Agent E2E: Issue Triage"Conclusion
Mocking external APIs in agentic systems requires choosing the right layer: HTTP interception for testing tool implementations, function-level mocking for testing agent orchestration, record/replay for realistic but reproducible responses, and in-process fakes for stateful multi-step workflows. The key principle: mock at the boundary closest to what you're testing. Don't test HTTP details when you're testing agent logic, and don't test agent logic when you're testing tool implementation.