Testing

Mocking External APIs in Agentic Systems

HelpMeTest

17 May 2026 — 6 min read

AI agents call external APIs — search engines, databases, Slack, GitHub, payment processors. Mocking these in tests is essential: real APIs are slow, rate-limited, non-deterministic, and sometimes irreversible. This guide covers four mocking strategies for agentic systems: HTTP interception, tool-level mocking, record/replay, and in-process fakes.

An AI agent that manages your GitHub issues calls the GitHub API. An agent that schedules meetings calls Google Calendar. A customer support agent hits your CRM, your ticket system, and your billing API — all in a single workflow.

Testing these agents against real APIs has serious problems:

Rate limits kill CI pipelines
API calls cost money
Tests that delete records or send emails have irreversible effects
Flaky external services make tests non-deterministic
Real APIs are slow (network latency × number of tool calls × number of tests = hours)

Mocking solves all of this. But agentic systems require mocking at multiple layers, and the choice of which layer matters.

The Four Mocking Layers

Layer 1: HTTP Interception

The lowest-level approach intercepts HTTP requests before they leave the process. Tools like responses (Python), nock (Node.js), or httpretty intercept calls to any HTTP client.

# Python: mock all GitHub API calls
import responses
from agent.tools import create_github_issue

@responses.activate
def test_create_issue_tool():
    responses.post(
        "https://api.github.com/repos/myorg/myrepo/issues",
        json={
            "id": 1,
            "number": 42,
            "title": "Bug: login fails on mobile",
            "html_url": "https://github.com/myorg/myrepo/issues/42"
        },
        status=201
    )

    result = create_github_issue(
        owner="myorg",
        repo="myrepo",
        title="Bug: login fails on mobile",
        body="Steps to reproduce..."
    )

    assert result["number"] == 42
    assert result["html_url"].endswith("/42")
    assert len(responses.calls) == 1  # Exactly one API call was made

HTTP interception works for any HTTP client (requests, httpx, fetch) without changing your production code. The downside: you're asserting on HTTP request/response details, which tightly couples tests to implementation.

Layer 2: Tool-Level Mocking

Rather than intercepting HTTP, mock the tool functions themselves. This is the right level of abstraction for agent orchestration tests — you're testing that the agent calls the right tools, not the HTTP details:

# Python: mock tools at the function level
from unittest.mock import patch, MagicMock
from agent.workflow import run_issue_triage_agent

def test_agent_creates_issue_for_bug_report():
    mock_search = MagicMock(return_value=[
        {"title": "Similar bug #38", "url": "https://github.com/myorg/myrepo/issues/38"}
    ])
    mock_create = MagicMock(return_value={
        "number": 42,
        "html_url": "https://github.com/myorg/myrepo/issues/42"
    })

    with patch("agent.tools.search_existing_issues", mock_search), \
         patch("agent.tools.create_github_issue", mock_create):

        result = run_issue_triage_agent(
            "Users are reporting login failures on mobile Safari since yesterday's deploy"
        )

    # Agent should have searched before creating
    mock_search.assert_called_once()
    search_call = mock_search.call_args
    assert "login" in str(search_call).lower() or "mobile" in str(search_call).lower()

    # Agent should have created an issue
    mock_create.assert_called_once()
    create_call = mock_create.call_args[1]
    assert "mobile" in create_call["title"].lower() or "login" in create_call["title"].lower()

Tool-level mocking makes the test read like documentation: "the agent should search before creating." It decouples from HTTP details and lets you focus on agent behavior.

Layer 3: Record and Replay

Record real API interactions once, then replay them in tests. This gives you realistic responses without hitting real APIs repeatedly.

Several libraries support this pattern:

VCR.py (Python):

import vcr

@vcr.use_cassette("cassettes/github_search.yaml")
def test_search_issues():
    from agent.tools import search_existing_issues
    results = search_existing_issues(query="mobile login", repo="myorg/myrepo")
    assert len(results) > 0
    assert all("url" in r for r in results)

The first run with VCR_RECORD_MODE=new_episodes hits the real API and saves the response to cassettes/github_search.yaml. Subsequent runs replay from the file — no network needed.

Polly.js (Node.js):

import { Polly } from "@pollyjs/core";
import NodeHttpAdapter from "@pollyjs/adapter-node-http";

Polly.register(NodeHttpAdapter);

describe("GitHub tool", () => {
  let polly;

  beforeEach(() => {
    polly = new Polly("github-search", {
      adapters: ["node-http"],
      mode: "replay",
    });
  });

  afterEach(() => polly.stop());

  it("searches existing issues", async () => {
    const { searchIssues } = await import("./tools/github.js");
    const results = await searchIssues("mobile login", "myorg/myrepo");
    expect(results.length).toBeGreaterThan(0);
  });
});

Record/replay is ideal for complex, stateful API interactions where writing manual mocks would be too much work. The cassette files are committed to git, so the team shares the same recorded responses.

Layer 4: In-Process Fakes

Fakes are lightweight implementations of external services that run in the same process. Unlike mocks (which assert call behavior), fakes implement just enough real behavior to be useful:

# A fake GitHub API for testing
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class FakeGitHubAPI:
    issues: list = field(default_factory=list)
    _next_number: int = 1

    def search_issues(self, query: str, repo: str) -> list:
        """Return issues whose title or body contains any query word."""
        words = query.lower().split()
        return [
            issue for issue in self.issues
            if any(w in (issue["title"] + issue["body"]).lower() for w in words)
        ]

    def create_issue(self, owner: str, repo: str, title: str, body: str) -> dict:
        issue = {
            "number": self._next_number,
            "title": title,
            "body": body,
            "html_url": f"https://github.com/{owner}/{repo}/issues/{self._next_number}",
            "state": "open"
        }
        self.issues.append(issue)
        self._next_number += 1
        return issue

    def close_issue(self, number: int) -> dict:
        issue = next((i for i in self.issues if i["number"] == number), None)
        if not issue:
            raise ValueError(f"Issue #{number} not found")
        issue["state"] = "closed"
        return issue

Using the fake in tests:

def test_agent_deduplicates_issues(fake_github):
    # Pre-populate with an existing issue
    fake_github.create_issue(
        owner="myorg", repo="myrepo",
        title="Bug: login fails on mobile",
        body="Reported by user A"
    )

    with patch("agent.tools.github_api", fake_github):
        result = run_issue_triage_agent(
            "Mobile login is broken, same thing user A reported"
        )

    # Agent should have found the existing issue and not created a duplicate
    assert len(fake_github.issues) == 1, "Agent should not have created a duplicate issue"
    assert "existing" in result.output.lower() or "duplicate" in result.output.lower()

Fakes are the best choice when:

Tests involve stateful interactions (create, then read, then update)
You want to test the agent's behavior across multiple tool calls
The mocked service's behavior matters (not just whether the call was made)

Mocking Irreversible Actions

Some tool calls are irreversible: sending emails, posting to Slack, deleting records, charging credit cards. Always mock these in tests, but test that the mock was called:

def test_agent_sends_confirmation_email():
    sent_emails = []

    def mock_send_email(to: str, subject: str, body: str) -> dict:
        sent_emails.append({"to": to, "subject": subject, "body": body})
        return {"message_id": "msg_fake_123", "status": "sent"}

    with patch("agent.tools.send_email", mock_send_email):
        run_subscription_cancellation_agent(user_email="user@example.com")

    assert len(sent_emails) == 1, "Agent should send exactly one confirmation email"
    assert sent_emails[0]["to"] == "user@example.com"
    assert "cancel" in sent_emails[0]["subject"].lower()

For Slack, GitHub comments, and other social actions — assert the content of what would have been posted, not just that the function was called.

Handling Rate Limits in Tests

When you can't mock (e.g., running selected integration tests against a staging environment), implement rate limit simulation so tests don't accidentally behave differently than production:

class RateLimitAwareClient:
    def __init__(self, real_client, max_calls_per_minute: int = 30):
        self._client = real_client
        self._max_calls = max_calls_per_minute
        self._calls = []

    def search_issues(self, *args, **kwargs):
        self._enforce_rate_limit()
        return self._client.search_issues(*args, **kwargs)

    def _enforce_rate_limit(self):
        import time
        now = time.time()
        self._calls = [t for t in self._calls if now - t < 60]
        if len(self._calls) >= self._max_calls:
            sleep_time = 60 - (now - self._calls[0])
            time.sleep(sleep_time)
        self._calls.append(now)

Structuring Mock Fixtures

Organize mock fixtures so they're reusable across tests:

# conftest.py
import pytest
from tests.fakes import FakeGitHubAPI, FakeSlackAPI, FakeDatabaseAPI

@pytest.fixture
def fake_github():
    return FakeGitHubAPI()

@pytest.fixture
def fake_slack():
    return FakeSlackAPI()

@pytest.fixture
def fake_db():
    db = FakeDatabaseAPI()
    db.seed_from_fixture("tests/fixtures/sample_data.json")
    return db

@pytest.fixture
def agent_environment(fake_github, fake_slack, fake_db):
    """Complete mocked environment for agent tests."""
    with patch("agent.tools.github_api", fake_github), \
         patch("agent.tools.slack_api", fake_slack), \
         patch("agent.tools.database", fake_db):
        yield {
            "github": fake_github,
            "slack": fake_slack,
            "db": fake_db
        }

Then use in tests:

def test_agent_full_triage_workflow(agent_environment):
    env = agent_environment
    
    result = run_full_triage_agent("Critical: payment processing is down")
    
    # Assert GitHub issue was created
    assert len(env["github"].issues) == 1
    assert "critical" in env["github"].issues[0]["title"].lower()
    
    # Assert Slack alert was sent
    assert len(env["slack"].messages) >= 1
    slack_message = env["slack"].messages[0]
    assert "#incidents" in slack_message["channel"]

End-to-End Testing After Mocked Development

Once you've verified behavior with mocks, run a smaller suite of end-to-end tests against real (staging) infrastructure. These should be:

Fewer — maybe 5-10 per agent workflow
Read-only or reversible — don't delete production data
Isolated — use dedicated test accounts, namespaces, or sandboxes
Run less often — nightly or pre-release, not on every commit

For browser-visible effects of agent workflows, HelpMeTest's Robot Framework tests work well:

curl -fsSL https://helpmetest.com/install | bash
helpmetest <span class="hljs-built_in">test <span class="hljs-string">"Agent E2E: Issue Triage"

Conclusion

Mocking external APIs in agentic systems requires choosing the right layer: HTTP interception for testing tool implementations, function-level mocking for testing agent orchestration, record/replay for realistic but reproducible responses, and in-process fakes for stateful multi-step workflows. The key principle: mock at the boundary closest to what you're testing. Don't test HTTP details when you're testing agent logic, and don't test agent logic when you're testing tool implementation.