AI Testing

How to Test LangGraph Agents Before They Hit Production

HelpMeTest

13 May 2026 — 4 min read

You built a LangGraph agent. The nodes look right. The graph compiles. You run it manually and it works.

Then you ship it, and a user hits an edge case in your routing logic, your agent ends up in an infinite loop, and your production system starts burning tokens until someone notices.

This is the LangGraph testing problem: the graph runs, but you don't know if it runs correctly across the state space that matters.

Why LangGraph Agents Are Hard to Test

A LangGraph agent is fundamentally different from a REST API or a function. It has:

State that evolves across multiple node executions
Routing logic that decides which node runs next based on that state
Non-deterministic behavior (LLM calls) inside individual nodes
Cycles that can loop indefinitely if not carefully controlled

Unit testing a single node tells you the node works in isolation. It doesn't tell you whether your graph reaches the right terminal state from a given initial state — which is what your users actually care about.

Layer 1: Unit Testing Individual Nodes

Start here. Each node in your graph is a Python function that takes state and returns a state update. Test them in isolation.

from your_agent import classify_intent

def test_classify_intent_customer_support():
    state = {"messages": [HumanMessage(content="My order hasn't arrived")]}
    update = classify_intent(state)
    assert update["intent"] == "support"
    assert update["next"] == "support_handler"

def test_classify_intent_sales():
    state = {"messages": [HumanMessage(content="I want to upgrade my plan")]}
    update = classify_intent(state)
    assert update["intent"] == "sales"
    assert update["next"] == "sales_handler"

For nodes that call LLMs, mock the model in unit tests. LangChain provides GenericFakeChatModel for this:

from langchain_core.language_models import GenericFakeChatModel

def test_summarize_node_with_mock():
    mock_model = GenericFakeChatModel(messages=iter([
        AIMessage(content="Order #1234 not received. Customer wants refund.")
    ]))
    node = SummarizeNode(model=mock_model)
    result = node.invoke({"transcript": "I ordered last week and nothing came..."})
    assert "refund" in result["summary"].lower()

This keeps node tests fast and deterministic. You're testing the node logic, not the LLM.

Layer 2: Testing Routing Functions

Routing functions are where most LangGraph bugs live. They decide which node runs next — get it wrong and your agent goes somewhere it shouldn't.

Test every branch explicitly:

from your_agent import route_after_classification

def test_routes_support_to_handler():
    state = {"intent": "support", "confidence": 0.95}
    next_node = route_after_classification(state)
    assert next_node == "support_handler"

def test_routes_low_confidence_to_escalation():
    state = {"intent": "unknown", "confidence": 0.3}
    next_node = route_after_classification(state)
    assert next_node == "human_escalation"

def test_routes_sales_to_sales_handler():
    state = {"intent": "sales", "confidence": 0.88}
    next_node = route_after_classification(state)
    assert next_node == "sales_handler"

Don't assume the default case works. Test it explicitly, especially the low-confidence and fallback paths.

Layer 3: Testing Individual Nodes via the Compiled Graph

LangGraph lets you invoke individual nodes in a compiled graph without running the full execution. Use this to test nodes in their real context without triggering the full graph:

from your_agent import build_graph

def test_support_handler_node():
    graph = build_graph()
    compiled = graph.compile()
    
    result = compiled.nodes["support_handler"].invoke({
        "intent": "support",
        "messages": [HumanMessage(content="My order hasn't arrived")],
        "customer_id": "cust_123"
    })
    
    assert "ticket_id" in result
    assert result["status"] == "created"

This catches integration issues between your node implementation and the graph's state schema — things unit tests on the raw function won't surface.

Layer 4: Full Graph Trajectory Testing

After individual nodes check out, test whether the full graph reaches the right terminal state from given inputs. This is where you catch routing bugs and infinite loops.

def test_support_ticket_creation_flow():
    graph = build_graph().compile()
    
    final_state = graph.invoke({
        "messages": [HumanMessage(content="My order hasn't arrived and it's been 2 weeks")]
    })
    
    # Assert terminal state
    assert final_state["status"] == "ticket_created"
    assert final_state["ticket_id"] is not None
    assert "support" in final_state["intent"]
    
    # Assert the graph didn't loop excessively
    assert final_state.get("iteration_count", 0) < 5

Add an iteration counter to your state if you have cycles. Loops that work correctly should terminate in a bounded number of steps.

The Trajectory Problem

Individual state assertions tell you where the agent ended up. They don't tell you how it got there. An agent that reaches the right final state via the wrong path is a bug waiting to happen in production.

LangChain's AgentEvals package lets you evaluate the sequence of node executions, not just the final output:

from langchain_agent_evals import evaluate_trajectory

expected_trajectory = [
    "classify_intent",
    "support_handler", 
    "ticket_creation",
    "notify_customer"
]

result = graph.invoke(initial_state)
score = evaluate_trajectory(
    actual=result["trajectory"],
    expected=expected_trajectory
)
assert score > 0.9

This catches cases where your agent reaches the right answer through accidental reasoning — fragile behavior that breaks on slightly different inputs.

What Code-Level Tests Miss

All of the above tests run against your graph in isolation. That's necessary but not sufficient for production LangGraph agents.

Production agents fail for reasons outside the graph:

The LLM changes behavior — model updates shift classification accuracy. Your routing logic worked with the old model, fails with the new one.
Tool integrations drift — the external API your agent calls changes its response format. The tool output still returns 200, but your parsing node breaks silently.
State accumulation — in long-running multi-turn agents, state grows large enough to exceed context windows. Works in tests with short sessions, fails in production with real users.
Concurrency — multiple users hit your agent simultaneously, shared external state gets corrupted.

These failure modes need ongoing monitoring against real production behavior, not just CI tests.

Monitoring LangGraph Agents in Production

Once your agent is live, you need behavioral monitoring — periodic test runs that verify your agent still behaves correctly as models, prompts, and external dependencies change.

HelpMeTest lets you write natural language tests against your deployed agent endpoint and run them on a schedule. No test infrastructure to maintain:

Test: customer support agent ticket creation
When user says: "My order hasn't arrived after 2 weeks"
Then: response contains ticket confirmation
And: agent does not ask for clarification more than once
And: response time under 10 seconds

Tests run every hour. If your agent's routing breaks after a model update or a tool API change, you find out before your users do.

Free tier: 10 tests, unlimited health checks. Try HelpMeTest →

LangGraph Testing Checklist

Before shipping any LangGraph agent:

Unit tests for every node function with mocked LLM calls
Explicit tests for every routing branch including fallback paths
Full graph trajectory tests from representative initial states
Cycle detection — verify loops terminate in bounded steps
Tests for the error path — what happens when a node fails
State schema validation — does your graph handle unexpected state shapes
Concurrency test if multiple users can invoke the same agent
Production monitoring for behavioral drift after model/tool updates

The graph compiling is table stakes. The trajectory being correct is what matters.

How to Test LangGraph Agents Before They Hit Production

HelpMeTest

Why LangGraph Agents Are Hard to Test

Layer 1: Unit Testing Individual Nodes

Layer 2: Testing Routing Functions

Layer 3: Testing Individual Nodes via the Compiled Graph

Layer 4: Full Graph Trajectory Testing

The Trajectory Problem

What Code-Level Tests Miss

Monitoring LangGraph Agents in Production

LangGraph Testing Checklist

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest