How to Test LangGraph Agents Before They Hit Production
You built a LangGraph agent. The nodes look right. The graph compiles. You run it manually and it works.
Then you ship it, and a user hits an edge case in your routing logic, your agent ends up in an infinite loop, and your production system starts burning tokens until someone notices.
This is the LangGraph testing problem: the graph runs, but you don't know if it runs correctly across the state space that matters.
Why LangGraph Agents Are Hard to Test
A LangGraph agent is fundamentally different from a REST API or a function. It has:
- State that evolves across multiple node executions
- Routing logic that decides which node runs next based on that state
- Non-deterministic behavior (LLM calls) inside individual nodes
- Cycles that can loop indefinitely if not carefully controlled
Unit testing a single node tells you the node works in isolation. It doesn't tell you whether your graph reaches the right terminal state from a given initial state — which is what your users actually care about.
Layer 1: Unit Testing Individual Nodes
Start here. Each node in your graph is a Python function that takes state and returns a state update. Test them in isolation.
from your_agent import classify_intent
def test_classify_intent_customer_support():
state = {"messages": [HumanMessage(content="My order hasn't arrived")]}
update = classify_intent(state)
assert update["intent"] == "support"
assert update["next"] == "support_handler"
def test_classify_intent_sales():
state = {"messages": [HumanMessage(content="I want to upgrade my plan")]}
update = classify_intent(state)
assert update["intent"] == "sales"
assert update["next"] == "sales_handler"For nodes that call LLMs, mock the model in unit tests. LangChain provides GenericFakeChatModel for this:
from langchain_core.language_models import GenericFakeChatModel
def test_summarize_node_with_mock():
mock_model = GenericFakeChatModel(messages=iter([
AIMessage(content="Order #1234 not received. Customer wants refund.")
]))
node = SummarizeNode(model=mock_model)
result = node.invoke({"transcript": "I ordered last week and nothing came..."})
assert "refund" in result["summary"].lower()This keeps node tests fast and deterministic. You're testing the node logic, not the LLM.
Layer 2: Testing Routing Functions
Routing functions are where most LangGraph bugs live. They decide which node runs next — get it wrong and your agent goes somewhere it shouldn't.
Test every branch explicitly:
from your_agent import route_after_classification
def test_routes_support_to_handler():
state = {"intent": "support", "confidence": 0.95}
next_node = route_after_classification(state)
assert next_node == "support_handler"
def test_routes_low_confidence_to_escalation():
state = {"intent": "unknown", "confidence": 0.3}
next_node = route_after_classification(state)
assert next_node == "human_escalation"
def test_routes_sales_to_sales_handler():
state = {"intent": "sales", "confidence": 0.88}
next_node = route_after_classification(state)
assert next_node == "sales_handler"Don't assume the default case works. Test it explicitly, especially the low-confidence and fallback paths.
Layer 3: Testing Individual Nodes via the Compiled Graph
LangGraph lets you invoke individual nodes in a compiled graph without running the full execution. Use this to test nodes in their real context without triggering the full graph:
from your_agent import build_graph
def test_support_handler_node():
graph = build_graph()
compiled = graph.compile()
result = compiled.nodes["support_handler"].invoke({
"intent": "support",
"messages": [HumanMessage(content="My order hasn't arrived")],
"customer_id": "cust_123"
})
assert "ticket_id" in result
assert result["status"] == "created"This catches integration issues between your node implementation and the graph's state schema — things unit tests on the raw function won't surface.
Layer 4: Full Graph Trajectory Testing
After individual nodes check out, test whether the full graph reaches the right terminal state from given inputs. This is where you catch routing bugs and infinite loops.
def test_support_ticket_creation_flow():
graph = build_graph().compile()
final_state = graph.invoke({
"messages": [HumanMessage(content="My order hasn't arrived and it's been 2 weeks")]
})
# Assert terminal state
assert final_state["status"] == "ticket_created"
assert final_state["ticket_id"] is not None
assert "support" in final_state["intent"]
# Assert the graph didn't loop excessively
assert final_state.get("iteration_count", 0) < 5Add an iteration counter to your state if you have cycles. Loops that work correctly should terminate in a bounded number of steps.
The Trajectory Problem
Individual state assertions tell you where the agent ended up. They don't tell you how it got there. An agent that reaches the right final state via the wrong path is a bug waiting to happen in production.
LangChain's AgentEvals package lets you evaluate the sequence of node executions, not just the final output:
from langchain_agent_evals import evaluate_trajectory
expected_trajectory = [
"classify_intent",
"support_handler",
"ticket_creation",
"notify_customer"
]
result = graph.invoke(initial_state)
score = evaluate_trajectory(
actual=result["trajectory"],
expected=expected_trajectory
)
assert score > 0.9This catches cases where your agent reaches the right answer through accidental reasoning — fragile behavior that breaks on slightly different inputs.
What Code-Level Tests Miss
All of the above tests run against your graph in isolation. That's necessary but not sufficient for production LangGraph agents.
Production agents fail for reasons outside the graph:
- The LLM changes behavior — model updates shift classification accuracy. Your routing logic worked with the old model, fails with the new one.
- Tool integrations drift — the external API your agent calls changes its response format. The tool output still returns 200, but your parsing node breaks silently.
- State accumulation — in long-running multi-turn agents, state grows large enough to exceed context windows. Works in tests with short sessions, fails in production with real users.
- Concurrency — multiple users hit your agent simultaneously, shared external state gets corrupted.
These failure modes need ongoing monitoring against real production behavior, not just CI tests.
Monitoring LangGraph Agents in Production
Once your agent is live, you need behavioral monitoring — periodic test runs that verify your agent still behaves correctly as models, prompts, and external dependencies change.
HelpMeTest lets you write natural language tests against your deployed agent endpoint and run them on a schedule. No test infrastructure to maintain:
Test: customer support agent ticket creation
When user says: "My order hasn't arrived after 2 weeks"
Then: response contains ticket confirmation
And: agent does not ask for clarification more than once
And: response time under 10 secondsTests run every hour. If your agent's routing breaks after a model update or a tool API change, you find out before your users do.
Free tier: 10 tests, unlimited health checks. Try HelpMeTest →
LangGraph Testing Checklist
Before shipping any LangGraph agent:
- Unit tests for every node function with mocked LLM calls
- Explicit tests for every routing branch including fallback paths
- Full graph trajectory tests from representative initial states
- Cycle detection — verify loops terminate in bounded steps
- Tests for the error path — what happens when a node fails
- State schema validation — does your graph handle unexpected state shapes
- Concurrency test if multiple users can invoke the same agent
- Production monitoring for behavioral drift after model/tool updates
The graph compiling is table stakes. The trajectory being correct is what matters.