Unit Testing LangGraph StateGraph Nodes with pytest: Mocks, State Reducers, and Checkpointers

Unit Testing LangGraph StateGraph Nodes with pytest: Mocks, State Reducers, and Checkpointers

Most LangGraph developers test their graphs end-to-end: compile the graph, call .invoke(), assert on the final output. This works until something goes wrong. When a multi-hop workflow produces a wrong answer, you have no idea which node introduced the error, whether it was a routing decision or a state mutation, or whether the LLM even behaved the way you expected.

Unit testing individual StateGraph nodes solves this. Each node is a Python function — it takes state, it returns state. You can test it directly without compiling the graph, without making LLM calls, and without running the rest of the pipeline.

This guide focuses on unit tests specifically. For production monitoring and integration testing, LangGraph's own documentation covers end-to-end patterns. Here we go granular: individual nodes, individual state reducers, and the checkpointer in isolation.

Setting Up Your Test Environment

pip install langgraph langchain-openai pytest pytest-mock

Your project structure should keep node functions importable separately from the compiled graph:

myagent/
  nodes/
    __init__.py
    researcher.py      # node functions
    writer.py
    router.py
  graph.py             # graph compilation (StateGraph, add_node, etc.)
  state.py             # TypedDict state definition
tests/
  test_researcher_node.py
  test_writer_node.py
  test_state_reducers.py
  test_checkpointer.py

The key discipline: node functions must be importable and callable without importing the compiled graph. If your node functions live inside graph.py as closures, refactor them out before you can test them properly.

Defining State and Nodes for Testability

# myagent/state.py
from typing import TypedDict, Annotated
from langchain_core.messages import BaseMessage
import operator

class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], operator.add]
    context: list[str]
    step_count: int
    final_answer: str | None
# myagent/nodes/researcher.py
from langchain_core.messages import HumanMessage, AIMessage
from myagent.state import AgentState

def research_node(state: AgentState, llm, retriever) -> AgentState:
    """Retrieves context and appends an AI message summarizing findings."""
    query = state["messages"][-1].content
    docs = retriever.invoke(query)
    
    summary_prompt = f"Summarize these documents for answering: {query}\n\nDocs: {docs}"
    response = llm.invoke([HumanMessage(content=summary_prompt)])
    
    return {
        "messages": [response],
        "context": docs,
        "step_count": state["step_count"] + 1,
        "final_answer": None,
    }

Notice llm and retriever are injected as parameters rather than imported as module-level singletons. This makes mocking trivial in tests.

Unit Testing Nodes with pytest-mock

# tests/test_researcher_node.py
import pytest
from unittest.mock import MagicMock
from langchain_core.messages import HumanMessage, AIMessage
from myagent.nodes.researcher import research_node
from myagent.state import AgentState

@pytest.fixture
def mock_llm():
    llm = MagicMock()
    llm.invoke.return_value = AIMessage(
        content="Based on the documents, RAG is a technique that combines retrieval with generation."
    )
    return llm

@pytest.fixture
def mock_retriever():
    retriever = MagicMock()
    retriever.invoke.return_value = [
        "Document 1: RAG stands for Retrieval-Augmented Generation.",
        "Document 2: RAG improves factuality by grounding outputs in retrieved text.",
    ]
    return retriever

@pytest.fixture
def initial_state() -> AgentState:
    return {
        "messages": [HumanMessage(content="What is RAG?")],
        "context": [],
        "step_count": 0,
        "final_answer": None,
    }

def test_research_node_adds_ai_message(initial_state, mock_llm, mock_retriever):
    result = research_node(initial_state, llm=mock_llm, retriever=mock_retriever)
    
    assert len(result["messages"]) == 1
    assert isinstance(result["messages"][0], AIMessage)

def test_research_node_populates_context(initial_state, mock_llm, mock_retriever):
    result = research_node(initial_state, llm=mock_llm, retriever=mock_retriever)
    
    assert len(result["context"]) == 2
    assert "RAG" in result["context"][0]

def test_research_node_increments_step_count(initial_state, mock_llm, mock_retriever):
    result = research_node(initial_state, llm=mock_llm, retriever=mock_retriever)
    
    assert result["step_count"] == 1

def test_research_node_calls_retriever_with_query(initial_state, mock_llm, mock_retriever):
    research_node(initial_state, llm=mock_llm, retriever=mock_retriever)
    
    mock_retriever.invoke.assert_called_once_with("What is RAG?")

def test_research_node_does_not_set_final_answer(initial_state, mock_llm, mock_retriever):
    result = research_node(initial_state, llm=mock_llm, retriever=mock_retriever)
    
    assert result["final_answer"] is None

These five tests run in under 50ms total, make zero network calls, and each tests exactly one behavior. If research_node breaks, you know which assertion failed and exactly what it was doing when it failed.

Testing State Reducer Functions

LangGraph state reducers merge incoming updates with existing state. The operator.add reducer for messages is a list append — but custom reducers can be complex and deserve their own tests.

# myagent/state.py (extended)
def merge_context(existing: list[str], new: list[str]) -> list[str]:
    """Merges new context, deduplicating by content."""
    existing_set = set(existing)
    return existing + [doc for doc in new if doc not in existing_set]
# tests/test_state_reducers.py
from myagent.state import merge_context

def test_merge_context_appends_new_documents():
    existing = ["doc A", "doc B"]
    new = ["doc C", "doc D"]
    result = merge_context(existing, new)
    assert result == ["doc A", "doc B", "doc C", "doc D"]

def test_merge_context_deduplicates():
    existing = ["doc A", "doc B"]
    new = ["doc B", "doc C"]
    result = merge_context(existing, new)
    assert result == ["doc A", "doc B", "doc C"]
    assert result.count("doc B") == 1

def test_merge_context_handles_empty_existing():
    result = merge_context([], ["doc A"])
    assert result == ["doc A"]

def test_merge_context_handles_empty_new():
    result = merge_context(["doc A"], [])
    assert result == ["doc A"]

Reducer bugs are subtle and cause state corruption that is almost impossible to debug from end-to-end test output. Test them directly.

Testing Conditional Edges

Conditional edges in LangGraph are routing functions — they take state and return the name of the next node. They are pure functions and should be tested exhaustively.

# myagent/nodes/router.py
from myagent.state import AgentState

def should_continue(state: AgentState) -> str:
    """Decides whether to keep researching or produce a final answer."""
    if state["step_count"] >= 3:
        return "writer"
    if state["final_answer"] is not None:
        return "writer"
    if len(state["context"]) == 0:
        return "researcher"
    return "researcher"
# tests/test_router.py
from myagent.nodes.router import should_continue
from langchain_core.messages import HumanMessage

def base_state():
    return {
        "messages": [HumanMessage(content="test")],
        "context": [],
        "step_count": 0,
        "final_answer": None,
    }

def test_routes_to_researcher_when_no_context():
    state = base_state()
    assert should_continue(state) == "researcher"

def test_routes_to_researcher_when_context_exists_and_steps_low():
    state = {**base_state(), "context": ["doc1"], "step_count": 1}
    assert should_continue(state) == "researcher"

def test_routes_to_writer_after_max_steps():
    state = {**base_state(), "context": ["doc1"], "step_count": 3}
    assert should_continue(state) == "writer"

def test_routes_to_writer_when_final_answer_set():
    state = {**base_state(), "final_answer": "The answer is 42"}
    assert should_continue(state) == "writer"

Every branch of every conditional edge deserves a test. If you have a routing function with five branches, you need five tests. This is where most LangGraph debugging time gets spent — make it cheap to catch routing errors early.

Testing MemorySaver Checkpointer Persistence

The MemorySaver checkpointer stores conversation state across invocations. Testing it verifies that state persists correctly between turns and that different thread IDs remain isolated.

# tests/test_checkpointer.py
import pytest
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage, AIMessage
from myagent.state import AgentState

def make_test_graph(checkpointer):
    """Minimal graph for checkpointer testing."""
    def echo_node(state: AgentState) -> AgentState:
        last = state["messages"][-1].content
        return {
            "messages": [AIMessage(content=f"Echo: {last}")],
            "context": state["context"],
            "step_count": state["step_count"] + 1,
            "final_answer": None,
        }
    
    graph = StateGraph(AgentState)
    graph.add_node("echo", echo_node)
    graph.set_entry_point("echo")
    graph.add_edge("echo", END)
    return graph.compile(checkpointer=checkpointer)

@pytest.fixture
def memory():
    return MemorySaver()

def test_checkpointer_persists_messages_across_turns(memory):
    graph = make_test_graph(memory)
    config = {"configurable": {"thread_id": "test-thread-1"}}
    
    # First turn
    result1 = graph.invoke(
        {"messages": [HumanMessage(content="Hello")], "context": [], "step_count": 0, "final_answer": None},
        config=config
    )
    assert result1["messages"][-1].content == "Echo: Hello"
    
    # Second turn — history should be preserved in the checkpointer
    result2 = graph.invoke(
        {"messages": [HumanMessage(content="World")], "context": [], "step_count": 0, "final_answer": None},
        config=config
    )
    # The graph accumulated messages across both turns
    assert any("Hello" in m.content for m in result2["messages"])
    assert any("World" in m.content for m in result2["messages"])

def test_checkpointer_isolates_threads(memory):
    graph = make_test_graph(memory)
    
    config_a = {"configurable": {"thread_id": "thread-A"}}
    config_b = {"configurable": {"thread_id": "thread-B"}}
    
    initial = {"messages": [HumanMessage(content="test")], "context": [], "step_count": 0, "final_answer": None}
    
    graph.invoke(initial, config=config_a)
    
    # Thread B should have no knowledge of Thread A's history
    checkpoint_b = memory.get(config_b)
    assert checkpoint_b is None

def test_checkpointer_get_state(memory):
    graph = make_test_graph(memory)
    config = {"configurable": {"thread_id": "state-test"}}
    
    graph.invoke(
        {"messages": [HumanMessage(content="test")], "context": [], "step_count": 0, "final_answer": None},
        config=config
    )
    
    state_snapshot = graph.get_state(config)
    assert state_snapshot is not None
    assert len(state_snapshot.values["messages"]) > 0
    assert state_snapshot.values["step_count"] == 1

These tests verify the checkpointer without any LLM calls. The echo node is a stand-in that makes the graph deterministic — you are testing persistence and isolation, not LLM behavior.

Tracing with LangSmith in Tests

LangSmith tracing in tests is useful for debugging flaky integration tests. Wrap your test invocations with a traced run to capture exactly what happened at each step.

# conftest.py
import os
import pytest

@pytest.fixture(scope="session", autouse=True)
def configure_langsmith():
    """Enable LangSmith tracing in test runs if env vars are set."""
    if os.getenv("LANGCHAIN_API_KEY"):
        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_PROJECT"] = f"test-{os.getenv('CI_COMMIT_SHA', 'local')}"
    yield

With this fixture, any test that invokes a compiled graph will be traced automatically when LANGCHAIN_API_KEY is present. In CI, set LANGCHAIN_TRACING_V2=true and LANGCHAIN_PROJECT to your repo name to get per-branch traces.

Do not enable tracing for pure unit tests that mock the LLM — there is nothing useful to trace and it adds latency. Use it selectively for integration tests where you need to debug why an agent made a particular decision.

Running Node Tests Independently of the Compiled Graph

A key benefit of this architecture is speed. You can run all node unit tests in under a second:

# Run only unit tests (fast, no integration)
pytest tests/ -m <span class="hljs-string">"not integration" -v

<span class="hljs-comment"># Run integration tests separately
pytest tests/ -m integration -v

<span class="hljs-comment"># Run with coverage to find untested nodes
pytest tests/ -m <span class="hljs-string">"not integration" --cov=myagent/nodes --cov-report=term-missing

Your pytest.ini or pyproject.toml should define the markers:

[pytest]
markers =
    integration: marks tests as integration tests (require real LLM calls or external services)

End-to-End Monitoring for LangGraph Applications

Unit tests catch logic errors. They do not catch model drift, latency regressions, or UI-level breakage when a new model version changes output formatting. Once your LangGraph application is in production, you need a layer of continuous health monitoring on top of your test suite.

HelpMeTest covers this layer — it runs end-to-end tests against your live application using Robot Framework and Playwright, checking that your AI-powered features work correctly from the user's perspective on a continuous basis. For teams shipping LangGraph applications, this means catching "the chatbot stopped responding correctly" before your users do. Health monitoring at $100/month is the operational complement to the unit test suite you just built.


The patterns in this guide — injectable dependencies, node functions separated from the compiled graph, explicit state reducers, thoroughly tested routing functions — are what make LangGraph applications maintainable at scale. Start here, keep the unit tests fast and isolated, and add integration tests only where the behavior of multiple nodes together cannot be verified any other way.

Read more