Testing LangChain Applications: Unit Testing Chains, Mocking LLMs, Eval Harnesses

Testing LangChain Applications: Unit Testing Chains, Mocking LLMs, Eval Harnesses

LangChain applications are complex pipelines: prompts, chains, retrievers, tools, agents, and memory all interact. Testing them requires strategies that go beyond standard unit testing — you need to mock LLMs, test chains in isolation, and run eval harnesses that verify output quality over representative datasets. This guide covers all three layers.

Key Takeaways

Mock the LLM in unit tests. LangChain provides FakeListLLM and FakeStreamingLLM for deterministic testing without API costs.

Test each chain component in isolation. Don't test the entire pipeline when you want to verify prompt formatting or output parsing. Break it down.

Use LangChain's built-in testing utilities. HumanInputChatModel and fake embeddings let you control all external dependencies in tests.

Evaluation datasets are your ground truth. Build a representative set of inputs/expected outputs and run your chain against it on every significant change.

Test error handling and fallbacks. LLM API calls fail. Chains should handle timeouts, rate limits, and unexpected output formats gracefully.

LangChain Testing Architecture

A LangChain application typically has these testable layers:

User Input
  ↓
Input Validation / Preprocessing
  ↓
Prompt Template (testable: formatting, variable injection)
  ↓
LLM Call (mockable: FakeListLLM, FakeStreamingLLM)
  ↓
Output Parser (testable: parsing, validation, error handling)
  ↓
Tool Calls / Retrieval (mockable: fake retrievers, tool mocks)
  ↓
Final Response

Each layer has different testing strategies. Unit tests verify individual components. Integration tests verify the assembled chain with a real (or fake) LLM. Eval harnesses verify output quality over datasets.


Unit Testing with Fake LLMs

LangChain provides fake LLM implementations that return predetermined responses — enabling fast, deterministic tests that don't hit the API.

FakeListLLM

from langchain_community.llms.fake import FakeListLLM
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

def test_summarization_chain_formats_prompt():
    """Verify the chain formats the input correctly and returns the LLM response."""
    fake_llm = FakeListLLM(responses=["This is a summary of the document."])

    template = PromptTemplate(
        input_variables=["document"],
        template="Summarize the following document in one sentence:\n\n{document}"
    )

    chain = LLMChain(llm=fake_llm, prompt=template)
    result = chain.invoke({"document": "LangChain is a framework for building LLM applications."})

    assert result["text"] == "This is a summary of the document."
    # Verify the prompt was constructed correctly
    assert "LangChain is a framework" in fake_llm.queries[0]
    assert "Summarize" in fake_llm.queries[0]

FakeStreamingLLM

from langchain_community.llms.fake import FakeStreamingListLLM

def test_streaming_chain():
    """Test streaming response handling."""
    fake_llm = FakeStreamingListLLM(responses=["Hello", " world", "!"])

    collected = []
    for chunk in fake_llm.stream("Say hello"):
        collected.append(chunk)

    assert "".join(collected) == "Hello world!"

Fake Embeddings

from langchain_community.embeddings import FakeEmbeddings
from langchain.vectorstores import FAISS

def test_retriever_returns_relevant_documents():
    """Test retriever logic with deterministic fake embeddings."""
    embeddings = FakeEmbeddings(size=1536)

    vectorstore = FAISS.from_texts(
        texts=[
            "Return policy: 30 days with receipt",
            "Shipping: 5-7 business days standard",
            "Support: email support@example.com"
        ],
        embedding=embeddings
    )

    retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
    docs = retriever.invoke("What is the return policy?")

    assert len(docs) == 2
    # With fake embeddings, results are not semantically meaningful,
    # but we can verify the retriever mechanics work
    assert all(hasattr(doc, "page_content") for doc in docs)

Testing Prompt Templates

Prompt templates are logic — they concatenate strings, inject variables, and format context. Test them independently:

from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
import pytest

def test_chat_prompt_template_variables():
    """Verify the prompt template accepts the correct variables."""
    system = SystemMessagePromptTemplate.from_template(
        "You are a helpful assistant. Today's date is {date}."
    )
    human = HumanMessagePromptTemplate.from_template("{question}")

    chat_prompt = ChatPromptTemplate.from_messages([system, human])

    assert set(chat_prompt.input_variables) == {"date", "question"}

def test_prompt_formats_correctly():
    """Verify prompt formatting with known inputs."""
    template = ChatPromptTemplate.from_messages([
        ("system", "You are a {role}. Always respond in {language}."),
        ("human", "{question}")
    ])

    messages = template.format_messages(
        role="customer service agent",
        language="English",
        question="What is your return policy?"
    )

    assert "customer service agent" in messages[0].content
    assert "English" in messages[0].content
    assert "return policy" in messages[1].content

def test_prompt_raises_on_missing_variable():
    """Missing required variables should raise an error."""
    template = ChatPromptTemplate.from_template("Hello, {name}!")

    with pytest.raises(KeyError):
        template.format_messages()  # Missing 'name'

Testing Output Parsers

Output parsers transform LLM text responses into structured data. They're easy to unit test because they're pure functions:

from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field
import pytest

class ProductReview(BaseModel):
    sentiment: str = Field(description="positive, negative, or neutral")
    score: float = Field(description="score from 0 to 1", ge=0, le=1)
    summary: str = Field(description="one sentence summary")

parser = PydanticOutputParser(pydantic_object=ProductReview)

def test_parser_extracts_valid_json():
    """Parser correctly extracts structured data from LLM output."""
    llm_output = '''
    ```json
    {
        "sentiment": "positive",
        "score": 0.85,
        "summary": "Great product, fast shipping."
    }
    ```
    '''
    result = parser.parse(llm_output)

    assert result.sentiment == "positive"
    assert result.score == 0.85
    assert isinstance(result, ProductReview)

def test_parser_rejects_invalid_sentiment():
    """Parser validates enum values."""
    invalid_output = '{"sentiment": "amazing", "score": 0.9, "summary": "Great!"}'

    with pytest.raises(Exception):
        parser.parse(invalid_output)

def test_parser_rejects_out_of_range_score():
    """Parser validates numeric ranges."""
    invalid_output = '{"sentiment": "positive", "score": 1.5, "summary": "Great!"}'

    with pytest.raises(Exception):
        parser.parse(invalid_output)

Testing Agents and Tools

LangChain agents dynamically choose which tools to call. Testing them requires mocking both the LLM (to control which tool is selected) and the tools themselves:

from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import tool
from langchain_community.llms.fake import FakeListLLM
from unittest.mock import patch, MagicMock
import pytest

@tool
def get_weather(location: str) -> str:
    """Get current weather for a location."""
    return f"Sunny, 22°C in {location}"

@tool
def search_database(query: str) -> str:
    """Search the product database."""
    return f"Found 5 products matching '{query}'"

def test_agent_calls_correct_tool():
    """Agent selects the appropriate tool based on the query."""
    # Control the LLM's tool selection via fake responses
    # ReAct format: Thought -> Action -> Action Input -> Observation -> Final Answer
    fake_responses = [
        "Thought: I need to search the database for this product.\nAction: search_database\nAction Input: blue headphones",
        "Final Answer: Found 5 products matching 'blue headphones'"
    ]
    fake_llm = FakeListLLM(responses=fake_responses)

    tools = [get_weather, search_database]
    prompt = create_react_prompt(tools)  # your prompt builder

    agent = create_react_agent(fake_llm, tools, prompt)
    executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

    result = executor.invoke({"input": "Find blue headphones"})
    assert "blue headphones" in result["output"]

def test_agent_handles_tool_error():
    """Agent recovers gracefully when a tool raises an error."""
    @tool
    def failing_tool(query: str) -> str:
        """A tool that always fails."""
        raise ConnectionError("Database unavailable")

    fake_llm = FakeListLLM(responses=[
        "Action: failing_tool\nAction Input: test",
        "Thought: The tool failed. I'll note this.\nFinal Answer: Unable to complete request due to system error."
    ])

    executor = AgentExecutor(
        agent=create_react_agent(fake_llm, [failing_tool], prompt),
        tools=[failing_tool],
        handle_parsing_errors=True,
        max_iterations=3
    )

    result = executor.invoke({"input": "Run the failing tool"})
    # Should not raise, should return an error message
    assert result["output"] is not None

Testing RAG Chains

Retrieval-Augmented Generation chains combine retrieval with generation. Test them by mocking the retriever:

from langchain.chains import RetrievalQA
from langchain.schema import Document
from unittest.mock import MagicMock
from langchain_community.llms.fake import FakeListLLM

def test_rag_chain_uses_retrieved_context():
    """RAG chain passes retrieved documents to the LLM."""
    fake_docs = [
        Document(page_content="HelpMeTest Pro costs $100/month.", metadata={"source": "pricing.html"}),
        Document(page_content="Free plan includes 10 tests.", metadata={"source": "pricing.html"})
    ]

    mock_retriever = MagicMock()
    mock_retriever.invoke.return_value = fake_docs

    fake_llm = FakeListLLM(responses=["HelpMeTest Pro costs $100 per month."])

    chain = RetrievalQA.from_chain_type(
        llm=fake_llm,
        retriever=mock_retriever,
        return_source_documents=True
    )

    result = chain.invoke({"query": "How much does HelpMeTest Pro cost?"})

    # Verify retriever was called with the query
    mock_retriever.invoke.assert_called_once()
    assert "100" in result["result"] or "$100" in result["result"]
    assert len(result["source_documents"]) == 2

def test_rag_chain_handles_empty_retrieval():
    """RAG chain responds appropriately when no documents are retrieved."""
    mock_retriever = MagicMock()
    mock_retriever.invoke.return_value = []  # No documents found

    fake_llm = FakeListLLM(responses=["I don't have information about that topic."])

    chain = RetrievalQA.from_chain_type(
        llm=fake_llm,
        retriever=mock_retriever
    )

    result = chain.invoke({"query": "What is the answer to life?"})
    assert result["result"] is not None

Building an Evaluation Harness

Beyond unit tests, evaluate your chain over a dataset of representative examples:

# eval/evaluate_chain.py
import json
from dataclasses import dataclass
from langchain.evaluation import load_evaluator
from myapp.chain import build_qa_chain

@dataclass
class EvalResult:
    question: str
    answer: str
    expected: str
    score: float
    passed: bool

def run_evaluation(eval_dataset_path: str, threshold: float = 0.7) -> list[EvalResult]:
    """Evaluate the QA chain against a dataset."""
    chain = build_qa_chain()
    evaluator = load_evaluator("labeled_criteria", criteria="correctness")

    with open(eval_dataset_path) as f:
        dataset = json.load(f)

    results = []
    for item in dataset:
        answer = chain.invoke({"question": item["question"]})["result"]

        eval_result = evaluator.evaluate_strings(
            prediction=answer,
            reference=item["expected_answer"],
            input=item["question"]
        )

        results.append(EvalResult(
            question=item["question"],
            answer=answer,
            expected=item["expected_answer"],
            score=eval_result["score"],
            passed=eval_result["score"] >= threshold
        ))

    return results

# pytest integration
def test_qa_chain_eval_harness():
    """Full evaluation of QA chain against test dataset."""
    results = run_evaluation("tests/fixtures/qa_eval_dataset.json", threshold=0.7)

    pass_rate = sum(1 for r in results if r.passed) / len(results)
    failed = [r for r in results if not r.passed]

    assert pass_rate >= 0.85, (
        f"Eval pass rate {pass_rate:.0%} below 85%. "
        f"Failed examples:\n" +
        "\n".join(f"  Q: {r.question}\n  Expected: {r.expected}\n  Got: {r.answer}" for r in failed[:3])
    )

Testing Memory and State

LangChain memory components maintain conversation history. Test that context is preserved and doesn't grow unbounded:

from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory
from langchain_community.llms.fake import FakeListLLM

def test_buffer_memory_retains_history():
    """Memory correctly stores and retrieves conversation history."""
    memory = ConversationBufferMemory()
    memory.chat_memory.add_user_message("My name is Alice.")
    memory.chat_memory.add_ai_message("Nice to meet you, Alice!")
    memory.chat_memory.add_user_message("What is my name?")
    memory.chat_memory.add_ai_message("Your name is Alice.")

    history = memory.load_memory_variables({})
    assert "Alice" in history["history"]

def test_memory_window_limits_context():
    """Windowed memory doesn't grow unbounded."""
    from langchain.memory import ConversationBufferWindowMemory

    memory = ConversationBufferWindowMemory(k=2)  # Keep last 2 exchanges

    for i in range(10):
        memory.chat_memory.add_user_message(f"Message {i}")
        memory.chat_memory.add_ai_message(f"Response {i}")

    history = memory.load_memory_variables({})
    # Should only contain the last 2 exchanges (4 messages)
    messages = memory.chat_memory.messages
    assert len(messages) <= 4, f"Memory has {len(messages)} messages, expected <= 4"

CI Configuration

# .github/workflows/langchain-tests.yml
name: LangChain Tests
on:
  push:
    paths:
      - 'src/**'
      - 'tests/**'

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install langchain langchain-community pytest
      - run: pytest tests/unit/ -v -m "not eval"

  eval-tests:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'  # Only on main
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install langchain langchain-community pytest deepeval
      - run: pytest tests/eval/ -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Conclusion

Testing LangChain applications well requires working at multiple layers: unit tests for prompts, parsers, and chain logic using fake LLMs; integration tests for assembled chains; and eval harnesses that measure output quality over representative datasets.

The fake LLM utilities (FakeListLLM, fake embeddings) are the most valuable tools — they make LangChain tests fast, deterministic, and cost-free. Start there. Add evaluation harnesses when your chain matures and you have a representative dataset to validate against.

HelpMeTest complements this with end-to-end browser tests that verify your LangChain application works correctly from the user's perspective — ensuring the full stack, not just the chain logic, functions as expected on every deployment.

Read more