How to Test Haystack Pipelines and RAG Applications

How to Test Haystack Pipelines and RAG Applications

You built a Haystack pipeline. The retriever finds relevant documents, the prompt template structures them correctly, and the generator produces clean answers. It works in your test environment. You deploy it. Three days later a customer reports that the chatbot is confidently answering with information from six months ago. The document store was never re-indexed after the last content update.

Haystack gives you a powerful framework for building RAG and LLM pipelines. It does not give you automated tests, retrieval quality monitoring, or regression detection. That part is yours to build.

Here's how to do it systematically.

What Makes Haystack Testing Different

Haystack pipelines are composed — you chain components (DocumentStore, Retriever, PromptBuilder, Generator) and data flows through them. This creates specific failure modes:

  • Retrieval drift — the document store is stale or the embeddings are mismatched
  • Prompt regression — a PromptBuilder template was edited and the downstream output changed
  • Component misconfiguration — a component's parameters changed between versions
  • Generator hallucination — the generator produces answers not grounded in retrieved documents
  • Pipeline connectivity — a component was swapped and the output schema changed

Testing each layer in isolation tells you which component is failing. Testing the pipeline end-to-end tells you if it all fits together.

Unit Testing Haystack Components

Individual Haystack components — custom converters, preprocessors, rankers — are pure functions. Test them directly.

# test_components.py
import pytest
from haystack.components.preprocessors import DocumentSplitter
from haystack import Document

def test_splitter_chunks_on_sentence_boundary():
    splitter = DocumentSplitter(split_by="sentence", split_length=2)
    doc = Document(content="First sentence. Second sentence. Third sentence.")
    result = splitter.run(documents=[doc])

    assert len(result["documents"]) == 2
    # First chunk: sentences 1-2
    assert "First sentence" in result["documents"][0].content
    # Second chunk: sentence 3
    assert "Third sentence" in result["documents"][1].content

def test_splitter_handles_empty_document():
    splitter = DocumentSplitter(split_by="word", split_length=100)
    doc = Document(content="")
    result = splitter.run(documents=[doc])
    # Should return a single empty-content document, not crash
    assert isinstance(result["documents"], list)

Component tests are fast and catch the majority of bugs before they reach pipeline integration.

Testing the Retriever

Retrieval quality is the highest-risk component in any RAG pipeline. A retriever returning zero results or irrelevant chunks produces bad answers every time — but doesn't raise an exception.

# test_retrieval.py
import pytest
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack import Document

@pytest.fixture
def document_store():
    store = InMemoryDocumentStore()
    store.write_documents([
        Document(content="The return policy allows returns within 30 days of purchase."),
        Document(content="Shipping takes 3-5 business days for standard delivery."),
        Document(content="Enterprise plans include priority support with a 4-hour SLA."),
    ])
    return store

def test_retriever_finds_relevant_document(document_store):
    retriever = InMemoryBM25Retriever(document_store=document_store)
    result = retriever.run(query="What is the return policy?")

    assert len(result["documents"]) > 0
    # The most relevant document should mention returns
    top_doc = result["documents"][0]
    assert "return" in top_doc.content.lower()

def test_retriever_top_result_score(document_store):
    retriever = InMemoryBM25Retriever(document_store=document_store, top_k=3)
    result = retriever.run(query="return policy refund")

    # Top result should have a meaningful score
    assert result["documents"][0].score is not None
    assert result["documents"][0].score > 0

def test_retriever_returns_empty_for_out_of_scope_query(document_store):
    retriever = InMemoryBM25Retriever(document_store=document_store, top_k=3)
    result = retriever.run(query="quantum physics dark matter")

    # Not zero necessarily — but scores should be low
    if result["documents"]:
        assert result["documents"][0].score < 0.1

For embedding-based retrievers (vector search), test that the embedding model produces consistent vectors and that semantic similarity queries return topically relevant documents.

Pipeline Integration Tests

Once components work in isolation, test the assembled pipeline. Use a real document store with a known corpus so you can assert on expected outputs.

# test_pipeline.py
import pytest
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from unittest.mock import MagicMock, patch

PROMPT_TEMPLATE = """
Answer the question using only the provided documents.
If the answer is not in the documents, say "I don't know."

Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}

Question: {{ question }}
"""

@pytest.fixture
def rag_pipeline():
    store = InMemoryDocumentStore()
    store.write_documents([
        Document(content="Returns are accepted within 30 days with a receipt."),
        Document(content="Products must be in original packaging for returns."),
    ])

    pipeline = Pipeline()
    pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=store))
    pipeline.add_component("prompt_builder", PromptBuilder(template=PROMPT_TEMPLATE))
    pipeline.add_component("generator", OpenAIGenerator(model="gpt-4o-mini"))

    pipeline.connect("retriever.documents", "prompt_builder.documents")
    pipeline.connect("prompt_builder.prompt", "generator.prompt")
    return pipeline

def test_pipeline_answers_known_question(rag_pipeline):
    with patch.object(
        rag_pipeline.get_component("generator"),
        "run",
        return_value={"replies": ["Returns are accepted within 30 days with a receipt."]}
    ):
        result = rag_pipeline.run({
            "retriever": {"query": "return policy"},
            "prompt_builder": {"question": "What is the return policy?"},
        })

    assert "replies" in result["generator"]
    assert len(result["generator"]["replies"]) > 0
    assert "30 days" in result["generator"]["replies"][0]

def test_pipeline_handles_empty_retrieval(rag_pipeline):
    with patch.object(
        rag_pipeline.get_component("retriever"),
        "run",
        return_value={"documents": []}
    ), patch.object(
        rag_pipeline.get_component("generator"),
        "run",
        return_value={"replies": ["I don't know."]}
    ):
        result = rag_pipeline.run({
            "retriever": {"query": "black holes"},
            "prompt_builder": {"question": "What do you know about black holes?"},
        })

    assert "I don't know" in result["generator"]["replies"][0]

Mocking the generator keeps tests fast and free. The key assertion is that the pipeline's plumbing works — data flows from retriever to prompt builder to generator with the right schema at each step.

Golden Answer Tests

For production RAG systems, golden answer tests are your strongest regression signal. Pick a set of questions where the correct answer is deterministic based on your document corpus. Run these in CI. Any change to your pipeline (new model, updated prompt, re-indexed documents) that breaks a golden answer is caught before deployment.

# test_golden_answers.py
import pytest

GOLDEN_CASES = [
    {
        "question": "How long do I have to return a product?",
        "expected_pattern": r"30 days",
    },
    {
        "question": "What condition must items be in for returns?",
        "expected_pattern": r"original packaging",
    },
    {
        "question": "Do you offer free shipping?",
        "expected_pattern": r"I don't know|not sure|not mentioned",
    },
]

@pytest.mark.parametrize("case", GOLDEN_CASES)
def test_golden_answer(rag_pipeline, case):
    import re
    # For integration tests against a real LLM, remove the mock
    result = rag_pipeline.run({
        "retriever": {"query": case["question"]},
        "prompt_builder": {"question": case["question"]},
    })

    answer = result["generator"]["replies"][0]
    assert re.search(case["expected_pattern"], answer, re.IGNORECASE), (
        f"Expected pattern '{case['expected_pattern']}' not found in: {answer}"
    )

Keep golden tests in a separate suite with a @pytest.mark.integration marker and run them in CI only when you deploy to staging — they call the real LLM and cost money.

Testing YAML Pipeline Configurations

Haystack lets you define pipelines in YAML. If your team uses this, test that the YAML loads and validates correctly — a misconfigured YAML is a silent failure mode.

# test_pipeline_config.py
from haystack.core.serialization import default_from_dict
import yaml

def test_pipeline_yaml_loads_correctly():
    with open("pipelines/rag_pipeline.yaml") as f:
        config = yaml.safe_load(f)

    pipeline = Pipeline.from_dict(config)

    # Verify expected components are present
    assert "retriever" in pipeline.graph.nodes
    assert "prompt_builder" in pipeline.graph.nodes
    assert "generator" in pipeline.graph.nodes

def test_pipeline_yaml_connections_are_valid():
    with open("pipelines/rag_pipeline.yaml") as f:
        config = yaml.safe_load(f)

    pipeline = Pipeline.from_dict(config)
    # Pipeline.from_dict raises if connections are invalid
    # If we get here, the graph is valid
    assert pipeline is not None

How HelpMeTest Helps

Your Haystack application has a user-facing interface — a chatbot, a search UI, an internal tool. That surface needs browser-level tests that your pipeline tests can't reach.

Write plain-English tests against the actual UI:

Go to https://your-haystack-app.com/search
Type "return policy" in the search box
Click Search
Wait for results to appear
Verify the first result mentions "30 days"

HelpMeTest runs these on a schedule and alerts you when the answer changes — your document store was re-indexed with corrupted data, the LLM provider silently switched to a worse model, or the UI broke in a browser update.

Set up helpmetest health haystack-search 5m for a continuous uptime check that runs every five minutes in production. If your retrieval pipeline is down, you'll know before your users do.

The free tier covers 10 tests — enough to protect your critical pipeline paths from the start.

What to Actually Ship

Minimum test coverage for a production Haystack application:

  1. Component tests — each custom component, preprocessor, and converter has a unit test
  2. Retrieval quality tests — known queries return topically relevant documents
  3. Pipeline integration tests — data flows correctly from retriever to generator
  4. Golden answer tests — 5-10 deterministic Q&A pairs that run in CI on deploy
  5. YAML validation — pipeline configurations load without errors
  6. Smoke tests — latency and availability running on a schedule in production

Haystack moves fast. Model upgrades, component API changes, and document store updates all introduce regressions. Tests catch them before users do.


Start with HelpMeTest's free tier — 10 tests, no credit card. Add browser-level monitoring to your Haystack application in under ten minutes at helpmetest.com.

Read more