AI Testing

How to Test LlamaIndex RAG Pipelines Before They Hit Production

HelpMeTest

13 May 2026 — 5 min read

You built a LlamaIndex RAG pipeline. The retrieval looks right on your test documents. The LLM synthesizes clean answers. You ship it.

Then users start getting wrong answers. The retrieval is returning the right documents — mostly. The LLM is synthesizing correctly — mostly. But "mostly correct" at retrieval times "mostly correct" at synthesis equals a meaningful rate of wrong answers, and you have no systematic way to catch them.

This is the LlamaIndex testing problem: RAG pipelines fail silently. Wrong retrievals don't throw exceptions. Hallucinated answers don't return error codes. The system looks healthy while producing garbage.

Why LlamaIndex Pipelines Are Hard to Test

LlamaIndex is the most widely used framework for RAG and retrieval-augmented applications. It abstracts document loading, chunking, indexing, retrieval, and synthesis into clean pipelines.

But the abstraction hides failure modes that are unique to retrieval-augmented systems:

Retrieval failures — the right documents exist in your index but aren't returned for a given query. Your pipeline returns an answer based on partially relevant context.
Context window failures — you retrieve more context than fits in the model's window. The model silently truncates or ignores parts of it.
Synthesis hallucinations — the model synthesizes an answer that sounds correct but contradicts the retrieved documents.
Index drift — your underlying data changes but your index doesn't update. Users get answers based on stale data.
Query transform failures — your query transformation or reranking step changes the query in a way that retrieves worse documents.

Layer 1: Testing the Retrieval Layer

Start by testing retrieval independently of generation. This is where most RAG pipelines fail:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

def test_retrieval_returns_relevant_docs():
    documents = SimpleDirectoryReader("./test_docs").load_data()
    index = VectorStoreIndex.from_documents(documents)
    retriever = index.as_retriever(similarity_top_k=3)
    
    nodes = retriever.retrieve("What is our refund policy?")
    
    # Verify retrieval returned something
    assert len(nodes) > 0
    
    # Verify the most relevant doc is actually about refunds
    top_doc_text = nodes[0].get_text().lower()
    assert "refund" in top_doc_text or "return" in top_doc_text
    
    # Verify relevance scores are reasonable
    assert nodes[0].score > 0.7

Test retrieval with queries where you know the ground truth answer exists in specific documents. If retrieval fails here, no amount of prompt engineering will fix your downstream synthesis.

Layer 2: Testing Retrieval Coverage

Beyond relevance, test coverage — does your pipeline retrieve documents for the full range of query types users will send?

def test_retrieval_coverage_across_query_types():
    retriever = index.as_retriever(similarity_top_k=3)
    
    # Test queries that should return results
    positive_queries = [
        "pricing information",
        "how to cancel subscription",
        "data privacy policy",
        "supported integrations",
    ]
    
    for query in positive_queries:
        nodes = retriever.retrieve(query)
        assert len(nodes) > 0, f"No results for: {query}"
        assert nodes[0].score > 0.5, f"Low relevance for: {query}"
    
    # Test queries that should return nothing meaningful
    negative_queries = [
        "weather in Tokyo",
        "recipe for pasta carbonara",
    ]
    
    for query in negative_queries:
        nodes = retriever.retrieve(query)
        # Either no results or very low relevance
        if len(nodes) > 0:
            assert nodes[0].score < 0.4, f"False positive for out-of-domain query: {query}"

Layer 3: Testing the Full Query Pipeline

After retrieval checks out, test the full pipeline from query to answer:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RetrieverQueryEngine

def test_query_engine_answers_correctly():
    index = VectorStoreIndex.from_documents(documents)
    query_engine = index.as_query_engine()
    
    response = query_engine.query("What is the cancellation policy?")
    
    # Verify an answer was produced
    assert response.response is not None
    assert len(response.response) > 50
    
    # Verify source nodes were used
    assert len(response.source_nodes) > 0
    
    # Verify the answer is grounded in retrieved content
    # (basic check: key terms from source appear in answer)
    source_text = " ".join([n.get_text().lower() for n in response.source_nodes])
    answer_lower = response.response.lower()
    
    # At least one significant term from source docs should appear in answer
    assert any(term in answer_lower for term in ["cancel", "days", "notice", "policy"])

Layer 4: Testing Response Groundedness

The most critical LlamaIndex test: does the answer come from the retrieved documents, or is the model hallucinating?

def test_response_is_grounded_not_hallucinated():
    query_engine = index.as_query_engine()
    response = query_engine.query("What is the price of the Enterprise plan?")
    
    # Extract price claims from the answer
    answer = response.response
    
    # Verify source nodes contain the pricing information
    source_text = " ".join([n.get_text() for n in response.source_nodes])
    
    # If the answer mentions a specific price, it must be in the sources
    import re
    price_pattern = r'\$[\d,]+|[\d,]+ dollars'
    prices_in_answer = re.findall(price_pattern, answer, re.IGNORECASE)
    prices_in_sources = re.findall(price_pattern, source_text, re.IGNORECASE)
    
    for price in prices_in_answer:
        # Normalize and check
        assert price in prices_in_sources, \
            f"Price '{price}' in answer not found in source docs — possible hallucination"

This won't catch all hallucinations, but it catches the most dangerous category: fabricated facts that contradict or extend your documents.

Layer 5: Testing LlamaIndex Agents

If you're using LlamaIndex's agent capabilities (not just query engines), test agent tool selection and reasoning:

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool

def test_agent_uses_correct_tool_for_query():
    pricing_tool = QueryEngineTool.from_defaults(
        query_engine=pricing_engine,
        name="pricing_lookup",
        description="Use for questions about prices, plans, and billing"
    )
    support_tool = QueryEngineTool.from_defaults(
        query_engine=support_engine,
        name="support_lookup",
        description="Use for technical support and troubleshooting questions"
    )
    
    agent = ReActAgent.from_tools([pricing_tool, support_tool])
    
    # Track which tools get called
    tool_calls = []
    original_query = pricing_tool.query_engine.query
    
    def tracked_query(q):
        tool_calls.append("pricing_lookup")
        return original_query(q)
    pricing_tool.query_engine.query = tracked_query
    
    response = agent.query("How much does the Pro plan cost?")
    
    assert "pricing_lookup" in tool_calls
    assert "support_lookup" not in tool_calls

What Code-Level Tests Miss

Your retrieval and synthesis tests run against your current index. Production fails differently:

Index staleness — your underlying data changes. New pricing, updated policies, deprecated features. The index reflects yesterday's truth. Users get yesterday's answers.
Query distribution shift — users ask questions you didn't anticipate. Retrieval that works for your test queries fails for real user phrasing.
Embedding model drift — your embedding model updates. Documents indexed with the old model retrieve poorly with the new embeddings.
Chunk size edge cases — a key fact lands at a chunk boundary and gets split between two retrievals. Your tests with clean documents don't catch this.
Reranker failures — your reranker works correctly on development queries but demotes the right document for specific real-world phrasing patterns.

Monitoring LlamaIndex Pipelines in Production

Once your RAG pipeline is live, you need ongoing quality monitoring — not just a CI suite.

HelpMeTest lets you write behavioral tests against your deployed LlamaIndex endpoint and run them on a schedule:

Test: pricing query returns current plan prices
Go to https://yourapp.com/chat
Type "What are your current pricing plans?"
Then: response mentions at least one specific price
And: response does not say "I don't have information"
And: response time under 8 seconds
And: response does not contradict published pricing page

Tests run continuously. If your RAG pipeline starts hallucinating after an index update, embedding model change, or LLM version bump, you find out before your users do.

Free tier: 10 tests, unlimited health checks. Try HelpMeTest →

LlamaIndex Testing Checklist

Before shipping any LlamaIndex RAG pipeline:

Retrieval relevance tests — top results actually answer the query
Retrieval coverage tests — all query types return relevant docs
Out-of-domain rejection tests — low-relevance queries return low scores
Full pipeline tests — end-to-end from query to answer
Groundedness tests — answer claims are present in source documents
Agent tool selection tests — if using agents, right tool is called per query type
Index freshness validation — stale documents don't serve outdated answers
Production behavioral monitoring for retrieval and synthesis drift

Your index is correct today. It might not be correct tomorrow. Build the monitoring that tells you when it's not.

How to Test LlamaIndex RAG Pipelines Before They Hit Production

HelpMeTest

Why LlamaIndex Pipelines Are Hard to Test

Layer 1: Testing the Retrieval Layer

Layer 2: Testing Retrieval Coverage

Layer 3: Testing the Full Query Pipeline

Layer 4: Testing Response Groundedness

Layer 5: Testing LlamaIndex Agents

What Code-Level Tests Miss

Monitoring LlamaIndex Pipelines in Production

LlamaIndex Testing Checklist

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest