How to Test LlamaIndex RAG Pipelines Before They Hit Production
You built a LlamaIndex RAG pipeline. The retrieval looks right on your test documents. The LLM synthesizes clean answers. You ship it.
Then users start getting wrong answers. The retrieval is returning the right documents — mostly. The LLM is synthesizing correctly — mostly. But "mostly correct" at retrieval times "mostly correct" at synthesis equals a meaningful rate of wrong answers, and you have no systematic way to catch them.
This is the LlamaIndex testing problem: RAG pipelines fail silently. Wrong retrievals don't throw exceptions. Hallucinated answers don't return error codes. The system looks healthy while producing garbage.
Why LlamaIndex Pipelines Are Hard to Test
LlamaIndex is the most widely used framework for RAG and retrieval-augmented applications. It abstracts document loading, chunking, indexing, retrieval, and synthesis into clean pipelines.
But the abstraction hides failure modes that are unique to retrieval-augmented systems:
- Retrieval failures — the right documents exist in your index but aren't returned for a given query. Your pipeline returns an answer based on partially relevant context.
- Context window failures — you retrieve more context than fits in the model's window. The model silently truncates or ignores parts of it.
- Synthesis hallucinations — the model synthesizes an answer that sounds correct but contradicts the retrieved documents.
- Index drift — your underlying data changes but your index doesn't update. Users get answers based on stale data.
- Query transform failures — your query transformation or reranking step changes the query in a way that retrieves worse documents.
Layer 1: Testing the Retrieval Layer
Start by testing retrieval independently of generation. This is where most RAG pipelines fail:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
def test_retrieval_returns_relevant_docs():
documents = SimpleDirectoryReader("./test_docs").load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve("What is our refund policy?")
# Verify retrieval returned something
assert len(nodes) > 0
# Verify the most relevant doc is actually about refunds
top_doc_text = nodes[0].get_text().lower()
assert "refund" in top_doc_text or "return" in top_doc_text
# Verify relevance scores are reasonable
assert nodes[0].score > 0.7Test retrieval with queries where you know the ground truth answer exists in specific documents. If retrieval fails here, no amount of prompt engineering will fix your downstream synthesis.
Layer 2: Testing Retrieval Coverage
Beyond relevance, test coverage — does your pipeline retrieve documents for the full range of query types users will send?
def test_retrieval_coverage_across_query_types():
retriever = index.as_retriever(similarity_top_k=3)
# Test queries that should return results
positive_queries = [
"pricing information",
"how to cancel subscription",
"data privacy policy",
"supported integrations",
]
for query in positive_queries:
nodes = retriever.retrieve(query)
assert len(nodes) > 0, f"No results for: {query}"
assert nodes[0].score > 0.5, f"Low relevance for: {query}"
# Test queries that should return nothing meaningful
negative_queries = [
"weather in Tokyo",
"recipe for pasta carbonara",
]
for query in negative_queries:
nodes = retriever.retrieve(query)
# Either no results or very low relevance
if len(nodes) > 0:
assert nodes[0].score < 0.4, f"False positive for out-of-domain query: {query}"Layer 3: Testing the Full Query Pipeline
After retrieval checks out, test the full pipeline from query to answer:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RetrieverQueryEngine
def test_query_engine_answers_correctly():
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is the cancellation policy?")
# Verify an answer was produced
assert response.response is not None
assert len(response.response) > 50
# Verify source nodes were used
assert len(response.source_nodes) > 0
# Verify the answer is grounded in retrieved content
# (basic check: key terms from source appear in answer)
source_text = " ".join([n.get_text().lower() for n in response.source_nodes])
answer_lower = response.response.lower()
# At least one significant term from source docs should appear in answer
assert any(term in answer_lower for term in ["cancel", "days", "notice", "policy"])Layer 4: Testing Response Groundedness
The most critical LlamaIndex test: does the answer come from the retrieved documents, or is the model hallucinating?
def test_response_is_grounded_not_hallucinated():
query_engine = index.as_query_engine()
response = query_engine.query("What is the price of the Enterprise plan?")
# Extract price claims from the answer
answer = response.response
# Verify source nodes contain the pricing information
source_text = " ".join([n.get_text() for n in response.source_nodes])
# If the answer mentions a specific price, it must be in the sources
import re
price_pattern = r'\$[\d,]+|[\d,]+ dollars'
prices_in_answer = re.findall(price_pattern, answer, re.IGNORECASE)
prices_in_sources = re.findall(price_pattern, source_text, re.IGNORECASE)
for price in prices_in_answer:
# Normalize and check
assert price in prices_in_sources, \
f"Price '{price}' in answer not found in source docs — possible hallucination"This won't catch all hallucinations, but it catches the most dangerous category: fabricated facts that contradict or extend your documents.
Layer 5: Testing LlamaIndex Agents
If you're using LlamaIndex's agent capabilities (not just query engines), test agent tool selection and reasoning:
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool
def test_agent_uses_correct_tool_for_query():
pricing_tool = QueryEngineTool.from_defaults(
query_engine=pricing_engine,
name="pricing_lookup",
description="Use for questions about prices, plans, and billing"
)
support_tool = QueryEngineTool.from_defaults(
query_engine=support_engine,
name="support_lookup",
description="Use for technical support and troubleshooting questions"
)
agent = ReActAgent.from_tools([pricing_tool, support_tool])
# Track which tools get called
tool_calls = []
original_query = pricing_tool.query_engine.query
def tracked_query(q):
tool_calls.append("pricing_lookup")
return original_query(q)
pricing_tool.query_engine.query = tracked_query
response = agent.query("How much does the Pro plan cost?")
assert "pricing_lookup" in tool_calls
assert "support_lookup" not in tool_callsWhat Code-Level Tests Miss
Your retrieval and synthesis tests run against your current index. Production fails differently:
- Index staleness — your underlying data changes. New pricing, updated policies, deprecated features. The index reflects yesterday's truth. Users get yesterday's answers.
- Query distribution shift — users ask questions you didn't anticipate. Retrieval that works for your test queries fails for real user phrasing.
- Embedding model drift — your embedding model updates. Documents indexed with the old model retrieve poorly with the new embeddings.
- Chunk size edge cases — a key fact lands at a chunk boundary and gets split between two retrievals. Your tests with clean documents don't catch this.
- Reranker failures — your reranker works correctly on development queries but demotes the right document for specific real-world phrasing patterns.
Monitoring LlamaIndex Pipelines in Production
Once your RAG pipeline is live, you need ongoing quality monitoring — not just a CI suite.
HelpMeTest lets you write behavioral tests against your deployed LlamaIndex endpoint and run them on a schedule:
Test: pricing query returns current plan prices
Go to https://yourapp.com/chat
Type "What are your current pricing plans?"
Then: response mentions at least one specific price
And: response does not say "I don't have information"
And: response time under 8 seconds
And: response does not contradict published pricing pageTests run continuously. If your RAG pipeline starts hallucinating after an index update, embedding model change, or LLM version bump, you find out before your users do.
Free tier: 10 tests, unlimited health checks. Try HelpMeTest →
LlamaIndex Testing Checklist
Before shipping any LlamaIndex RAG pipeline:
- Retrieval relevance tests — top results actually answer the query
- Retrieval coverage tests — all query types return relevant docs
- Out-of-domain rejection tests — low-relevance queries return low scores
- Full pipeline tests — end-to-end from query to answer
- Groundedness tests — answer claims are present in source documents
- Agent tool selection tests — if using agents, right tool is called per query type
- Index freshness validation — stale documents don't serve outdated answers
- Production behavioral monitoring for retrieval and synthesis drift
Your index is correct today. It might not be correct tomorrow. Build the monitoring that tells you when it's not.