Testing LlamaIndex RAG Pipelines: Retrieval Accuracy, Context Quality, Hallucination Detection
LlamaIndex is a popular framework for building RAG (Retrieval-Augmented Generation) pipelines. Testing these pipelines requires verifying three distinct components: the retriever (is it finding relevant documents?), the context (is the retrieved content accurate?), and the generator (is the answer faithful to the context and free of hallucination?). This guide covers all three.
Key Takeaways
Test the retriever and generator separately. A hallucinating answer might be the generator's fault or the retriever's fault — a poorly retrieved context forces the model to invent. Isolate the layers.
Retrieval accuracy is measurable with precision and recall. Build a labeled dataset of questions + relevant documents and evaluate retriever performance against it.
Use LlamaIndex's built-in evaluators. FaithfulnessEvaluator, RelevancyEvaluator, and CorrectnessEvaluator give you structured scores without building eval infrastructure from scratch.
Mock the LLM in unit tests, use real LLMs for eval. Unit tests verify pipeline mechanics. Eval tests measure output quality — they need a real (or realistic) LLM.
Test indexing too. The quality of your vector index affects everything downstream. Verify documents are indexed correctly before testing retrieval.
LlamaIndex RAG Pipeline Architecture
A LlamaIndex RAG pipeline has these components:
Documents
↓ (ingestion)
Document Transformations (chunking, metadata extraction)
↓
Embedding Model → Vector Store (index)
↓ (query time)
Query → Embedding → Vector Search → Retrieved Nodes
↓
Response Synthesis (LLM generates answer from nodes)
↓
Final ResponseEach component has testable properties:
- Ingestion: Are documents chunked correctly? Is metadata preserved?
- Retrieval: Are relevant documents returned for the query?
- Context: Is retrieved content sufficient and accurate?
- Generation: Is the answer faithful to the context?
Testing Indexing and Ingestion
Document Chunking
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
import pytest
def test_document_chunking_preserves_content():
"""Verify chunking doesn't lose document content."""
document = Document(
text="LlamaIndex is a framework for LLM applications. " * 100, # Long document
metadata={"source": "docs/intro.md"}
)
splitter = SentenceSplitter(chunk_size=256, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents([document])
# Reconstruct the original text from chunks (minus overlaps)
all_text = " ".join(node.text for node in nodes)
assert "LlamaIndex is a framework" in all_text
assert len(nodes) > 1, "Long document should produce multiple chunks"
def test_chunk_size_respected():
"""Chunks should not exceed the specified token limit."""
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=128)
document = Document(text="This is a test sentence. " * 100)
nodes = splitter.get_nodes_from_documents([document])
for node in nodes:
# Approximate token count (rough: 1 token ≈ 4 chars)
approx_tokens = len(node.text) / 4
assert approx_tokens <= 200, f"Chunk too large: ~{approx_tokens:.0f} tokens"
def test_metadata_preserved_through_chunking():
"""Document metadata should be present on all derived nodes."""
document = Document(
text="Important document content. " * 50,
metadata={"source": "report.pdf", "author": "Alice", "page": 1}
)
splitter = SentenceSplitter(chunk_size=128)
nodes = splitter.get_nodes_from_documents([document])
for node in nodes:
assert "source" in node.metadata, "Source metadata missing from node"
assert node.metadata["author"] == "Alice"Index Construction
from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from unittest.mock import patch, MagicMock
def test_index_built_from_documents():
"""Verify index is constructed with the expected number of documents."""
docs = [
Document(text="HelpMeTest is a testing platform.", metadata={"id": "1"}),
Document(text="It supports Robot Framework and Playwright.", metadata={"id": "2"}),
Document(text="Pricing starts at $100/month for Pro.", metadata={"id": "3"}),
]
with patch("llama_index.core.Settings.embed_model") as mock_embed:
mock_embed.get_text_embedding.return_value = [0.1] * 1536
mock_embed.get_text_embedding_batch.return_value = [[0.1] * 1536] * 10
index = VectorStoreIndex.from_documents(docs)
assert index.docstore.docs # Not empty
assert len(index.docstore.docs) >= 3 # At least one node per documentTesting the Retriever
Precision and Recall Measurement
Build a labeled dataset where you know which documents are relevant to each query:
import pytest
from llama_index.core import VectorStoreIndex, Document
from dataclasses import dataclass
@dataclass
class RetrievalTestCase:
query: str
relevant_doc_ids: set[str]
RETRIEVAL_TEST_CASES = [
RetrievalTestCase(
query="What is the return policy?",
relevant_doc_ids={"doc-returns-policy", "doc-customer-service"}
),
RetrievalTestCase(
query="How do I integrate with CI/CD?",
relevant_doc_ids={"doc-ci-integration", "doc-github-actions"}
),
RetrievalTestCase(
query="What programming languages are supported?",
relevant_doc_ids={"doc-sdk-overview", "doc-python-sdk", "doc-js-sdk"}
),
]
def compute_precision_recall(retrieved_ids: set, relevant_ids: set) -> tuple[float, float]:
if not retrieved_ids:
return 0.0, 0.0
true_positives = len(retrieved_ids & relevant_ids)
precision = true_positives / len(retrieved_ids)
recall = true_positives / len(relevant_ids) if relevant_ids else 0.0
return precision, recall
def test_retriever_precision_recall():
"""Retriever achieves minimum precision and recall on labeled test cases."""
index = load_production_index() # Load your built index
retriever = index.as_retriever(similarity_top_k=5)
precisions, recalls = [], []
for test_case in RETRIEVAL_TEST_CASES:
nodes = retriever.retrieve(test_case.query)
retrieved_ids = {node.node.metadata.get("doc_id") for node in nodes}
p, r = compute_precision_recall(retrieved_ids, test_case.relevant_doc_ids)
precisions.append(p)
recalls.append(r)
avg_precision = sum(precisions) / len(precisions)
avg_recall = sum(recalls) / len(recalls)
assert avg_precision >= 0.7, f"Avg precision {avg_precision:.2f} below 0.70"
assert avg_recall >= 0.6, f"Avg recall {avg_recall:.2f} below 0.60"Testing Top-K Relevance
def test_top_result_is_most_relevant():
"""The highest-ranked retrieved document should be the most relevant."""
index = load_production_index()
retriever = index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve("What is the return policy?")
assert len(nodes) > 0, "No documents retrieved"
top_doc = nodes[0].node.text.lower()
assert "return" in top_doc or "refund" in top_doc, \
f"Top result not about returns: {top_doc[:200]}"
def test_retriever_filters_by_metadata():
"""Metadata filters correctly narrow the retrieval scope."""
from llama_index.core.vector_stores.types import MetadataFilters, ExactMatchFilter
index = load_production_index()
retriever = index.as_retriever(
similarity_top_k=5,
filters=MetadataFilters(
filters=[ExactMatchFilter(key="category", value="pricing")]
)
)
nodes = retriever.retrieve("How much does it cost?")
for node in nodes:
assert node.node.metadata.get("category") == "pricing", \
f"Retrieved node not in pricing category: {node.node.metadata}"Testing Response Quality with LlamaIndex Evaluators
LlamaIndex ships with built-in evaluators that use an LLM-as-judge pattern:
from llama_index.core import VectorStoreIndex
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator,
RetrieverEvaluator,
)
from llama_index.llms.openai import OpenAI
import pytest
@pytest.fixture
def eval_llm():
return OpenAI(model="gpt-4o-mini", temperature=0)
def test_response_faithfulness(eval_llm):
"""LLM response should not hallucinate facts not in the retrieved context."""
evaluator = FaithfulnessEvaluator(llm=eval_llm)
index = load_production_index()
query_engine = index.as_query_engine()
response = query_engine.query("What is the Pro plan price?")
eval_result = evaluator.evaluate_response(
query="What is the Pro plan price?",
response=response
)
assert eval_result.passing, \
f"Faithfulness check failed: {eval_result.feedback}"
def test_response_relevancy(eval_llm):
"""Response should be relevant to the question asked."""
evaluator = RelevancyEvaluator(llm=eval_llm)
index = load_production_index()
query_engine = index.as_query_engine()
response = query_engine.query("Does HelpMeTest support Playwright?")
eval_result = evaluator.evaluate_response(
query="Does HelpMeTest support Playwright?",
response=response
)
assert eval_result.passing, \
f"Relevancy check failed: {eval_result.feedback}"
def test_response_correctness_with_ground_truth(eval_llm):
"""Response correctness against known ground truth."""
evaluator = CorrectnessEvaluator(llm=eval_llm)
index = load_production_index()
query_engine = index.as_query_engine()
response = query_engine.query("How much does HelpMeTest Pro cost per month?")
eval_result = evaluator.evaluate_response(
query="How much does HelpMeTest Pro cost per month?",
response=response,
reference="HelpMeTest Pro costs $100 per month."
)
assert eval_result.score >= 3.0, \
f"Correctness score {eval_result.score}/5.0 below threshold. Feedback: {eval_result.feedback}"Batch Evaluation with Datasets
For systematic evaluation across many examples:
from llama_index.core.evaluation import BatchEvalRunner
import asyncio
async def run_batch_evaluation():
"""Evaluate the RAG pipeline across a full test dataset."""
index = load_production_index()
query_engine = index.as_query_engine()
llm = OpenAI(model="gpt-4o-mini", temperature=0)
questions = [
"What is the return policy?",
"How do I integrate with GitHub Actions?",
"What is the free plan limit?",
"Does HelpMeTest support parallel test execution?",
"How do I set up health monitoring?",
]
evaluators = {
"faithfulness": FaithfulnessEvaluator(llm=llm),
"relevancy": RelevancyEvaluator(llm=llm),
}
runner = BatchEvalRunner(evaluators, workers=4)
eval_results = await runner.aevaluate_queries(
query_engine,
queries=questions
)
# Summarize results
for metric_name, results in eval_results.items():
pass_rate = sum(1 for r in results if r.passing) / len(results)
print(f"{metric_name}: {pass_rate:.0%} pass rate")
assert pass_rate >= 0.8, f"{metric_name} pass rate {pass_rate:.0%} below 80%"
def test_batch_evaluation():
asyncio.run(run_batch_evaluation())Testing Hallucination
Hallucination testing verifies that the model doesn't add facts not present in the retrieved context:
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.evaluation import FaithfulnessEvaluator
def test_no_hallucination_on_unknown_topic():
"""When documents don't contain relevant info, the response should admit uncertainty."""
# Index only contains pricing information
docs = [
Document(text="HelpMeTest Pro costs $100/month with unlimited tests."),
]
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
# Ask about something not in the index
response = query_engine.query("What is the company's founding date?")
# The response should NOT invent a founding date
response_text = str(response).lower()
assert "founded" not in response_text or "don't have" in response_text or "not available" in response_text, \
f"Possible hallucination detected: {response_text}"
def test_faithfulness_score_across_dataset(eval_llm):
"""Average faithfulness score across dataset should exceed threshold."""
evaluator = FaithfulnessEvaluator(llm=eval_llm)
index = load_production_index()
query_engine = index.as_query_engine()
test_queries = load_eval_dataset() # list of {"query": str} dicts
scores = []
for item in test_queries:
response = query_engine.query(item["query"])
result = evaluator.evaluate_response(
query=item["query"],
response=response
)
scores.append(1.0 if result.passing else 0.0)
avg_faithfulness = sum(scores) / len(scores)
assert avg_faithfulness >= 0.85, \
f"Average faithfulness {avg_faithfulness:.2f} below 0.85 threshold"Performance Testing
RAG pipelines can be slow. Test that retrieval and generation stay within acceptable latency bounds:
import time, statistics
def test_query_latency():
"""End-to-end query should complete within 5 seconds."""
index = load_production_index()
query_engine = index.as_query_engine()
queries = [
"What is the return policy?",
"How do I set up CI integration?",
"What browsers are supported?",
]
latencies = []
for query in queries:
start = time.perf_counter()
response = query_engine.query(query)
elapsed = time.perf_counter() - start
latencies.append(elapsed)
assert elapsed < 10.0, f"Query '{query}' took {elapsed:.1f}s (>10s limit)"
p95 = sorted(latencies)[int(len(latencies) * 0.95)]
print(f"P95 latency: {p95:.2f}s")
assert p95 < 8.0, f"P95 latency {p95:.2f}s exceeds 8s"
def test_retrieval_latency():
"""Retrieval (without generation) should be fast."""
index = load_production_index()
retriever = index.as_retriever(similarity_top_k=5)
start = time.perf_counter()
nodes = retriever.retrieve("What is the pricing?")
elapsed = time.perf_counter() - start
assert elapsed < 1.0, f"Retrieval took {elapsed:.2f}s (>1s limit)"CI Integration
# .github/workflows/llamaindex-eval.yml
name: LlamaIndex RAG Evaluation
on:
push:
paths:
- 'src/rag/**'
- 'data/documents/**'
schedule:
- cron: '0 6 * * 1' # Weekly evaluation
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install llama-index pytest
- run: pytest tests/rag/test_chunking.py tests/rag/test_retriever.py -v
eval-tests:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- run: pip install llama-index pytest ragas
- run: pytest tests/rag/test_evaluation.py -v --timeout=300
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}HelpMeTest Integration
HelpMeTest adds end-to-end testing to your LlamaIndex application, verifying the complete user experience:
*** Test Cases ***
RAG Chat Returns Relevant Answer
Go To https://app.example.com/ask
Input Text id=question What is the return policy?
Click Button id=ask-btn
Wait Until Page Contains Element .answer-text timeout=15s
${answer}= Get Text .answer-text
Should Contain ${answer} 30 days
Should Not Be Empty ${answer}
RAG Chat Shows Source Documents
Go To https://app.example.com/ask
Input Text id=question How do I enable CI integration?
Click Button id=ask-btn
Wait Until Page Contains Sources: timeout=15s
Page Should Contain Element .source-document
No Answer For Out-Of-Scope Questions
Go To https://app.example.com/ask
Input Text id=question What is the population of France?
Click Button id=ask-btn
Wait Until Page Contains Element .answer-text timeout=15s
${answer}= Get Text .answer-text
Should Contain Any ${answer} don't have not available outside my knowledgeConclusion
Testing LlamaIndex RAG pipelines well requires a three-layer strategy: unit tests for pipeline mechanics (chunking, metadata, retriever logic), offline eval tests with LlamaIndex's built-in evaluators for quality measurement, and end-to-end tests for the complete user experience.
The built-in evaluators (FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator) provide the most value for minimal setup — start there. Build your labeled retrieval dataset next, since retrieval quality is the largest single determinant of RAG pipeline quality. Add latency tests when performance becomes a concern.
Run unit tests on every push and evaluation tests weekly or before major releases. This gives you fast feedback on mechanical regressions while catching quality regressions before they reach production users.