Testing RAG Applications: Retrieval, Chunking, and Answer Quality
RAG applications fail at multiple points — poor chunking, irrelevant retrieval, unfaithful generation. Testing each layer independently gives you clear signal on where failures originate. This guide covers practical testing strategies for every stage of the RAG pipeline.
Key Takeaways
RAG has three distinct failure modes. Retrieval failure (wrong chunks returned), chunking failure (right source, wrong chunk boundaries), and generation failure (hallucinations despite correct context). Test each separately.
Retrieval quality is measurable without an LLM. Use precision@k and recall@k against a labeled query set. This is fast, cheap, and highly diagnostic.
Chunking strategy affects retrieval quality profoundly. Test different chunk sizes and overlap values against your specific query distribution before deploying.
Faithfulness testing verifies the generator doesn't add fabricated claims. Ground every output claim in the retrieved context. Claims with no contextual basis are hallucinations.
Build a golden dataset early. A set of 50–200 (query, expected_answer, relevant_chunks) examples is your most valuable testing asset. Build it before you optimize anything.
The RAG Testing Challenge
RAG (Retrieval-Augmented Generation) applications combine two complex systems: a retrieval system (typically vector search) and a generative LLM. Both can fail. And when the final output is wrong, you need to know which one failed.
Without structured testing, debugging RAG failures is guesswork: "Did we retrieve the wrong documents? Did the model ignore the documents we retrieved? Was the document chunked in a way that lost important context?"
A properly tested RAG pipeline gives you a failure dashboard instead of a mystery.
The RAG Pipeline and Its Failure Points
A standard RAG pipeline:
User query
→ Query preprocessing (optional)
→ Embedding the query
→ Vector search against document store
→ Retrieve top-k chunks
→ Build context window (retrieved chunks + system prompt)
→ LLM generates answer
→ Answer deliveryEach step is a potential failure point:
- Query preprocessing: Query expansion or reformulation can distort the retrieval signal
- Embedding quality: Low-quality embeddings produce poor similarity matches
- Chunk size/overlap: Chunks that cut across sentence or concept boundaries lose context
- Retrieval quality: The wrong chunks get returned for the query
- Context assembly: Important retrieved chunks get truncated or poorly ordered
- Generation faithfulness: The LLM ignores or contradicts the retrieved context
- Answer relevance: The final answer doesn't actually address the user's question
Testing Retrieval Quality
Retrieval testing is the most diagnostic step because you can run it without any LLM calls — it's just evaluating whether the right documents are returned for a given query.
Build a Labeled Query Set
First, create a set of (query, relevant_document_ids) pairs. This is your ground truth for retrieval evaluation.
[
{
"query": "What is the return policy for software purchases?",
"relevant_doc_ids": ["policy-doc-id-42", "faq-returns-id-17"]
},
{
"query": "How do I reset my password?",
"relevant_doc_ids": ["help-auth-id-05"]
}
]Creating this dataset is manual work, but 50–100 labeled pairs give you reliable retrieval metrics. Use actual user queries from logs when possible — synthetic queries are less representative.
Measure Precision and Recall
For each query in your labeled set:
- Precision@k: Of the k chunks returned, how many were actually relevant?
- Recall@k: Of all relevant chunks, how many were in the top k?
- MRR (Mean Reciprocal Rank): How highly was the first relevant result ranked?
def evaluate_retrieval(retriever, labeled_queries, k=5):
results = []
for item in labeled_queries:
retrieved = retriever.search(item["query"], k=k)
retrieved_ids = [chunk.doc_id for chunk in retrieved]
relevant_ids = set(item["relevant_doc_ids"])
hits = len(set(retrieved_ids) & relevant_ids)
precision = hits / k
recall = hits / len(relevant_ids)
# MRR: find rank of first relevant result
mrr = 0
for rank, doc_id in enumerate(retrieved_ids, 1):
if doc_id in relevant_ids:
mrr = 1 / rank
break
results.append({"precision": precision, "recall": recall, "mrr": mrr})
avg = lambda key: sum(r[key] for r in results) / len(results)
return {
"precision@k": avg("precision"),
"recall@k": avg("recall"),
"mrr": avg("mrr"),
"total_queries": len(labeled_queries)
}Run this after every change to your embedding model, chunk size, or retrieval configuration. Precision@5 below 0.6 or Recall@5 below 0.7 usually means the retrieval layer is the primary problem.
Testing Chunking Strategies
Chunking — splitting documents into retrievable pieces — has a massive impact on retrieval quality that's often underestimated.
Common chunking problems:
- Chunks too small: Each chunk lacks enough context to be meaningful. Answers require information from multiple chunks that may not be retrieved together.
- Chunks too large: One chunk covers too many topics. Similarity scores get diluted.
- Boundary cuts: Splitting mid-sentence or mid-table destroys the semantic unit.
Test chunking by running your full retrieval eval against different chunk configurations:
| Config | Chunk Size | Overlap | Precision@5 | Recall@5 |
|---|---|---|---|---|
| Baseline | 512 tokens | 0 | 0.61 | 0.58 |
| Large | 1024 tokens | 0 | 0.64 | 0.67 |
| Overlapping | 512 tokens | 128 | 0.70 | 0.72 |
| Sentence-aware | Variable | 0 | 0.73 | 0.71 |
There's no universally correct chunk size. The optimal configuration depends on your document structure and query distribution. Test with your actual data.
Advanced techniques to test:
- Hierarchical chunking: Small chunks for retrieval, larger parent chunks for context
- Semantic chunking: Split on sentence boundaries and semantic similarity rather than fixed token counts
- Metadata-enriched chunks: Prepend document title and section header to each chunk
Testing Generation Faithfulness
Once you know retrieval is working, test whether the generator actually uses the retrieved context faithfully.
Faithfulness measures: do all claims in the output have support in the retrieved chunks?
def test_faithfulness(query: str, retrieved_chunks: list[str], generated_answer: str):
context = "\n".join(retrieved_chunks)
# Extract claims from the answer
claim_prompt = f"""
Extract all factual claims from this answer as a JSON list of strings:
Answer: {generated_answer}
Return only JSON: ["claim1", "claim2", ...]
"""
claims = json.loads(call_llm(claim_prompt))
# Check each claim against context
unsupported = []
for claim in claims:
support_prompt = f"""
Does this context support this claim? Answer yes or no.
Context: {context}
Claim: {claim}
"""
supported = "yes" in call_llm(support_prompt).lower()
if not supported:
unsupported.append(claim)
faithfulness_score = 1 - (len(unsupported) / len(claims))
return {"score": faithfulness_score, "unsupported_claims": unsupported}A faithfulness score below 0.85 indicates the model is hallucinating claims not present in the retrieved context.
Testing Answer Relevance
A retrieved context can be faithful but still not answer the actual question. Test answer relevance separately:
Does the generated answer actually address what the user asked?
def test_answer_relevance(query: str, answer: str) -> float:
prompt = f"""
Rate how well this answer addresses the question on a scale of 0-1.
Question: {query}
Answer: {answer}
Return only a number between 0 and 1.
"""
return float(call_llm(prompt).strip())Low answer relevance with high faithfulness means the retrieved context was off-topic. That's a retrieval problem, not a generation problem.
End-to-End RAG Evaluation
After testing each layer, run end-to-end evaluations that measure the full pipeline against your golden dataset.
For each (query, expected_answer) pair:
- Run the full RAG pipeline
- Compute faithfulness of the output
- Compute answer relevance
- If you have expected answers: compute semantic similarity against expected
- Optionally: LLM-as-judge for holistic quality
Track these metrics across pipeline changes:
golden_results = {
"retrieval_precision@5": 0.72,
"faithfulness": 0.91,
"answer_relevance": 0.85,
"semantic_similarity_to_expected": 0.88
}Run end-to-end evals on every significant change: new embedding model, chunking strategy change, system prompt update, model upgrade.
Testing Context Window Management
Long document collections often produce more retrieved chunks than fit in the context window. How you handle this matters:
- Truncation: What happens when you have 15 relevant chunks but only 5 fit? Are the right 5 selected?
- Reranking: Does a reranker correctly promote the most relevant chunks to the top?
- Ordering: Does chunk order in the context window affect output quality? (It often does — models tend to use information at the start and end of context more than the middle.)
Test reranking by comparing retrieval metrics with and without a reranker on your labeled query set. The reranker should improve precision@k significantly or it's not worth its latency cost.
Regression Testing RAG Pipelines
RAG systems regress silently. Someone updates the embedding model, changes the chunk size, or modifies the system prompt — and retrieval quality drops by 15% without anyone noticing for weeks.
Run your retrieval eval and faithfulness tests in CI:
# .github/workflows/rag-eval.yaml
on:
push:
paths:
- 'src/rag/**'
- 'prompts/**'
- 'chunking/**'
jobs:
evaluate:
steps:
- run: python eval/retrieval.py --fail-below 0.65
- run: python eval/faithfulness.py --fail-below 0.85Gate merges on retrieval precision and faithfulness thresholds. Silent regressions are the biggest reliability risk in RAG applications.
Production Monitoring
Offline evals miss real user query distributions. Monitor production RAG runs:
Log for every query:
- User query and preprocessed query
- Retrieved chunk IDs and similarity scores
- Full generated answer
- Latency breakdown (embedding, retrieval, generation)
Evaluate asynchronously on a sample:
- Faithfulness score
- Answer relevance score
- User feedback signals (thumbs down, follow-up questions suggesting confusion)
Alert on:
- Faithfulness drops below 0.80 (model hallucinating)
- Retrieval returning zero results for queries that previously had results (document store issues)
- Average similarity scores declining (embedding drift or data quality issues)
Putting It Together
A RAG testing checklist:
- Labeled query set (50+ examples) exists
- Retrieval precision@5 ≥ 0.70
- Retrieval recall@5 ≥ 0.70
- Faithfulness score ≥ 0.85 on golden set
- Answer relevance ≥ 0.80 on golden set
- Chunking configuration tested across multiple strategies
- CI gates on retrieval and faithfulness metrics
- Production logging and async eval in place
- Alert thresholds configured
Testing a RAG application is more work than testing a simple LLM call — but the payoff is knowing exactly where failures come from and catching regressions before users do.