AI Testing

Testing RAG Applications: Retrieval, Chunking, and Answer Quality

HelpMeTest

16 May 2026 — 6 min read

RAG applications fail at multiple points — poor chunking, irrelevant retrieval, unfaithful generation. Testing each layer independently gives you clear signal on where failures originate. This guide covers practical testing strategies for every stage of the RAG pipeline.

Key Takeaways

RAG has three distinct failure modes. Retrieval failure (wrong chunks returned), chunking failure (right source, wrong chunk boundaries), and generation failure (hallucinations despite correct context). Test each separately.

Retrieval quality is measurable without an LLM. Use precision@k and recall@k against a labeled query set. This is fast, cheap, and highly diagnostic.

Chunking strategy affects retrieval quality profoundly. Test different chunk sizes and overlap values against your specific query distribution before deploying.

Faithfulness testing verifies the generator doesn't add fabricated claims. Ground every output claim in the retrieved context. Claims with no contextual basis are hallucinations.

Build a golden dataset early. A set of 50–200 (query, expected_answer, relevant_chunks) examples is your most valuable testing asset. Build it before you optimize anything.

The RAG Testing Challenge

RAG (Retrieval-Augmented Generation) applications combine two complex systems: a retrieval system (typically vector search) and a generative LLM. Both can fail. And when the final output is wrong, you need to know which one failed.

Without structured testing, debugging RAG failures is guesswork: "Did we retrieve the wrong documents? Did the model ignore the documents we retrieved? Was the document chunked in a way that lost important context?"

A properly tested RAG pipeline gives you a failure dashboard instead of a mystery.

The RAG Pipeline and Its Failure Points

A standard RAG pipeline:

User query
  → Query preprocessing (optional)
  → Embedding the query
  → Vector search against document store
  → Retrieve top-k chunks
  → Build context window (retrieved chunks + system prompt)
  → LLM generates answer
  → Answer delivery

Each step is a potential failure point:

Query preprocessing: Query expansion or reformulation can distort the retrieval signal
Embedding quality: Low-quality embeddings produce poor similarity matches
Chunk size/overlap: Chunks that cut across sentence or concept boundaries lose context
Retrieval quality: The wrong chunks get returned for the query
Context assembly: Important retrieved chunks get truncated or poorly ordered
Generation faithfulness: The LLM ignores or contradicts the retrieved context
Answer relevance: The final answer doesn't actually address the user's question

Testing Retrieval Quality

Retrieval testing is the most diagnostic step because you can run it without any LLM calls — it's just evaluating whether the right documents are returned for a given query.

Build a Labeled Query Set

First, create a set of (query, relevant_document_ids) pairs. This is your ground truth for retrieval evaluation.

[
  {
    "query": "What is the return policy for software purchases?",
    "relevant_doc_ids": ["policy-doc-id-42", "faq-returns-id-17"]
  },
  {
    "query": "How do I reset my password?",
    "relevant_doc_ids": ["help-auth-id-05"]
  }
]

Creating this dataset is manual work, but 50–100 labeled pairs give you reliable retrieval metrics. Use actual user queries from logs when possible — synthetic queries are less representative.

Measure Precision and Recall

For each query in your labeled set:

Precision@k: Of the k chunks returned, how many were actually relevant?
Recall@k: Of all relevant chunks, how many were in the top k?
MRR (Mean Reciprocal Rank): How highly was the first relevant result ranked?

def evaluate_retrieval(retriever, labeled_queries, k=5):
    results = []
    for item in labeled_queries:
        retrieved = retriever.search(item["query"], k=k)
        retrieved_ids = [chunk.doc_id for chunk in retrieved]
        relevant_ids = set(item["relevant_doc_ids"])
        
        hits = len(set(retrieved_ids) & relevant_ids)
        precision = hits / k
        recall = hits / len(relevant_ids)
        
        # MRR: find rank of first relevant result
        mrr = 0
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in relevant_ids:
                mrr = 1 / rank
                break
        
        results.append({"precision": precision, "recall": recall, "mrr": mrr})
    
    avg = lambda key: sum(r[key] for r in results) / len(results)
    return {
        "precision@k": avg("precision"),
        "recall@k": avg("recall"),
        "mrr": avg("mrr"),
        "total_queries": len(labeled_queries)
    }

Run this after every change to your embedding model, chunk size, or retrieval configuration. Precision@5 below 0.6 or Recall@5 below 0.7 usually means the retrieval layer is the primary problem.

Testing Chunking Strategies

Chunking — splitting documents into retrievable pieces — has a massive impact on retrieval quality that's often underestimated.

Common chunking problems:

Chunks too small: Each chunk lacks enough context to be meaningful. Answers require information from multiple chunks that may not be retrieved together.
Chunks too large: One chunk covers too many topics. Similarity scores get diluted.
Boundary cuts: Splitting mid-sentence or mid-table destroys the semantic unit.

Test chunking by running your full retrieval eval against different chunk configurations:

Config	Chunk Size	Overlap	Precision@5	Recall@5
Baseline	512 tokens	0	0.61	0.58
Large	1024 tokens	0	0.64	0.67
Overlapping	512 tokens	128	0.70	0.72
Sentence-aware	Variable	0	0.73	0.71

There's no universally correct chunk size. The optimal configuration depends on your document structure and query distribution. Test with your actual data.

Advanced techniques to test:

Hierarchical chunking: Small chunks for retrieval, larger parent chunks for context
Semantic chunking: Split on sentence boundaries and semantic similarity rather than fixed token counts
Metadata-enriched chunks: Prepend document title and section header to each chunk

Testing Generation Faithfulness

Once you know retrieval is working, test whether the generator actually uses the retrieved context faithfully.

Faithfulness measures: do all claims in the output have support in the retrieved chunks?

def test_faithfulness(query: str, retrieved_chunks: list[str], generated_answer: str):
    context = "\n".join(retrieved_chunks)
    
    # Extract claims from the answer
    claim_prompt = f"""
    Extract all factual claims from this answer as a JSON list of strings:
    
    Answer: {generated_answer}
    
    Return only JSON: ["claim1", "claim2", ...]
    """
    claims = json.loads(call_llm(claim_prompt))
    
    # Check each claim against context
    unsupported = []
    for claim in claims:
        support_prompt = f"""
        Does this context support this claim? Answer yes or no.
        
        Context: {context}
        Claim: {claim}
        """
        supported = "yes" in call_llm(support_prompt).lower()
        if not supported:
            unsupported.append(claim)
    
    faithfulness_score = 1 - (len(unsupported) / len(claims))
    return {"score": faithfulness_score, "unsupported_claims": unsupported}

A faithfulness score below 0.85 indicates the model is hallucinating claims not present in the retrieved context.

Testing Answer Relevance

A retrieved context can be faithful but still not answer the actual question. Test answer relevance separately:

Does the generated answer actually address what the user asked?

def test_answer_relevance(query: str, answer: str) -> float:
    prompt = f"""
    Rate how well this answer addresses the question on a scale of 0-1.
    
    Question: {query}
    Answer: {answer}
    
    Return only a number between 0 and 1.
    """
    return float(call_llm(prompt).strip())

Low answer relevance with high faithfulness means the retrieved context was off-topic. That's a retrieval problem, not a generation problem.

End-to-End RAG Evaluation

After testing each layer, run end-to-end evaluations that measure the full pipeline against your golden dataset.

For each (query, expected_answer) pair:

Run the full RAG pipeline
Compute faithfulness of the output
Compute answer relevance
If you have expected answers: compute semantic similarity against expected
Optionally: LLM-as-judge for holistic quality

Track these metrics across pipeline changes:

golden_results = {
    "retrieval_precision@5": 0.72,
    "faithfulness": 0.91,
    "answer_relevance": 0.85,
    "semantic_similarity_to_expected": 0.88
}

Run end-to-end evals on every significant change: new embedding model, chunking strategy change, system prompt update, model upgrade.

Testing Context Window Management

Long document collections often produce more retrieved chunks than fit in the context window. How you handle this matters:

Truncation: What happens when you have 15 relevant chunks but only 5 fit? Are the right 5 selected?
Reranking: Does a reranker correctly promote the most relevant chunks to the top?
Ordering: Does chunk order in the context window affect output quality? (It often does — models tend to use information at the start and end of context more than the middle.)

Test reranking by comparing retrieval metrics with and without a reranker on your labeled query set. The reranker should improve precision@k significantly or it's not worth its latency cost.

Regression Testing RAG Pipelines

RAG systems regress silently. Someone updates the embedding model, changes the chunk size, or modifies the system prompt — and retrieval quality drops by 15% without anyone noticing for weeks.

Run your retrieval eval and faithfulness tests in CI:

# .github/workflows/rag-eval.yaml
on:
  push:
    paths:
      - 'src/rag/**'
      - 'prompts/**'
      - 'chunking/**'

jobs:
  evaluate:
    steps:
      - run: python eval/retrieval.py --fail-below 0.65
      - run: python eval/faithfulness.py --fail-below 0.85

Gate merges on retrieval precision and faithfulness thresholds. Silent regressions are the biggest reliability risk in RAG applications.

Production Monitoring

Offline evals miss real user query distributions. Monitor production RAG runs:

Log for every query:

User query and preprocessed query
Retrieved chunk IDs and similarity scores
Full generated answer
Latency breakdown (embedding, retrieval, generation)

Evaluate asynchronously on a sample:

Faithfulness score
Answer relevance score
User feedback signals (thumbs down, follow-up questions suggesting confusion)

Alert on:

Faithfulness drops below 0.80 (model hallucinating)
Retrieval returning zero results for queries that previously had results (document store issues)
Average similarity scores declining (embedding drift or data quality issues)

Putting It Together

A RAG testing checklist:

Labeled query set (50+ examples) exists
Retrieval precision@5 ≥ 0.70
Retrieval recall@5 ≥ 0.70
Faithfulness score ≥ 0.85 on golden set
Answer relevance ≥ 0.80 on golden set
Chunking configuration tested across multiple strategies
CI gates on retrieval and faithfulness metrics
Production logging and async eval in place
Alert thresholds configured

Testing a RAG application is more work than testing a simple LLM call — but the payoff is knowing exactly where failures come from and catching regressions before users do.