Testing RAG Pipelines: Retrieval Quality, Context Relevance, and Answer Faithfulness
RAG (Retrieval-Augmented Generation) systems fail in ways that don't look like failures. The answer is fluent, confident, and wrong. The retrieved context was relevant but the LLM ignored it. The retrieved context was irrelevant and the LLM hallucinated around it. Without systematic testing, these failures reach users invisibly.
Testing a RAG pipeline requires evaluating three distinct components: the retrieval step (did we fetch the right documents?), the context relevance step (are the fetched documents actually useful for this query?), and the generation step (does the answer faithfully use the context?). Each component can fail independently.
The Three Failure Modes of RAG Systems
Retrieval failure: The most relevant documents aren't returned. Either the embedding similarity doesn't capture semantic meaning correctly, or the top-K cutoff is too aggressive. Result: the LLM has no good context to work with.
Context relevance failure: Documents are retrieved but they're not relevant to the specific question. A query about Python exception handling might retrieve a Python installation guide. The retrieved text is "about Python" but doesn't answer the question.
Faithfulness failure: The context is correct and relevant, but the LLM ignores or misinterprets it. Instead of grounding the answer in retrieved facts, the model uses its training data and produces confident hallucinations.
Evaluating Retrieval Quality
Retrieval quality tests verify that the right documents come back for given queries. This requires a test dataset: a set of (query, expected_document_ids) pairs.
Building a Retrieval Test Dataset
# test_dataset.json structure
[
{
"query": "How do I configure CORS in Express.js?",
"relevant_doc_ids": ["express-cors-setup", "express-middleware-guide"],
"irrelevant_doc_ids": ["flask-cors-setup", "django-cors-headers"]
},
{
"query": "What is the difference between supervised and unsupervised learning?",
"relevant_doc_ids": ["ml-fundamentals", "supervised-learning-intro"],
"irrelevant_doc_ids": ["deep-learning-architectures"]
}
]Measuring Recall@K
Recall@K measures what fraction of relevant documents appear in the top-K retrieved results:
import pytest
import json
from your_rag_system import retrieve_documents
def recall_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float:
top_k = retrieved_ids[:k]
hits = len(set(top_k) & set(relevant_ids))
return hits / len(relevant_ids)
@pytest.fixture
def test_dataset():
with open('tests/retrieval_test_dataset.json') as f:
return json.load(f)
def test_retrieval_recall_at_5(test_dataset):
total_recall = 0
for case in test_dataset:
results = retrieve_documents(case['query'], top_k=5)
retrieved_ids = [r['id'] for r in results]
recall = recall_at_k(retrieved_ids, case['relevant_doc_ids'], k=5)
total_recall += recall
avg_recall = total_recall / len(test_dataset)
# Set your minimum acceptable recall threshold
assert avg_recall >= 0.80, f"Retrieval recall@5 is {avg_recall:.2f}, below threshold 0.80"
def test_irrelevant_docs_not_in_top_3(test_dataset):
for case in test_dataset:
results = retrieve_documents(case['query'], top_k=3)
retrieved_ids = [r['id'] for r in results]
for irrelevant_id in case['irrelevant_doc_ids']:
assert irrelevant_id not in retrieved_ids, (
f"Irrelevant doc '{irrelevant_id}' appeared in top-3 for query: {case['query']}"
)Precision@K
Precision measures the fraction of retrieved documents that are actually relevant:
def precision_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float:
top_k = retrieved_ids[:k]
hits = len(set(top_k) & set(relevant_ids))
return hits / k
def test_retrieval_precision_at_3(test_dataset):
total_precision = 0
for case in test_dataset:
results = retrieve_documents(case['query'], top_k=3)
retrieved_ids = [r['id'] for r in results]
precision = precision_at_k(retrieved_ids, case['relevant_doc_ids'], k=3)
total_precision += precision
avg_precision = total_precision / len(test_dataset)
assert avg_precision >= 0.70, f"Precision@3 is {avg_precision:.2f}, below threshold 0.70"Evaluating Context Relevance
Even if the right documents are retrieved, they may not be relevant to the specific question. Context relevance scoring evaluates whether each retrieved chunk actually helps answer the query.
LLM-Based Relevance Scoring
from openai import OpenAI
client = OpenAI()
def score_context_relevance(query: str, context: str) -> float:
"""Score how relevant a context chunk is to a query. Returns 0-1."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You are a relevance scorer. Given a query and a context chunk, "
"score how relevant the context is to answering the query. "
"Return only a number from 0 to 1, where 1 = perfectly relevant, "
"0 = completely irrelevant."
)
},
{
"role": "user",
"content": f"Query: {query}\n\nContext: {context}\n\nRelevance score:"
}
],
max_tokens=10,
temperature=0,
)
score_str = response.choices[0].message.content.strip()
return float(score_str)
def test_context_relevance_above_threshold():
test_cases = [
{
"query": "How do I handle database connection errors in Python?",
"context": "Use try/except blocks around database operations. SQLAlchemy raises OperationalError for connection failures.",
"min_score": 0.8,
},
{
"query": "How do I handle database connection errors in Python?",
"context": "Python was created by Guido van Rossum in 1991 and is named after Monty Python.",
"max_score": 0.2,
},
]
for case in test_cases:
score = score_context_relevance(case['query'], case['context'])
if 'min_score' in case:
assert score >= case['min_score'], f"Expected relevance >= {case['min_score']}, got {score}"
if 'max_score' in case:
assert score <= case['max_score'], f"Expected relevance <= {case['max_score']}, got {score}"Evaluating Answer Faithfulness
Faithfulness measures whether the generated answer is grounded in the provided context, or whether the LLM is making things up.
Claim Extraction and Verification
def extract_claims(answer: str) -> list[str]:
"""Extract factual claims from an answer."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Extract each factual claim from the following answer as a list of simple statements."
},
{"role": "user", "content": answer}
]
)
claims_text = response.choices[0].message.content
return [line.strip('- ').strip() for line in claims_text.split('\n') if line.strip()]
def verify_claim_against_context(claim: str, context: str) -> bool:
"""Check if a claim is supported by the context."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Given a context and a claim, determine if the claim is supported by the context. "
"Respond with only 'YES' or 'NO'."
)
},
{
"role": "user",
"content": f"Context: {context}\n\nClaim: {claim}\n\nIs the claim supported?"
}
],
max_tokens=5,
temperature=0,
)
return response.choices[0].message.content.strip().upper() == 'YES'
def faithfulness_score(answer: str, context: str) -> float:
"""Calculate what fraction of claims in the answer are supported by context."""
claims = extract_claims(answer)
if not claims:
return 1.0
supported = sum(1 for c in claims if verify_claim_against_context(c, context))
return supported / len(claims)
def test_answer_faithfulness():
context = (
"Redis is an in-memory data structure store used as a database, cache, and message broker. "
"It supports data structures such as strings, hashes, lists, sets, and sorted sets. "
"Redis was created by Salvatore Sanfilippo and first released in 2009."
)
answer = (
"Redis is an in-memory database that supports strings, hashes, and lists. "
"It was created in 2009 and is commonly used for caching."
)
score = faithfulness_score(answer, context)
assert score >= 0.90, f"Faithfulness score {score:.2f} is below 0.90 threshold"
def test_hallucination_detection():
context = (
"Python is dynamically typed. Variables do not need type declarations."
)
hallucinated_answer = (
"Python is a statically typed language, like Java, which requires explicit type declarations. "
"This was introduced in Python 3.5."
)
score = faithfulness_score(hallucinated_answer, context)
assert score < 0.3, f"Hallucination not detected — faithfulness score was {score:.2f}"End-to-End RAG Pipeline Test
Combine all three evaluations in a single pipeline test:
def test_rag_pipeline_end_to_end():
query = "What are the main differences between PostgreSQL and MySQL?"
# Run the full RAG pipeline
result = rag_pipeline.answer(query)
# 1. Check retrieval quality
retrieved_ids = [doc['id'] for doc in result.retrieved_docs]
assert any('postgres' in doc_id for doc_id in retrieved_ids), \
"No PostgreSQL-related documents retrieved"
# 2. Check context relevance
for doc in result.retrieved_docs:
relevance = score_context_relevance(query, doc['content'])
assert relevance >= 0.5, \
f"Retrieved document '{doc['id']}' has low relevance score: {relevance:.2f}"
# 3. Check answer faithfulness
full_context = ' '.join(doc['content'] for doc in result.retrieved_docs)
faithfulness = faithfulness_score(result.answer, full_context)
assert faithfulness >= 0.85, \
f"Answer faithfulness {faithfulness:.2f} below threshold"
# 4. Check answer is non-empty and not a refusal
assert len(result.answer) > 50, "Answer is too short"
assert "I don't know" not in result.answer, \
"RAG system refused to answer with valid context available"Regression Testing After Changes
Track metrics over time to catch regressions when you update embeddings, chunking strategy, or the LLM:
# Store baseline metrics from last known-good run
BASELINE_METRICS = {
'recall_at_5': 0.85,
'precision_at_3': 0.72,
'avg_context_relevance': 0.78,
'avg_faithfulness': 0.91,
}
def test_no_retrieval_regression():
current_metrics = compute_retrieval_metrics(test_dataset)
for metric, baseline in BASELINE_METRICS.items():
current = current_metrics[metric]
# Allow 5% degradation before failing
assert current >= baseline * 0.95, (
f"Regression detected: {metric} dropped from {baseline:.2f} to {current:.2f}"
)Key Takeaways
- Test all three components separately: retrieval quality, context relevance, and answer faithfulness
- Use a curated test dataset with known (query, relevant_doc_ids) pairs for retrieval testing
- LLM-based relevance and faithfulness scoring is practical for automated evaluation
- Faithfulness testing via claim extraction is the most reliable hallucination detection method
- Track metrics over time — a 5% regression in faithfulness after changing the LLM is a real problem
- Run RAG evaluation tests in CI on every change to embedding models, chunking, or LLM selection