Ragas Guide: Evaluating RAG Pipelines with Faithfulness, Relevancy, and Precision
Ragas gives you a rigorous metric suite for RAG pipelines: faithfulness, answer relevancy, context precision, and context recall. Each metric isolates a different failure mode — bad retrieval vs. bad generation vs. incomplete recall. This guide shows you how to compute them, interpret scores, and integrate Ragas into CI.
The RAG Evaluation Problem
Retrieval-Augmented Generation has a deceptively complex failure space. Your chatbot can fail because:
- Retrieval is bad — wrong chunks, irrelevant documents
- Retrieval is incomplete — right topic, but missing key facts
- Generation hallucinates — model ignores context, invents facts
- Generation is off-topic — context was fine, but the answer doesn't address the question
A single "thumbs up / thumbs down" score can't tell you which layer broke. Ragas fixes this with four targeted metrics, each measuring a distinct failure mode.
Installation
pip install ragasRagas uses OpenAI as the default judge. Set your key:
export OPENAI_API_KEY=sk-...The Four Core Metrics
1. Faithfulness
What it measures: Does the generated answer stick to the retrieved context?
A faithful answer only makes claims that are explicitly supported by the context. Faithfulness catches hallucinations where the model ignores retrieved evidence and generates from its training data.
Score: 0.0 (completely hallucinated) to 1.0 (every claim supported by context)
2. Answer Relevancy
What it measures: Does the answer actually address the user's question?
A high-relevancy answer is complete and directly on-topic. Low relevancy means the answer is vague, tangential, or answers a different question.
Score: 0.0 (unrelated) to 1.0 (perfectly addresses the question)
3. Context Precision
What it measures: Are the retrieved chunks relevant to the question?
Context precision = useful chunks / total retrieved chunks. If you retrieve 10 chunks and 3 are actually relevant, precision is 0.30. Low precision means your retriever returns noise.
Score: 0.0 (all retrieved chunks are irrelevant) to 1.0 (all retrieved chunks are relevant)
4. Context Recall
What it measures: Did the retrieval capture all the information needed to answer the question?
Recall requires a ground-truth answer. It checks whether the retrieved chunks contain all the facts in the expected answer. Low recall means your knowledge base is missing critical information or your retriever can't find it.
Score: 0.0 (nothing relevant was retrieved) to 1.0 (all needed information was retrieved)
Quick Start
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from datasets import Dataset
# Sample data
data = {
"question": [
"What is HelpMeTest's pricing for the Pro plan?",
"Does HelpMeTest support visual testing?",
"What monitoring intervals does HelpMeTest offer?"
],
"answer": [
"HelpMeTest Pro costs $100 per month and includes unlimited tests with parallel execution.",
"Yes, HelpMeTest supports visual testing with AI-powered flaw detection across mobile, tablet, and desktop viewports.",
"HelpMeTest monitors every 5 minutes on the free plan and every 10 seconds on Enterprise."
],
"contexts": [
["HelpMeTest Pro plan: $100/month, unlimited tests, parallel execution, 3-month data retention."],
["HelpMeTest visual testing: multi-viewport (mobile, tablet, desktop), AI flaw detection, baseline comparison."],
["HelpMeTest health monitoring: 5-minute intervals on free, 10-second intervals on Enterprise."]
],
"ground_truth": [
"The Pro plan costs $100/month.",
"HelpMeTest supports visual testing with AI-powered detection.",
"5-minute intervals on free, 10-second on Enterprise."
]
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)Output:
{'faithfulness': 0.97, 'answer_relevancy': 0.91, 'context_precision': 0.88, 'context_recall': 0.94}Evaluating Your Actual RAG System
Wire Ragas into your real pipeline:
import asyncio
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Your actual RAG components
from myapp.retriever import retrieve_documents
from myapp.generator import generate_answer
TEST_QUESTIONS = [
{
"question": "What testing frameworks does HelpMeTest use?",
"ground_truth": "HelpMeTest uses Robot Framework with Playwright for browser automation."
},
{
"question": "Can I self-host HelpMeTest?",
"ground_truth": "No, HelpMeTest is a cloud-hosted SaaS product. Self-hosting is not available."
},
{
"question": "What is included in HelpMeTest's free plan?",
"ground_truth": "The free plan includes up to 10 tests, unlimited health checks, and 5-minute monitoring intervals."
},
]
def build_eval_dataset(questions: list[dict]) -> Dataset:
rows = []
for item in questions:
q = item["question"]
docs = retrieve_documents(q)
contexts = [doc.page_content for doc in docs]
answer = generate_answer(q, docs)
rows.append({
"question": q,
"answer": answer,
"contexts": contexts,
"ground_truth": item["ground_truth"]
})
return Dataset.from_list(rows)
dataset = build_eval_dataset(TEST_QUESTIONS)
result = evaluate(dataset, metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
])
# Print per-row breakdown
df = result.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy", "context_precision", "context_recall"]])Interpreting Your Scores
Diagnosing Failures by Metric Pattern
| Pattern | Likely cause | Fix |
|---|---|---|
| Low faithfulness, high context precision | Generator ignores context | Stronger system prompt, lower temperature |
| Low context precision | Retriever returns noise | Better embedding model, re-ranking |
| Low context recall | Missing info in vector DB | Expand knowledge base, tune chunking |
| Low answer relevancy | Answer is vague or off-topic | Prompt engineering, better generation model |
| All metrics low | Fundamental pipeline issue | Audit end-to-end with examples |
Score Benchmarks
| Score | Quality |
|---|---|
| 0.9+ | Excellent |
| 0.75–0.9 | Good, production-viable |
| 0.6–0.75 | Needs improvement |
| < 0.6 | Significant quality problems |
These are rough guides. Your domain may require different thresholds — a medical or legal RAG system should target 0.95+ faithfulness.
Context Precision at Rank k
Standard context precision treats all retrieved chunks equally. Context Precision@K weights earlier-ranked chunks more heavily, since most generators pay more attention to the first chunks in the context window:
from ragas.metrics import context_precision
# context_precision computes a weighted score by default
# Provide contexts in retrieval order (most relevant first according to your retriever)
data = {
"question": ["What is HelpMeTest?"],
"answer": ["HelpMeTest is a cloud SaaS testing platform."],
"contexts": [[
"HelpMeTest is a cloud-hosted SaaS for automated testing.", # rank 1
"HelpMeTest supports Robot Framework and Playwright.", # rank 2
"The company was founded to make testing accessible.", # rank 3 (less relevant)
]],
"ground_truth": ["HelpMeTest is a cloud SaaS testing platform."]
}If you're seeing low precision, check whether your vector database is returning chunks in relevance order.
Custom Metrics with Ragas
Beyond the four core metrics, define domain-specific criteria:
from ragas.metrics.base import MetricWithLLM
from langchain_openai import ChatOpenAI
# Example: measure whether the answer maintains a professional tone
from ragas import evaluate
from ragas.metrics import faithfulness
# Or use Ragas' aspect critique for custom criteria
from ragas.metrics import AspectCritique
professional_tone = AspectCritique(
name="professional_tone",
definition="The answer is written in a professional, business-appropriate tone without casual language or slang."
)
result = evaluate(dataset, metrics=[faithfulness, professional_tone])Async Evaluation for Large Datasets
For large datasets, async evaluation cuts wall-clock time significantly:
import asyncio
from ragas import evaluate
from ragas.run_config import RunConfig
# Configure parallelism
run_config = RunConfig(
max_workers=16, # concurrent evaluations
max_retries=3, # retry on rate limits
timeout=60 # seconds per evaluation
)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
run_config=run_config
)For very large datasets (10k+ rows), consider batching:
def evaluate_in_batches(dataset, metrics, batch_size=100):
results = []
for i in range(0, len(dataset), batch_size):
batch = dataset.select(range(i, min(i + batch_size, len(dataset))))
result = evaluate(batch, metrics=metrics)
results.append(result.to_pandas())
import pandas as pd
return pd.concat(results, ignore_index=True)Using a Local Judge Model
Cut costs and keep data on-premise by using a local LLM as judge:
from langchain_community.chat_models import ChatOllama
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.embeddings import OllamaEmbeddings
# Local LLM (requires Ollama running locally)
local_llm = LangchainLLMWrapper(ChatOllama(model="llama3.1:70b"))
local_embeddings = LangchainEmbeddingsWrapper(OllamaEmbeddings(model="nomic-embed-text"))
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy],
llm=local_llm,
embeddings=local_embeddings
)Local judge quality is lower than GPT-4 — use for development, use GPT-4 for pre-production checks.
Ragas in CI
Fail your pipeline when RAG quality drops:
# scripts/eval_rag.py
import sys
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
THRESHOLDS = {
"faithfulness": 0.80,
"answer_relevancy": 0.75,
"context_precision": 0.70,
"context_recall": 0.70,
}
# Build dataset from your test fixtures
dataset = build_eval_dataset(TEST_QUESTIONS)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
failed = []
for metric, threshold in THRESHOLDS.items():
score = result[metric]
if score < threshold:
failed.append(f" {metric}: {score:.2f} < {threshold:.2f}")
if failed:
print("RAG quality check FAILED:")
print("\n".join(failed))
sys.exit(1)
else:
print("RAG quality check PASSED")
for metric, score in result.items():
print(f" {metric}: {score:.2f}")GitHub Actions step:
- name: RAG quality gate
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python scripts/eval_rag.pyTracking Scores Over Time
Single-run scores are useful. Trends are essential. Track Ragas scores across builds:
import json
import datetime
from pathlib import Path
def save_eval_results(result, run_id: str):
record = {
"timestamp": datetime.datetime.utcnow().isoformat(),
"run_id": run_id,
"scores": {k: float(v) for k, v in result.items()}
}
history_file = Path("eval-history.jsonl")
with history_file.open("a") as f:
f.write(json.dumps(record) + "\n")
# In CI: pass git SHA as run_id
import os
run_id = os.environ.get("GITHUB_SHA", "local")
save_eval_results(result, run_id)Plot this over time to catch gradual quality drift — a common issue when the underlying model is updated or your knowledge base grows.
Common Pitfalls
1. Evaluating without ground truth Context recall requires ground_truth. Without it, you lose visibility into retrieval completeness. Invest in curating 20-50 ground-truth QA pairs.
2. Chunking misalignment If your chunks don't contain complete thoughts, faithfulness scores suffer even when the generator is correct. Aim for semantic chunking (full paragraphs) over fixed-token chunking.
3. Ignoring rank order in context Pass retrieved documents in relevance rank order. Context precision at rank k weights position, so order matters.
4. Using low-quality judge models Ragas evaluation is only as good as the judge. GPT-3.5 often gives noisy faithfulness scores. Stick with GPT-4 or better for final quality gates.
Ragas vs DeepEval
| Ragas | DeepEval | |
|---|---|---|
| Primary focus | RAG pipeline metrics | General LLM unit testing |
| Integration | HuggingFace Datasets | pytest |
| Custom metrics | AspectCritique | G-Eval (very flexible) |
| CI fit | Script-based eval | Native pytest support |
| Best for | RAG quality tracking | Feature-level testing |
For comprehensive coverage: use Ragas for pipeline-level evaluation and DeepEval for unit-level assertions. They complement each other.
Next Steps
- Curate your golden dataset — 30+ representative questions with ground-truth answers
- Set up nightly CI runs — catch model drift before users do
- Explore Promptfoo for prompt regression testing across model versions
- Add LangSmith for production tracing alongside offline Ragas evaluation
For teams that need monitoring beyond eval scripts — scheduled RAG quality checks with alerting — HelpMeTest runs your Ragas test suites on a schedule and notifies you when scores drop.