RAG Pipeline Evaluation with RAGAS: Faithfulness, Relevancy Metrics
Retrieval-Augmented Generation (RAG) pipelines are notoriously hard to evaluate. A model can produce fluent, confident-sounding answers that are completely unsupported by the retrieved documents. RAGAS gives you a structured framework to measure exactly that — and to wire it into your CI pipeline so regressions surface before they reach production.
What RAGAS Measures
RAGAS evaluates RAG pipelines across four core dimensions:
- Faithfulness — Does the answer contain only claims supported by the retrieved context?
- Answer Relevancy — How well does the answer address the actual question?
- Context Precision — Are the retrieved chunks actually relevant to the question?
- Context Recall — Does the retrieved context contain the information needed to answer?
Each metric is scored 0–1. Faithfulness and context recall require a ground-truth answer; answer relevancy and context precision do not — making them useful for production monitoring where you don't have labeled data.
Installation
pip install ragas langchain openaiRAGAS uses an LLM internally to compute some metrics (faithfulness, answer relevancy). By default it uses GPT-4, but you can swap in any LangChain-compatible model.
Basic Evaluation
Start with the core data structure RAGAS expects:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
# Each sample: question, answer, contexts (list of retrieved chunks), ground_truth
data = {
"question": [
"What is the return policy?",
"How do I cancel my subscription?",
],
"answer": [
"You can return items within 30 days for a full refund.",
"Go to Account Settings and click Cancel Subscription.",
],
"contexts": [
["Our return policy allows returns within 30 days of purchase for a full refund if items are unused."],
["To cancel, navigate to Settings > Billing > Cancel Subscription."],
],
"ground_truth": [
"Items can be returned within 30 days for a full refund.",
"Cancel via Settings > Billing > Cancel Subscription.",
],
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.95, 'context_recall': 0.90}Evaluating a Real RAG Pipeline
Hook RAGAS directly into your LangChain retriever and chain:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from ragas.langchain.evalchain import RagasEvaluatorChain
from ragas.metrics import faithfulness, answer_relevancy
# Build your RAG chain (abbreviated)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.load_local("./vectorstore", embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
llm = ChatOpenAI(model="gpt-4o-mini")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True,
)
# Wrap with RAGAS evaluator
evaluator = RagasEvaluatorChain(metric=faithfulness)
test_questions = [
"What payment methods do you accept?",
"What is the SLA for enterprise support?",
]
for question in test_questions:
result = qa_chain(question)
score = evaluator(
{
"query": question,
"result": result["result"],
"source_documents": result["source_documents"],
}
)
print(f"Q: {question}")
print(f"Faithfulness: {score['faithfulness_score']:.2f}")Generating Synthetic Test Datasets
Collecting real question/answer pairs is slow. RAGAS can generate a test dataset from your documents:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
# Load your knowledge base
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()
# Configure the generator
generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings,
)
# Generate 50 test cases with a mix of question types
testset = generator.generate_with_langchain_docs(
documents,
test_size=50,
distributions={
simple: 0.5, # Direct lookup questions
reasoning: 0.3, # Multi-hop reasoning
multi_context: 0.2, # Requires multiple chunks
},
)
testset.to_pandas().to_csv("./testset.csv", index=False)The generated testset includes questions, expected answers, and the relevant context chunks — ready for automated evaluation.
Setting Quality Thresholds
Define minimum acceptable scores and fail your pipeline if they drop:
# evaluate_rag.py
import sys
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
THRESHOLDS = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_precision": 0.75,
}
def run_evaluation(dataset: Dataset) -> bool:
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
passed = True
for metric, threshold in THRESHOLDS.items():
score = result[metric]
status = "PASS" if score >= threshold else "FAIL"
print(f"{status} {metric}: {score:.3f} (threshold: {threshold})")
if score < threshold:
passed = False
return passed
if __name__ == "__main__":
import pandas as pd
df = pd.read_csv("./testset.csv")
dataset = Dataset.from_pandas(df)
success = run_evaluation(dataset)
sys.exit(0 if success else 1)CI Integration
Add RAG evaluation to your GitHub Actions pipeline:
# .github/workflows/rag-eval.yml
name: RAG Pipeline Evaluation
on:
pull_request:
paths:
- "docs/**"
- "src/rag/**"
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install ragas langchain openai datasets
- name: Run RAG evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python evaluate_rag.py
- name: Upload evaluation report
if: always()
uses: actions/upload-artifact@v4
with:
name: ragas-report
path: ragas_report.jsonTracking Scores Over Time
Evaluation scores are only useful if you track them across commits. Write results to a file and surface regressions:
import json
import os
from datetime import datetime
def save_results(result: dict, output_path: str = "ragas_report.json"):
report = {
"timestamp": datetime.utcnow().isoformat(),
"commit": os.environ.get("GITHUB_SHA", "local"),
"scores": {k: float(v) for k, v in result.items()},
}
with open(output_path, "w") as f:
json.dump(report, f, indent=2)
return reportStore these reports as CI artifacts or push them to a time-series store. When faithfulness drops from 0.91 to 0.78 after a retriever change, you want to catch that in the PR — not in a user complaint.
Common Failure Patterns
Low faithfulness usually means the LLM is hallucinating beyond the retrieved context. Check whether your prompt template explicitly instructs the model to answer only from the provided documents.
Low context precision means your retriever is pulling irrelevant chunks. Tune chunk size, overlap, or switch from cosine similarity to a reranker.
Low answer relevancy often points to prompt issues — the model answers a related but different question than what was asked. Review your system prompt and few-shot examples.
RAGAS gives you the signal; fixing the underlying problem still requires understanding where in the pipeline the failure originates. Use the per-sample scores (not just averages) to identify which question types consistently fail.