RAG Pipeline Evaluation with RAGAS: Faithfulness, Relevancy Metrics

RAG Pipeline Evaluation with RAGAS: Faithfulness, Relevancy Metrics

Retrieval-Augmented Generation (RAG) pipelines are notoriously hard to evaluate. A model can produce fluent, confident-sounding answers that are completely unsupported by the retrieved documents. RAGAS gives you a structured framework to measure exactly that — and to wire it into your CI pipeline so regressions surface before they reach production.

What RAGAS Measures

RAGAS evaluates RAG pipelines across four core dimensions:

  • Faithfulness — Does the answer contain only claims supported by the retrieved context?
  • Answer Relevancy — How well does the answer address the actual question?
  • Context Precision — Are the retrieved chunks actually relevant to the question?
  • Context Recall — Does the retrieved context contain the information needed to answer?

Each metric is scored 0–1. Faithfulness and context recall require a ground-truth answer; answer relevancy and context precision do not — making them useful for production monitoring where you don't have labeled data.

Installation

pip install ragas langchain openai

RAGAS uses an LLM internally to compute some metrics (faithfulness, answer relevancy). By default it uses GPT-4, but you can swap in any LangChain-compatible model.

Basic Evaluation

Start with the core data structure RAGAS expects:

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# Each sample: question, answer, contexts (list of retrieved chunks), ground_truth
data = {
    "question": [
        "What is the return policy?",
        "How do I cancel my subscription?",
    ],
    "answer": [
        "You can return items within 30 days for a full refund.",
        "Go to Account Settings and click Cancel Subscription.",
    ],
    "contexts": [
        ["Our return policy allows returns within 30 days of purchase for a full refund if items are unused."],
        ["To cancel, navigate to Settings > Billing > Cancel Subscription."],
    ],
    "ground_truth": [
        "Items can be returned within 30 days for a full refund.",
        "Cancel via Settings > Billing > Cancel Subscription.",
    ],
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.95, 'context_recall': 0.90}

Evaluating a Real RAG Pipeline

Hook RAGAS directly into your LangChain retriever and chain:

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from ragas.langchain.evalchain import RagasEvaluatorChain
from ragas.metrics import faithfulness, answer_relevancy

# Build your RAG chain (abbreviated)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.load_local("./vectorstore", embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
llm = ChatOpenAI(model="gpt-4o-mini")

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
)

# Wrap with RAGAS evaluator
evaluator = RagasEvaluatorChain(metric=faithfulness)

test_questions = [
    "What payment methods do you accept?",
    "What is the SLA for enterprise support?",
]

for question in test_questions:
    result = qa_chain(question)
    score = evaluator(
        {
            "query": question,
            "result": result["result"],
            "source_documents": result["source_documents"],
        }
    )
    print(f"Q: {question}")
    print(f"Faithfulness: {score['faithfulness_score']:.2f}")

Generating Synthetic Test Datasets

Collecting real question/answer pairs is slow. RAGAS can generate a test dataset from your documents:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader

# Load your knowledge base
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# Configure the generator
generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings,
)

# Generate 50 test cases with a mix of question types
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,
    distributions={
        simple: 0.5,        # Direct lookup questions
        reasoning: 0.3,     # Multi-hop reasoning
        multi_context: 0.2, # Requires multiple chunks
    },
)

testset.to_pandas().to_csv("./testset.csv", index=False)

The generated testset includes questions, expected answers, and the relevant context chunks — ready for automated evaluation.

Setting Quality Thresholds

Define minimum acceptable scores and fail your pipeline if they drop:

# evaluate_rag.py
import sys
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

THRESHOLDS = {
    "faithfulness": 0.85,
    "answer_relevancy": 0.80,
    "context_precision": 0.75,
}

def run_evaluation(dataset: Dataset) -> bool:
    result = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision],
    )

    passed = True
    for metric, threshold in THRESHOLDS.items():
        score = result[metric]
        status = "PASS" if score >= threshold else "FAIL"
        print(f"{status} {metric}: {score:.3f} (threshold: {threshold})")
        if score < threshold:
            passed = False

    return passed

if __name__ == "__main__":
    import pandas as pd
    df = pd.read_csv("./testset.csv")
    dataset = Dataset.from_pandas(df)
    success = run_evaluation(dataset)
    sys.exit(0 if success else 1)

CI Integration

Add RAG evaluation to your GitHub Actions pipeline:

# .github/workflows/rag-eval.yml
name: RAG Pipeline Evaluation

on:
  pull_request:
    paths:
      - "docs/**"
      - "src/rag/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install ragas langchain openai datasets

      - name: Run RAG evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python evaluate_rag.py

      - name: Upload evaluation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: ragas-report
          path: ragas_report.json

Tracking Scores Over Time

Evaluation scores are only useful if you track them across commits. Write results to a file and surface regressions:

import json
import os
from datetime import datetime

def save_results(result: dict, output_path: str = "ragas_report.json"):
    report = {
        "timestamp": datetime.utcnow().isoformat(),
        "commit": os.environ.get("GITHUB_SHA", "local"),
        "scores": {k: float(v) for k, v in result.items()},
    }
    with open(output_path, "w") as f:
        json.dump(report, f, indent=2)
    return report

Store these reports as CI artifacts or push them to a time-series store. When faithfulness drops from 0.91 to 0.78 after a retriever change, you want to catch that in the PR — not in a user complaint.

Common Failure Patterns

Low faithfulness usually means the LLM is hallucinating beyond the retrieved context. Check whether your prompt template explicitly instructs the model to answer only from the provided documents.

Low context precision means your retriever is pulling irrelevant chunks. Tune chunk size, overlap, or switch from cosine similarity to a reranker.

Low answer relevancy often points to prompt issues — the model answers a related but different question than what was asked. Review your system prompt and few-shot examples.

RAGAS gives you the signal; fixing the underlying problem still requires understanding where in the pipeline the failure originates. Use the per-sample scores (not just averages) to identify which question types consistently fail.

Read more