Testing

Ragas Guide: Evaluating RAG Pipelines with Faithfulness, Relevancy, and Precision

HelpMeTest

16 May 2026 — 7 min read

Ragas gives you a rigorous metric suite for RAG pipelines: faithfulness, answer relevancy, context precision, and context recall. Each metric isolates a different failure mode — bad retrieval vs. bad generation vs. incomplete recall. This guide shows you how to compute them, interpret scores, and integrate Ragas into CI.

The RAG Evaluation Problem

Retrieval-Augmented Generation has a deceptively complex failure space. Your chatbot can fail because:

Retrieval is bad — wrong chunks, irrelevant documents
Retrieval is incomplete — right topic, but missing key facts
Generation hallucinates — model ignores context, invents facts
Generation is off-topic — context was fine, but the answer doesn't address the question

A single "thumbs up / thumbs down" score can't tell you which layer broke. Ragas fixes this with four targeted metrics, each measuring a distinct failure mode.

Installation

pip install ragas

Ragas uses OpenAI as the default judge. Set your key:

export OPENAI_API_KEY=sk-...

The Four Core Metrics

1. Faithfulness

What it measures: Does the generated answer stick to the retrieved context?

A faithful answer only makes claims that are explicitly supported by the context. Faithfulness catches hallucinations where the model ignores retrieved evidence and generates from its training data.

Score: 0.0 (completely hallucinated) to 1.0 (every claim supported by context)

2. Answer Relevancy

What it measures: Does the answer actually address the user's question?

A high-relevancy answer is complete and directly on-topic. Low relevancy means the answer is vague, tangential, or answers a different question.

Score: 0.0 (unrelated) to 1.0 (perfectly addresses the question)

3. Context Precision

What it measures: Are the retrieved chunks relevant to the question?

Context precision = useful chunks / total retrieved chunks. If you retrieve 10 chunks and 3 are actually relevant, precision is 0.30. Low precision means your retriever returns noise.

Score: 0.0 (all retrieved chunks are irrelevant) to 1.0 (all retrieved chunks are relevant)

4. Context Recall

What it measures: Did the retrieval capture all the information needed to answer the question?

Recall requires a ground-truth answer. It checks whether the retrieved chunks contain all the facts in the expected answer. Low recall means your knowledge base is missing critical information or your retriever can't find it.

Score: 0.0 (nothing relevant was retrieved) to 1.0 (all needed information was retrieved)

Quick Start

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Sample data
data = {
    "question": [
        "What is HelpMeTest's pricing for the Pro plan?",
        "Does HelpMeTest support visual testing?",
        "What monitoring intervals does HelpMeTest offer?"
    ],
    "answer": [
        "HelpMeTest Pro costs $100 per month and includes unlimited tests with parallel execution.",
        "Yes, HelpMeTest supports visual testing with AI-powered flaw detection across mobile, tablet, and desktop viewports.",
        "HelpMeTest monitors every 5 minutes on the free plan and every 10 seconds on Enterprise."
    ],
    "contexts": [
        ["HelpMeTest Pro plan: $100/month, unlimited tests, parallel execution, 3-month data retention."],
        ["HelpMeTest visual testing: multi-viewport (mobile, tablet, desktop), AI flaw detection, baseline comparison."],
        ["HelpMeTest health monitoring: 5-minute intervals on free, 10-second intervals on Enterprise."]
    ],
    "ground_truth": [
        "The Pro plan costs $100/month.",
        "HelpMeTest supports visual testing with AI-powered detection.",
        "5-minute intervals on free, 10-second on Enterprise."
    ]
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(result)

Output:

{'faithfulness': 0.97, 'answer_relevancy': 0.91, 'context_precision': 0.88, 'context_recall': 0.94}

Evaluating Your Actual RAG System

Wire Ragas into your real pipeline:

import asyncio
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Your actual RAG components
from myapp.retriever import retrieve_documents
from myapp.generator import generate_answer

TEST_QUESTIONS = [
    {
        "question": "What testing frameworks does HelpMeTest use?",
        "ground_truth": "HelpMeTest uses Robot Framework with Playwright for browser automation."
    },
    {
        "question": "Can I self-host HelpMeTest?",
        "ground_truth": "No, HelpMeTest is a cloud-hosted SaaS product. Self-hosting is not available."
    },
    {
        "question": "What is included in HelpMeTest's free plan?",
        "ground_truth": "The free plan includes up to 10 tests, unlimited health checks, and 5-minute monitoring intervals."
    },
]

def build_eval_dataset(questions: list[dict]) -> Dataset:
    rows = []
    for item in questions:
        q = item["question"]
        docs = retrieve_documents(q)
        contexts = [doc.page_content for doc in docs]
        answer = generate_answer(q, docs)
        
        rows.append({
            "question": q,
            "answer": answer,
            "contexts": contexts,
            "ground_truth": item["ground_truth"]
        })
    
    return Dataset.from_list(rows)


dataset = build_eval_dataset(TEST_QUESTIONS)
result = evaluate(dataset, metrics=[
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
])

# Print per-row breakdown
df = result.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy", "context_precision", "context_recall"]])

Interpreting Your Scores

Diagnosing Failures by Metric Pattern

Pattern	Likely cause	Fix
Low faithfulness, high context precision	Generator ignores context	Stronger system prompt, lower temperature
Low context precision	Retriever returns noise	Better embedding model, re-ranking
Low context recall	Missing info in vector DB	Expand knowledge base, tune chunking
Low answer relevancy	Answer is vague or off-topic	Prompt engineering, better generation model
All metrics low	Fundamental pipeline issue	Audit end-to-end with examples

Score Benchmarks

Score	Quality
0.9+	Excellent
0.75–0.9	Good, production-viable
0.6–0.75	Needs improvement
< 0.6	Significant quality problems

These are rough guides. Your domain may require different thresholds — a medical or legal RAG system should target 0.95+ faithfulness.

Context Precision at Rank k

Standard context precision treats all retrieved chunks equally. Context Precision@K weights earlier-ranked chunks more heavily, since most generators pay more attention to the first chunks in the context window:

from ragas.metrics import context_precision

# context_precision computes a weighted score by default
# Provide contexts in retrieval order (most relevant first according to your retriever)

data = {
    "question": ["What is HelpMeTest?"],
    "answer": ["HelpMeTest is a cloud SaaS testing platform."],
    "contexts": [[
        "HelpMeTest is a cloud-hosted SaaS for automated testing.",   # rank 1
        "HelpMeTest supports Robot Framework and Playwright.",          # rank 2
        "The company was founded to make testing accessible.",          # rank 3 (less relevant)
    ]],
    "ground_truth": ["HelpMeTest is a cloud SaaS testing platform."]
}

If you're seeing low precision, check whether your vector database is returning chunks in relevance order.

Custom Metrics with Ragas

Beyond the four core metrics, define domain-specific criteria:

from ragas.metrics.base import MetricWithLLM
from langchain_openai import ChatOpenAI

# Example: measure whether the answer maintains a professional tone
from ragas import evaluate
from ragas.metrics import faithfulness

# Or use Ragas' aspect critique for custom criteria
from ragas.metrics import AspectCritique

professional_tone = AspectCritique(
    name="professional_tone",
    definition="The answer is written in a professional, business-appropriate tone without casual language or slang."
)

result = evaluate(dataset, metrics=[faithfulness, professional_tone])

Async Evaluation for Large Datasets

For large datasets, async evaluation cuts wall-clock time significantly:

import asyncio
from ragas import evaluate
from ragas.run_config import RunConfig

# Configure parallelism
run_config = RunConfig(
    max_workers=16,        # concurrent evaluations
    max_retries=3,         # retry on rate limits
    timeout=60             # seconds per evaluation
)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    run_config=run_config
)

For very large datasets (10k+ rows), consider batching:

def evaluate_in_batches(dataset, metrics, batch_size=100):
    results = []
    for i in range(0, len(dataset), batch_size):
        batch = dataset.select(range(i, min(i + batch_size, len(dataset))))
        result = evaluate(batch, metrics=metrics)
        results.append(result.to_pandas())
    
    import pandas as pd
    return pd.concat(results, ignore_index=True)

Using a Local Judge Model

Cut costs and keep data on-premise by using a local LLM as judge:

from langchain_community.chat_models import ChatOllama
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_community.embeddings import OllamaEmbeddings

# Local LLM (requires Ollama running locally)
local_llm = LangchainLLMWrapper(ChatOllama(model="llama3.1:70b"))
local_embeddings = LangchainEmbeddingsWrapper(OllamaEmbeddings(model="nomic-embed-text"))

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy],
    llm=local_llm,
    embeddings=local_embeddings
)

Local judge quality is lower than GPT-4 — use for development, use GPT-4 for pre-production checks.

Ragas in CI

Fail your pipeline when RAG quality drops:

# scripts/eval_rag.py
import sys
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

THRESHOLDS = {
    "faithfulness": 0.80,
    "answer_relevancy": 0.75,
    "context_precision": 0.70,
    "context_recall": 0.70,
}

# Build dataset from your test fixtures
dataset = build_eval_dataset(TEST_QUESTIONS)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])

failed = []
for metric, threshold in THRESHOLDS.items():
    score = result[metric]
    if score < threshold:
        failed.append(f"  {metric}: {score:.2f} < {threshold:.2f}")

if failed:
    print("RAG quality check FAILED:")
    print("\n".join(failed))
    sys.exit(1)
else:
    print("RAG quality check PASSED")
    for metric, score in result.items():
        print(f"  {metric}: {score:.2f}")

GitHub Actions step:

- name: RAG quality gate
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: python scripts/eval_rag.py

Tracking Scores Over Time

Single-run scores are useful. Trends are essential. Track Ragas scores across builds:

import json
import datetime
from pathlib import Path

def save_eval_results(result, run_id: str):
    record = {
        "timestamp": datetime.datetime.utcnow().isoformat(),
        "run_id": run_id,
        "scores": {k: float(v) for k, v in result.items()}
    }
    
    history_file = Path("eval-history.jsonl")
    with history_file.open("a") as f:
        f.write(json.dumps(record) + "\n")

# In CI: pass git SHA as run_id
import os
run_id = os.environ.get("GITHUB_SHA", "local")
save_eval_results(result, run_id)

Plot this over time to catch gradual quality drift — a common issue when the underlying model is updated or your knowledge base grows.

Common Pitfalls

1. Evaluating without ground truth Context recall requires ground_truth. Without it, you lose visibility into retrieval completeness. Invest in curating 20-50 ground-truth QA pairs.

2. Chunking misalignment If your chunks don't contain complete thoughts, faithfulness scores suffer even when the generator is correct. Aim for semantic chunking (full paragraphs) over fixed-token chunking.

3. Ignoring rank order in context Pass retrieved documents in relevance rank order. Context precision at rank k weights position, so order matters.

4. Using low-quality judge models Ragas evaluation is only as good as the judge. GPT-3.5 often gives noisy faithfulness scores. Stick with GPT-4 or better for final quality gates.

Ragas vs DeepEval

	Ragas	DeepEval
Primary focus	RAG pipeline metrics	General LLM unit testing
Integration	HuggingFace Datasets	pytest
Custom metrics	AspectCritique	G-Eval (very flexible)
CI fit	Script-based eval	Native pytest support
Best for	RAG quality tracking	Feature-level testing

For comprehensive coverage: use Ragas for pipeline-level evaluation and DeepEval for unit-level assertions. They complement each other.

Next Steps

Curate your golden dataset — 30+ representative questions with ground-truth answers
Set up nightly CI runs — catch model drift before users do
Explore Promptfoo for prompt regression testing across model versions
Add LangSmith for production tracing alongside offline Ragas evaluation

For teams that need monitoring beyond eval scripts — scheduled RAG quality checks with alerting — HelpMeTest runs your Ragas test suites on a schedule and notifies you when scores drop.