Evaluating RAG Systems with RAGAS and TruLens
RAG evaluation frameworks exist because manual inspection doesn't scale. You can't read 10,000 (question, answer, context) triples and judge whether the model is hallucinating, whether the retrieved context is relevant, or whether the answer actually addresses the question. RAGAS and TruLens automate this judgment using LLM-based scoring, giving you quantitative metrics you can track over time and integrate into CI.
This guide covers how to use both frameworks to build automated RAG evaluation pipelines.
RAGAS: Metric-First RAG Evaluation
RAGAS (Retrieval-Augmented Generation Assessment) provides a suite of reference-free metrics that evaluate different aspects of RAG quality. The key insight: you don't need a ground-truth answer for most metrics. You need the question, the generated answer, and the retrieved contexts.
Core RAGAS Metrics
| Metric | Measures | Range |
|---|---|---|
faithfulness |
Are claims in the answer supported by retrieved context? | 0-1 |
answer_relevancy |
Does the answer address the question? | 0-1 |
context_precision |
Are retrieved chunks relevant to the question? | 0-1 |
context_recall |
Does the retrieved context cover the answer? (needs ground truth) | 0-1 |
Installing and Setting Up RAGAS
pip install ragas datasets langchain-openaifrom ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Configure RAGAS to use your LLM
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())Preparing a RAGAS Test Dataset
# Build evaluation dataset from your RAG system's outputs
def build_ragas_dataset(questions: list[str], rag_pipeline) -> Dataset:
data = {
"question": [],
"answer": [],
"contexts": [],
}
for question in questions:
result = rag_pipeline.answer(question)
data["question"].append(question)
data["answer"].append(result.answer)
data["contexts"].append([doc.content for doc in result.retrieved_docs])
return Dataset.from_dict(data)
# Your evaluation questions
eval_questions = [
"What are the main benefits of using PostgreSQL over MySQL?",
"How does vector indexing work in Qdrant?",
"What is the difference between cosine and dot product similarity?",
"How do I configure connection pooling in SQLAlchemy?",
"What are the steps to set up row-level security in PostgreSQL?",
]
eval_dataset = build_ragas_dataset(eval_questions, rag_pipeline)Running RAGAS Evaluation
results = evaluate(
dataset=eval_dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
],
llm=evaluator_llm,
embeddings=evaluator_embeddings,
)
print(results)
# Output:
# {'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}Integrating RAGAS into Pytest
import pytest
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
RAGAS_THRESHOLDS = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_precision": 0.75,
}
@pytest.fixture(scope="session")
def ragas_eval_results(rag_pipeline, eval_questions):
dataset = build_ragas_dataset(eval_questions, rag_pipeline)
return evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
llm=evaluator_llm,
embeddings=evaluator_embeddings,
)
def test_faithfulness_above_threshold(ragas_eval_results):
score = ragas_eval_results["faithfulness"]
threshold = RAGAS_THRESHOLDS["faithfulness"]
assert score >= threshold, (
f"RAG faithfulness {score:.3f} is below threshold {threshold}. "
"The model may be hallucinating or ignoring retrieved context."
)
def test_answer_relevancy_above_threshold(ragas_eval_results):
score = ragas_eval_results["answer_relevancy"]
threshold = RAGAS_THRESHOLDS["answer_relevancy"]
assert score >= threshold, (
f"Answer relevancy {score:.3f} is below threshold {threshold}. "
"Answers may not be addressing the questions correctly."
)
def test_context_precision_above_threshold(ragas_eval_results):
score = ragas_eval_results["context_precision"]
threshold = RAGAS_THRESHOLDS["context_precision"]
assert score >= threshold, (
f"Context precision {score:.3f} is below threshold {threshold}. "
"Retrieval is pulling in irrelevant documents."
)Per-Question Analysis with RAGAS
Aggregate scores hide individual failures. Inspect per-question results to find systematic issues:
def test_no_single_question_with_critical_faithfulness_failure(ragas_eval_results, eval_questions):
"""No individual question should have faithfulness below 0.5."""
scores = ragas_eval_results.to_pandas()
critical_failures = scores[scores["faithfulness"] < 0.5]
if len(critical_failures) > 0:
failure_details = []
for _, row in critical_failures.iterrows():
failure_details.append(
f"Q: {row['question'][:80]}...\n"
f" Faithfulness: {row['faithfulness']:.2f}\n"
f" Answer: {row['answer'][:100]}..."
)
pytest.fail(
f"{len(critical_failures)} questions had critical faithfulness failures:\n\n" +
"\n\n".join(failure_details)
)TruLens: Tracing and Feedback-Based Evaluation
TruLens takes a different approach: it instruments your RAG pipeline with a tracing layer and applies feedback functions to each traced call. This makes it easier to debug specific failures and track metrics per component.
Setting Up TruLens
pip install trulens trulens-providers-openaifrom trulens.core import TruSession
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as TruOpenAI
session = TruSession()
session.reset_database()
provider = TruOpenAI(model_engine="gpt-4o-mini")Defining Feedback Functions
from trulens.core import Feedback
import numpy as np
# Faithfulness: are answer claims supported by retrieved context?
f_faithfulness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Faithfulness")
.on_input_output() # applies to (context, answer)
)
# Answer relevance: does the answer address the question?
f_answer_relevance = (
Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
.on_input_output() # applies to (question, answer)
)
# Context relevance: is the retrieved context relevant to the question?
f_context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
.on_input()
.on(TruChain.select_context())
.aggregate(np.mean)
)Wrapping Your RAG Chain with TruLens
from langchain_core.runnables import RunnableLambda
# Your existing LangChain RAG chain
rag_chain = build_rag_chain()
# Wrap with TruLens instrumentation
tru_rag = TruChain(
rag_chain,
app_name="Production RAG",
app_version="v1.2",
feedbacks=[f_faithfulness, f_answer_relevance, f_context_relevance],
)
# Run evaluations with tracing
with tru_rag as recording:
for question in eval_questions:
rag_chain.invoke({"question": question})
# Retrieve results
records, feedback_results = session.get_records_and_feedback(app_ids=["Production RAG"])Integrating TruLens into Pytest
TRULENS_THRESHOLDS = {
"Faithfulness": 0.85,
"Answer Relevance": 0.80,
"Context Relevance": 0.75,
}
@pytest.fixture(scope="session")
def trulens_results(rag_chain, eval_questions):
session = TruSession()
session.reset_database()
tru_rag = TruChain(
rag_chain,
app_name="Test RAG",
app_version="test",
feedbacks=[f_faithfulness, f_answer_relevance, f_context_relevance],
)
with tru_rag:
for question in eval_questions:
rag_chain.invoke({"question": question})
records, feedback = session.get_records_and_feedback(app_ids=["Test RAG"])
return feedback
def test_trulens_faithfulness(trulens_results):
faithfulness_scores = trulens_results[trulens_results["feedback_name"] == "Faithfulness"]["result"]
avg = faithfulness_scores.mean()
assert avg >= TRULENS_THRESHOLDS["Faithfulness"], (
f"TruLens faithfulness {avg:.3f} below threshold {TRULENS_THRESHOLDS['Faithfulness']}"
)Comparing RAGAS vs TruLens
| Aspect | RAGAS | TruLens |
|---|---|---|
| Setup | Simple, batch evaluation | Requires instrumentation |
| Debugging | Aggregate metrics | Per-call traces with reasoning |
| CI integration | Easy | Moderate |
| Customization | Define custom metrics | Define custom feedback functions |
| Best for | Benchmarking and regression | Production monitoring and debugging |
Use RAGAS for CI regression testing. Use TruLens for production monitoring and debugging specific failures.
Building a RAG Evaluation CI Pipeline
# .github/workflows/rag-evaluation.yml
name: RAG Evaluation
on:
push:
paths:
- 'src/rag/**'
- 'src/embeddings/**'
- 'src/prompts/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run RAG evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
PINECONE_API_KEY: ${{ secrets.PINECONE_TEST_API_KEY }}
run: |
pip install ragas datasets langchain-openai
pytest tests/rag_evaluation/ -v --tb=short \
--junitxml=rag-eval-results.xml
- name: Upload evaluation results
uses: actions/upload-artifact@v4
with:
name: rag-evaluation-results
path: rag-eval-results.xmlKey Takeaways
- RAGAS provides reference-free metrics: faithfulness, answer relevancy, and context precision
- TruLens provides per-call tracing with reasoning — better for debugging than aggregate metrics
- Run RAGAS in CI to catch regressions when you change embeddings, chunking, LLM, or prompts
- Per-question failure analysis is more actionable than aggregate scores alone
- Set explicit thresholds for each metric and fail CI when thresholds are breached
- Use RAGAS for benchmarking, TruLens for production monitoring — they complement each other