Evaluating RAG Systems with RAGAS and TruLens

Evaluating RAG Systems with RAGAS and TruLens

RAG evaluation frameworks exist because manual inspection doesn't scale. You can't read 10,000 (question, answer, context) triples and judge whether the model is hallucinating, whether the retrieved context is relevant, or whether the answer actually addresses the question. RAGAS and TruLens automate this judgment using LLM-based scoring, giving you quantitative metrics you can track over time and integrate into CI.

This guide covers how to use both frameworks to build automated RAG evaluation pipelines.

RAGAS: Metric-First RAG Evaluation

RAGAS (Retrieval-Augmented Generation Assessment) provides a suite of reference-free metrics that evaluate different aspects of RAG quality. The key insight: you don't need a ground-truth answer for most metrics. You need the question, the generated answer, and the retrieved contexts.

Core RAGAS Metrics

Metric Measures Range
faithfulness Are claims in the answer supported by retrieved context? 0-1
answer_relevancy Does the answer address the question? 0-1
context_precision Are retrieved chunks relevant to the question? 0-1
context_recall Does the retrieved context cover the answer? (needs ground truth) 0-1

Installing and Setting Up RAGAS

pip install ragas datasets langchain-openai
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Configure RAGAS to use your LLM
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Preparing a RAGAS Test Dataset

# Build evaluation dataset from your RAG system's outputs
def build_ragas_dataset(questions: list[str], rag_pipeline) -> Dataset:
    data = {
        "question": [],
        "answer": [],
        "contexts": [],
    }

    for question in questions:
        result = rag_pipeline.answer(question)
        data["question"].append(question)
        data["answer"].append(result.answer)
        data["contexts"].append([doc.content for doc in result.retrieved_docs])

    return Dataset.from_dict(data)

# Your evaluation questions
eval_questions = [
    "What are the main benefits of using PostgreSQL over MySQL?",
    "How does vector indexing work in Qdrant?",
    "What is the difference between cosine and dot product similarity?",
    "How do I configure connection pooling in SQLAlchemy?",
    "What are the steps to set up row-level security in PostgreSQL?",
]

eval_dataset = build_ragas_dataset(eval_questions, rag_pipeline)

Running RAGAS Evaluation

results = evaluate(
    dataset=eval_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
    ],
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
)

print(results)
# Output:
# {'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}

Integrating RAGAS into Pytest

import pytest
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

RAGAS_THRESHOLDS = {
    "faithfulness": 0.85,
    "answer_relevancy": 0.80,
    "context_precision": 0.75,
}

@pytest.fixture(scope="session")
def ragas_eval_results(rag_pipeline, eval_questions):
    dataset = build_ragas_dataset(eval_questions, rag_pipeline)
    return evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision],
        llm=evaluator_llm,
        embeddings=evaluator_embeddings,
    )

def test_faithfulness_above_threshold(ragas_eval_results):
    score = ragas_eval_results["faithfulness"]
    threshold = RAGAS_THRESHOLDS["faithfulness"]
    assert score >= threshold, (
        f"RAG faithfulness {score:.3f} is below threshold {threshold}. "
        "The model may be hallucinating or ignoring retrieved context."
    )

def test_answer_relevancy_above_threshold(ragas_eval_results):
    score = ragas_eval_results["answer_relevancy"]
    threshold = RAGAS_THRESHOLDS["answer_relevancy"]
    assert score >= threshold, (
        f"Answer relevancy {score:.3f} is below threshold {threshold}. "
        "Answers may not be addressing the questions correctly."
    )

def test_context_precision_above_threshold(ragas_eval_results):
    score = ragas_eval_results["context_precision"]
    threshold = RAGAS_THRESHOLDS["context_precision"]
    assert score >= threshold, (
        f"Context precision {score:.3f} is below threshold {threshold}. "
        "Retrieval is pulling in irrelevant documents."
    )

Per-Question Analysis with RAGAS

Aggregate scores hide individual failures. Inspect per-question results to find systematic issues:

def test_no_single_question_with_critical_faithfulness_failure(ragas_eval_results, eval_questions):
    """No individual question should have faithfulness below 0.5."""
    scores = ragas_eval_results.to_pandas()

    critical_failures = scores[scores["faithfulness"] < 0.5]

    if len(critical_failures) > 0:
        failure_details = []
        for _, row in critical_failures.iterrows():
            failure_details.append(
                f"Q: {row['question'][:80]}...\n"
                f"  Faithfulness: {row['faithfulness']:.2f}\n"
                f"  Answer: {row['answer'][:100]}..."
            )
        pytest.fail(
            f"{len(critical_failures)} questions had critical faithfulness failures:\n\n" +
            "\n\n".join(failure_details)
        )

TruLens: Tracing and Feedback-Based Evaluation

TruLens takes a different approach: it instruments your RAG pipeline with a tracing layer and applies feedback functions to each traced call. This makes it easier to debug specific failures and track metrics per component.

Setting Up TruLens

pip install trulens trulens-providers-openai
from trulens.core import TruSession
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as TruOpenAI

session = TruSession()
session.reset_database()

provider = TruOpenAI(model_engine="gpt-4o-mini")

Defining Feedback Functions

from trulens.core import Feedback
import numpy as np

# Faithfulness: are answer claims supported by retrieved context?
f_faithfulness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Faithfulness")
    .on_input_output()  # applies to (context, answer)
)

# Answer relevance: does the answer address the question?
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input_output()  # applies to (question, answer)
)

# Context relevance: is the retrieved context relevant to the question?
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(TruChain.select_context())
    .aggregate(np.mean)
)

Wrapping Your RAG Chain with TruLens

from langchain_core.runnables import RunnableLambda

# Your existing LangChain RAG chain
rag_chain = build_rag_chain()

# Wrap with TruLens instrumentation
tru_rag = TruChain(
    rag_chain,
    app_name="Production RAG",
    app_version="v1.2",
    feedbacks=[f_faithfulness, f_answer_relevance, f_context_relevance],
)

# Run evaluations with tracing
with tru_rag as recording:
    for question in eval_questions:
        rag_chain.invoke({"question": question})

# Retrieve results
records, feedback_results = session.get_records_and_feedback(app_ids=["Production RAG"])

Integrating TruLens into Pytest

TRULENS_THRESHOLDS = {
    "Faithfulness": 0.85,
    "Answer Relevance": 0.80,
    "Context Relevance": 0.75,
}

@pytest.fixture(scope="session")
def trulens_results(rag_chain, eval_questions):
    session = TruSession()
    session.reset_database()

    tru_rag = TruChain(
        rag_chain,
        app_name="Test RAG",
        app_version="test",
        feedbacks=[f_faithfulness, f_answer_relevance, f_context_relevance],
    )

    with tru_rag:
        for question in eval_questions:
            rag_chain.invoke({"question": question})

    records, feedback = session.get_records_and_feedback(app_ids=["Test RAG"])
    return feedback

def test_trulens_faithfulness(trulens_results):
    faithfulness_scores = trulens_results[trulens_results["feedback_name"] == "Faithfulness"]["result"]
    avg = faithfulness_scores.mean()
    assert avg >= TRULENS_THRESHOLDS["Faithfulness"], (
        f"TruLens faithfulness {avg:.3f} below threshold {TRULENS_THRESHOLDS['Faithfulness']}"
    )

Comparing RAGAS vs TruLens

Aspect RAGAS TruLens
Setup Simple, batch evaluation Requires instrumentation
Debugging Aggregate metrics Per-call traces with reasoning
CI integration Easy Moderate
Customization Define custom metrics Define custom feedback functions
Best for Benchmarking and regression Production monitoring and debugging

Use RAGAS for CI regression testing. Use TruLens for production monitoring and debugging specific failures.

Building a RAG Evaluation CI Pipeline

# .github/workflows/rag-evaluation.yml
name: RAG Evaluation

on:
  push:
    paths:
      - 'src/rag/**'
      - 'src/embeddings/**'
      - 'src/prompts/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run RAG evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PINECONE_API_KEY: ${{ secrets.PINECONE_TEST_API_KEY }}
        run: |
          pip install ragas datasets langchain-openai
          pytest tests/rag_evaluation/ -v --tb=short \
            --junitxml=rag-eval-results.xml
      - name: Upload evaluation results
        uses: actions/upload-artifact@v4
        with:
          name: rag-evaluation-results
          path: rag-eval-results.xml

Key Takeaways

  • RAGAS provides reference-free metrics: faithfulness, answer relevancy, and context precision
  • TruLens provides per-call tracing with reasoning — better for debugging than aggregate metrics
  • Run RAGAS in CI to catch regressions when you change embeddings, chunking, LLM, or prompts
  • Per-question failure analysis is more actionable than aggregate scores alone
  • Set explicit thresholds for each metric and fail CI when thresholds are breached
  • Use RAGAS for benchmarking, TruLens for production monitoring — they complement each other

Read more