TruLens: LLM Observability and Evaluation for RAG Applications

TruLens: LLM Observability and Evaluation for RAG Applications

TruLens is an open-source observability and evaluation framework for LLM applications. Its core contribution is the RAG Triad — three metrics (groundedness, context relevance, answer relevance) that together diagnose where a RAG pipeline breaks. Traces are stored locally or remotely, and a built-in dashboard makes debugging fast.


What TruLens Does

TruLens wraps your LLM application and records every inference — inputs, outputs, intermediate steps — to a local SQLite database (or remote store). On top of that trace data, it runs feedback functions: automated evaluators that score each recorded inference.

The result is a searchable record of how your application performed, with per-inference quality scores you can filter and analyze.

Key capabilities:

  • Instrument any LLM app — LangChain, LlamaIndex, raw API calls, or custom code
  • RAG Triad evaluation — three metrics designed specifically for RAG failure modes
  • Feedback functions — plug in any evaluator (OpenAI, Hugging Face, custom)
  • Local dashboard — visualize traces and scores without sending data anywhere
  • No cloud required — fully local operation possible

Installation

pip install trulens trulens-providers-openai

For LlamaIndex integration:

pip install trulens trulens-apps-llamaindex trulens-providers-openai

The RAG Triad

TruLens introduced the RAG Triad as a systematic framework for evaluating RAG systems. Three metrics, each targeting a specific failure mode:

1. Context Relevance

Question: Is the retrieved context relevant to the user's query?

Measures retrieval quality. Low context relevance means your retriever is pulling irrelevant documents — garbage in, garbage out.

2. Groundedness

Question: Is the generated answer supported by the retrieved context?

Measures hallucination. Low groundedness means the model ignores its context and generates from training data instead.

3. Answer Relevance

Question: Does the generated answer address the user's question?

Measures output quality. Low answer relevance means the answer is off-topic or incomplete, even if it's factually grounded.

A high-quality RAG system scores well on all three. When something breaks, the pattern tells you where:

Context Relevance Groundedness Answer Relevance Diagnosis
Low Fix your retriever
High Low Generator is hallucinating
High High Low Prompt or generation logic issue
High High High Working well

Quick Start with LangChain

import os
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

from trulens.core import TruSession
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as TruOpenAI

# Initialize TruLens session (stores to local SQLite)
session = TruSession()
session.reset_database()  # fresh start

# Build your LangChain RAG app
embeddings = OpenAIEmbeddings()
docs = [
    "HelpMeTest Pro costs $100/month with unlimited tests.",
    "HelpMeTest uses Robot Framework with Playwright.",
    "The free plan allows up to 10 tests with 5-minute monitoring.",
    "Visual testing supports mobile, tablet, and desktop viewports.",
    "HelpMeTest is cloud-hosted SaaS — no self-hosting available."
]

from langchain.schema import Document
vector_store = FAISS.from_documents(
    [Document(page_content=d) for d in docs],
    embeddings
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 3})
)

# Set up TruLens feedback functions (the RAG Triad)
provider = TruOpenAI()

from trulens.core import Feedback
import numpy as np

# Context relevance: are retrieved docs relevant to query?
context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(TruChain.select_context(rag_chain))
    .aggregate(np.mean)
)

# Groundedness: is answer supported by context?
groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruChain.select_context(rag_chain).collect())
    .on_output()
)

# Answer relevance: does answer address the question?
answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input_output()
)

# Wrap your chain with TruLens instrumentation
tru_rag = TruChain(
    rag_chain,
    app_name="HelpMeTest Support Bot",
    app_version="1.0",
    feedbacks=[context_relevance, groundedness, answer_relevance]
)

# Run queries — TruLens records everything
test_questions = [
    "What does HelpMeTest Pro cost?",
    "Does HelpMeTest support self-hosting?",
    "What monitoring frequency is available?",
    "Which testing frameworks does HelpMeTest use?",
]

with tru_rag as recording:
    for q in test_questions:
        response = rag_chain.invoke(q)
        print(f"Q: {q}")
        print(f"A: {response['result']}\n")

Viewing Results in the Dashboard

# Launch the TruLens dashboard
session.get_leaderboard()  # aggregate scores

# Or launch the full UI
from trulens.dashboard import run_dashboard
run_dashboard(session)

Open http://localhost:8501 — you'll see:

  • Leaderboard: aggregate scores per app version
  • Evaluations: per-question scores with reasoning
  • Traces: full request traces with timing

The dashboard is local — no data leaves your machine.


Programmatic Results Access

# Get leaderboard as DataFrame
leaderboard = session.get_leaderboard()
print(leaderboard[["app_name", "app_version", "Context Relevance", "Groundedness", "Answer Relevance"]])

# Get per-record scores
records = session.get_records_and_feedback()[0]
df = records[["input", "output", "Context Relevance", "Groundedness", "Answer Relevance"]]
print(df)

# Find failing records
failing = df[df["Groundedness"] < 0.7]
print(f"Grounding failures: {len(failing)}")

Using TruLens with LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RetrieverQueryEngine

from trulens.apps.llamaindex import TruLlama
from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI as TruOpenAI
import numpy as np

session = TruSession()
provider = TruOpenAI()

# Build LlamaIndex app
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)

# RAG Triad feedback
context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
    .aggregate(np.mean)
)

groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruLlama.select_source_nodes().node.text.collect())
    .on_output()
)

answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input_output()
)

tru_query_engine = TruLlama(
    query_engine,
    app_name="LlamaIndex RAG",
    feedbacks=[context_relevance, groundedness, answer_relevance]
)

with tru_query_engine as recording:
    response = query_engine.query("What is HelpMeTest?")

Custom Feedback Functions

Beyond the RAG Triad, write domain-specific evaluators:

from trulens.core import Feedback

# Custom: check that answers never claim self-hosting
def no_self_hosting_claim(output: str) -> float:
    """Returns 1.0 if output doesn't claim self-hosting, 0.0 if it does."""
    forbidden = ["self-host", "on-premise", "your own server", "deploy yourself"]
    if any(phrase.lower() in output.lower() for phrase in forbidden):
        return 0.0
    return 1.0

no_self_hosting = Feedback(
    no_self_hosting_claim,
    name="No Self-Hosting Claim"
).on_output()

# Custom: pricing accuracy check
def pricing_accuracy(input: str, output: str) -> float:
    """Checks that pricing questions mention correct prices."""
    pricing_keywords = ["cost", "price", "pricing", "how much"]
    if any(kw in input.lower() for kw in pricing_keywords):
        if "$100" in output or "100/month" in output.lower():
            return 1.0
        return 0.0
    return 1.0  # Not a pricing question, skip

pricing = Feedback(pricing_accuracy, name="Pricing Accuracy").on_input_output()

Comparing App Versions

TruLens makes A/B testing straightforward — track separate versions under different app_version labels:

# Version 1: baseline
tru_v1 = TruChain(
    chain_v1,
    app_name="Support Bot",
    app_version="v1-gpt4o",
    feedbacks=[context_relevance, groundedness, answer_relevance]
)

# Version 2: new prompt
tru_v2 = TruChain(
    chain_v2,
    app_name="Support Bot",
    app_version="v2-improved-prompt",
    feedbacks=[context_relevance, groundedness, answer_relevance]
)

# Run the same questions through both
questions = ["What does HelpMeTest cost?", "How do I set up monitoring?"]

with tru_v1 as recording:
    for q in questions:
        chain_v1.invoke(q)

with tru_v2 as recording:
    for q in questions:
        chain_v2.invoke(q)

# Compare in leaderboard
leaderboard = session.get_leaderboard()
print(leaderboard[["app_version", "Context Relevance", "Groundedness", "Answer Relevance"]])

Output:

        app_version  Context Relevance  Groundedness  Answer Relevance
0       v1-gpt4o              0.71          0.83             0.79
1  v2-improved-prompt          0.71          0.91             0.87

V2 improved groundedness and answer relevance without changing retrieval — the improvement came from the prompt, not the retriever.


CI Integration

Run TruLens evaluations in CI and fail on quality regression:

# scripts/trulens_eval.py
import sys
from trulens.core import TruSession

session = TruSession()

# Import and run your instrumented app
from myapp.rag import run_eval_questions
run_eval_questions()  # runs questions through tru_rag

# Get aggregate scores
leaderboard = session.get_leaderboard()
scores = leaderboard.iloc[0]

THRESHOLDS = {
    "Context Relevance": 0.7,
    "Groundedness": 0.8,
    "Answer Relevance": 0.75,
}

failed = []
for metric, threshold in THRESHOLDS.items():
    score = scores.get(metric, 0)
    print(f"{metric}: {score:.2f} (threshold: {threshold})")
    if score < threshold:
        failed.append(f"{metric}: {score:.2f} < {threshold}")

if failed:
    print("\nFAILED:")
    for f in failed:
        print(f"  {f}")
    sys.exit(1)

print("\nPASSED all quality thresholds")

Persistent Storage for Production

By default, TruLens stores to local SQLite. For team-shared results or production tracking:

from trulens.core import TruSession

# PostgreSQL
session = TruSession(
    database_url="postgresql://user:pass@host:5432/trulens"
)

# Or Snowflake
session = TruSession(
    database_url="snowflake://user:pass@account/db/schema"
)

This enables shared evaluation history across team members and CI runs.


TruLens vs LangSmith

TruLens LangSmith
Open source Yes No (freemium)
Local operation Yes (SQLite) Requires Langchain/LCEL API
RAG Triad Built-in Via custom evaluators
Dashboard Local (Streamlit) Cloud UI
LangChain integration Wrapper-based Native
Dataset management Limited Full-featured
Annotation queues No Yes

TruLens wins on privacy and cost (fully local, open source). LangSmith wins on team collaboration features and tighter LangChain integration.


Next Steps

  • Start with the RAG Triad on your existing RAG app — it takes under an hour to instrument
  • Find your weakest metric and fix that layer first
  • Add custom feedback functions for domain-specific checks (pricing accuracy, brand safety)
  • Build a CI eval script that blocks merges when groundedness drops
  • Explore building your own eval framework if you need deeper customization than TruLens provides

For teams that want eval results in a shared dashboard with alerting — without maintaining the infrastructure — HelpMeTest runs your TruLens evaluation scripts on a schedule and surfaces quality trends in one place.

Read more