TruLens: LLM Observability and Evaluation for RAG Applications
TruLens is an open-source observability and evaluation framework for LLM applications. Its core contribution is the RAG Triad — three metrics (groundedness, context relevance, answer relevance) that together diagnose where a RAG pipeline breaks. Traces are stored locally or remotely, and a built-in dashboard makes debugging fast.
What TruLens Does
TruLens wraps your LLM application and records every inference — inputs, outputs, intermediate steps — to a local SQLite database (or remote store). On top of that trace data, it runs feedback functions: automated evaluators that score each recorded inference.
The result is a searchable record of how your application performed, with per-inference quality scores you can filter and analyze.
Key capabilities:
- Instrument any LLM app — LangChain, LlamaIndex, raw API calls, or custom code
- RAG Triad evaluation — three metrics designed specifically for RAG failure modes
- Feedback functions — plug in any evaluator (OpenAI, Hugging Face, custom)
- Local dashboard — visualize traces and scores without sending data anywhere
- No cloud required — fully local operation possible
Installation
pip install trulens trulens-providers-openaiFor LlamaIndex integration:
pip install trulens trulens-apps-llamaindex trulens-providers-openaiThe RAG Triad
TruLens introduced the RAG Triad as a systematic framework for evaluating RAG systems. Three metrics, each targeting a specific failure mode:
1. Context Relevance
Question: Is the retrieved context relevant to the user's query?
Measures retrieval quality. Low context relevance means your retriever is pulling irrelevant documents — garbage in, garbage out.
2. Groundedness
Question: Is the generated answer supported by the retrieved context?
Measures hallucination. Low groundedness means the model ignores its context and generates from training data instead.
3. Answer Relevance
Question: Does the generated answer address the user's question?
Measures output quality. Low answer relevance means the answer is off-topic or incomplete, even if it's factually grounded.
A high-quality RAG system scores well on all three. When something breaks, the pattern tells you where:
| Context Relevance | Groundedness | Answer Relevance | Diagnosis |
|---|---|---|---|
| Low | — | — | Fix your retriever |
| High | Low | — | Generator is hallucinating |
| High | High | Low | Prompt or generation logic issue |
| High | High | High | Working well |
Quick Start with LangChain
import os
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from trulens.core import TruSession
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as TruOpenAI
# Initialize TruLens session (stores to local SQLite)
session = TruSession()
session.reset_database() # fresh start
# Build your LangChain RAG app
embeddings = OpenAIEmbeddings()
docs = [
"HelpMeTest Pro costs $100/month with unlimited tests.",
"HelpMeTest uses Robot Framework with Playwright.",
"The free plan allows up to 10 tests with 5-minute monitoring.",
"Visual testing supports mobile, tablet, and desktop viewports.",
"HelpMeTest is cloud-hosted SaaS — no self-hosting available."
]
from langchain.schema import Document
vector_store = FAISS.from_documents(
[Document(page_content=d) for d in docs],
embeddings
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
rag_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(search_kwargs={"k": 3})
)
# Set up TruLens feedback functions (the RAG Triad)
provider = TruOpenAI()
from trulens.core import Feedback
import numpy as np
# Context relevance: are retrieved docs relevant to query?
context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
.on_input()
.on(TruChain.select_context(rag_chain))
.aggregate(np.mean)
)
# Groundedness: is answer supported by context?
groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruChain.select_context(rag_chain).collect())
.on_output()
)
# Answer relevance: does answer address the question?
answer_relevance = (
Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
.on_input_output()
)
# Wrap your chain with TruLens instrumentation
tru_rag = TruChain(
rag_chain,
app_name="HelpMeTest Support Bot",
app_version="1.0",
feedbacks=[context_relevance, groundedness, answer_relevance]
)
# Run queries — TruLens records everything
test_questions = [
"What does HelpMeTest Pro cost?",
"Does HelpMeTest support self-hosting?",
"What monitoring frequency is available?",
"Which testing frameworks does HelpMeTest use?",
]
with tru_rag as recording:
for q in test_questions:
response = rag_chain.invoke(q)
print(f"Q: {q}")
print(f"A: {response['result']}\n")Viewing Results in the Dashboard
# Launch the TruLens dashboard
session.get_leaderboard() # aggregate scores
# Or launch the full UI
from trulens.dashboard import run_dashboard
run_dashboard(session)Open http://localhost:8501 — you'll see:
- Leaderboard: aggregate scores per app version
- Evaluations: per-question scores with reasoning
- Traces: full request traces with timing
The dashboard is local — no data leaves your machine.
Programmatic Results Access
# Get leaderboard as DataFrame
leaderboard = session.get_leaderboard()
print(leaderboard[["app_name", "app_version", "Context Relevance", "Groundedness", "Answer Relevance"]])
# Get per-record scores
records = session.get_records_and_feedback()[0]
df = records[["input", "output", "Context Relevance", "Groundedness", "Answer Relevance"]]
print(df)
# Find failing records
failing = df[df["Groundedness"] < 0.7]
print(f"Grounding failures: {len(failing)}")Using TruLens with LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.query_engine import RetrieverQueryEngine
from trulens.apps.llamaindex import TruLlama
from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI as TruOpenAI
import numpy as np
session = TruSession()
provider = TruOpenAI()
# Build LlamaIndex app
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
# RAG Triad feedback
context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
.on_input()
.on(TruLlama.select_source_nodes().node.text)
.aggregate(np.mean)
)
groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruLlama.select_source_nodes().node.text.collect())
.on_output()
)
answer_relevance = (
Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
.on_input_output()
)
tru_query_engine = TruLlama(
query_engine,
app_name="LlamaIndex RAG",
feedbacks=[context_relevance, groundedness, answer_relevance]
)
with tru_query_engine as recording:
response = query_engine.query("What is HelpMeTest?")Custom Feedback Functions
Beyond the RAG Triad, write domain-specific evaluators:
from trulens.core import Feedback
# Custom: check that answers never claim self-hosting
def no_self_hosting_claim(output: str) -> float:
"""Returns 1.0 if output doesn't claim self-hosting, 0.0 if it does."""
forbidden = ["self-host", "on-premise", "your own server", "deploy yourself"]
if any(phrase.lower() in output.lower() for phrase in forbidden):
return 0.0
return 1.0
no_self_hosting = Feedback(
no_self_hosting_claim,
name="No Self-Hosting Claim"
).on_output()
# Custom: pricing accuracy check
def pricing_accuracy(input: str, output: str) -> float:
"""Checks that pricing questions mention correct prices."""
pricing_keywords = ["cost", "price", "pricing", "how much"]
if any(kw in input.lower() for kw in pricing_keywords):
if "$100" in output or "100/month" in output.lower():
return 1.0
return 0.0
return 1.0 # Not a pricing question, skip
pricing = Feedback(pricing_accuracy, name="Pricing Accuracy").on_input_output()Comparing App Versions
TruLens makes A/B testing straightforward — track separate versions under different app_version labels:
# Version 1: baseline
tru_v1 = TruChain(
chain_v1,
app_name="Support Bot",
app_version="v1-gpt4o",
feedbacks=[context_relevance, groundedness, answer_relevance]
)
# Version 2: new prompt
tru_v2 = TruChain(
chain_v2,
app_name="Support Bot",
app_version="v2-improved-prompt",
feedbacks=[context_relevance, groundedness, answer_relevance]
)
# Run the same questions through both
questions = ["What does HelpMeTest cost?", "How do I set up monitoring?"]
with tru_v1 as recording:
for q in questions:
chain_v1.invoke(q)
with tru_v2 as recording:
for q in questions:
chain_v2.invoke(q)
# Compare in leaderboard
leaderboard = session.get_leaderboard()
print(leaderboard[["app_version", "Context Relevance", "Groundedness", "Answer Relevance"]])Output:
app_version Context Relevance Groundedness Answer Relevance
0 v1-gpt4o 0.71 0.83 0.79
1 v2-improved-prompt 0.71 0.91 0.87V2 improved groundedness and answer relevance without changing retrieval — the improvement came from the prompt, not the retriever.
CI Integration
Run TruLens evaluations in CI and fail on quality regression:
# scripts/trulens_eval.py
import sys
from trulens.core import TruSession
session = TruSession()
# Import and run your instrumented app
from myapp.rag import run_eval_questions
run_eval_questions() # runs questions through tru_rag
# Get aggregate scores
leaderboard = session.get_leaderboard()
scores = leaderboard.iloc[0]
THRESHOLDS = {
"Context Relevance": 0.7,
"Groundedness": 0.8,
"Answer Relevance": 0.75,
}
failed = []
for metric, threshold in THRESHOLDS.items():
score = scores.get(metric, 0)
print(f"{metric}: {score:.2f} (threshold: {threshold})")
if score < threshold:
failed.append(f"{metric}: {score:.2f} < {threshold}")
if failed:
print("\nFAILED:")
for f in failed:
print(f" {f}")
sys.exit(1)
print("\nPASSED all quality thresholds")Persistent Storage for Production
By default, TruLens stores to local SQLite. For team-shared results or production tracking:
from trulens.core import TruSession
# PostgreSQL
session = TruSession(
database_url="postgresql://user:pass@host:5432/trulens"
)
# Or Snowflake
session = TruSession(
database_url="snowflake://user:pass@account/db/schema"
)This enables shared evaluation history across team members and CI runs.
TruLens vs LangSmith
| TruLens | LangSmith | |
|---|---|---|
| Open source | Yes | No (freemium) |
| Local operation | Yes (SQLite) | Requires Langchain/LCEL API |
| RAG Triad | Built-in | Via custom evaluators |
| Dashboard | Local (Streamlit) | Cloud UI |
| LangChain integration | Wrapper-based | Native |
| Dataset management | Limited | Full-featured |
| Annotation queues | No | Yes |
TruLens wins on privacy and cost (fully local, open source). LangSmith wins on team collaboration features and tighter LangChain integration.
Next Steps
- Start with the RAG Triad on your existing RAG app — it takes under an hour to instrument
- Find your weakest metric and fix that layer first
- Add custom feedback functions for domain-specific checks (pricing accuracy, brand safety)
- Build a CI eval script that blocks merges when groundedness drops
- Explore building your own eval framework if you need deeper customization than TruLens provides
For teams that want eval results in a shared dashboard with alerting — without maintaining the infrastructure — HelpMeTest runs your TruLens evaluation scripts on a schedule and surfaces quality trends in one place.