Opik LLM Evaluation: Open-Source Testing with Comet Opik
Open-source LLM evaluation tools matter for teams that can't send production data to third-party cloud services, that need to self-host their observability stack, or that simply want full control over their evaluation pipeline. Opik, from Comet, is one of the most capable open-source options in this space. It provides tracing, dataset management, and LLM evaluation — all of which you can run on your own infrastructure.
This guide is a practical walkthrough: how to instrument your LLM application with Opik, how to build evaluation datasets, how to score your outputs, and how to integrate the whole thing into a CI workflow.
Why Open-Source Evaluation Matters
The case for open-source LLM observability isn't purely ideological. There are concrete operational reasons:
Data residency — in regulated industries (healthcare, finance, legal), sending user queries to a third-party logging service may violate compliance requirements. Self-hosting eliminates this concern.
Cost at scale — cloud observability platforms charge per request or per trace. At high volume, this adds up. Running Opik yourself means you pay infrastructure costs, not per-trace fees.
Customization — open-source means you can modify the scoring logic, the storage backend, the UI. Cloud platforms give you what they give you.
Auditability — you can inspect exactly how traces are collected, stored, and scored. No black boxes.
Setting Up Opik
Cloud (Easiest Start)
pip install opik
opik configure # Set API key and workspaceSelf-Hosted with Docker
# docker-compose.yml
version: '3'
services:
opik:
image: ghcr.io/comet-ml/opik:latest
ports:
- "5173:5173" # UI
- "8080:8080" # API
volumes:
- opik-data:/data
environment:
- OPIK_STORAGE_TYPE=local
volumes:
opik-data:docker compose up -d
opik configure --url http://localhost:8080 --api-key localConfigure the client:
import opik
opik.configure(
url="http://localhost:8080", # or cloud URL
api_key="your-api-key",
workspace="your-workspace"
)Tracing LLM Applications
The @opik.track decorator instruments any function with tracing. Nested decorators create parent-child spans:
import opik
from opik import track
import openai
client = openai.OpenAI()
@track(name="llm-call")
def call_llm(prompt: str, model: str = "gpt-4o-mini") -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
@track(name="retrieve-context")
def retrieve_context(query: str) -> list[str]:
# Simulate vector DB retrieval
return [
"Relevant document chunk 1...",
"Relevant document chunk 2..."
]
@track(name="rag-pipeline")
def answer_question(question: str) -> str:
# This creates a parent trace with two child spans
context_chunks = retrieve_context(question)
context = "\n".join(context_chunks)
prompt = f"""Answer the following question using only the provided context.
Context:
{context}
Question: {question}"""
return call_llm(prompt)For automatic OpenAI tracing, Opik provides an integration that wraps the client:
from opik.integrations.openai import track_openai
client = track_openai(openai.OpenAI())
# All calls through this client are automatically traced
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello!"}]
)Building Evaluation Datasets
Opik datasets work like other evaluation platforms: a collection of input/expected-output pairs that you run your pipeline against.
import opik
client = opik.Opik()
# Create a dataset
dataset = client.get_or_create_dataset("rag-qa-v1")
# Add items
qa_pairs = [
{
"question": "What is the refund policy?",
"expected": "Full refund within 30 days of purchase",
"context": ["Our refund policy allows full refunds within 30 days..."],
"metadata": {"category": "refund", "difficulty": "easy"}
},
{
"question": "How do I reset my password?",
"expected": "Click 'Forgot Password' on the login page",
"context": ["To reset your password, navigate to the login page..."],
"metadata": {"category": "account", "difficulty": "easy"}
},
{
"question": "What happens if my subscription lapses?",
"expected": "Account is downgraded to free tier, data retained 90 days",
"context": ["If your subscription expires, your account will automatically..."],
"metadata": {"category": "billing", "difficulty": "medium"}
}
]
for item in qa_pairs:
dataset.insert([{
"input": {"question": item["question"], "context": item["context"]},
"expected_output": {"answer": item["expected"]},
"metadata": item["metadata"]
}])
print(f"Dataset has {len(dataset.get_all_items())} items")Seeding Datasets from Production
The most valuable dataset items come from production. Opik lets you log production traces and later promote specific ones to a dataset:
from opik import track, opik_context
@track(name="production-rag")
def production_query(question: str, user_id: str) -> str:
opik_context.update_current_trace(
metadata={
"user_id": user_id,
"environment": "production"
}
)
answer = answer_question(question)
return answer
# Later, after reviewing traces in the UI:
# Select low-quality traces → "Add to dataset" → "rag-qa-failures-v1"
# Now you have a dataset of real failure cases to fixScoring Functions
Opik ships with built-in metrics and supports custom ones:
Built-in Metrics
from opik.evaluation.metrics import (
Hallucination,
AnswerRelevance,
ContextRecall,
ContextPrecision
)
hallucination_metric = Hallucination()
relevance_metric = AnswerRelevance()
# Score a single output
score = hallucination_metric.score(
input="What is the refund policy?",
output="Full refund within 30 days of purchase",
context=["Our refund policy allows full refunds within 30 days..."]
)
print(f"Hallucination score: {score.value}") # 0 = not hallucinated, 1 = hallucinatedRunning Full Dataset Evaluations
import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance
client = opik.Opik()
dataset = client.get_dataset("rag-qa-v1")
def rag_task(dataset_item: dict) -> dict:
question = dataset_item["input"]["question"]
context = dataset_item["input"]["context"]
# Build prompt with context
prompt = f"Using only this context:\n{chr(10).join(context)}\n\nAnswer: {question}"
answer = call_llm(prompt)
return {
"output": answer,
"context": context # Pass context for context-based metrics
}
eval_results = evaluate(
experiment_name="rag-pipeline-v2.1",
dataset=dataset,
task=rag_task,
scoring_metrics=[
Hallucination(),
AnswerRelevance(),
],
experiment_config={
"model": "gpt-4o-mini",
"retrieval_top_k": 3,
"prompt_version": "v2.1"
}
)
print(f"Hallucination rate: {eval_results.get_metric('hallucination'):.3f}")
print(f"Answer relevance: {eval_results.get_metric('answer_relevance'):.3f}")Custom Scoring Functions
from opik.evaluation.metrics import base_metric, score_result
class ExactIntentMatch(base_metric.BaseMetric):
"""Score 1 if predicted intent exactly matches expected."""
def __init__(self, name: str = "exact-intent-match"):
super().__init__(name=name)
def score(self, output: dict, expected_output: dict, **kwargs) -> score_result.ScoreResult:
predicted = output.get("intent", "").lower()
expected = expected_output.get("intent", "").lower()
match = predicted == expected
return score_result.ScoreResult(
name=self.name,
value=1.0 if match else 0.0,
reason=f"Predicted '{predicted}', expected '{expected}'"
)
class ResponseLengthConstraint(base_metric.BaseMetric):
"""Score based on whether response length is within acceptable bounds."""
def __init__(self, min_words: int = 20, max_words: int = 200):
super().__init__(name="length-constraint")
self.min_words = min_words
self.max_words = max_words
def score(self, output: dict, **kwargs) -> score_result.ScoreResult:
word_count = len(output.get("response", "").split())
if word_count < self.min_words:
return score_result.ScoreResult(
name=self.name,
value=0.0,
reason=f"Too short: {word_count} words (min: {self.min_words})"
)
elif word_count > self.max_words:
return score_result.ScoreResult(
name=self.name,
value=0.5,
reason=f"Too long: {word_count} words (max: {self.max_words})"
)
else:
return score_result.ScoreResult(
name=self.name,
value=1.0,
reason=f"Length OK: {word_count} words"
)LangChain and LlamaIndex Integration
Opik integrates directly with popular LLM frameworks, which is particularly useful if you're building RAG pipelines or agent systems:
# LangChain integration
from opik.integrations.langchain import OpikTracer
tracer = OpikTracer(project_name="rag-app")
chain = (
prompt_template
| llm
| output_parser
)
# All LangChain calls automatically traced
result = chain.invoke(
{"question": "What is the return policy?"},
config={"callbacks": [tracer]}
)# LlamaIndex integration
from opik.integrations.llama_index import LlamaIndexCallbackHandler
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager
Settings.callback_manager = CallbackManager([
LlamaIndexCallbackHandler()
])
# Your LlamaIndex code runs unchanged, fully tracedCI Integration
For CI, run your evaluations as part of the test pipeline and fail on score regression:
# scripts/run_opik_evals.py
import sys
import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance
SCORE_THRESHOLDS = {
"hallucination": 0.1, # Allow at most 10% hallucination rate
"answer_relevance": 0.80, # Require 80%+ answer relevance
}
def main():
client = opik.Opik()
dataset = client.get_dataset("rag-qa-v1")
results = evaluate(
experiment_name=f"ci-{os.environ.get('GITHUB_SHA', 'local')}",
dataset=dataset,
task=rag_task,
scoring_metrics=[Hallucination(), AnswerRelevance()]
)
failed = False
for metric, threshold in SCORE_THRESHOLDS.items():
score = results.get_metric(metric)
if metric == "hallucination":
# Lower is better for hallucination
if score > threshold:
print(f"FAIL: {metric} = {score:.3f} > threshold {threshold}")
failed = True
else:
print(f"PASS: {metric} = {score:.3f} <= threshold {threshold}")
else:
# Higher is better for other metrics
if score < threshold:
print(f"FAIL: {metric} = {score:.3f} < threshold {threshold}")
failed = True
else:
print(f"PASS: {metric} = {score:.3f} >= threshold {threshold}")
if failed:
sys.exit(1)
print("All evaluation scores within thresholds")
sys.exit(0)
if __name__ == "__main__":
main()# .github/workflows/llm-eval.yml
- name: Run LLM evaluations
run: python scripts/run_opik_evals.py
env:
OPIK_URL: ${{ secrets.OPIK_URL }}
OPIK_API_KEY: ${{ secrets.OPIK_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Comparing Experiments
After running multiple experiments, Opik's UI shows you side-by-side comparisons. You can filter by metadata to understand which categories of inputs drove score changes.
For programmatic comparison:
client = opik.Opik()
# Get recent experiments for comparison
experiments = client.search_experiments(
project_name="rag-app",
limit=5
)
for exp in experiments:
scores = exp.get_scores()
print(f"{exp.name}: hallucination={scores.get('hallucination', 'N/A'):.3f}")Complementing Opik with End-to-End Tests
Opik covers the LLM evaluation layer: did your model answer correctly, hallucinate, or stay on topic? It doesn't cover the application layer: did the user's question get routed correctly, did the response render properly in the UI, did the loading state work?
For application-level coverage, HelpMeTest provides Robot Framework + Playwright automation that tests your application from the outside — exactly as a user would experience it. The combination of Opik (model quality) and HelpMeTest (application behavior) gives you full-stack confidence before each deploy.
A typical quality gate:
- Opik eval passes (hallucination < 10%, relevance > 80%) → allow merge
- HelpMeTest E2E suite passes on staging → allow deploy
- Production traces feed back into Opik datasets for the next evaluation cycle
Summary
Opik is a mature, production-ready open-source LLM evaluation platform. Self-hosting removes data residency concerns and per-trace costs. The combination of automatic tracing, dataset management, built-in metrics (hallucination, relevance, context recall), and a clean Python SDK makes it practical to add systematic LLM evaluation to teams that were previously doing it ad hoc or not at all.
The CI integration pattern — run evals, check thresholds, fail on regression — is the key discipline. Once that gate is in place, LLM quality becomes a first-class engineering concern rather than something you check manually before big releases.