Opik LLM Evaluation: Open-Source Testing with Comet Opik

Opik LLM Evaluation: Open-Source Testing with Comet Opik

Open-source LLM evaluation tools matter for teams that can't send production data to third-party cloud services, that need to self-host their observability stack, or that simply want full control over their evaluation pipeline. Opik, from Comet, is one of the most capable open-source options in this space. It provides tracing, dataset management, and LLM evaluation — all of which you can run on your own infrastructure.

This guide is a practical walkthrough: how to instrument your LLM application with Opik, how to build evaluation datasets, how to score your outputs, and how to integrate the whole thing into a CI workflow.

Why Open-Source Evaluation Matters

The case for open-source LLM observability isn't purely ideological. There are concrete operational reasons:

Data residency — in regulated industries (healthcare, finance, legal), sending user queries to a third-party logging service may violate compliance requirements. Self-hosting eliminates this concern.

Cost at scale — cloud observability platforms charge per request or per trace. At high volume, this adds up. Running Opik yourself means you pay infrastructure costs, not per-trace fees.

Customization — open-source means you can modify the scoring logic, the storage backend, the UI. Cloud platforms give you what they give you.

Auditability — you can inspect exactly how traces are collected, stored, and scored. No black boxes.

Setting Up Opik

Cloud (Easiest Start)

pip install opik
opik configure  # Set API key and workspace

Self-Hosted with Docker

# docker-compose.yml
version: '3'
services:
  opik:
    image: ghcr.io/comet-ml/opik:latest
    ports:
      - "5173:5173"  # UI
      - "8080:8080"  # API
    volumes:
      - opik-data:/data
    environment:
      - OPIK_STORAGE_TYPE=local

volumes:
  opik-data:
docker compose up -d
opik configure --url http://localhost:8080 --api-key local

Configure the client:

import opik

opik.configure(
    url="http://localhost:8080",  # or cloud URL
    api_key="your-api-key",
    workspace="your-workspace"
)

Tracing LLM Applications

The @opik.track decorator instruments any function with tracing. Nested decorators create parent-child spans:

import opik
from opik import track
import openai

client = openai.OpenAI()

@track(name="llm-call")
def call_llm(prompt: str, model: str = "gpt-4o-mini") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@track(name="retrieve-context")
def retrieve_context(query: str) -> list[str]:
    # Simulate vector DB retrieval
    return [
        "Relevant document chunk 1...",
        "Relevant document chunk 2..."
    ]

@track(name="rag-pipeline")
def answer_question(question: str) -> str:
    # This creates a parent trace with two child spans
    context_chunks = retrieve_context(question)
    context = "\n".join(context_chunks)
    
    prompt = f"""Answer the following question using only the provided context.
    
Context:
{context}

Question: {question}"""
    
    return call_llm(prompt)

For automatic OpenAI tracing, Opik provides an integration that wraps the client:

from opik.integrations.openai import track_openai

client = track_openai(openai.OpenAI())

# All calls through this client are automatically traced
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

Building Evaluation Datasets

Opik datasets work like other evaluation platforms: a collection of input/expected-output pairs that you run your pipeline against.

import opik

client = opik.Opik()

# Create a dataset
dataset = client.get_or_create_dataset("rag-qa-v1")

# Add items
qa_pairs = [
    {
        "question": "What is the refund policy?",
        "expected": "Full refund within 30 days of purchase",
        "context": ["Our refund policy allows full refunds within 30 days..."],
        "metadata": {"category": "refund", "difficulty": "easy"}
    },
    {
        "question": "How do I reset my password?",
        "expected": "Click 'Forgot Password' on the login page",
        "context": ["To reset your password, navigate to the login page..."],
        "metadata": {"category": "account", "difficulty": "easy"}
    },
    {
        "question": "What happens if my subscription lapses?",
        "expected": "Account is downgraded to free tier, data retained 90 days",
        "context": ["If your subscription expires, your account will automatically..."],
        "metadata": {"category": "billing", "difficulty": "medium"}
    }
]

for item in qa_pairs:
    dataset.insert([{
        "input": {"question": item["question"], "context": item["context"]},
        "expected_output": {"answer": item["expected"]},
        "metadata": item["metadata"]
    }])

print(f"Dataset has {len(dataset.get_all_items())} items")

Seeding Datasets from Production

The most valuable dataset items come from production. Opik lets you log production traces and later promote specific ones to a dataset:

from opik import track, opik_context

@track(name="production-rag")
def production_query(question: str, user_id: str) -> str:
    opik_context.update_current_trace(
        metadata={
            "user_id": user_id,
            "environment": "production"
        }
    )
    
    answer = answer_question(question)
    return answer

# Later, after reviewing traces in the UI:
# Select low-quality traces → "Add to dataset" → "rag-qa-failures-v1"
# Now you have a dataset of real failure cases to fix

Scoring Functions

Opik ships with built-in metrics and supports custom ones:

Built-in Metrics

from opik.evaluation.metrics import (
    Hallucination,
    AnswerRelevance,
    ContextRecall,
    ContextPrecision
)

hallucination_metric = Hallucination()
relevance_metric = AnswerRelevance()

# Score a single output
score = hallucination_metric.score(
    input="What is the refund policy?",
    output="Full refund within 30 days of purchase",
    context=["Our refund policy allows full refunds within 30 days..."]
)
print(f"Hallucination score: {score.value}")  # 0 = not hallucinated, 1 = hallucinated

Running Full Dataset Evaluations

import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance

client = opik.Opik()
dataset = client.get_dataset("rag-qa-v1")

def rag_task(dataset_item: dict) -> dict:
    question = dataset_item["input"]["question"]
    context = dataset_item["input"]["context"]
    
    # Build prompt with context
    prompt = f"Using only this context:\n{chr(10).join(context)}\n\nAnswer: {question}"
    answer = call_llm(prompt)
    
    return {
        "output": answer,
        "context": context  # Pass context for context-based metrics
    }

eval_results = evaluate(
    experiment_name="rag-pipeline-v2.1",
    dataset=dataset,
    task=rag_task,
    scoring_metrics=[
        Hallucination(),
        AnswerRelevance(),
    ],
    experiment_config={
        "model": "gpt-4o-mini",
        "retrieval_top_k": 3,
        "prompt_version": "v2.1"
    }
)

print(f"Hallucination rate: {eval_results.get_metric('hallucination'):.3f}")
print(f"Answer relevance: {eval_results.get_metric('answer_relevance'):.3f}")

Custom Scoring Functions

from opik.evaluation.metrics import base_metric, score_result

class ExactIntentMatch(base_metric.BaseMetric):
    """Score 1 if predicted intent exactly matches expected."""
    
    def __init__(self, name: str = "exact-intent-match"):
        super().__init__(name=name)
    
    def score(self, output: dict, expected_output: dict, **kwargs) -> score_result.ScoreResult:
        predicted = output.get("intent", "").lower()
        expected = expected_output.get("intent", "").lower()
        
        match = predicted == expected
        
        return score_result.ScoreResult(
            name=self.name,
            value=1.0 if match else 0.0,
            reason=f"Predicted '{predicted}', expected '{expected}'"
        )

class ResponseLengthConstraint(base_metric.BaseMetric):
    """Score based on whether response length is within acceptable bounds."""
    
    def __init__(self, min_words: int = 20, max_words: int = 200):
        super().__init__(name="length-constraint")
        self.min_words = min_words
        self.max_words = max_words
    
    def score(self, output: dict, **kwargs) -> score_result.ScoreResult:
        word_count = len(output.get("response", "").split())
        
        if word_count < self.min_words:
            return score_result.ScoreResult(
                name=self.name,
                value=0.0,
                reason=f"Too short: {word_count} words (min: {self.min_words})"
            )
        elif word_count > self.max_words:
            return score_result.ScoreResult(
                name=self.name,
                value=0.5,
                reason=f"Too long: {word_count} words (max: {self.max_words})"
            )
        else:
            return score_result.ScoreResult(
                name=self.name,
                value=1.0,
                reason=f"Length OK: {word_count} words"
            )

LangChain and LlamaIndex Integration

Opik integrates directly with popular LLM frameworks, which is particularly useful if you're building RAG pipelines or agent systems:

# LangChain integration
from opik.integrations.langchain import OpikTracer

tracer = OpikTracer(project_name="rag-app")

chain = (
    prompt_template 
    | llm 
    | output_parser
)

# All LangChain calls automatically traced
result = chain.invoke(
    {"question": "What is the return policy?"},
    config={"callbacks": [tracer]}
)
# LlamaIndex integration
from opik.integrations.llama_index import LlamaIndexCallbackHandler
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager

Settings.callback_manager = CallbackManager([
    LlamaIndexCallbackHandler()
])

# Your LlamaIndex code runs unchanged, fully traced

CI Integration

For CI, run your evaluations as part of the test pipeline and fail on score regression:

# scripts/run_opik_evals.py
import sys
import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import Hallucination, AnswerRelevance

SCORE_THRESHOLDS = {
    "hallucination": 0.1,       # Allow at most 10% hallucination rate
    "answer_relevance": 0.80,   # Require 80%+ answer relevance
}

def main():
    client = opik.Opik()
    dataset = client.get_dataset("rag-qa-v1")
    
    results = evaluate(
        experiment_name=f"ci-{os.environ.get('GITHUB_SHA', 'local')}",
        dataset=dataset,
        task=rag_task,
        scoring_metrics=[Hallucination(), AnswerRelevance()]
    )
    
    failed = False
    for metric, threshold in SCORE_THRESHOLDS.items():
        score = results.get_metric(metric)
        
        if metric == "hallucination":
            # Lower is better for hallucination
            if score > threshold:
                print(f"FAIL: {metric} = {score:.3f} > threshold {threshold}")
                failed = True
            else:
                print(f"PASS: {metric} = {score:.3f} <= threshold {threshold}")
        else:
            # Higher is better for other metrics
            if score < threshold:
                print(f"FAIL: {metric} = {score:.3f} < threshold {threshold}")
                failed = True
            else:
                print(f"PASS: {metric} = {score:.3f} >= threshold {threshold}")
    
    if failed:
        sys.exit(1)
    
    print("All evaluation scores within thresholds")
    sys.exit(0)

if __name__ == "__main__":
    main()
# .github/workflows/llm-eval.yml
- name: Run LLM evaluations
  run: python scripts/run_opik_evals.py
  env:
    OPIK_URL: ${{ secrets.OPIK_URL }}
    OPIK_API_KEY: ${{ secrets.OPIK_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Comparing Experiments

After running multiple experiments, Opik's UI shows you side-by-side comparisons. You can filter by metadata to understand which categories of inputs drove score changes.

For programmatic comparison:

client = opik.Opik()

# Get recent experiments for comparison
experiments = client.search_experiments(
    project_name="rag-app",
    limit=5
)

for exp in experiments:
    scores = exp.get_scores()
    print(f"{exp.name}: hallucination={scores.get('hallucination', 'N/A'):.3f}")

Complementing Opik with End-to-End Tests

Opik covers the LLM evaluation layer: did your model answer correctly, hallucinate, or stay on topic? It doesn't cover the application layer: did the user's question get routed correctly, did the response render properly in the UI, did the loading state work?

For application-level coverage, HelpMeTest provides Robot Framework + Playwright automation that tests your application from the outside — exactly as a user would experience it. The combination of Opik (model quality) and HelpMeTest (application behavior) gives you full-stack confidence before each deploy.

A typical quality gate:

  1. Opik eval passes (hallucination < 10%, relevance > 80%) → allow merge
  2. HelpMeTest E2E suite passes on staging → allow deploy
  3. Production traces feed back into Opik datasets for the next evaluation cycle

Summary

Opik is a mature, production-ready open-source LLM evaluation platform. Self-hosting removes data residency concerns and per-trace costs. The combination of automatic tracing, dataset management, built-in metrics (hallucination, relevance, context recall), and a clean Python SDK makes it practical to add systematic LLM evaluation to teams that were previously doing it ad hoc or not at all.

The CI integration pattern — run evals, check thresholds, fail on regression — is the key discipline. Once that gate is in place, LLM quality becomes a first-class engineering concern rather than something you check manually before big releases.

Read more