Testing LLMs with Langfuse: Tracing, Evals, and Datasets

Testing LLMs with Langfuse: Tracing, Evals, and Datasets

LLM applications introduce a new category of failure that traditional testing tools were never built to catch. A response that was accurate yesterday might drift subtly today — same prompt, different model behavior. Tracing, evaluation datasets, and online scoring are how production teams stay ahead of that drift. Langfuse is one of the most widely adopted open-source platforms for doing all three. This guide walks through how to use it effectively for testing LLM-powered applications.

What Langfuse Actually Does

Langfuse is an LLM engineering platform that gives you three core capabilities:

  1. Tracing — captures every prompt, completion, and intermediate step your application makes, with latency, token counts, and cost attached.
  2. Dataset evals — lets you build curated input/output pairs and run your LLM pipeline against them to catch regressions.
  3. Online evaluations — scores real production traces automatically using model-based or rule-based scorers.

It integrates with OpenAI, Anthropic, Langchain, LlamaIndex, and most other popular LLM frameworks. You can self-host it (it's open source) or use the cloud version at cloud.langfuse.com.

Setting Up Tracing

Install the SDK:

pip install langfuse openai

The fastest way to get traces is through the OpenAI drop-in:

from langfuse.openai import openai

# Your existing OpenAI calls automatically get traced
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize the following: ..."}]
)

For more granular control — especially in multi-step agent applications — use the observe decorator:

from langfuse.decorators import observe, langfuse_context

@observe()
def classify_document(text: str) -> str:
    langfuse_context.update_current_observation(
        input=text,
        metadata={"pipeline": "document-classifier", "version": "2.1"}
    )
    result = call_classifier_model(text)
    langfuse_context.update_current_observation(output=result)
    return result

@observe()
def process_batch(documents: list[str]) -> list[str]:
    return [classify_document(doc) for doc in documents]

Every call to process_batch creates a parent trace, and each classify_document call inside it becomes a child span. This nesting is what makes debugging multi-step pipelines actually useful — you can see exactly which step in a 10-step chain produced the wrong answer.

Building Evaluation Datasets

Datasets are the backbone of regression testing for LLMs. The concept is simple: a dataset is a collection of (input, expected_output) pairs. You run your current pipeline against them and score the results.

Create a dataset via the SDK:

from langfuse import Langfuse

lf = Langfuse()

dataset = lf.create_dataset(
    name="document-classifier-v1",
    description="Classification test cases from production edge cases"
)

# Add items
test_cases = [
    {"input": "Invoice #4521 for $1,200 due June 30", "expected": "invoice"},
    {"input": "Please review the attached contract", "expected": "legal"},
    {"input": "Q3 revenue was up 12% YoY", "expected": "financial-report"},
]

for case in test_cases:
    lf.create_dataset_item(
        dataset_name="document-classifier-v1",
        input={"text": case["input"]},
        expected_output={"label": case["expected"]}
    )

Then run your pipeline against the dataset:

def run_eval(pipeline_version: str):
    dataset = lf.get_dataset("document-classifier-v1")
    
    for item in dataset.items:
        with item.observe(run_name=pipeline_version) as trace_id:
            predicted = classify_document(item.input["text"])
            
            # Score the result
            lf.score(
                trace_id=trace_id,
                name="exact-match",
                value=1.0 if predicted == item.expected_output["label"] else 0.0
            )

run_eval("classifier-v2.1")

After running this, Langfuse shows you a run-by-run comparison: v2.0 scored 87% accuracy, v2.1 scores 91%. That's the kind of quantitative signal that makes "is this change an improvement?" answerable without manual review of hundreds of outputs.

Online Evaluations in Production

Dataset evals catch regressions before deploy. Online evals catch problems after deploy — things like toxicity, hallucination, or off-topic responses that only appear on real user inputs.

Set up an LLM-as-judge scorer in the Langfuse UI, or define one via the SDK:

# Example: scoring factual accuracy using GPT-4 as judge
def score_factual_accuracy(trace_id: str, output: str, context: str):
    judge_prompt = f"""
    Given the context: {context}
    
    Rate the factual accuracy of this output on a scale of 0-1:
    {output}
    
    Return only a number between 0 and 1.
    """
    
    score = float(call_judge_model(judge_prompt))
    
    lf.score(
        trace_id=trace_id,
        name="factual-accuracy",
        value=score,
        comment=f"Judge scored at {score}"
    )

You can wire this up to run automatically on a sample of production traces — say, 10% of all responses — to get a rolling accuracy signal without incurring the cost of judging every single call.

Integrating Langfuse with CI Pipelines

The most impactful thing you can do with Langfuse is gate your deployments on eval scores. Here's a GitHub Actions step that fails the build if accuracy drops below 85%:

- name: Run LLM regression evals
  run: |
    python scripts/run_langfuse_evals.py \
      --dataset document-classifier-v1 \
      --run-name ${{ github.sha }} \
      --min-score 0.85
  env:
    LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
    LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}

The eval script queries the run's scores after completion and exits with code 1 if the threshold isn't met. This is the LLM equivalent of a test suite failure.

Combining Langfuse with End-to-End Testing

Tracing and evaluation cover the LLM layer. They don't cover what happens when a user interacts with your application's UI. For that you need end-to-end tests.

This is where a platform like HelpMeTest complements Langfuse. HelpMeTest runs Robot Framework + Playwright tests against your live application — checking that the UI surfaces correct LLM outputs, that error states are handled gracefully, and that response times stay within acceptable bounds. You can run these tests on a schedule or as part of the same CI pipeline that triggers your Langfuse eval run.

A typical setup looks like:

  1. PR opens → Langfuse dataset eval runs (LLM accuracy check)
  2. Both pass → deploy to staging
  3. HelpMeTest E2E suite runs against staging (UI + behavior check)
  4. All green → merge and deploy to production
  5. Langfuse online evals run on sampled production traffic (drift monitoring)

This layered approach means you're catching LLM regressions at the model level, and UI regressions at the application level, before users ever see them.

Prompt Management and Versioning

Langfuse has a built-in prompt registry that versions your prompts and links each version to the traces it produced. This is more useful than it sounds.

from langfuse import Langfuse

lf = Langfuse()

# Fetch a versioned prompt
prompt = lf.get_prompt("document-classifier", version=3)

# Use it
messages = prompt.compile(document_type="invoice")
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

When you later look at a trace and ask "wait, why did this response come out wrong?", you can click through to see exactly which prompt version was active at that moment. No more debugging production issues with no idea what prompt was running.

Key Metrics to Track

Once you have tracing in place, these are the metrics worth monitoring:

  • Latency by step — which part of your pipeline is slow? Often it's not the LLM call itself.
  • Token costs — which user segments or query types cost the most?
  • Score distributions — is your accuracy score normally distributed, or are there clusters of failures?
  • Error rates — how often does your LLM return malformed JSON, refuse to answer, or time out?
  • Drift over time — is accuracy declining week-over-week even though you haven't changed anything?

Langfuse's dashboard surfaces all of these out of the box. The custom dashboards feature lets you build views tailored to your specific pipeline — useful when you have multiple LLM steps and need to track each independently.

Practical Tips for Teams Getting Started

Start with tracing, not evals. Before you can build good datasets, you need to understand what your pipeline is actually doing. Spend two weeks in production with tracing on, look at the traces, identify the failure patterns. Those patterns become your dataset.

Curate datasets from production failures. Every time a user reports a bad response or your online eval flags something as low quality, add it to your dataset. Datasets built from real failures catch real regressions.

Be skeptical of LLM-as-judge without calibration. If you're using GPT-4 to score GPT-4 outputs, you need to validate that the judge scores correlate with human judgment. Run a calibration set where humans label 100-200 outputs and check that the judge agrees 80%+ of the time.

Version your experiments. Every time you change a prompt, a model, or a retrieval strategy, run it as a named experiment in Langfuse. This gives you a historical record of what you tried and what it did to your scores.

When Langfuse Isn't Enough

Langfuse is excellent at LLM-level observability. It doesn't cover:

  • End-to-end user flows through your application
  • Visual regressions in how your UI renders LLM output
  • Non-LLM parts of your backend that might affect the user experience
  • Load testing and performance under concurrent users

For those, you need dedicated tooling. Langfuse fits cleanly into a broader testing stack — it handles the AI layer, while tools like HelpMeTest handle the application layer.

Summary

Langfuse gives LLM teams three things that matter: visibility into what's happening inside your pipeline (tracing), a systematic way to catch regressions (dataset evals), and automated quality monitoring in production (online evals). The setup is straightforward, the self-hosting option keeps your data in-house, and the CI integration makes it possible to block deploys on score thresholds.

The teams seeing the most value from it aren't just using it reactively to debug problems. They're using it proactively — building datasets before deploying new models, running evals before merging prompt changes, and watching online scores as a leading indicator of user experience degradation. That's the shift from "we'll fix it when users complain" to "we caught it before it shipped."

Read more