Braintrust AI Evaluation: Datasets, Scoring, and CI Integration

Braintrust AI Evaluation: Datasets, Scoring, and CI Integration

The hardest thing about deploying LLM-powered features isn't building them — it's knowing whether they're getting better or worse. Prompt tweaks, model upgrades, retrieval changes: each one can improve performance on some inputs while degrading it on others. Braintrust gives you the infrastructure to measure this systematically, with datasets, scoring functions, and experiment tracking that integrates directly into your development workflow.

This guide is a practical walkthrough of the platform: how to build datasets that actually catch regressions, how to write scoring functions that measure what matters, and how to wire it all into a CI pipeline that gates deploys on quality thresholds.

The Evaluation Gap

Most teams start with a small set of manual test prompts they run before shipping changes. The problems with this approach emerge quickly:

  • Coverage is arbitrary — you test the examples you thought of, not the ones that break
  • No quantification — "it seems better" is not a signal you can track over time
  • No history — you can't compare the current version to what you had three months ago

Braintrust is designed to solve all three. It's an evaluation platform that stores datasets, runs experiments, computes scores, and displays results in a UI built for comparing LLM versions.

Setup

Install the SDK:

npm install braintrust autoevals
# or
pip install braintrust autoevals

Set your API key:

export BRAINTRUST_API_KEY=your-api-key

Run a minimal eval to verify everything works:

import { Eval } from "braintrust";
import { Levenshtein } from "autoevals";

Eval("My first eval", {
  data: () => [
    { input: "hello", expected: "Hello, world!" }
  ],
  task: async (input) => `${input}, world!`,
  scores: [Levenshtein]
});

Run it:

npx braintrust eval my-eval.ts

You'll get a link to the Braintrust UI showing the score, the diff from previous runs, and a row-by-row breakdown. That's the core loop — every subsequent eval is a new experiment you can compare to the baseline.

Building Datasets That Catch Real Regressions

A dataset is only as good as the examples in it. The best datasets are built from three sources: manually crafted edge cases, production failures you've observed, and synthetic variations generated to stress-test specific behaviors.

import { initDataset } from "braintrust";

const dataset = await initDataset("my-project", {
  dataset: "customer-support-v1"
});

// Add manually crafted cases
await dataset.insert({
  input: { message: "I need to cancel my subscription" },
  expected: { intent: "cancellation", requires_human: false },
  metadata: { source: "manual", category: "cancellation" }
});

// Add production failures (most valuable!)
await dataset.insert({
  input: { message: "unsubscribe me please" },
  expected: { intent: "cancellation", requires_human: false },
  metadata: { source: "production-failure", ticket: "TKT-4521" }
});

// Add synthetic variations
const cancellationVariants = [
  "I want to stop my subscription",
  "Please cancel my account",
  "How do I unsubscribe?",
  "I'd like to discontinue the service",
];

for (const variant of cancellationVariants) {
  await dataset.insert({
    input: { message: variant },
    expected: { intent: "cancellation", requires_human: false },
    metadata: { source: "synthetic", template: "cancellation" }
  });
}

The metadata field is important. It lets you slice your scores by category in the Braintrust UI — so you can see that your model scores 94% on cancellation intents but only 71% on billing disputes, which tells you where to focus improvement effort.

Writing Scoring Functions

Braintrust has a library of built-in scorers (autoevals) covering exact match, fuzzy match, embedding similarity, and LLM-as-judge. For most real-world use cases, you'll want a combination.

Exact and Fuzzy Match

import { Levenshtein, Factuality, ClosedQA } from "autoevals";

Eval("intent-classifier", {
  data: loadDataset("customer-support-v1"),
  task: async (input) => await classifyIntent(input.message),
  scores: [
    // Exact match on the intent field
    ({ output, expected }) => ({
      name: "intent-match",
      score: output.intent === expected.intent ? 1 : 0
    }),
    // Fuzzy match on response text
    Levenshtein
  ]
});

LLM-as-Judge for Qualitative Scoring

For open-ended responses where there's no single correct answer, you need a model to evaluate quality:

import { LLMClassifierFromSpec } from "autoevals";

const HelpfulnessScorer = LLMClassifierFromSpec("helpfulness", {
  prompt_template: `You are evaluating a customer support response.

User message: {{input.message}}
Support response: {{output.response}}

Is this response:
(A) Helpful and complete — addresses the user's issue directly
(B) Partially helpful — acknowledges the issue but doesn't resolve it  
(C) Not helpful — generic, off-topic, or incorrect

Answer with just the letter.`,
  choice_scores: { A: 1.0, B: 0.5, C: 0.0 }
});

Eval("support-response-quality", {
  data: loadDataset("support-responses-v2"),
  task: async (input) => await generateSupportResponse(input.message),
  scores: [HelpfulnessScorer]
});

Custom Scoring Functions

Sometimes you need domain-specific logic that no generic scorer captures:

function scoreContractCompliance({ output, expected }: EvalCase) {
  const issues: string[] = [];
  
  // Check required fields are present
  if (!output.includes("case number")) {
    issues.push("Missing case number reference");
  }
  
  // Check prohibited phrases aren't present
  const prohibited = ["I don't know", "not my problem", "talk to someone else"];
  for (const phrase of prohibited) {
    if (output.toLowerCase().includes(phrase)) {
      issues.push(`Contains prohibited phrase: "${phrase}"`);
    }
  }
  
  // Check response length
  if (output.split(" ").length < 20) {
    issues.push("Response too short");
  }
  
  return {
    name: "contract-compliance",
    score: issues.length === 0 ? 1 : Math.max(0, 1 - (issues.length * 0.33)),
    metadata: { issues }
  };
}

Experiment Tracking and Comparison

Every eval run in Braintrust creates an experiment. Experiments are versioned and you can compare them side by side. This is the UI where the value becomes obvious: you're not just looking at "did the score go up", you're seeing which specific examples improved and which regressed.

The typical workflow when making a prompt change:

# Run eval on current version
PROMPT_VERSION=v2 npx braintrust <span class="hljs-built_in">eval my-eval.ts

<span class="hljs-comment"># Make prompt changes
vim prompts/support-classifier.txt

<span class="hljs-comment"># Run eval on new version
PROMPT_VERSION=v3 npx braintrust <span class="hljs-built_in">eval my-eval.ts

In the Braintrust UI, you select both experiments and click "Compare". You see:

  • Overall score: v2 was 0.82, v3 is 0.87 (+0.05)
  • 12 examples improved, 3 regressed
  • The 3 regressions: all "account locked" intents, which v3's prompt now misclassifies

You know exactly what changed, where it improved, and what you broke. That's the feedback loop that makes iteration fast.

CI Integration

The most important step is making evals blocking in CI. A score drop should fail the build the same way a failing unit test would.

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install dependencies
        run: npm ci
      
      - name: Run evals
        id: eval
        run: npx braintrust eval evals/support-classifier.eval.ts
        env:
          BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      
      - name: Check score threshold
        run: |
          python scripts/check_eval_scores.py \
            --experiment-name "support-classifier-${{ github.sha }}" \
            --min-score 0.85 \
            --scorer intent-match
        env:
          BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}

The check_eval_scores.py script fetches the experiment via the Braintrust API and exits with code 1 if any scorer falls below threshold:

import os
import sys
import requests
import argparse

def check_scores(experiment_name: str, min_score: float, scorer: str):
    api_key = os.environ["BRAINTRUST_API_KEY"]
    
    # Fetch experiment via Braintrust API
    response = requests.get(
        f"https://api.braintrust.dev/v1/experiment",
        params={"name": experiment_name},
        headers={"Authorization": f"Bearer {api_key}"}
    )
    
    experiment = response.json()["objects"][0]
    scores = experiment["scores"]
    
    if scorer not in scores:
        print(f"Scorer '{scorer}' not found in experiment")
        sys.exit(1)
    
    actual_score = scores[scorer]["mean"]
    
    if actual_score < min_score:
        print(f"FAIL: {scorer} score {actual_score:.3f} < threshold {min_score}")
        print(f"View experiment: {experiment['permalink']}")
        sys.exit(1)
    
    print(f"PASS: {scorer} score {actual_score:.3f} >= threshold {min_score}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--experiment-name", required=True)
    parser.add_argument("--min-score", type=float, required=True)
    parser.add_argument("--scorer", required=True)
    args = parser.parse_args()
    
    check_scores(args.experiment_name, args.min_score, args.scorer)

Now prompt changes that degrade accuracy fail the PR. Changes that improve it show up as green with a visible score delta in the build logs.

Layering Braintrust with Application-Level Testing

Braintrust evaluates your LLM pipeline in isolation. It doesn't know what happens when your pipeline runs inside your actual application — whether the UI surfaces the output correctly, whether the API contract is honored, whether error states are handled gracefully.

For that layer, teams pair Braintrust with end-to-end testing. HelpMeTest lets you write Robot Framework tests that run against your live application, verifying that user-facing behavior is correct regardless of what's happening inside the LLM. A complete quality gate looks like:

  1. Braintrust eval: LLM pipeline accuracy ≥ 85% (blocks merge)
  2. HelpMeTest E2E: application behavior correct on staging (blocks deploy)
  3. Both green: deploy to production

The Braintrust layer catches model-level regressions early. The E2E layer catches integration and UI regressions. Neither is sufficient alone.

Dataset Management Over Time

Datasets are living artifacts. As you add features, encounter edge cases, and fix bugs, your datasets should grow to reflect what you've learned.

A few practices that keep datasets useful:

Link dataset items to production incidents. When a user reports a bad response, add the input to your dataset immediately. Use the metadata field to link back to the ticket number.

Tag by category and priority. metadata: { category: "billing", priority: "high" } lets you filter in the UI and focus on the areas that matter most.

Prune items that no longer apply. If you remove a feature, remove its test cases. Stale cases dilute your score signals.

Version your datasets. When you make major changes to your pipeline, create a new dataset version rather than mutating the existing one. This lets you run both old and new experiments against the same historical baseline.

Summary

Braintrust gives LLM teams the infrastructure to move from "we think this is better" to "we know this is better, here's the score delta and the exact inputs that changed." The combination of versioned datasets, flexible scoring functions, and experiment comparison makes evaluation continuous rather than ad hoc.

Wiring it into CI is the step that transforms evaluation from a pre-launch ritual into a development discipline. When accuracy regressions fail builds, teams iterate differently — more carefully, with faster feedback loops, and with a shared quantitative definition of "better."

Read more