Braintrust AI Evaluation: Datasets, Scoring, and CI Integration
The hardest thing about deploying LLM-powered features isn't building them — it's knowing whether they're getting better or worse. Prompt tweaks, model upgrades, retrieval changes: each one can improve performance on some inputs while degrading it on others. Braintrust gives you the infrastructure to measure this systematically, with datasets, scoring functions, and experiment tracking that integrates directly into your development workflow.
This guide is a practical walkthrough of the platform: how to build datasets that actually catch regressions, how to write scoring functions that measure what matters, and how to wire it all into a CI pipeline that gates deploys on quality thresholds.
The Evaluation Gap
Most teams start with a small set of manual test prompts they run before shipping changes. The problems with this approach emerge quickly:
- Coverage is arbitrary — you test the examples you thought of, not the ones that break
- No quantification — "it seems better" is not a signal you can track over time
- No history — you can't compare the current version to what you had three months ago
Braintrust is designed to solve all three. It's an evaluation platform that stores datasets, runs experiments, computes scores, and displays results in a UI built for comparing LLM versions.
Setup
Install the SDK:
npm install braintrust autoevals
# or
pip install braintrust autoevalsSet your API key:
export BRAINTRUST_API_KEY=your-api-keyRun a minimal eval to verify everything works:
import { Eval } from "braintrust";
import { Levenshtein } from "autoevals";
Eval("My first eval", {
data: () => [
{ input: "hello", expected: "Hello, world!" }
],
task: async (input) => `${input}, world!`,
scores: [Levenshtein]
});Run it:
npx braintrust eval my-eval.tsYou'll get a link to the Braintrust UI showing the score, the diff from previous runs, and a row-by-row breakdown. That's the core loop — every subsequent eval is a new experiment you can compare to the baseline.
Building Datasets That Catch Real Regressions
A dataset is only as good as the examples in it. The best datasets are built from three sources: manually crafted edge cases, production failures you've observed, and synthetic variations generated to stress-test specific behaviors.
import { initDataset } from "braintrust";
const dataset = await initDataset("my-project", {
dataset: "customer-support-v1"
});
// Add manually crafted cases
await dataset.insert({
input: { message: "I need to cancel my subscription" },
expected: { intent: "cancellation", requires_human: false },
metadata: { source: "manual", category: "cancellation" }
});
// Add production failures (most valuable!)
await dataset.insert({
input: { message: "unsubscribe me please" },
expected: { intent: "cancellation", requires_human: false },
metadata: { source: "production-failure", ticket: "TKT-4521" }
});
// Add synthetic variations
const cancellationVariants = [
"I want to stop my subscription",
"Please cancel my account",
"How do I unsubscribe?",
"I'd like to discontinue the service",
];
for (const variant of cancellationVariants) {
await dataset.insert({
input: { message: variant },
expected: { intent: "cancellation", requires_human: false },
metadata: { source: "synthetic", template: "cancellation" }
});
}The metadata field is important. It lets you slice your scores by category in the Braintrust UI — so you can see that your model scores 94% on cancellation intents but only 71% on billing disputes, which tells you where to focus improvement effort.
Writing Scoring Functions
Braintrust has a library of built-in scorers (autoevals) covering exact match, fuzzy match, embedding similarity, and LLM-as-judge. For most real-world use cases, you'll want a combination.
Exact and Fuzzy Match
import { Levenshtein, Factuality, ClosedQA } from "autoevals";
Eval("intent-classifier", {
data: loadDataset("customer-support-v1"),
task: async (input) => await classifyIntent(input.message),
scores: [
// Exact match on the intent field
({ output, expected }) => ({
name: "intent-match",
score: output.intent === expected.intent ? 1 : 0
}),
// Fuzzy match on response text
Levenshtein
]
});LLM-as-Judge for Qualitative Scoring
For open-ended responses where there's no single correct answer, you need a model to evaluate quality:
import { LLMClassifierFromSpec } from "autoevals";
const HelpfulnessScorer = LLMClassifierFromSpec("helpfulness", {
prompt_template: `You are evaluating a customer support response.
User message: {{input.message}}
Support response: {{output.response}}
Is this response:
(A) Helpful and complete — addresses the user's issue directly
(B) Partially helpful — acknowledges the issue but doesn't resolve it
(C) Not helpful — generic, off-topic, or incorrect
Answer with just the letter.`,
choice_scores: { A: 1.0, B: 0.5, C: 0.0 }
});
Eval("support-response-quality", {
data: loadDataset("support-responses-v2"),
task: async (input) => await generateSupportResponse(input.message),
scores: [HelpfulnessScorer]
});Custom Scoring Functions
Sometimes you need domain-specific logic that no generic scorer captures:
function scoreContractCompliance({ output, expected }: EvalCase) {
const issues: string[] = [];
// Check required fields are present
if (!output.includes("case number")) {
issues.push("Missing case number reference");
}
// Check prohibited phrases aren't present
const prohibited = ["I don't know", "not my problem", "talk to someone else"];
for (const phrase of prohibited) {
if (output.toLowerCase().includes(phrase)) {
issues.push(`Contains prohibited phrase: "${phrase}"`);
}
}
// Check response length
if (output.split(" ").length < 20) {
issues.push("Response too short");
}
return {
name: "contract-compliance",
score: issues.length === 0 ? 1 : Math.max(0, 1 - (issues.length * 0.33)),
metadata: { issues }
};
}Experiment Tracking and Comparison
Every eval run in Braintrust creates an experiment. Experiments are versioned and you can compare them side by side. This is the UI where the value becomes obvious: you're not just looking at "did the score go up", you're seeing which specific examples improved and which regressed.
The typical workflow when making a prompt change:
# Run eval on current version
PROMPT_VERSION=v2 npx braintrust <span class="hljs-built_in">eval my-eval.ts
<span class="hljs-comment"># Make prompt changes
vim prompts/support-classifier.txt
<span class="hljs-comment"># Run eval on new version
PROMPT_VERSION=v3 npx braintrust <span class="hljs-built_in">eval my-eval.tsIn the Braintrust UI, you select both experiments and click "Compare". You see:
- Overall score: v2 was 0.82, v3 is 0.87 (+0.05)
- 12 examples improved, 3 regressed
- The 3 regressions: all "account locked" intents, which v3's prompt now misclassifies
You know exactly what changed, where it improved, and what you broke. That's the feedback loop that makes iteration fast.
CI Integration
The most important step is making evals blocking in CI. A score drop should fail the build the same way a failing unit test would.
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: npm ci
- name: Run evals
id: eval
run: npx braintrust eval evals/support-classifier.eval.ts
env:
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Check score threshold
run: |
python scripts/check_eval_scores.py \
--experiment-name "support-classifier-${{ github.sha }}" \
--min-score 0.85 \
--scorer intent-match
env:
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}The check_eval_scores.py script fetches the experiment via the Braintrust API and exits with code 1 if any scorer falls below threshold:
import os
import sys
import requests
import argparse
def check_scores(experiment_name: str, min_score: float, scorer: str):
api_key = os.environ["BRAINTRUST_API_KEY"]
# Fetch experiment via Braintrust API
response = requests.get(
f"https://api.braintrust.dev/v1/experiment",
params={"name": experiment_name},
headers={"Authorization": f"Bearer {api_key}"}
)
experiment = response.json()["objects"][0]
scores = experiment["scores"]
if scorer not in scores:
print(f"Scorer '{scorer}' not found in experiment")
sys.exit(1)
actual_score = scores[scorer]["mean"]
if actual_score < min_score:
print(f"FAIL: {scorer} score {actual_score:.3f} < threshold {min_score}")
print(f"View experiment: {experiment['permalink']}")
sys.exit(1)
print(f"PASS: {scorer} score {actual_score:.3f} >= threshold {min_score}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--experiment-name", required=True)
parser.add_argument("--min-score", type=float, required=True)
parser.add_argument("--scorer", required=True)
args = parser.parse_args()
check_scores(args.experiment_name, args.min_score, args.scorer)Now prompt changes that degrade accuracy fail the PR. Changes that improve it show up as green with a visible score delta in the build logs.
Layering Braintrust with Application-Level Testing
Braintrust evaluates your LLM pipeline in isolation. It doesn't know what happens when your pipeline runs inside your actual application — whether the UI surfaces the output correctly, whether the API contract is honored, whether error states are handled gracefully.
For that layer, teams pair Braintrust with end-to-end testing. HelpMeTest lets you write Robot Framework tests that run against your live application, verifying that user-facing behavior is correct regardless of what's happening inside the LLM. A complete quality gate looks like:
- Braintrust eval: LLM pipeline accuracy ≥ 85% (blocks merge)
- HelpMeTest E2E: application behavior correct on staging (blocks deploy)
- Both green: deploy to production
The Braintrust layer catches model-level regressions early. The E2E layer catches integration and UI regressions. Neither is sufficient alone.
Dataset Management Over Time
Datasets are living artifacts. As you add features, encounter edge cases, and fix bugs, your datasets should grow to reflect what you've learned.
A few practices that keep datasets useful:
Link dataset items to production incidents. When a user reports a bad response, add the input to your dataset immediately. Use the metadata field to link back to the ticket number.
Tag by category and priority. metadata: { category: "billing", priority: "high" } lets you filter in the UI and focus on the areas that matter most.
Prune items that no longer apply. If you remove a feature, remove its test cases. Stale cases dilute your score signals.
Version your datasets. When you make major changes to your pipeline, create a new dataset version rather than mutating the existing one. This lets you run both old and new experiments against the same historical baseline.
Summary
Braintrust gives LLM teams the infrastructure to move from "we think this is better" to "we know this is better, here's the score delta and the exact inputs that changed." The combination of versioned datasets, flexible scoring functions, and experiment comparison makes evaluation continuous rather than ad hoc.
Wiring it into CI is the step that transforms evaluation from a pre-launch ritual into a development discipline. When accuracy regressions fail builds, teams iterate differently — more carefully, with faster feedback loops, and with a shared quantitative definition of "better."