Testing LLM Applications with OpenAI Evals Framework
OpenAI Evals is a framework for evaluating large language model outputs. It lets you define test cases (called "evals") that measure whether a model produces correct, consistent, or appropriately formatted responses — and run them as automated checks in your development pipeline.
What Evals Test
Evals measure properties of LLM outputs:
- Accuracy — does the model give the right answer?
- Format compliance — does the output follow a required structure (JSON, code, markdown)?
- Regression — did a model or prompt update degrade quality?
- Consistency — does the same input reliably produce equivalent outputs?
Installing the Evals Library
pip install evals
# or from source
git <span class="hljs-built_in">clone https://github.com/openai/evals
<span class="hljs-built_in">cd evals
pip install -e .Set your API key:
export OPENAI_API_KEY=sk-...Anatomy of an Eval
An eval has two parts:
- A dataset — JSONL file of test cases, each with an
inputand an expectedidealoutput - An eval spec — YAML that ties the dataset to an evaluation method (match, includes, model-graded)
Dataset format (data/qa_test.jsonl)
{"input": [{"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"}
{"input": [{"role": "user", "content": "What is 12 * 8?"}], "ideal": "96"}
{"input": [{"role": "user", "content": "Who wrote Hamlet?"}], "ideal": "Shakespeare"}Eval spec (evals/registry/evals/my_qa.yaml)
my_qa:
id: my_qa.v0
description: Basic factual QA test
metrics: [accuracy]
my_qa.v0:
class: evals.elsuite.basic.match:Match
args:
samples_jsonl: data/qa_test.jsonlRunning an Eval
oaieval gpt-4o my_qaOutput:
Final report:
{
"accuracy": 0.97,
"n": 100,
"failures": 3
}Eval Types
Exact Match
Tests that the model's answer exactly matches the expected value (case-insensitive):
my_match_eval.v0:
class: evals.elsuite.basic.match:Match
args:
samples_jsonl: data/match_cases.jsonlIncludes
Tests that the expected string appears anywhere in the response:
my_includes_eval.v0:
class: evals.elsuite.basic.includes:Includes
args:
samples_jsonl: data/includes_cases.jsonlModel-Graded
Uses a grader model (typically GPT-4) to evaluate open-ended outputs:
my_modelgraded_eval.v0:
class: evals.elsuite.modelgraded.classify:ModelBasedClassify
args:
samples_jsonl: data/open_ended_cases.jsonl
eval_type: cot_classify
modelgraded_spec: closedqaWriting Custom Evals in Python
For complex evaluation logic, subclass evals.Eval:
import evals
import evals.metrics
from evals.eval import Eval
from evals.record import RecorderBase
class JSONFormatEval(Eval):
"""Tests that the model returns valid JSON matching a schema."""
def __init__(self, completion_fns, samples_jsonl, schema, **kwargs):
super().__init__(completion_fns, **kwargs)
self.samples_jsonl = samples_jsonl
self.schema = schema
def eval_sample(self, sample, rng):
prompt = sample["input"]
result = self.completion_fn(prompt)
output = result.get_completions()[0]
try:
import json
parsed = json.loads(output)
# Validate required keys
for key in self.schema.get("required", []):
assert key in parsed, f"Missing key: {key}"
evals.record_and_check_match(
prompt=prompt,
sampled=output,
expected="valid",
)
except (json.JSONDecodeError, AssertionError) as e:
evals.record.record_match(False, expected="valid", picked=str(e))
def run(self, recorder: RecorderBase):
samples = evals.get_jsonl(self.samples_jsonl)
self.eval_all_samples(recorder, samples)
return {"accuracy": evals.metrics.get_accuracy(recorder.get_events("match"))}Register it in evals/registry/evals/json_format.yaml:
json_format:
id: json_format.v0
description: Validates model outputs are valid JSON
json_format.v0:
class: myevals.json_format:JSONFormatEval
args:
samples_jsonl: data/json_format_cases.jsonl
schema:
required: [name, value, unit]Using Evals in CI
# .github/workflows/llm-tests.yml
name: LLM Eval Tests
on: [push]
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install evals
- name: Run factual accuracy eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
oaieval gpt-4o-mini my_qa 2>&1 | tee eval_output.txt
python -c "
import json, sys
# Parse last JSON line from output
lines = open('eval_output.txt').readlines()
for line in reversed(lines):
try:
result = json.loads(line)
if result.get('accuracy', 0) < 0.90:
print(f'Accuracy too low: {result[\"accuracy\"]}')
sys.exit(1)
break
except json.JSONDecodeError:
continue
print('Eval passed')
"Prompt Regression Testing
Create a baseline dataset from your current prompt, then run evals after every prompt change:
import openai
import json
def capture_baseline(prompts: list[str], output_file: str):
"""Run prompts against current model and save as eval dataset."""
client = openai.OpenAI()
with open(output_file, "w") as f:
for prompt in prompts:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
output = response.choices[0].message.content
f.write(json.dumps({
"input": [{"role": "user", "content": prompt}],
"ideal": output,
}) + "\n")
# Capture baseline before prompt changes
capture_baseline(
prompts=open("test_prompts.txt").read().splitlines(),
output_file="data/regression_baseline.jsonl",
)Then run oaieval new_model_or_prompt regression_eval to check for degradation.
Key Takeaways
- Use exact-match evals for factual questions and structured outputs
- Use model-graded evals for creative or open-ended responses
- Capture baselines before changing prompts or upgrading models
- Run evals in CI with a minimum accuracy threshold
- Prefer small, focused datasets (50–200 samples) for fast iteration