MLOps

Testing LLM Applications with OpenAI Evals Framework

HelpMeTest

20 May 2026 — 3 min read

OpenAI Evals is a framework for evaluating large language model outputs. It lets you define test cases (called "evals") that measure whether a model produces correct, consistent, or appropriately formatted responses — and run them as automated checks in your development pipeline.

What Evals Test

Evals measure properties of LLM outputs:

Accuracy — does the model give the right answer?
Format compliance — does the output follow a required structure (JSON, code, markdown)?
Regression — did a model or prompt update degrade quality?
Consistency — does the same input reliably produce equivalent outputs?

Installing the Evals Library

pip install evals
# or from source
git <span class="hljs-built_in">clone https://github.com/openai/evals
<span class="hljs-built_in">cd evals
pip install -e .

Set your API key:

export OPENAI_API_KEY=sk-...

Anatomy of an Eval

An eval has two parts:

A dataset — JSONL file of test cases, each with an input and an expected ideal output
An eval spec — YAML that ties the dataset to an evaluation method (match, includes, model-graded)

Dataset format (`data/qa_test.jsonl`)

{"input": [{"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"}
{"input": [{"role": "user", "content": "What is 12 * 8?"}], "ideal": "96"}
{"input": [{"role": "user", "content": "Who wrote Hamlet?"}], "ideal": "Shakespeare"}

Eval spec (`evals/registry/evals/my_qa.yaml`)

my_qa:
  id: my_qa.v0
  description: Basic factual QA test
  metrics: [accuracy]

my_qa.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: data/qa_test.jsonl

Running an Eval

oaieval gpt-4o my_qa

Output:

Final report:
{
  "accuracy": 0.97,
  "n": 100,
  "failures": 3
}

Eval Types

Exact Match

Tests that the model's answer exactly matches the expected value (case-insensitive):

my_match_eval.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: data/match_cases.jsonl

Includes

Tests that the expected string appears anywhere in the response:

my_includes_eval.v0:
  class: evals.elsuite.basic.includes:Includes
  args:
    samples_jsonl: data/includes_cases.jsonl

Model-Graded

Uses a grader model (typically GPT-4) to evaluate open-ended outputs:

my_modelgraded_eval.v0:
  class: evals.elsuite.modelgraded.classify:ModelBasedClassify
  args:
    samples_jsonl: data/open_ended_cases.jsonl
    eval_type: cot_classify
    modelgraded_spec: closedqa

Writing Custom Evals in Python

For complex evaluation logic, subclass evals.Eval:

import evals
import evals.metrics
from evals.eval import Eval
from evals.record import RecorderBase

class JSONFormatEval(Eval):
    """Tests that the model returns valid JSON matching a schema."""

    def __init__(self, completion_fns, samples_jsonl, schema, **kwargs):
        super().__init__(completion_fns, **kwargs)
        self.samples_jsonl = samples_jsonl
        self.schema = schema

    def eval_sample(self, sample, rng):
        prompt = sample["input"]
        result = self.completion_fn(prompt)
        output = result.get_completions()[0]

        try:
            import json
            parsed = json.loads(output)
            # Validate required keys
            for key in self.schema.get("required", []):
                assert key in parsed, f"Missing key: {key}"
            evals.record_and_check_match(
                prompt=prompt,
                sampled=output,
                expected="valid",
            )
        except (json.JSONDecodeError, AssertionError) as e:
            evals.record.record_match(False, expected="valid", picked=str(e))

    def run(self, recorder: RecorderBase):
        samples = evals.get_jsonl(self.samples_jsonl)
        self.eval_all_samples(recorder, samples)
        return {"accuracy": evals.metrics.get_accuracy(recorder.get_events("match"))}

json_format:
  id: json_format.v0
  description: Validates model outputs are valid JSON

json_format.v0:
  class: myevals.json_format:JSONFormatEval
  args:
    samples_jsonl: data/json_format_cases.jsonl
    schema:
      required: [name, value, unit]

Using Evals in CI

# .github/workflows/llm-tests.yml
name: LLM Eval Tests

on: [push]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install evals
      - name: Run factual accuracy eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          oaieval gpt-4o-mini my_qa 2>&1 | tee eval_output.txt
          python -c "
          import json, sys
          # Parse last JSON line from output
          lines = open('eval_output.txt').readlines()
          for line in reversed(lines):
              try:
                  result = json.loads(line)
                  if result.get('accuracy', 0) < 0.90:
                      print(f'Accuracy too low: {result[\"accuracy\"]}')
                      sys.exit(1)
                  break
              except json.JSONDecodeError:
                  continue
          print('Eval passed')
          "

Prompt Regression Testing

Create a baseline dataset from your current prompt, then run evals after every prompt change:

import openai
import json

def capture_baseline(prompts: list[str], output_file: str):
    """Run prompts against current model and save as eval dataset."""
    client = openai.OpenAI()
    with open(output_file, "w") as f:
        for prompt in prompts:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
            )
            output = response.choices[0].message.content
            f.write(json.dumps({
                "input": [{"role": "user", "content": prompt}],
                "ideal": output,
            }) + "\n")

# Capture baseline before prompt changes
capture_baseline(
    prompts=open("test_prompts.txt").read().splitlines(),
    output_file="data/regression_baseline.jsonl",
)

Then run oaieval new_model_or_prompt regression_eval to check for degradation.

Key Takeaways

Use exact-match evals for factual questions and structured outputs
Use model-graded evals for creative or open-ended responses
Capture baselines before changing prompts or upgrading models
Run evals in CI with a minimum accuracy threshold
Prefer small, focused datasets (50–200 samples) for fast iteration

Testing LLM Applications with OpenAI Evals Framework

HelpMeTest

What Evals Test

Installing the Evals Library

Anatomy of an Eval

Dataset format (`data/qa_test.jsonl`)

Eval spec (`evals/registry/evals/my_qa.yaml`)

Running an Eval

Eval Types

Exact Match

Includes

Model-Graded

Writing Custom Evals in Python

Using Evals in CI

Prompt Regression Testing

Key Takeaways

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest

What Evals Test

Installing the Evals Library

Anatomy of an Eval

Dataset format (data/qa_test.jsonl)

Eval spec (evals/registry/evals/my_qa.yaml)

Running an Eval

Eval Types

Exact Match

Includes

Model-Graded

Writing Custom Evals in Python

Using Evals in CI

Prompt Regression Testing

Key Takeaways

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest

Dataset format (`data/qa_test.jsonl`)

Eval spec (`evals/registry/evals/my_qa.yaml`)