Testing LLM Applications with OpenAI Evals Framework

Testing LLM Applications with OpenAI Evals Framework

OpenAI Evals is a framework for evaluating large language model outputs. It lets you define test cases (called "evals") that measure whether a model produces correct, consistent, or appropriately formatted responses — and run them as automated checks in your development pipeline.

What Evals Test

Evals measure properties of LLM outputs:

  • Accuracy — does the model give the right answer?
  • Format compliance — does the output follow a required structure (JSON, code, markdown)?
  • Regression — did a model or prompt update degrade quality?
  • Consistency — does the same input reliably produce equivalent outputs?

Installing the Evals Library

pip install evals
# or from source
git <span class="hljs-built_in">clone https://github.com/openai/evals
<span class="hljs-built_in">cd evals
pip install -e .

Set your API key:

export OPENAI_API_KEY=sk-...

Anatomy of an Eval

An eval has two parts:

  1. A dataset — JSONL file of test cases, each with an input and an expected ideal output
  2. An eval spec — YAML that ties the dataset to an evaluation method (match, includes, model-graded)

Dataset format (data/qa_test.jsonl)

{"input": [{"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"}
{"input": [{"role": "user", "content": "What is 12 * 8?"}], "ideal": "96"}
{"input": [{"role": "user", "content": "Who wrote Hamlet?"}], "ideal": "Shakespeare"}

Eval spec (evals/registry/evals/my_qa.yaml)

my_qa:
  id: my_qa.v0
  description: Basic factual QA test
  metrics: [accuracy]

my_qa.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: data/qa_test.jsonl

Running an Eval

oaieval gpt-4o my_qa

Output:

Final report:
{
  "accuracy": 0.97,
  "n": 100,
  "failures": 3
}

Eval Types

Exact Match

Tests that the model's answer exactly matches the expected value (case-insensitive):

my_match_eval.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: data/match_cases.jsonl

Includes

Tests that the expected string appears anywhere in the response:

my_includes_eval.v0:
  class: evals.elsuite.basic.includes:Includes
  args:
    samples_jsonl: data/includes_cases.jsonl

Model-Graded

Uses a grader model (typically GPT-4) to evaluate open-ended outputs:

my_modelgraded_eval.v0:
  class: evals.elsuite.modelgraded.classify:ModelBasedClassify
  args:
    samples_jsonl: data/open_ended_cases.jsonl
    eval_type: cot_classify
    modelgraded_spec: closedqa

Writing Custom Evals in Python

For complex evaluation logic, subclass evals.Eval:

import evals
import evals.metrics
from evals.eval import Eval
from evals.record import RecorderBase

class JSONFormatEval(Eval):
    """Tests that the model returns valid JSON matching a schema."""

    def __init__(self, completion_fns, samples_jsonl, schema, **kwargs):
        super().__init__(completion_fns, **kwargs)
        self.samples_jsonl = samples_jsonl
        self.schema = schema

    def eval_sample(self, sample, rng):
        prompt = sample["input"]
        result = self.completion_fn(prompt)
        output = result.get_completions()[0]

        try:
            import json
            parsed = json.loads(output)
            # Validate required keys
            for key in self.schema.get("required", []):
                assert key in parsed, f"Missing key: {key}"
            evals.record_and_check_match(
                prompt=prompt,
                sampled=output,
                expected="valid",
            )
        except (json.JSONDecodeError, AssertionError) as e:
            evals.record.record_match(False, expected="valid", picked=str(e))

    def run(self, recorder: RecorderBase):
        samples = evals.get_jsonl(self.samples_jsonl)
        self.eval_all_samples(recorder, samples)
        return {"accuracy": evals.metrics.get_accuracy(recorder.get_events("match"))}

Register it in evals/registry/evals/json_format.yaml:

json_format:
  id: json_format.v0
  description: Validates model outputs are valid JSON

json_format.v0:
  class: myevals.json_format:JSONFormatEval
  args:
    samples_jsonl: data/json_format_cases.jsonl
    schema:
      required: [name, value, unit]

Using Evals in CI

# .github/workflows/llm-tests.yml
name: LLM Eval Tests

on: [push]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install evals
      - name: Run factual accuracy eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          oaieval gpt-4o-mini my_qa 2>&1 | tee eval_output.txt
          python -c "
          import json, sys
          # Parse last JSON line from output
          lines = open('eval_output.txt').readlines()
          for line in reversed(lines):
              try:
                  result = json.loads(line)
                  if result.get('accuracy', 0) < 0.90:
                      print(f'Accuracy too low: {result[\"accuracy\"]}')
                      sys.exit(1)
                  break
              except json.JSONDecodeError:
                  continue
          print('Eval passed')
          "

Prompt Regression Testing

Create a baseline dataset from your current prompt, then run evals after every prompt change:

import openai
import json

def capture_baseline(prompts: list[str], output_file: str):
    """Run prompts against current model and save as eval dataset."""
    client = openai.OpenAI()
    with open(output_file, "w") as f:
        for prompt in prompts:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
            )
            output = response.choices[0].message.content
            f.write(json.dumps({
                "input": [{"role": "user", "content": prompt}],
                "ideal": output,
            }) + "\n")

# Capture baseline before prompt changes
capture_baseline(
    prompts=open("test_prompts.txt").read().splitlines(),
    output_file="data/regression_baseline.jsonl",
)

Then run oaieval new_model_or_prompt regression_eval to check for degradation.

Key Takeaways

  • Use exact-match evals for factual questions and structured outputs
  • Use model-graded evals for creative or open-ended responses
  • Capture baselines before changing prompts or upgrading models
  • Run evals in CI with a minimum accuracy threshold
  • Prefer small, focused datasets (50–200 samples) for fast iteration

Read more