LLM Output Regression Testing: Catch Prompt Regressions Before Users Do

LLM Output Regression Testing: Catch Prompt Regressions Before Users Do

Every time you change a prompt, update a model version, modify your retrieval pipeline, or adjust a system instruction, you've potentially introduced a regression. Unlike traditional software regressions — where a function returns the wrong value and a unit test catches it — LLM regressions are subtle. The output format might break. Responses might become less relevant for a specific question type. Hallucination rates might increase on edge cases. And none of this shows up in your application logs until a user complains.

LLM output regression testing is the practice of systematically detecting these quality drops before deployment. This guide shows you how to build a regression pipeline that gates deployments on quality.

The Core Problem

Traditional regression tests compare actual output to expected output. With LLMs, this approach fails immediately:

# WRONG — this test will fail constantly
def test_summarizer():
    result = summarize("The quick brown fox jumps over the lazy dog.")
    assert result == "A fox jumps over a dog."  # LLMs never produce the same output twice

LLM regression testing requires a different mental model: instead of comparing to a fixed expected string, you compare quality dimensions across two configurations. Did the correctness score drop? Did the response length distribution shift? Did the hallucination rate increase?

Building the Regression Dataset

The dataset is the foundation. It must:

  1. Cover your real traffic distribution — not just happy paths
  2. Include edge cases that exposed past failures
  3. Have ground truth labels for at least some examples
  4. Be stable — add examples, never remove them
# tests/fixtures/regression_dataset.py
REGRESSION_CASES = [
    # Happy path
    {
        "id": "rg-001",
        "input": "What is the capital of France?",
        "context": "France is a country in Western Europe. Paris is its capital and largest city.",
        "expected_contains": ["Paris"],
        "should_not_contain": ["London", "Berlin", "Rome"],
        "category": "factual",
    },
    # Refusal case — model should NOT hallucinate when context is missing
    {
        "id": "rg-002",
        "input": "What is the population of Mars?",
        "context": "Mars is the fourth planet from the Sun.",
        "expected_contains": ["don't", "cannot", "no information", "not available"],
        "category": "refusal",
    },
    # Format case — model should respect output format requirements
    {
        "id": "rg-003",
        "input": "List the top 3 features",
        "context": "Features: real-time sync, offline mode, dark theme, export to PDF",
        "output_must_match_schema": {
            "type": "array",
            "maxItems": 3,
        },
        "category": "format",
    },
    # Edge case — previously caused hallucination
    {
        "id": "rg-004",
        "input": "What's the refund policy for premium users?",
        "context": "Standard refund policy: 30 days. No special policy for premium.",
        "expected_contains": ["30 days"],
        "should_not_contain": ["60 days", "90 days", "premium users get"],
        "category": "hallucination_risk",
    },
]

Categorize your cases — this lets you track regressions by category and understand whether a prompt change hurt factual accuracy but helped format compliance.

The Regression Runner

# scripts/regression_test.py
import json
import sys
from dataclasses import dataclass
from typing import Optional
from your_app.pipeline import run_pipeline

@dataclass
class RegressionResult:
    case_id: str
    category: str
    passed: bool
    score: float
    failure_reason: Optional[str]
    output: str

def evaluate_case(case: dict, output: str) -> RegressionResult:
    failures = []
    score = 1.0
    
    # Check expected contains
    for phrase in case.get("expected_contains", []):
        if phrase.lower() not in output.lower():
            failures.append(f"Missing expected phrase: '{phrase}'")
            score -= 0.3
    
    # Check should not contain
    for phrase in case.get("should_not_contain", []):
        if phrase.lower() in output.lower():
            failures.append(f"Contains forbidden phrase: '{phrase}'")
            score -= 0.5
    
    # Check output length is reasonable
    word_count = len(output.split())
    if word_count < 5:
        failures.append(f"Output too short: {word_count} words")
        score -= 0.4
    if word_count > 500:
        failures.append(f"Output too long: {word_count} words")
        score -= 0.2
    
    score = max(0, score)
    
    return RegressionResult(
        case_id=case["id"],
        category=case["category"],
        passed=len(failures) == 0,
        score=score,
        failure_reason="; ".join(failures) if failures else None,
        output=output,
    )

def run_regression_suite(cases: list[dict], fail_below: float = 0.85) -> bool:
    results = []
    
    for case in cases:
        output = run_pipeline(
            input=case["input"],
            context=case["context"],
        )
        result = evaluate_case(case, output)
        results.append(result)
        
        status = "PASS" if result.passed else "FAIL"
        print(f"[{status}] {result.case_id} ({result.category}): score={result.score:.2f}")
        if result.failure_reason:
            print(f"       Reason: {result.failure_reason}")
    
    # Summary by category
    categories = set(r.category for r in results)
    print("\n--- Category Summary ---")
    for cat in sorted(categories):
        cat_results = [r for r in results if r.category == cat]
        pass_rate = sum(1 for r in cat_results if r.passed) / len(cat_results)
        print(f"{cat}: {pass_rate:.0%} ({sum(1 for r in cat_results if r.passed)}/{len(cat_results)})")
    
    overall_pass_rate = sum(1 for r in results if r.passed) / len(results)
    print(f"\nOverall: {overall_pass_rate:.2%}")
    
    return overall_pass_rate >= fail_below

if __name__ == "__main__":
    from tests.fixtures.regression_dataset import REGRESSION_CASES
    
    passed = run_regression_suite(REGRESSION_CASES, fail_below=0.85)
    sys.exit(0 if passed else 1)

LLM-as-Judge for Semantic Correctness

String matching catches format and exact-fact regressions. For semantic quality, use an LLM judge:

import openai
import json

def llm_judge_correctness(
    question: str,
    context: str,
    output: str,
    expected: str,
) -> tuple[float, str]:
    """Returns (score 0-1, explanation)."""
    
    client = openai.OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert judge evaluating AI response quality. "
                    "Score the response from 0.0 to 1.0 based on:\n"
                    "- Factual accuracy relative to the context (most important)\n"
                    "- Whether it answers the question asked\n"
                    "- Absence of hallucinations (information not in context)\n\n"
                    "Respond ONLY with valid JSON: "
                    "{\"score\": 0.0-1.0, \"reason\": \"brief explanation\"}"
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {question}\n\n"
                    f"Context: {context}\n\n"
                    f"Response to evaluate: {output}\n\n"
                    f"Reference answer: {expected}"
                ),
            },
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    
    result = json.loads(response.choices[0].message.content)
    return result["score"], result["reason"]

# Add to your regression runner
def evaluate_case_with_llm_judge(case: dict, output: str) -> RegressionResult:
    # Run string-based checks first (fast, cheap)
    base_result = evaluate_case(case, output)
    
    # Add LLM judge if the case has an expected answer
    if "expected_answer" in case:
        judge_score, judge_reason = llm_judge_correctness(
            question=case["input"],
            context=case["context"],
            output=output,
            expected=case["expected_answer"],
        )
        
        # Blend scores: 60% LLM judge, 40% string checks
        blended_score = 0.6 * judge_score + 0.4 * base_result.score
        
        return RegressionResult(
            case_id=base_result.case_id,
            category=base_result.category,
            passed=blended_score >= 0.7,
            score=blended_score,
            failure_reason=f"LLM judge: {judge_reason}" if blended_score < 0.7 else None,
            output=output,
        )
    
    return base_result

Comparing Two Configurations

The most powerful use of regression testing is comparing configurations before deployment:

# scripts/compare_configs.py
from your_app.pipeline import run_pipeline_v1, run_pipeline_v2
from tests.fixtures.regression_dataset import REGRESSION_CASES

def compare_configurations():
    print("Running v1 and v2 in parallel on regression dataset...\n")
    
    v1_results = []
    v2_results = []
    
    for case in REGRESSION_CASES:
        v1_output = run_pipeline_v1(case["input"], case["context"])
        v2_output = run_pipeline_v2(case["input"], case["context"])
        
        v1_result = evaluate_case(case, v1_output)
        v2_result = evaluate_case(case, v2_output)
        
        v1_results.append(v1_result)
        v2_results.append(v2_result)
        
        # Flag regressions and improvements
        if v1_result.passed and not v2_result.passed:
            print(f"⚠️  REGRESSION [{case['id']}]: v1 passed, v2 failed")
            print(f"   v2 failure: {v2_result.failure_reason}")
        elif not v1_result.passed and v2_result.passed:
            print(f"✅ IMPROVEMENT [{case['id']}]: v2 fixed a v1 failure")
    
    v1_rate = sum(1 for r in v1_results if r.passed) / len(v1_results)
    v2_rate = sum(1 for r in v2_results if r.passed) / len(v2_results)
    
    print(f"\nv1 pass rate: {v1_rate:.2%}")
    print(f"v2 pass rate: {v2_rate:.2%}")
    print(f"Delta: {(v2_rate - v1_rate):+.2%}")
    
    regressions = sum(
        1 for v1, v2 in zip(v1_results, v2_results)
        if v1.passed and not v2.passed
    )
    
    if regressions > 0:
        print(f"\n❌ {regressions} regressions detected. Do not deploy v2.")
        return False
    
    if v2_rate < v1_rate - 0.05:
        print(f"\n❌ v2 quality dropped {(v1_rate - v2_rate):.2%}. Do not deploy.")
        return False
    
    print("\n✅ No regressions. v2 is safe to deploy.")
    return True

CI Pipeline Integration

# .github/workflows/llm-regression.yml
name: LLM Regression Tests

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'app/pipeline.py'
      - 'app/retrieval.py'

jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      
      - run: pip install -r requirements.txt
      
      - name: Run LLM regression suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/regression_test.py --fail-below 0.85
      
      - name: Compare against main
        if: github.event_name == 'pull_request'
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          # Checkout main and run v1
          git stash
          python scripts/run_and_save_results.py --output /tmp/v1_results.json
          
          # Restore PR changes and run v2
          git stash pop
          python scripts/run_and_save_results.py --output /tmp/v2_results.json
          
          # Compare
          python scripts/compare_configs.py \
            --v1 /tmp/v1_results.json \
            --v2 /tmp/v2_results.json \
            --fail-on-regression

Monitoring Production Regression

Regression testing in CI catches regressions from code changes. But models change without your code changing. Set up continuous evaluation on production traffic:

# In your application code
import random
from your_app.evaluators import async_evaluate

async def handle_request(request):
    response = await pipeline.run(request.input, request.context)
    
    # Sample 5% of production requests for quality evaluation
    if random.random() < 0.05:
        asyncio.create_task(
            async_evaluate(
                input=request.input,
                context=request.context,
                output=response,
                metadata={"user_tier": request.user.tier},
            )
        )
    
    return response

Alert when the sampled quality score drops below baseline. This catches silent model degradation that CI can't detect.

What Regression Testing Doesn't Replace

LLM regression tests operate at the pipeline level — they verify that your AI produces good outputs. They don't verify that:

  • The feature renders correctly in the browser
  • Error states display user-friendly messages
  • The application handles the AI response correctly when it's in an unexpected format

For that, use end-to-end browser testing with HelpMeTest. Regression tests guard the model quality. End-to-end tests guard the user experience. Both are necessary.

Summary

LLM output regression testing requires moving from "does the output match a fixed string?" to "did quality drop compared to baseline?" Build a curated dataset that covers your real traffic distribution, write evaluators that check semantic correctness and format compliance, run comparisons between old and new configurations in CI, and sample production traffic continuously to catch silent model drift. The cost of building this pipeline is one week; the cost of deploying a silent regression that erodes user trust is much higher.

Read more