LLM Output Regression Testing: Catch Prompt Regressions Before Users Do
Every time you change a prompt, update a model version, modify your retrieval pipeline, or adjust a system instruction, you've potentially introduced a regression. Unlike traditional software regressions — where a function returns the wrong value and a unit test catches it — LLM regressions are subtle. The output format might break. Responses might become less relevant for a specific question type. Hallucination rates might increase on edge cases. And none of this shows up in your application logs until a user complains.
LLM output regression testing is the practice of systematically detecting these quality drops before deployment. This guide shows you how to build a regression pipeline that gates deployments on quality.
The Core Problem
Traditional regression tests compare actual output to expected output. With LLMs, this approach fails immediately:
# WRONG — this test will fail constantly
def test_summarizer():
result = summarize("The quick brown fox jumps over the lazy dog.")
assert result == "A fox jumps over a dog." # LLMs never produce the same output twiceLLM regression testing requires a different mental model: instead of comparing to a fixed expected string, you compare quality dimensions across two configurations. Did the correctness score drop? Did the response length distribution shift? Did the hallucination rate increase?
Building the Regression Dataset
The dataset is the foundation. It must:
- Cover your real traffic distribution — not just happy paths
- Include edge cases that exposed past failures
- Have ground truth labels for at least some examples
- Be stable — add examples, never remove them
# tests/fixtures/regression_dataset.py
REGRESSION_CASES = [
# Happy path
{
"id": "rg-001",
"input": "What is the capital of France?",
"context": "France is a country in Western Europe. Paris is its capital and largest city.",
"expected_contains": ["Paris"],
"should_not_contain": ["London", "Berlin", "Rome"],
"category": "factual",
},
# Refusal case — model should NOT hallucinate when context is missing
{
"id": "rg-002",
"input": "What is the population of Mars?",
"context": "Mars is the fourth planet from the Sun.",
"expected_contains": ["don't", "cannot", "no information", "not available"],
"category": "refusal",
},
# Format case — model should respect output format requirements
{
"id": "rg-003",
"input": "List the top 3 features",
"context": "Features: real-time sync, offline mode, dark theme, export to PDF",
"output_must_match_schema": {
"type": "array",
"maxItems": 3,
},
"category": "format",
},
# Edge case — previously caused hallucination
{
"id": "rg-004",
"input": "What's the refund policy for premium users?",
"context": "Standard refund policy: 30 days. No special policy for premium.",
"expected_contains": ["30 days"],
"should_not_contain": ["60 days", "90 days", "premium users get"],
"category": "hallucination_risk",
},
]Categorize your cases — this lets you track regressions by category and understand whether a prompt change hurt factual accuracy but helped format compliance.
The Regression Runner
# scripts/regression_test.py
import json
import sys
from dataclasses import dataclass
from typing import Optional
from your_app.pipeline import run_pipeline
@dataclass
class RegressionResult:
case_id: str
category: str
passed: bool
score: float
failure_reason: Optional[str]
output: str
def evaluate_case(case: dict, output: str) -> RegressionResult:
failures = []
score = 1.0
# Check expected contains
for phrase in case.get("expected_contains", []):
if phrase.lower() not in output.lower():
failures.append(f"Missing expected phrase: '{phrase}'")
score -= 0.3
# Check should not contain
for phrase in case.get("should_not_contain", []):
if phrase.lower() in output.lower():
failures.append(f"Contains forbidden phrase: '{phrase}'")
score -= 0.5
# Check output length is reasonable
word_count = len(output.split())
if word_count < 5:
failures.append(f"Output too short: {word_count} words")
score -= 0.4
if word_count > 500:
failures.append(f"Output too long: {word_count} words")
score -= 0.2
score = max(0, score)
return RegressionResult(
case_id=case["id"],
category=case["category"],
passed=len(failures) == 0,
score=score,
failure_reason="; ".join(failures) if failures else None,
output=output,
)
def run_regression_suite(cases: list[dict], fail_below: float = 0.85) -> bool:
results = []
for case in cases:
output = run_pipeline(
input=case["input"],
context=case["context"],
)
result = evaluate_case(case, output)
results.append(result)
status = "PASS" if result.passed else "FAIL"
print(f"[{status}] {result.case_id} ({result.category}): score={result.score:.2f}")
if result.failure_reason:
print(f" Reason: {result.failure_reason}")
# Summary by category
categories = set(r.category for r in results)
print("\n--- Category Summary ---")
for cat in sorted(categories):
cat_results = [r for r in results if r.category == cat]
pass_rate = sum(1 for r in cat_results if r.passed) / len(cat_results)
print(f"{cat}: {pass_rate:.0%} ({sum(1 for r in cat_results if r.passed)}/{len(cat_results)})")
overall_pass_rate = sum(1 for r in results if r.passed) / len(results)
print(f"\nOverall: {overall_pass_rate:.2%}")
return overall_pass_rate >= fail_below
if __name__ == "__main__":
from tests.fixtures.regression_dataset import REGRESSION_CASES
passed = run_regression_suite(REGRESSION_CASES, fail_below=0.85)
sys.exit(0 if passed else 1)LLM-as-Judge for Semantic Correctness
String matching catches format and exact-fact regressions. For semantic quality, use an LLM judge:
import openai
import json
def llm_judge_correctness(
question: str,
context: str,
output: str,
expected: str,
) -> tuple[float, str]:
"""Returns (score 0-1, explanation)."""
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are an expert judge evaluating AI response quality. "
"Score the response from 0.0 to 1.0 based on:\n"
"- Factual accuracy relative to the context (most important)\n"
"- Whether it answers the question asked\n"
"- Absence of hallucinations (information not in context)\n\n"
"Respond ONLY with valid JSON: "
"{\"score\": 0.0-1.0, \"reason\": \"brief explanation\"}"
),
},
{
"role": "user",
"content": (
f"Question: {question}\n\n"
f"Context: {context}\n\n"
f"Response to evaluate: {output}\n\n"
f"Reference answer: {expected}"
),
},
],
response_format={"type": "json_object"},
temperature=0,
)
result = json.loads(response.choices[0].message.content)
return result["score"], result["reason"]
# Add to your regression runner
def evaluate_case_with_llm_judge(case: dict, output: str) -> RegressionResult:
# Run string-based checks first (fast, cheap)
base_result = evaluate_case(case, output)
# Add LLM judge if the case has an expected answer
if "expected_answer" in case:
judge_score, judge_reason = llm_judge_correctness(
question=case["input"],
context=case["context"],
output=output,
expected=case["expected_answer"],
)
# Blend scores: 60% LLM judge, 40% string checks
blended_score = 0.6 * judge_score + 0.4 * base_result.score
return RegressionResult(
case_id=base_result.case_id,
category=base_result.category,
passed=blended_score >= 0.7,
score=blended_score,
failure_reason=f"LLM judge: {judge_reason}" if blended_score < 0.7 else None,
output=output,
)
return base_resultComparing Two Configurations
The most powerful use of regression testing is comparing configurations before deployment:
# scripts/compare_configs.py
from your_app.pipeline import run_pipeline_v1, run_pipeline_v2
from tests.fixtures.regression_dataset import REGRESSION_CASES
def compare_configurations():
print("Running v1 and v2 in parallel on regression dataset...\n")
v1_results = []
v2_results = []
for case in REGRESSION_CASES:
v1_output = run_pipeline_v1(case["input"], case["context"])
v2_output = run_pipeline_v2(case["input"], case["context"])
v1_result = evaluate_case(case, v1_output)
v2_result = evaluate_case(case, v2_output)
v1_results.append(v1_result)
v2_results.append(v2_result)
# Flag regressions and improvements
if v1_result.passed and not v2_result.passed:
print(f"⚠️ REGRESSION [{case['id']}]: v1 passed, v2 failed")
print(f" v2 failure: {v2_result.failure_reason}")
elif not v1_result.passed and v2_result.passed:
print(f"✅ IMPROVEMENT [{case['id']}]: v2 fixed a v1 failure")
v1_rate = sum(1 for r in v1_results if r.passed) / len(v1_results)
v2_rate = sum(1 for r in v2_results if r.passed) / len(v2_results)
print(f"\nv1 pass rate: {v1_rate:.2%}")
print(f"v2 pass rate: {v2_rate:.2%}")
print(f"Delta: {(v2_rate - v1_rate):+.2%}")
regressions = sum(
1 for v1, v2 in zip(v1_results, v2_results)
if v1.passed and not v2.passed
)
if regressions > 0:
print(f"\n❌ {regressions} regressions detected. Do not deploy v2.")
return False
if v2_rate < v1_rate - 0.05:
print(f"\n❌ v2 quality dropped {(v1_rate - v2_rate):.2%}. Do not deploy.")
return False
print("\n✅ No regressions. v2 is safe to deploy.")
return TrueCI Pipeline Integration
# .github/workflows/llm-regression.yml
name: LLM Regression Tests
on:
pull_request:
paths:
- 'prompts/**'
- 'app/pipeline.py'
- 'app/retrieval.py'
jobs:
regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements.txt
- name: Run LLM regression suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python scripts/regression_test.py --fail-below 0.85
- name: Compare against main
if: github.event_name == 'pull_request'
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
# Checkout main and run v1
git stash
python scripts/run_and_save_results.py --output /tmp/v1_results.json
# Restore PR changes and run v2
git stash pop
python scripts/run_and_save_results.py --output /tmp/v2_results.json
# Compare
python scripts/compare_configs.py \
--v1 /tmp/v1_results.json \
--v2 /tmp/v2_results.json \
--fail-on-regressionMonitoring Production Regression
Regression testing in CI catches regressions from code changes. But models change without your code changing. Set up continuous evaluation on production traffic:
# In your application code
import random
from your_app.evaluators import async_evaluate
async def handle_request(request):
response = await pipeline.run(request.input, request.context)
# Sample 5% of production requests for quality evaluation
if random.random() < 0.05:
asyncio.create_task(
async_evaluate(
input=request.input,
context=request.context,
output=response,
metadata={"user_tier": request.user.tier},
)
)
return responseAlert when the sampled quality score drops below baseline. This catches silent model degradation that CI can't detect.
What Regression Testing Doesn't Replace
LLM regression tests operate at the pipeline level — they verify that your AI produces good outputs. They don't verify that:
- The feature renders correctly in the browser
- Error states display user-friendly messages
- The application handles the AI response correctly when it's in an unexpected format
For that, use end-to-end browser testing with HelpMeTest. Regression tests guard the model quality. End-to-end tests guard the user experience. Both are necessary.
Summary
LLM output regression testing requires moving from "does the output match a fixed string?" to "did quality drop compared to baseline?" Build a curated dataset that covers your real traffic distribution, write evaluators that check semantic correctness and format compliance, run comparisons between old and new configurations in CI, and sample production traffic continuously to catch silent model drift. The cost of building this pipeline is one week; the cost of deploying a silent regression that erodes user trust is much higher.