Regression Testing for LLM-Powered Applications

Regression Testing for LLM-Powered Applications

LLM regressions are silent killers: the app still returns a response, but the quality degraded after a prompt change, model upgrade, or context window modification. Traditional regression tests that assert exact string equality fail immediately with LLMs. This guide covers behavioral regression testing, prompt versioning, LLM-as-judge evaluation, and CI gates for LLM-powered applications.

Traditional software regression: you change a function, run tests, and if they fail you've introduced a regression. Easy to detect.

LLM regression: you change a prompt, run your app, and it still returns responses. They look fine. But response quality dropped 15%, factual accuracy degraded, the model started hallucinating edge cases — and you won't notice for weeks because users are suffering silently.

This is the core challenge of LLM regression testing: regressions are often subtle, probabilistic, and not caught by tests that just check "did the app return something."

What Causes LLM Regressions

Understanding the root causes helps you design the right regression tests:

Prompt changes — Even small edits to system prompts can significantly alter output style, accuracy, or behavior. Adding "Be concise" can cause the model to omit important details.

Model upgrades — Moving from GPT-4o to GPT-4o-mini, or from Claude claude-sonnet-4-6 to claude-haiku-4-5, changes behavior. The new model might be faster and cheaper but handle edge cases differently.

Context changes — Adding retrieved documents (RAG), changing the conversation history format, or modifying tool schemas all change what the model sees.

Temperature/sampling changes — Higher temperature means more creative but less consistent outputs.

Token limit changes — Truncating context to save costs can silently cut off critical information.

Building a Behavioral Test Suite

Behavioral tests assert properties of outputs rather than exact strings:

# tests/regression/test_summarizer.py
import pytest
from app.summarizer import summarize_article

# Test cases: input, expected behavioral properties
REGRESSION_CASES = [
    {
        "id": "basic-news",
        "input": open("fixtures/news_article_100words.txt").read(),
        "max_length": 50,  # words
        "must_contain": [],
        "must_not_contain": ["I don't know", "As an AI"],
        "must_be_factual": True,
        "tone": "neutral"
    },
    {
        "id": "technical-doc",
        "input": open("fixtures/api_documentation.txt").read(),
        "max_length": 100,
        "must_contain": ["API", "endpoint"],
        "must_not_contain": [],
        "must_be_factual": True,
        "tone": "neutral"
    }
]

def count_words(text: str) -> int:
    return len(text.split())

@pytest.mark.parametrize("case", REGRESSION_CASES, ids=[c["id"] for c in REGRESSION_CASES])
def test_summarizer_regression(case):
    result = summarize_article(case["input"])

    # Length constraint
    assert count_words(result) <= case["max_length"], \
        f"Summary too long: {count_words(result)} words, max {case['max_length']}"

    # Required terms
    for term in case.get("must_contain", []):
        assert term.lower() in result.lower(), f"Summary missing required term: {term}"

    # Forbidden patterns
    for pattern in case.get("must_not_contain", []):
        assert pattern.lower() not in result.lower(), \
            f"Summary contains forbidden pattern: {pattern}"

These tests are deterministic assertions on measurable properties — they pass or fail consistently.

LLM-as-Judge Evaluation

For properties that can't be measured directly (accuracy, relevance, helpfulness), use another LLM to score the output:

import json
from anthropic import Anthropic

client = Anthropic()

def evaluate_with_llm(
    original_input: str,
    llm_output: str,
    criteria: list[str]
) -> dict:
    """Use Claude to evaluate output quality."""
    
    evaluation_prompt = f"""
You are a quality evaluator. Evaluate the following AI output against each criterion.

INPUT:
{original_input}

OUTPUT TO EVALUATE:
{llm_output}

CRITERIA:
{json.dumps(criteria, indent=2)}

For each criterion, rate it 1-5 (1=poor, 5=excellent) and explain why.
Return JSON with this structure:
{{
  "scores": {{"criterion_name": score}},
  "explanations": {{"criterion_name": "reason"}},
  "overall": score
}}
"""
    
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": evaluation_prompt}]
    )
    
    return json.loads(response.content[0].text)


def test_chatbot_quality_regression():
    test_question = "How do I reset my password?"
    response = chatbot.answer(test_question)
    
    evaluation = evaluate_with_llm(
        original_input=test_question,
        llm_output=response,
        criteria=[
            "accuracy: Does the response correctly explain password reset steps?",
            "clarity: Is the response easy to understand?",
            "completeness: Does it cover the main ways to reset (email, phone, support)?",
            "safety: Does it avoid security anti-patterns like 'share your old password'?"
        ]
    )
    
    assert evaluation["scores"]["accuracy"] >= 4, \
        f"Accuracy score too low: {evaluation['scores']['accuracy']}. {evaluation['explanations']['accuracy']}"
    assert evaluation["scores"]["safety"] >= 5, \
        f"Safety score too low: {evaluation['scores']['safety']}. {evaluation['explanations']['safety']}"
    assert evaluation["overall"] >= 3.5, \
        f"Overall score too low: {evaluation['overall']}"

LLM-as-judge works well but adds cost and non-determinism. Run it on a curated set of 20-50 high-stakes test cases, not every possible input.

Prompt Versioning

Version your prompts like code. Changes to prompts are changes to your application — treat them that way:

# prompts/versions.py
SUMMARIZER_PROMPTS = {
    "v1.0": """
You are a text summarizer. Summarize the following article in {max_words} words or fewer.
Article: {article}
""",
    "v1.1": """
You are a professional editor. Create a {max_words}-word executive summary of the following article.
Focus on key facts, decisions, and outcomes. Omit filler phrases.
Article: {article}
""",
    "v2.0": """
Summarize this article in exactly {max_words} words or fewer.
Requirements:
- Start with the most important fact
- Include specific numbers and names when present
- Use active voice
- Do not begin with "The article" or "This piece"

Article: {article}
"""
}

CURRENT_VERSION = "v2.0"

Before deploying a new prompt version, run your regression suite against both the old and new versions:

def test_prompt_upgrade_does_not_regress():
    old_version = "v1.1"
    new_version = "v2.0"
    
    results = {}
    for version in [old_version, new_version]:
        prompt = SUMMARIZER_PROMPTS[version]
        results[version] = []
        
        for case in REGRESSION_CASES:
            summary = summarize_with_prompt(
                article=case["input"],
                prompt_template=prompt,
                max_words=case["max_length"]
            )
            score = evaluate_with_llm(case["input"], summary, case["criteria"])
            results[version].append(score["overall"])
    
    old_avg = sum(results[old_version]) / len(results[old_version])
    new_avg = sum(results[new_version]) / len(results[new_version])
    
    # New version must not degrade average quality by more than 0.2 points
    assert new_avg >= old_avg - 0.2, \
        f"Prompt upgrade degraded quality: {old_avg:.2f}{new_avg:.2f}"
    
    print(f"Quality comparison: {old_version}={old_avg:.2f}, {new_version}={new_avg:.2f}")

Model Upgrade Testing

When upgrading to a new LLM model, run a comparison suite before switching in production:

# scripts/compare_models.py
import statistics
from app.llm import create_client

MODELS_TO_COMPARE = [
    "claude-opus-4-6",
    "claude-sonnet-4-6",
    "claude-haiku-4-5-20251001"
]

def compare_models_on_suite(test_cases: list, evaluator) -> dict:
    results = {model: [] for model in MODELS_TO_COMPARE}
    
    for case in test_cases:
        for model in MODELS_TO_COMPARE:
            client = create_client(model=model)
            output = client.generate(case["prompt"])
            score = evaluator.score(case["input"], output, case["criteria"])
            results[model].append({
                "score": score,
                "latency_ms": client.last_latency_ms,
                "cost_usd": client.last_cost_usd
            })
    
    # Summary
    summary = {}
    for model, runs in results.items():
        scores = [r["score"] for r in runs]
        summary[model] = {
            "avg_quality": statistics.mean(scores),
            "min_quality": min(scores),
            "avg_latency_ms": statistics.mean(r["latency_ms"] for r in runs),
            "total_cost_usd": sum(r["cost_usd"] for r in runs)
        }
    
    return summary

Run this before committing to a model upgrade. The output shows quality vs cost vs latency tradeoffs across models.

Baseline Snapshots

For outputs that are hard to evaluate programmatically, maintain baseline snapshots:

# tests/regression/test_snapshots.py
import json
import os
from pathlib import Path

SNAPSHOT_DIR = Path("tests/snapshots")

def save_snapshot(name: str, output: str):
    (SNAPSHOT_DIR / f"{name}.txt").write_text(output)

def load_snapshot(name: str) -> str | None:
    path = SNAPSHOT_DIR / f"{name}.txt"
    return path.read_text() if path.exists() else None

def test_product_description_snapshot():
    """Check that product description hasn't changed significantly."""
    input_data = {
        "product": "Wireless headphones",
        "features": ["noise cancelling", "30h battery", "USB-C"],
        "tone": "professional"
    }
    
    current_output = generate_product_description(**input_data)
    snapshot = load_snapshot("product-description-v2")
    
    if snapshot is None:
        # First run: save the snapshot
        save_snapshot("product-description-v2", current_output)
        pytest.skip("Snapshot created — run again to compare")
    
    # Compare: allow up to 20% word difference
    snapshot_words = set(snapshot.lower().split())
    current_words = set(current_output.lower().split())
    overlap = len(snapshot_words & current_words)
    similarity = overlap / max(len(snapshot_words), len(current_words))
    
    assert similarity >= 0.8, \
        f"Output diverged too much from snapshot (similarity: {similarity:.1%})"

Commit snapshots to git. When you intentionally change output behavior, update the snapshot and commit it as part of the same PR.

CI Integration

Structure LLM regression tests in layers by speed and cost:

# .github/workflows/llm-regression.yml
name: LLM Regression Tests

on:
  push:
    paths:
      - 'prompts/**'
      - 'app/llm/**'
  schedule:
    - cron: '0 2 * * *'  # Nightly

jobs:
  behavioral-fast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: pytest tests/regression/behavioral -v
        # No LLM calls — only property assertions on deterministic mocks

  llm-judge-evaluation:
    runs-on: ubuntu-latest
    needs: behavioral-fast
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: pytest tests/regression/llm_judge -v --timeout=300
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        # LLM-as-judge on 20-50 curated cases

  snapshot-comparison:
    runs-on: ubuntu-latest
    needs: behavioral-fast
    if: github.event_name == 'schedule'  # Only nightly
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: pytest tests/regression/snapshots -v
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Setting Quality Gates

Define numeric quality gates that must pass before deployment:

# tests/regression/quality_gates.py
QUALITY_GATES = {
    "summarizer": {
        "min_accuracy_score": 3.8,  # out of 5
        "max_hallucination_rate": 0.05,  # 5%
        "max_refused_responses": 0.01,  # 1%
    },
    "support_chatbot": {
        "min_accuracy_score": 4.0,
        "min_safety_score": 4.8,  # Higher bar for user-facing safety
        "max_hallucination_rate": 0.02,
    },
    "code_reviewer": {
        "min_relevance_score": 3.5,
        "max_false_positive_rate": 0.15,
        "min_catch_rate": 0.70,  # Must catch 70% of known bugs
    }
}

These gates turn regression testing from a manual review into an automated pass/fail decision. When a change degrades quality below a gate, the CI fails and deployment is blocked.

Conclusion

Regression testing for LLM applications requires a different mindset than traditional software testing. The key shifts: assert behavioral properties instead of exact strings, version prompts like code, use LLM-as-judge for quality that's hard to measure directly, maintain baseline snapshots for complex outputs, and run automated comparison before model upgrades. With these patterns in place, you can iterate on prompts and models with confidence that regressions will be caught before they reach users.

Read more