Prompt Regression Testing and Version Control for LLM Applications

Prompt Regression Testing and Version Control for LLM Applications

Prompts are code. Changing a single word in a system prompt can break a feature that was working fine. Without version control and regression testing for prompts, you're flying blind — every prompt change is a gamble. This guide covers how to version control prompts, run regression evals on every prompt diff, integrate into CI, and track performance trends over time.

Key Takeaways

Prompts belong in version control, not in the database. Prompts stored only in a database or admin UI can't be diff'd, reviewed in PRs, or reverted atomically with the code that uses them.

Diff-based eval testing is more efficient than full eval runs. Only run the test cases most likely to be affected by the specific change — then run the full suite before merge.

Performance metrics should trend, not just pass/fail. A PR that moves relevance from 0.82 to 0.79 is worth flagging even if both scores "pass." Track deltas, not just thresholds.

Prompt versioning needs semantic versions, not just git hashes. A semantic version scheme (1.0 → 1.1 for tweaks, 2.0 for major rewrites) lets you reason about compatibility between prompt and application code.

Never A/B test prompts in production without eval coverage. Shadow mode (run both, compare offline) is safer than live traffic splitting for consequential decisions.

Why Prompts Are Code

A system prompt change that moves "You are a helpful assistant" to "You are a concise, professional assistant" can:

  • Break tests that expected verbose, friendly responses
  • Change the format of responses in ways that break downstream parsing
  • Shift the model's behavior on edge cases in unpredictable ways

Despite this, most teams manage prompts as strings in a database with no diff history, no review process, and no regression testing. A bug introduced by a prompt change can take days to trace.

The fix: treat prompts exactly like code.

Versioning Prompts in Git

Directory Structure

prompts/
├── system/
│   ├── customer_support_v1.txt
│   ├── customer_support_v2.txt      # Current production
│   └── customer_support_v3_draft.txt # In development
├── user_templates/
│   ├── query_with_context.jinja2
│   └── summarization.jinja2
├── CHANGELOG.md
└── registry.yaml                    # Maps feature flags to prompt versions

Registry Pattern

# prompts/registry.yaml
prompts:
  customer_support:
    production: v2
    canary: v3_draft  # 10% of traffic
    versions:
      v1:
        file: system/customer_support_v1.txt
        released: 2026-01-15
        deprecated: 2026-03-01
        performance:
          relevance: 0.71
          faithfulness: 0.84
      v2:
        file: system/customer_support_v2.txt
        released: 2026-03-01
        performance:
          relevance: 0.79
          faithfulness: 0.87
      v3_draft:
        file: system/customer_support_v3_draft.txt
        released: null
        performance: null  # Not yet measured
# src/prompt_registry.py
import yaml
from pathlib import Path

class PromptRegistry:
    def __init__(self, registry_path: str = "prompts/registry.yaml"):
        self._registry = yaml.safe_load(Path(registry_path).read_text())
    
    def get_prompt(self, name: str, version: str = "production") -> str:
        prompt_config = self._registry["prompts"][name]
        version_key = prompt_config[version]  # e.g. "v2"
        file_path = prompt_config["versions"][version_key]["file"]
        return Path(f"prompts/{file_path}").read_text()
    
    def get_active_version(self, name: str) -> str:
        return self._registry["prompts"][name]["production"]
    
    def get_performance_baseline(self, name: str, version: str = "production") -> dict:
        version_key = self._registry["prompts"][name][version]
        return self._registry["prompts"][name]["versions"][version_key].get("performance", {})

# Usage
registry = PromptRegistry()
system_prompt = registry.get_prompt("customer_support")  # Gets production version

Prompt Change Workflow

# 1. Create a branch for prompt changes
git checkout -b feat/improve-support-prompt

<span class="hljs-comment"># 2. Edit the prompt file
vim prompts/system/customer_support_v3_draft.txt

<span class="hljs-comment"># 3. Update the registry
vim prompts/registry.yaml  <span class="hljs-comment"># Set canary: v3_draft

<span class="hljs-comment"># 4. Run regression evals locally
pytest evals/test_prompt_regression.py --prompt-version=v3_draft

<span class="hljs-comment"># 5. If evals pass, open PR — CI will run full eval suite
git add prompts/
git commit -m <span class="hljs-string">"feat(prompts): improve customer support tone in v3"
git push

Changelog Convention

# Prompt Changelog

## customer_support v3 (2026-05-19)

### Changed
- Softened opening tone: "I can definitely help with that" → "I'd be happy to help"
- Added instruction to always confirm user's issue before providing solution
- Reduced max response length guidance from 200 to 150 words

### Why
- v2 relevance score was 0.79 (target 0.82)
- User research showed responses felt curt (NPS 32 → improving)
- Long responses had 40% less scroll depth than short ones

### Performance (measured on core eval set, 500 samples)
- Relevance: 0.82 (+0.03 vs v2)
- Faithfulness: 0.88 (+0.01 vs v2)
- Coherence: 0.79 (-0.01 vs v2, within tolerance)

Diff-Based Regression Testing

Running the full eval suite on every prompt change is expensive. Start with targeted tests:

# evals/test_prompt_regression.py
import pytest
import subprocess
from pathlib import Path

def get_changed_prompts() -> list[str]:
    """Detect which prompt files changed in this PR."""
    result = subprocess.run(
        ["git", "diff", "--name-only", "origin/main...HEAD"],
        capture_output=True, text=True
    )
    changed_files = result.stdout.strip().split('\n')
    return [f for f in changed_files if f.startswith("prompts/")]

def get_affected_test_cases(changed_prompts: list[str]) -> list[dict]:
    """Load test cases most relevant to the changed prompts."""
    affected = []
    
    for prompt_file in changed_prompts:
        # Load test cases tagged for this prompt
        prompt_name = Path(prompt_file).stem.rsplit('_', 1)[0]  # Strip version
        
        with open(f"evals/datasets/core.jsonl") as f:
            for line in f:
                example = json.loads(line)
                if example.get("prompt_context") == prompt_name:
                    affected.append(example)
    
    return affected

class TestPromptRegression:
    
    @pytest.fixture(scope="class")
    def changed_prompts(self):
        return get_changed_prompts()
    
    @pytest.fixture(scope="class")
    def affected_cases(self, changed_prompts):
        return get_affected_test_cases(changed_prompts)
    
    @pytest.fixture(scope="class")
    def version_to_test(self):
        return pytest.config.getoption("--prompt-version", default="canary")
    
    def test_no_relevance_regression(self, affected_cases, version_to_test):
        """Test affected cases with new prompt version vs production."""
        registry = PromptRegistry()
        
        prod_scores = []
        new_scores = []
        
        for case in affected_cases:
            prod_response = generate_with_prompt(
                case["input"],
                registry.get_prompt("customer_support", version="production")
            )
            new_response = generate_with_prompt(
                case["input"],
                registry.get_prompt("customer_support", version=version_to_test)
            )
            
            prod_scores.append(evaluate_relevance(case["input"], prod_response))
            new_scores.append(evaluate_relevance(case["input"], new_response))
        
        prod_mean = sum(prod_scores) / len(prod_scores)
        new_mean = sum(new_scores) / len(new_scores)
        delta = new_mean - prod_mean
        
        print(f"Relevance: prod={prod_mean:.3f} new={new_mean:.3f} delta={delta:+.3f}")
        
        assert delta >= -0.05, (
            f"Relevance regression: {delta:+.3f}. "
            "New prompt performs 5%+ worse than production on affected test cases."
        )
    
    def test_response_format_stability(self, affected_cases, version_to_test):
        """Catch format changes that could break downstream parsers."""
        registry = PromptRegistry()
        
        format_regressions = []
        
        for case in affected_cases:
            if "expected_format" not in case:
                continue
            
            new_response = generate_with_prompt(
                case["input"],
                registry.get_prompt("customer_support", version=version_to_test)
            )
            
            if not matches_expected_format(new_response, case["expected_format"]):
                format_regressions.append({
                    "input": case["input"],
                    "expected_format": case["expected_format"],
                    "actual": new_response[:200]
                })
        
        assert not format_regressions, (
            f"Format regression in {len(format_regressions)} cases:\n"
            + json.dumps(format_regressions[:2], indent=2)
        )

Score trends over time reveal gradual degradation:

# evals/tracking.py
import json
from pathlib import Path
from datetime import datetime

METRICS_HISTORY_FILE = Path("evals/.metrics_history.jsonl")

def record_eval_run(
    prompt_name: str,
    prompt_version: str,
    metrics: dict[str, float],
    git_hash: str | None = None
):
    """Append metrics to history for trend analysis."""
    entry = {
        "timestamp": datetime.now().isoformat(),
        "prompt_name": prompt_name,
        "prompt_version": prompt_version,
        "git_hash": git_hash or get_git_hash(),
        "metrics": metrics
    }
    
    with METRICS_HISTORY_FILE.open("a") as f:
        f.write(json.dumps(entry) + "\n")

def get_score_trend(
    prompt_name: str,
    metric: str,
    last_n_runs: int = 20
) -> list[dict]:
    """Return the last N runs for trend analysis."""
    history = []
    with METRICS_HISTORY_FILE.open() as f:
        for line in f:
            entry = json.loads(line)
            if entry["prompt_name"] == prompt_name:
                history.append({
                    "timestamp": entry["timestamp"],
                    "version": entry["prompt_version"],
                    "score": entry["metrics"].get(metric)
                })
    
    return history[-last_n_runs:]

def detect_trend_degradation(
    prompt_name: str,
    metric: str,
    window: int = 5,
    threshold: float = 0.03
) -> dict:
    """
    Detect if scores are trending down over the last N runs.
    Returns warning if the slope exceeds threshold.
    """
    history = get_score_trend(prompt_name, metric, last_n_runs=window * 2)
    
    if len(history) < window * 2:
        return {"status": "insufficient_data"}
    
    recent = [h["score"] for h in history[-window:]]
    earlier = [h["score"] for h in history[-window * 2:-window]]
    
    recent_mean = sum(recent) / len(recent)
    earlier_mean = sum(earlier) / len(earlier)
    
    delta = recent_mean - earlier_mean
    
    if delta < -threshold:
        return {
            "status": "degrading",
            "delta": delta,
            "recent_mean": recent_mean,
            "earlier_mean": earlier_mean,
            "warning": f"{metric} has degraded by {delta:.3f} over last {window} runs"
        }
    
    return {"status": "stable", "delta": delta}

CI Integration for Prompt PRs

# .github/workflows/prompt-eval.yml
name: Prompt Regression Eval

on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      changed_prompts: ${{ steps.detect.outputs.changed }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      
      - id: detect
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD | grep '^prompts/' | jq -R -s -c 'split("\n")[:-1]')
          echo "changed=$CHANGED" >> $GITHUB_OUTPUT
  
  prompt-regression:
    needs: detect-changes
    runs-on: ubuntu-latest
    if: needs.detect-changes.outputs.changed_prompts != '[]'
    steps:
      - uses: actions/checkout@v4
      
      - name: Run targeted regression evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          CHANGED_PROMPTS: ${{ needs.detect-changes.outputs.changed_prompts }}
        run: |
          pytest evals/test_prompt_regression.py \
            --prompt-version=canary \
            -v --tb=short
      
      - name: Compare scores and post PR comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/post_eval_comment.py \
            --pr ${{ github.event.number }} \
            --prompt-version canary
# scripts/post_eval_comment.py
"""Post eval comparison as PR comment."""
import os
import sys
import requests

def compare_versions_and_comment(pr_number: int, new_version: str):
    registry = PromptRegistry()
    
    # Run both versions on sample
    prod_metrics = evaluate_version("production", sample_size=50)
    new_metrics = evaluate_version(new_version, sample_size=50)
    
    # Build markdown table
    rows = []
    for metric in ["relevance", "faithfulness", "coherence"]:
        prod_val = prod_metrics.get(metric, "N/A")
        new_val = new_metrics.get(metric, "N/A")
        
        if isinstance(prod_val, float) and isinstance(new_val, float):
            delta = new_val - prod_val
            status = "✅" if delta >= -0.02 else "⚠️" if delta >= -0.05 else "❌"
            rows.append(f"| {metric} | {prod_val:.3f} | {new_val:.3f} | {delta:+.3f} | {status} |")
    
    comment = f"""## Prompt Eval Comparison

Tested `{new_version}` vs production on 50 samples from core eval set.

| Metric | Production | {new_version} | Delta | Status |
|--------|-----------|-------|-------|--------|
{chr(10).join(rows)}

> ✅ Within tolerance | ⚠️ Minor regression | ❌ Blocking regression

[View full eval results]({os.environ.get('GITHUB_SERVER_URL')}/{os.environ.get('GITHUB_REPOSITORY')}/actions/runs/{os.environ.get('GITHUB_RUN_ID')})
"""
    
    # Post to GitHub
    requests.post(
        f"https://api.github.com/repos/{os.environ['GITHUB_REPOSITORY']}/issues/{pr_number}/comments",
        json={"body": comment},
        headers={"Authorization": f"Bearer {os.environ['GITHUB_TOKEN']}"}
    )

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--pr", type=int, required=True)
    parser.add_argument("--prompt-version", required=True)
    args = parser.parse_args()
    compare_versions_and_comment(args.pr, args.prompt_version)

Shadow Mode A/B Testing

Before switching production prompts, run in shadow mode:

class ShadowModeRouter:
    """
    Routes a percentage of requests to the new prompt version
    for offline comparison — without exposing users to the new prompt.
    """
    
    def __init__(self, shadow_percentage: float = 0.1):
        self.shadow_percentage = shadow_percentage
        self.registry = PromptRegistry()
    
    async def generate(self, user_input: str, context: dict) -> str:
        # Always use production prompt for the actual response
        prod_prompt = self.registry.get_prompt("customer_support", version="production")
        response = await generate_async(user_input, prod_prompt, context)
        
        # Shadow: also run with new prompt, store result for comparison
        if random.random() < self.shadow_percentage:
            asyncio.create_task(
                self._shadow_generate(user_input, context, production_response=response)
            )
        
        return response
    
    async def _shadow_generate(
        self,
        user_input: str,
        context: dict,
        production_response: str
    ):
        """Run new prompt in background, log comparison."""
        canary_prompt = self.registry.get_prompt("customer_support", version="canary")
        canary_response = await generate_async(user_input, canary_prompt, context)
        
        await log_shadow_comparison({
            "input": user_input,
            "production": production_response,
            "canary": canary_response,
            "timestamp": datetime.now().isoformat()
        })
# Analyze shadow mode results
def analyze_shadow_results(
    results_file: str,
    min_samples: int = 200
) -> dict:
    results = load_shadow_results(results_file)
    
    if len(results) < min_samples:
        return {"status": "insufficient_samples", "count": len(results)}
    
    # Run LLM-as-judge pairwise comparison
    wins = {"production": 0, "canary": 0, "tie": 0}
    
    for result in results:
        comparison = compare_responses(
            question=result["input"],
            response_a=result["production"],
            response_b=result["canary"],
            n_trials=3
        )
        wins[comparison["winner"]] += 1
    
    total = len(results)
    return {
        "status": "complete",
        "n_samples": total,
        "production_win_rate": wins["production"] / total,
        "canary_win_rate": wins["canary"] / total,
        "tie_rate": wins["tie"] / total,
        "recommendation": (
            "Promote canary to production" if wins["canary"] > wins["production"] * 1.1
            else "Keep production prompt" if wins["production"] > wins["canary"] * 1.1
            else "Inconclusive — more samples needed"
        )
    }

Continuous Prompt Monitoring with HelpMeTest

After promoting a new prompt to production, HelpMeTest monitors for regression continuously:

*** Test Cases ***
Post-Deploy Prompt Health Check
    [Documentation]    Run after every prompt promotion to detect regressions early
    ${score}=    Evaluate Prompt    customer_support    version=production    
    ...    cases=50
    ${baseline}=    Get Stored Baseline    customer_support    metric=relevance
    Should Be True    ${score} >= ${baseline} - 0.05
    ...    msg=Post-deploy relevance ${score} below baseline ${baseline}

Configure HelpMeTest to run this every hour for 24 hours after a prompt promotion — catching regressions before they affect more than a fraction of users.

Summary

Prompt regression testing requires the same discipline as code testing:

  1. Version control — prompts in Git, not databases; diff every change
  2. Registry pattern — map names to versions with performance baselines
  3. Diff-based testing — targeted eval runs on affected cases before full suite
  4. Trend tracking — score deltas over time, not just pass/fail thresholds
  5. Shadow mode — compare offline before exposing users to new prompts
  6. CI gates — prompt changes that regress score metrics block merges

Your prompt is code. Your eval suite is your test suite. The workflow is the same.

Read more

Testing Atlantis Terraform PR Automation: Workflows, Plan Verification, and Policy Enforcement

Testing Atlantis Terraform PR Automation: Workflows, Plan Verification, and Policy Enforcement

Atlantis automates Terraform plan and apply through pull requests. But Atlantis itself needs testing: workflow configuration, plan output validation, policy enforcement, and server health checks. This guide covers testing Atlantis workflows locally with atlantis-local, validating plan outputs with custom scripts, enforcing Terraform policies with OPA and Conftest, and monitoring Atlantis

By HelpMeTest