Prompt Testing in CI/CD: How to Catch LLM Regressions Before Users Do

Prompt Testing in CI/CD: How to Catch LLM Regressions Before Users Do

Every time you change a prompt, update a model, or adjust your context injection logic, you risk a silent regression — the LLM starts producing subtly different outputs that break downstream behavior. Without CI tests for your prompts, you find these regressions in production from user complaints. This guide shows how to build a prompt testing pipeline that catches regressions before they ship.

Key Takeaways

Prompt changes are unversioned code changes. Most teams track prompt changes in a config file or environment variable, not in code review. This means prompt regressions don't get caught by your existing PR review process or test suite.

LLM regressions are silent. A broken API returns a 500. A broken prompt returns a 200 with output that's subtly wrong — wrong format, wrong tone, missing field, incorrect assumption. Without explicit tests, you only discover this when users report problems.

You need output behavior tests, not equality checks. You can't assert output == "expected string" for non-deterministic LLM output. You assert structural properties: "response contains a JSON object with field X", "summary is under 200 words", "no profanity present".

Behavioral tests catch what unit tests miss. A unit test on your prompt construction code tells you the string is formatted correctly. An end-to-end behavioral test tells you the LLM actually uses it to produce the right output — a very different claim.

The Prompt Regression Problem

Most LLM application bugs don't come from code changes. They come from prompt changes.

The pattern is familiar: you tweak the system prompt to improve output quality for one use case. The change looks harmless — a few words rephrased, a constraint added. You ship it. Three days later, a user reports that summaries are now always in bullet points when they should be in paragraphs. Or the JSON output stopped including a field. Or responses started refusing requests that used to work.

Prompt regressions are uniquely hard to catch because:

1. They don't fail loudly. The model still produces output. It just produces the wrong output. No exception, no error code, no failed health check.

2. They're not in your diff. If your prompt is in a database or environment variable, the "code change" that caused the regression isn't in your PR. It happened outside the normal review process.

3. They're non-deterministic. The regression might not appear on every call — only on certain input patterns, at certain times, or with certain model load characteristics. Intermittent failures are the hardest to catch.

4. They compound with model updates. When your LLM provider updates the underlying model (which happens without notice for "minor" versions), your prompts may behave differently. You find out when users notice.

CI/CD testing for prompts solves all of these. Here's how to build it.

What Prompt Tests Actually Look Like

Prompt tests are behavioral assertions: given an input, does the output have these properties?

They're not equality assertions. assert response == "The capital of France is Paris" will fail on every other run even when the prompt is working correctly. Instead:

assert "Paris" in response
assert len(response.split()) < 100  # not too verbose
assert response does not contain "I cannot" or "I don't know"

The assertions test structure, content properties, and behavioral constraints — not exact string matches.

Categories of Assertions

Format assertions — does the output conform to the expected structure?

  • Is the JSON parseable?
  • Does it include required fields?
  • Is the response within expected length bounds?

Content assertions — does the output contain (or not contain) expected elements?

  • Does the summary cover the main topic?
  • Is the generated code syntactically valid?
  • Does the response address the user's question?

Behavioral assertions — does the model behave within defined guardrails?

  • Does it refuse requests it should refuse?
  • Does it not add unsolicited disclaimers to factual responses?
  • Does it follow the persona/tone constraints in the system prompt?

Regression assertions — does the output still have properties that used to be true?

  • This output always used to include a recommendations field. Does it still?
  • This assistant always used to respond in the user's language. Does it still?

Building the Pipeline

A prompt testing CI/CD pipeline has four components:

1. A Prompt Test Suite

A set of (input, assertions) pairs that cover your critical prompt behaviors. Store these in version control alongside your code — they're specifications, not ephemeral scripts.

Example structure:

# prompt-tests/summarizer.yaml
tests:
  - name: "long article summary"
    input:
      system: "{{ system_prompt }}"
      user: "{{ load_fixture('long_article.txt') }}"
    assertions:
      - type: json_parseable
      - type: max_length
        words: 200
      - type: contains
        value: "{{ extract_main_topic('long_article.txt') }}"
      - type: not_contains
        value: "I cannot summarize"

  - name: "empty input handling"
    input:
      system: "{{ system_prompt }}"
      user: ""
    assertions:
      - type: not_empty
      - type: not_contains
        value: "error"

2. A Test Runner

A script that loads your prompt, calls the LLM, and evaluates the assertions. This runs in CI on every PR that touches prompt files or LLM configuration.

For teams using Python:

import anthropic
import yaml
import re

def run_prompt_tests(config_path: str, prompt_path: str) -> dict:
    with open(config_path) as f:
        tests = yaml.safe_load(f)['tests']
    
    with open(prompt_path) as f:
        system_prompt = f.read()
    
    client = anthropic.Anthropic()
    results = []
    
    for test in tests:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": test['input']['user']}]
        )
        
        text = response.content[0].text
        passed = evaluate_assertions(text, test['assertions'])
        results.append({"test": test['name'], "passed": passed, "output": text})
    
    return results

def evaluate_assertions(output: str, assertions: list) -> bool:
    for assertion in assertions:
        if assertion['type'] == 'contains' and assertion['value'] not in output:
            return False
        if assertion['type'] == 'not_contains' and assertion['value'] in output:
            return False
        if assertion['type'] == 'max_length':
            if len(output.split()) > assertion['words']:
                return False
        if assertion['type'] == 'not_empty' and not output.strip():
            return False
    return True

3. CI Integration

Add the prompt test runner to your CI pipeline. It should run on any PR that modifies:

  • Prompt files
  • System prompt templates
  • Context injection logic
  • Model configuration (model name, temperature, max_tokens)
  • Tool definitions
# .github/workflows/prompt-tests.yml
name: Prompt Tests

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
      - 'config/model*.yaml'

jobs:
  prompt-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run prompt test suite
        run: python scripts/run_prompt_tests.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: prompt-test-results
          path: test-results/prompt-tests.json

Cost note: Running your full prompt test suite on every PR costs real money in API calls. Keep your suite focused on critical behaviors (20-50 tests, not 500). Non-critical prompt behaviors can run nightly rather than on every PR.

4. End-to-End Behavioral Tests

Beyond unit-level prompt assertions, run behavioral tests that verify the full user-visible behavior of your LLM feature. These catch regressions that span the model, the UI, and the integration layer together.

For an LLM app that exposes a chat interface:

*** Test Cases ***
Summarizer produces output under 200 words
    As  AuthUser
    Go To  https://app.yourproduct.com/summarize
    Fill Text  .document-input  ${LONG_ARTICLE}
    Click  button[data-testid="summarize"]
    Wait For Elements State  .summary-output  visible  timeout=30s
    ${text}=  Get Text  .summary-output
    ${word_count}=  Javascript  return "${text}".split(/\s+/).length
    Should Be True  ${word_count} < 200

Summarizer does not return error message for valid input
    As  AuthUser
    Go To  https://app.yourproduct.com/summarize
    Fill Text  .document-input  This is a short test document for summarization.
    Click  button[data-testid="summarize"]
    Wait For Elements State  .summary-output  visible  timeout=30s
    Get Text  .summary-output  !=  Something went wrong
    Get Text  .summary-output  !=  I cannot

These tests run in CI before every deploy. They don't test the prompt in isolation — they test the complete user experience. A prompt regression that produces correct output but breaks the downstream parser will fail these tests even if it passes the unit-level prompt assertions.

Monitoring Prompt Health in Production

CI tests catch regressions before they ship. You also need visibility into prompt behavior in production, where:

  • Real user inputs are different from your test fixtures
  • Model providers roll out minor version updates silently
  • Output quality degrades gradually over time as user input patterns shift

Sampling and Logging

Log every LLM call with: input token count, output token count, response time, and — if you can compute it cheaply — an automated quality signal (was the response parseable? did it pass format assertions?).

Set up a health check that fires whenever your automated quality signal drops below a threshold:

# In your LLM request handler, after each call:
<span class="hljs-keyword">if quality_score < threshold:
    <span class="hljs-comment"># Signal to HelpMeTest that something is degraded
    helpmetest health <span class="hljs-string">"llm-quality" <span class="hljs-string">"10m"

If the health check doesn't receive a signal within 10 minutes, you get alerted. If quality degrades enough that you're not sending signals, you find out before users do.

Regression Detection

Run your full prompt test suite nightly against production. This catches:

  • Gradual model drift (the model provider updated something)
  • Environment-specific issues (production model config differs from staging)
  • Regressions that slipped through review
# .github/workflows/nightly-prompt-tests.yml
on:
  schedule:
    - cron: '0 6 * * *'  # 6am daily

jobs:
  nightly-prompt-regression:
    runs-on: ubuntu-latest
    steps:
      - name: Run full prompt test suite
        run: python scripts/run_prompt_tests.py --suite full
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          TARGET_ENV: production

If nightly tests fail, you get the failure before your users' morning session. If they've been failing since 11pm, you fix it at 6am instead of when the first support ticket comes in.

The Prompt Test Pyramid

Think of prompt testing in three layers:

Fast (run on every PR):

  • Format assertion tests — is the output parseable and structured correctly?
  • Guardrail tests — does the model refuse what it should refuse?
  • Critical behavior tests — the 5-10 things that would break your app if they regressed

Medium (run on merge to main):

  • Full behavioral test suite — all user-visible LLM features
  • Integration tests — does the LLM output work with downstream parsers and consumers?

Slow (run nightly):

  • Full regression suite against production
  • Quality scoring across a sample of real user inputs
  • A/B comparison when you're evaluating a prompt change

Most teams skip the fast layer entirely (no CI testing) and discover regressions from users. The fast layer — 10-20 focused tests running in under 2 minutes — prevents the majority of prompt regressions from shipping.

Getting Started

If you have an LLM feature in production with no prompt tests today, here's the minimum viable starting point:

  1. Write 5 tests for your most critical LLM behavior (the feature your users would notice first if it broke)
  2. Assert on structure: is the output parseable? Does it have required fields?
  3. Assert on one behavioral property: does it stay within length? Does it include/exclude specific content?
  4. Run these in CI on any PR touching prompts or LLM config
  5. Add a health check so you know if your LLM endpoint is responding in production

That's it. 5 tests, CI integration, 1 health check. You've gone from zero visibility to catching the majority of prompt regressions before they reach users.

The full pipeline described above can come incrementally. Start with what you can do today.

Read more