Prompt Testing in CI/CD: How to Catch LLM Regressions Before Users Do
Every time you change a prompt, update a model, or adjust your context injection logic, you risk a silent regression — the LLM starts producing subtly different outputs that break downstream behavior. Without CI tests for your prompts, you find these regressions in production from user complaints. This guide shows how to build a prompt testing pipeline that catches regressions before they ship.
Key Takeaways
Prompt changes are unversioned code changes. Most teams track prompt changes in a config file or environment variable, not in code review. This means prompt regressions don't get caught by your existing PR review process or test suite.
LLM regressions are silent. A broken API returns a 500. A broken prompt returns a 200 with output that's subtly wrong — wrong format, wrong tone, missing field, incorrect assumption. Without explicit tests, you only discover this when users report problems.
You need output behavior tests, not equality checks. You can't assert output == "expected string" for non-deterministic LLM output. You assert structural properties: "response contains a JSON object with field X", "summary is under 200 words", "no profanity present".
Behavioral tests catch what unit tests miss. A unit test on your prompt construction code tells you the string is formatted correctly. An end-to-end behavioral test tells you the LLM actually uses it to produce the right output — a very different claim.
The Prompt Regression Problem
Most LLM application bugs don't come from code changes. They come from prompt changes.
The pattern is familiar: you tweak the system prompt to improve output quality for one use case. The change looks harmless — a few words rephrased, a constraint added. You ship it. Three days later, a user reports that summaries are now always in bullet points when they should be in paragraphs. Or the JSON output stopped including a field. Or responses started refusing requests that used to work.
Prompt regressions are uniquely hard to catch because:
1. They don't fail loudly. The model still produces output. It just produces the wrong output. No exception, no error code, no failed health check.
2. They're not in your diff. If your prompt is in a database or environment variable, the "code change" that caused the regression isn't in your PR. It happened outside the normal review process.
3. They're non-deterministic. The regression might not appear on every call — only on certain input patterns, at certain times, or with certain model load characteristics. Intermittent failures are the hardest to catch.
4. They compound with model updates. When your LLM provider updates the underlying model (which happens without notice for "minor" versions), your prompts may behave differently. You find out when users notice.
CI/CD testing for prompts solves all of these. Here's how to build it.
What Prompt Tests Actually Look Like
Prompt tests are behavioral assertions: given an input, does the output have these properties?
They're not equality assertions. assert response == "The capital of France is Paris" will fail on every other run even when the prompt is working correctly. Instead:
assert "Paris" in response
assert len(response.split()) < 100 # not too verbose
assert response does not contain "I cannot" or "I don't know"The assertions test structure, content properties, and behavioral constraints — not exact string matches.
Categories of Assertions
Format assertions — does the output conform to the expected structure?
- Is the JSON parseable?
- Does it include required fields?
- Is the response within expected length bounds?
Content assertions — does the output contain (or not contain) expected elements?
- Does the summary cover the main topic?
- Is the generated code syntactically valid?
- Does the response address the user's question?
Behavioral assertions — does the model behave within defined guardrails?
- Does it refuse requests it should refuse?
- Does it not add unsolicited disclaimers to factual responses?
- Does it follow the persona/tone constraints in the system prompt?
Regression assertions — does the output still have properties that used to be true?
- This output always used to include a
recommendationsfield. Does it still? - This assistant always used to respond in the user's language. Does it still?
Building the Pipeline
A prompt testing CI/CD pipeline has four components:
1. A Prompt Test Suite
A set of (input, assertions) pairs that cover your critical prompt behaviors. Store these in version control alongside your code — they're specifications, not ephemeral scripts.
Example structure:
# prompt-tests/summarizer.yaml
tests:
- name: "long article summary"
input:
system: "{{ system_prompt }}"
user: "{{ load_fixture('long_article.txt') }}"
assertions:
- type: json_parseable
- type: max_length
words: 200
- type: contains
value: "{{ extract_main_topic('long_article.txt') }}"
- type: not_contains
value: "I cannot summarize"
- name: "empty input handling"
input:
system: "{{ system_prompt }}"
user: ""
assertions:
- type: not_empty
- type: not_contains
value: "error"2. A Test Runner
A script that loads your prompt, calls the LLM, and evaluates the assertions. This runs in CI on every PR that touches prompt files or LLM configuration.
For teams using Python:
import anthropic
import yaml
import re
def run_prompt_tests(config_path: str, prompt_path: str) -> dict:
with open(config_path) as f:
tests = yaml.safe_load(f)['tests']
with open(prompt_path) as f:
system_prompt = f.read()
client = anthropic.Anthropic()
results = []
for test in tests:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": test['input']['user']}]
)
text = response.content[0].text
passed = evaluate_assertions(text, test['assertions'])
results.append({"test": test['name'], "passed": passed, "output": text})
return results
def evaluate_assertions(output: str, assertions: list) -> bool:
for assertion in assertions:
if assertion['type'] == 'contains' and assertion['value'] not in output:
return False
if assertion['type'] == 'not_contains' and assertion['value'] in output:
return False
if assertion['type'] == 'max_length':
if len(output.split()) > assertion['words']:
return False
if assertion['type'] == 'not_empty' and not output.strip():
return False
return True3. CI Integration
Add the prompt test runner to your CI pipeline. It should run on any PR that modifies:
- Prompt files
- System prompt templates
- Context injection logic
- Model configuration (model name, temperature, max_tokens)
- Tool definitions
# .github/workflows/prompt-tests.yml
name: Prompt Tests
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
- 'config/model*.yaml'
jobs:
prompt-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run prompt test suite
run: python scripts/run_prompt_tests.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: prompt-test-results
path: test-results/prompt-tests.jsonCost note: Running your full prompt test suite on every PR costs real money in API calls. Keep your suite focused on critical behaviors (20-50 tests, not 500). Non-critical prompt behaviors can run nightly rather than on every PR.
4. End-to-End Behavioral Tests
Beyond unit-level prompt assertions, run behavioral tests that verify the full user-visible behavior of your LLM feature. These catch regressions that span the model, the UI, and the integration layer together.
For an LLM app that exposes a chat interface:
*** Test Cases ***
Summarizer produces output under 200 words
As AuthUser
Go To https://app.yourproduct.com/summarize
Fill Text .document-input ${LONG_ARTICLE}
Click button[data-testid="summarize"]
Wait For Elements State .summary-output visible timeout=30s
${text}= Get Text .summary-output
${word_count}= Javascript return "${text}".split(/\s+/).length
Should Be True ${word_count} < 200
Summarizer does not return error message for valid input
As AuthUser
Go To https://app.yourproduct.com/summarize
Fill Text .document-input This is a short test document for summarization.
Click button[data-testid="summarize"]
Wait For Elements State .summary-output visible timeout=30s
Get Text .summary-output != Something went wrong
Get Text .summary-output != I cannotThese tests run in CI before every deploy. They don't test the prompt in isolation — they test the complete user experience. A prompt regression that produces correct output but breaks the downstream parser will fail these tests even if it passes the unit-level prompt assertions.
Monitoring Prompt Health in Production
CI tests catch regressions before they ship. You also need visibility into prompt behavior in production, where:
- Real user inputs are different from your test fixtures
- Model providers roll out minor version updates silently
- Output quality degrades gradually over time as user input patterns shift
Sampling and Logging
Log every LLM call with: input token count, output token count, response time, and — if you can compute it cheaply — an automated quality signal (was the response parseable? did it pass format assertions?).
Set up a health check that fires whenever your automated quality signal drops below a threshold:
# In your LLM request handler, after each call:
<span class="hljs-keyword">if quality_score < threshold:
<span class="hljs-comment"># Signal to HelpMeTest that something is degraded
helpmetest health <span class="hljs-string">"llm-quality" <span class="hljs-string">"10m"If the health check doesn't receive a signal within 10 minutes, you get alerted. If quality degrades enough that you're not sending signals, you find out before users do.
Regression Detection
Run your full prompt test suite nightly against production. This catches:
- Gradual model drift (the model provider updated something)
- Environment-specific issues (production model config differs from staging)
- Regressions that slipped through review
# .github/workflows/nightly-prompt-tests.yml
on:
schedule:
- cron: '0 6 * * *' # 6am daily
jobs:
nightly-prompt-regression:
runs-on: ubuntu-latest
steps:
- name: Run full prompt test suite
run: python scripts/run_prompt_tests.py --suite full
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
TARGET_ENV: productionIf nightly tests fail, you get the failure before your users' morning session. If they've been failing since 11pm, you fix it at 6am instead of when the first support ticket comes in.
The Prompt Test Pyramid
Think of prompt testing in three layers:
Fast (run on every PR):
- Format assertion tests — is the output parseable and structured correctly?
- Guardrail tests — does the model refuse what it should refuse?
- Critical behavior tests — the 5-10 things that would break your app if they regressed
Medium (run on merge to main):
- Full behavioral test suite — all user-visible LLM features
- Integration tests — does the LLM output work with downstream parsers and consumers?
Slow (run nightly):
- Full regression suite against production
- Quality scoring across a sample of real user inputs
- A/B comparison when you're evaluating a prompt change
Most teams skip the fast layer entirely (no CI testing) and discover regressions from users. The fast layer — 10-20 focused tests running in under 2 minutes — prevents the majority of prompt regressions from shipping.
Getting Started
If you have an LLM feature in production with no prompt tests today, here's the minimum viable starting point:
- Write 5 tests for your most critical LLM behavior (the feature your users would notice first if it broke)
- Assert on structure: is the output parseable? Does it have required fields?
- Assert on one behavioral property: does it stay within length? Does it include/exclude specific content?
- Run these in CI on any PR touching prompts or LLM config
- Add a health check so you know if your LLM endpoint is responding in production
That's it. 5 tests, CI integration, 1 health check. You've gone from zero visibility to catching the majority of prompt regressions before they reach users.
The full pipeline described above can come incrementally. Start with what you can do today.