How to Test DSPy Programs: From Modules to Production
Your DSPy program compiles cleanly. The optimizer finds a configuration that scores 87% on your dev set. You ship it. Three days later, your support queue fills up with users saying the outputs are wrong.
The optimizer did its job. It improved accuracy on the metric you gave it. But your metric didn't fully capture what "correct" means for your users. The compiled program is confidently wrong in a new way, and you have no tests to catch it.
This is the DSPy testing problem: optimization is not validation.
What Makes DSPy Testing Different
Static LLM applications are relatively easy to test — the prompts don't change, so you can write assertions against known inputs. DSPy programs are different. The whole point of DSPy is that the optimizer rewrites your prompts, selects few-shot examples, and chains modules together in ways you didn't explicitly write.
That means:
- The prompt your program uses in production may look nothing like what you wrote
- Compiled programs can regress when you change the optimizer, the LM backend, or the training set
- Signatures define contracts, but DSPy won't enforce them at runtime without your tests doing it
You need tests at multiple layers: unit tests for modules, contract tests for signatures, evaluation pipelines for end-to-end correctness, and regression tests across compiled versions. Let's build each one.
Layer 1: Unit Testing DSPy Modules
Every DSPy Module has a forward() method. Test it like any function: give it controlled inputs, assert on the outputs.
import dspy
import pytest
# Configure a deterministic LM for tests
lm = dspy.LM("openai/gpt-4o-mini", temperature=0.0, cache=True)
dspy.configure(lm=lm)
class SentimentClassifier(dspy.Module):
def __init__(self):
self.classify = dspy.Predict("text -> sentiment: str, confidence: float")
def forward(self, text):
return self.classify(text=text)
@pytest.fixture
def classifier():
return SentimentClassifier()
def test_positive_sentiment(classifier):
result = classifier("The product exceeded all my expectations.")
assert result.sentiment.lower() in ("positive", "very positive")
assert result.confidence > 0.7
def test_negative_sentiment(classifier):
result = classifier("Total waste of money. Broken on arrival.")
assert result.sentiment.lower() in ("negative", "very negative")
assert result.confidence > 0.7
def test_neutral_sentiment(classifier):
result = classifier("The package arrived on Tuesday.")
assert result.sentiment.lower() == "neutral"Two things matter here: temperature=0.0 for reproducibility, and cache=True so repeated test runs don't burn API credits. DSPy's caching layer is keyed on the exact prompt — identical inputs produce identical outputs without hitting the API again.
Layer 2: Testing Signatures
A DSPy Signature is a contract: these inputs produce these outputs, with these field descriptions. Test that your signatures reject invalid data and produce outputs that match your type expectations.
import dspy
from dspy import Signature, InputField, OutputField
class RAGSignature(Signature):
"""Answer questions using retrieved context. Be concise."""
context: list[str] = InputField(desc="Retrieved passages")
question: str = InputField(desc="User question")
answer: str = OutputField(desc="Factual answer, 1-2 sentences")
confidence: float = OutputField(desc="Confidence score between 0 and 1")
def test_signature_output_types():
predictor = dspy.Predict(RAGSignature)
result = predictor(
context=["Paris is the capital of France."],
question="What is the capital of France?"
)
assert isinstance(result.answer, str)
assert len(result.answer) > 0
# DSPy should parse confidence as float; validate the range
confidence = float(result.confidence)
assert 0.0 <= confidence <= 1.0
def test_signature_uses_context():
predictor = dspy.Predict(RAGSignature)
result = predictor(
context=["The Eiffel Tower was completed in 1889."],
question="When was the Eiffel Tower completed?"
)
assert "1889" in result.answer
def test_signature_handles_empty_context():
predictor = dspy.Predict(RAGSignature)
result = predictor(
context=[],
question="What is the speed of light?"
)
# Should return some answer; just not hallucinate from missing context
assert isinstance(result.answer, str)Field description quality directly affects output quality. If your OutputField(desc=...) is vague, the LM won't reliably produce parseable output. Test the failure modes explicitly.
Layer 3: Evaluation with DSPy's Built-in Evaluate
DSPy ships an Evaluate class designed for running programs against a Dataset. Use it to measure accuracy across a representative sample before and after any change.
import dspy
from dspy.evaluate import Evaluate
# Build a labeled dev set
devset = [
dspy.Example(
question="What year did WWII end?",
context=["World War II ended in 1945 with Germany's surrender in May and Japan's in September."],
answer="1945"
).with_inputs("question", "context"),
dspy.Example(
question="Who invented the telephone?",
context=["Alexander Graham Bell is credited with patenting the telephone in 1876."],
answer="Alexander Graham Bell"
).with_inputs("question", "context"),
# Add 50-100 examples for meaningful signal
]
def exact_match(example, pred, trace=None):
return example.answer.strip().lower() in pred.answer.strip().lower()
def evaluate_program(program, devset):
evaluator = Evaluate(
devset=devset,
metric=exact_match,
num_threads=4,
display_progress=True
)
score = evaluator(program)
return score
# Run this in CI before and after optimizer changes
def test_baseline_accuracy():
program = MyRAGProgram()
score = evaluate_program(program, devset)
assert score >= 0.80, f"Accuracy dropped: {score:.2%}"Your metric is the most important part. exact_match is a start, but most real programs need semantic similarity, field-specific validation, or even a judge LM. Write the metric to match what users actually need — not what's easy to automate.
Layer 4: Regression Testing Compiled vs Uncompiled Programs
The optimizer might improve your target metric while regressing on edge cases your devset doesn't cover. Always test the compiled program explicitly against the uncompiled version on a held-out test set.
import dspy
from dspy.teleprompt import BootstrapFewShot
import pickle
import os
def load_or_compile(program, trainset, cache_path):
if os.path.exists(cache_path):
with open(cache_path, "rb") as f:
return pickle.load(f)
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4)
compiled = optimizer.compile(program, trainset=trainset)
with open(cache_path, "wb") as f:
pickle.dump(compiled, f)
return compiled
def test_compiled_does_not_regress_on_edge_cases():
base = MyRAGProgram()
compiled = load_or_compile(base, trainset, "compiled_rag.pkl")
edge_cases = [
dspy.Example(
question="What if the context contradicts the question?",
context=["The sky is green."],
answer="green" # should use context, not world knowledge
).with_inputs("question", "context"),
]
base_score = evaluate_program(base, edge_cases)
compiled_score = evaluate_program(compiled, edge_cases)
# Compiled program should not regress more than 5 points on edge cases
assert compiled_score >= base_score - 0.05, (
f"Compiled program regressed on edge cases: "
f"base={base_score:.2%}, compiled={compiled_score:.2%}"
)
def test_compiled_program_determinism():
compiled = load_or_compile(MyRAGProgram(), trainset, "compiled_rag.pkl")
input_kwargs = dict(
question="What year did WWII end?",
context=["World War II ended in 1945."]
)
result_1 = compiled(**input_kwargs)
result_2 = compiled(**input_kwargs)
assert result_1.answer == result_2.answer, "Compiled program is non-deterministic at temperature=0"Treat the compiled program artifact the same way you treat a model checkpoint: version it, test it on every promotion, and never silently replace it.
Layer 5: Monitoring DSPy-Powered Applications in Production
Unit tests and evaluations catch problems before deployment. Production monitoring catches everything else: prompt injection, distribution shift, latency spikes, and silent failures where the program returns something syntactically valid but semantically wrong.
For applications built on top of DSPy programs, you need:
- Structured logging — log every input/output pair with a trace ID
- Latency tracking — DSPy programs with multiple hops can time out unpredictably
- Output validation — if your program produces a confidence score, alert when confidence drops consistently below a threshold
- Canary evaluation — run your dev set against production periodically and alert on regression
import time
import logging
from functools import wraps
logger = logging.getLogger(__name__)
def monitor_dspy_call(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.monotonic()
try:
result = func(*args, **kwargs)
latency = time.monotonic() - start
logger.info({
"event": "dspy_call",
"module": func.__qualname__,
"latency_ms": round(latency * 1000),
"success": True,
})
return result
except Exception as e:
latency = time.monotonic() - start
logger.error({
"event": "dspy_call_error",
"module": func.__qualname__,
"latency_ms": round(latency * 1000),
"error": str(e),
})
raise
return wrapperWrap your forward() methods with this decorator and ship the logs to whatever observability stack you use. The pattern is the same whether you're running locally or in a Kubernetes cluster.
How HelpMeTest Helps
DSPy programs power end-user applications: chatbots, search interfaces, document processors, code assistants. The layers above cover the LLM internals. But the application itself needs end-to-end testing too — the UI, the API, the auth flow, the error states.
That's where HelpMeTest comes in. You write tests in plain English or generate them automatically from your app, and HelpMeTest runs them against your live application using real browsers. When your DSPy program changes behavior after a recompile, your application-level tests catch the surface-level breakage: wrong output displayed, truncated response, missing field, UI error state triggered.
A health check for a DSPy-powered search feature looks like:
helpmetest health "dspy-search-widget" 5mThat keeps a Playwright test running on a 5-minute interval, hitting your real application, asserting that the search widget returns results and the results are non-empty. If the DSPy program starts returning empty outputs, you know in 5 minutes instead of finding out from users.
Visual regression tests catch the cases where your DSPy program changes its output format enough that the frontend renders it incorrectly — broken markdown, missing sections, response truncated mid-sentence.
The Testing Stack for DSPy Programs
| Layer | Tool | What it catches |
|---|---|---|
| Module unit tests | pytest + DSPy cache | Logic errors, type mismatches, obvious regressions |
| Signature tests | pytest | Contract violations, unparseable outputs |
| Evaluation | DSPy Evaluate |
Accuracy regression across your dataset |
| Compilation regression | pytest + pickled artifacts | Optimizer introducing new failure modes |
| Production monitoring | Structured logs + alerting | Distribution shift, latency, silent failures |
| Application E2E | HelpMeTest | User-visible breakage, UI integration, auth flows |
DSPy shifts where the complexity lives — from hand-written prompts to the optimizer. That's genuinely useful. But it doesn't shift the responsibility for correctness. That's still yours, and tests are how you exercise it.
Start with the free tier at helpmetest.com — 10 tests, no credit card. Write one health check for your DSPy-powered endpoint. If it breaks after your next compile, you'll know before your users do.