AI Testing

How to Test DSPy Programs: From Modules to Production

HelpMeTest

13 May 2026 — 6 min read

Your DSPy program compiles cleanly. The optimizer finds a configuration that scores 87% on your dev set. You ship it. Three days later, your support queue fills up with users saying the outputs are wrong.

The optimizer did its job. It improved accuracy on the metric you gave it. But your metric didn't fully capture what "correct" means for your users. The compiled program is confidently wrong in a new way, and you have no tests to catch it.

This is the DSPy testing problem: optimization is not validation.

What Makes DSPy Testing Different

Static LLM applications are relatively easy to test — the prompts don't change, so you can write assertions against known inputs. DSPy programs are different. The whole point of DSPy is that the optimizer rewrites your prompts, selects few-shot examples, and chains modules together in ways you didn't explicitly write.

That means:

The prompt your program uses in production may look nothing like what you wrote
Compiled programs can regress when you change the optimizer, the LM backend, or the training set
Signatures define contracts, but DSPy won't enforce them at runtime without your tests doing it

You need tests at multiple layers: unit tests for modules, contract tests for signatures, evaluation pipelines for end-to-end correctness, and regression tests across compiled versions. Let's build each one.

Layer 1: Unit Testing DSPy Modules

Every DSPy Module has a forward() method. Test it like any function: give it controlled inputs, assert on the outputs.

import dspy
import pytest

# Configure a deterministic LM for tests
lm = dspy.LM("openai/gpt-4o-mini", temperature=0.0, cache=True)
dspy.configure(lm=lm)

class SentimentClassifier(dspy.Module):
    def __init__(self):
        self.classify = dspy.Predict("text -> sentiment: str, confidence: float")

    def forward(self, text):
        return self.classify(text=text)

@pytest.fixture
def classifier():
    return SentimentClassifier()

def test_positive_sentiment(classifier):
    result = classifier("The product exceeded all my expectations.")
    assert result.sentiment.lower() in ("positive", "very positive")
    assert result.confidence > 0.7

def test_negative_sentiment(classifier):
    result = classifier("Total waste of money. Broken on arrival.")
    assert result.sentiment.lower() in ("negative", "very negative")
    assert result.confidence > 0.7

def test_neutral_sentiment(classifier):
    result = classifier("The package arrived on Tuesday.")
    assert result.sentiment.lower() == "neutral"

Two things matter here: temperature=0.0 for reproducibility, and cache=True so repeated test runs don't burn API credits. DSPy's caching layer is keyed on the exact prompt — identical inputs produce identical outputs without hitting the API again.

Layer 2: Testing Signatures

A DSPy Signature is a contract: these inputs produce these outputs, with these field descriptions. Test that your signatures reject invalid data and produce outputs that match your type expectations.

import dspy
from dspy import Signature, InputField, OutputField

class RAGSignature(Signature):
    """Answer questions using retrieved context. Be concise."""
    context: list[str] = InputField(desc="Retrieved passages")
    question: str = InputField(desc="User question")
    answer: str = OutputField(desc="Factual answer, 1-2 sentences")
    confidence: float = OutputField(desc="Confidence score between 0 and 1")

def test_signature_output_types():
    predictor = dspy.Predict(RAGSignature)
    result = predictor(
        context=["Paris is the capital of France."],
        question="What is the capital of France?"
    )
    assert isinstance(result.answer, str)
    assert len(result.answer) > 0
    # DSPy should parse confidence as float; validate the range
    confidence = float(result.confidence)
    assert 0.0 <= confidence <= 1.0

def test_signature_uses_context():
    predictor = dspy.Predict(RAGSignature)
    result = predictor(
        context=["The Eiffel Tower was completed in 1889."],
        question="When was the Eiffel Tower completed?"
    )
    assert "1889" in result.answer

def test_signature_handles_empty_context():
    predictor = dspy.Predict(RAGSignature)
    result = predictor(
        context=[],
        question="What is the speed of light?"
    )
    # Should return some answer; just not hallucinate from missing context
    assert isinstance(result.answer, str)

Field description quality directly affects output quality. If your OutputField(desc=...) is vague, the LM won't reliably produce parseable output. Test the failure modes explicitly.

Layer 3: Evaluation with DSPy's Built-in Evaluate

DSPy ships an Evaluate class designed for running programs against a Dataset. Use it to measure accuracy across a representative sample before and after any change.

import dspy
from dspy.evaluate import Evaluate

# Build a labeled dev set
devset = [
    dspy.Example(
        question="What year did WWII end?",
        context=["World War II ended in 1945 with Germany's surrender in May and Japan's in September."],
        answer="1945"
    ).with_inputs("question", "context"),
    dspy.Example(
        question="Who invented the telephone?",
        context=["Alexander Graham Bell is credited with patenting the telephone in 1876."],
        answer="Alexander Graham Bell"
    ).with_inputs("question", "context"),
    # Add 50-100 examples for meaningful signal
]

def exact_match(example, pred, trace=None):
    return example.answer.strip().lower() in pred.answer.strip().lower()

def evaluate_program(program, devset):
    evaluator = Evaluate(
        devset=devset,
        metric=exact_match,
        num_threads=4,
        display_progress=True
    )
    score = evaluator(program)
    return score

# Run this in CI before and after optimizer changes
def test_baseline_accuracy():
    program = MyRAGProgram()
    score = evaluate_program(program, devset)
    assert score >= 0.80, f"Accuracy dropped: {score:.2%}"

Your metric is the most important part. exact_match is a start, but most real programs need semantic similarity, field-specific validation, or even a judge LM. Write the metric to match what users actually need — not what's easy to automate.

Layer 4: Regression Testing Compiled vs Uncompiled Programs

The optimizer might improve your target metric while regressing on edge cases your devset doesn't cover. Always test the compiled program explicitly against the uncompiled version on a held-out test set.

import dspy
from dspy.teleprompt import BootstrapFewShot
import pickle
import os

def load_or_compile(program, trainset, cache_path):
    if os.path.exists(cache_path):
        with open(cache_path, "rb") as f:
            return pickle.load(f)
    optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4)
    compiled = optimizer.compile(program, trainset=trainset)
    with open(cache_path, "wb") as f:
        pickle.dump(compiled, f)
    return compiled

def test_compiled_does_not_regress_on_edge_cases():
    base = MyRAGProgram()
    compiled = load_or_compile(base, trainset, "compiled_rag.pkl")

    edge_cases = [
        dspy.Example(
            question="What if the context contradicts the question?",
            context=["The sky is green."],
            answer="green"  # should use context, not world knowledge
        ).with_inputs("question", "context"),
    ]

    base_score = evaluate_program(base, edge_cases)
    compiled_score = evaluate_program(compiled, edge_cases)

    # Compiled program should not regress more than 5 points on edge cases
    assert compiled_score >= base_score - 0.05, (
        f"Compiled program regressed on edge cases: "
        f"base={base_score:.2%}, compiled={compiled_score:.2%}"
    )

def test_compiled_program_determinism():
    compiled = load_or_compile(MyRAGProgram(), trainset, "compiled_rag.pkl")
    input_kwargs = dict(
        question="What year did WWII end?",
        context=["World War II ended in 1945."]
    )
    result_1 = compiled(**input_kwargs)
    result_2 = compiled(**input_kwargs)
    assert result_1.answer == result_2.answer, "Compiled program is non-deterministic at temperature=0"

Treat the compiled program artifact the same way you treat a model checkpoint: version it, test it on every promotion, and never silently replace it.

Layer 5: Monitoring DSPy-Powered Applications in Production

Unit tests and evaluations catch problems before deployment. Production monitoring catches everything else: prompt injection, distribution shift, latency spikes, and silent failures where the program returns something syntactically valid but semantically wrong.

For applications built on top of DSPy programs, you need:

Structured logging — log every input/output pair with a trace ID
Latency tracking — DSPy programs with multiple hops can time out unpredictably
Output validation — if your program produces a confidence score, alert when confidence drops consistently below a threshold
Canary evaluation — run your dev set against production periodically and alert on regression

import time
import logging
from functools import wraps

logger = logging.getLogger(__name__)

def monitor_dspy_call(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.monotonic()
        try:
            result = func(*args, **kwargs)
            latency = time.monotonic() - start
            logger.info({
                "event": "dspy_call",
                "module": func.__qualname__,
                "latency_ms": round(latency * 1000),
                "success": True,
            })
            return result
        except Exception as e:
            latency = time.monotonic() - start
            logger.error({
                "event": "dspy_call_error",
                "module": func.__qualname__,
                "latency_ms": round(latency * 1000),
                "error": str(e),
            })
            raise
    return wrapper

Wrap your forward() methods with this decorator and ship the logs to whatever observability stack you use. The pattern is the same whether you're running locally or in a Kubernetes cluster.

How HelpMeTest Helps

DSPy programs power end-user applications: chatbots, search interfaces, document processors, code assistants. The layers above cover the LLM internals. But the application itself needs end-to-end testing too — the UI, the API, the auth flow, the error states.

That's where HelpMeTest comes in. You write tests in plain English or generate them automatically from your app, and HelpMeTest runs them against your live application using real browsers. When your DSPy program changes behavior after a recompile, your application-level tests catch the surface-level breakage: wrong output displayed, truncated response, missing field, UI error state triggered.

A health check for a DSPy-powered search feature looks like:

helpmetest health "dspy-search-widget" 5m

That keeps a Playwright test running on a 5-minute interval, hitting your real application, asserting that the search widget returns results and the results are non-empty. If the DSPy program starts returning empty outputs, you know in 5 minutes instead of finding out from users.

Visual regression tests catch the cases where your DSPy program changes its output format enough that the frontend renders it incorrectly — broken markdown, missing sections, response truncated mid-sentence.

The Testing Stack for DSPy Programs

Layer	Tool	What it catches
Module unit tests	pytest + DSPy cache	Logic errors, type mismatches, obvious regressions
Signature tests	pytest	Contract violations, unparseable outputs
Evaluation	DSPy `Evaluate`	Accuracy regression across your dataset
Compilation regression	pytest + pickled artifacts	Optimizer introducing new failure modes
Production monitoring	Structured logs + alerting	Distribution shift, latency, silent failures
Application E2E	HelpMeTest	User-visible breakage, UI integration, auth flows

DSPy shifts where the complexity lives — from hand-written prompts to the optimizer. That's genuinely useful. But it doesn't shift the responsibility for correctness. That's still yours, and tests are how you exercise it.

Start with the free tier at helpmetest.com — 10 tests, no credit card. Write one health check for your DSPy-powered endpoint. If it breaks after your next compile, you'll know before your users do.

How to Test DSPy Programs: From Modules to Production

HelpMeTest

What Makes DSPy Testing Different

Layer 1: Unit Testing DSPy Modules

Layer 2: Testing Signatures

Layer 3: Evaluation with DSPy's Built-in Evaluate

Layer 4: Regression Testing Compiled vs Uncompiled Programs

Layer 5: Monitoring DSPy-Powered Applications in Production

How HelpMeTest Helps

The Testing Stack for DSPy Programs

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest