A/B Testing LLM Models and Prompts: A Statistical Framework

A/B Testing LLM Models and Prompts: A Statistical Framework

"The new prompt feels better" is not an evaluation strategy. Moving from GPT-4 to Claude, or changing a system prompt, requires rigorous A/B testing to make confident decisions — especially when the differences in quality are subtle and user impact is significant.

This guide covers the statistical framework for A/B testing LLMs: how to design experiments, calculate required sample sizes, analyze results correctly, and avoid the common mistakes that lead to wrong conclusions.

Why LLM A/B Testing Is Hard

Traditional software A/B tests compare binary outcomes: did the user click? Did they convert? LLM quality is continuous, multi-dimensional, and noisy.

Sources of variance in LLM evaluation:

  1. Model non-determinism. At temperature > 0, the same prompt produces different outputs on each run
  2. Input sensitivity. Small changes in phrasing cause large quality swings
  3. Evaluation noise. LLM-as-judge gives slightly different scores across evaluation runs
  4. Selection bias. If your test set isn't representative, results won't generalize

These sources combine to make effect sizes small and confidence intervals wide. You need more data than you think to draw reliable conclusions.

Experimental Design

Define the Hypothesis First

Before running any tests, state your hypothesis precisely:

❌ Vague: "Claude is better than GPT-4 for our use case"

✅ Specific: "Switching from GPT-4-turbo to Claude-opus-4-6 in our document summarization endpoint will improve faithfulness (measured by LLM-as-judge) from 0.84 to ≥0.87, with statistical significance at α=0.05 and power β=0.80"

A specific hypothesis forces you to:

  • Define the primary metric (faithfulness, not "overall quality")
  • Specify the minimum detectable effect (0.03 improvement)
  • Commit to statistical parameters before seeing data

Choose Your Primary Metric

Pick one primary metric for each experiment. If you test multiple metrics simultaneously, you need to adjust your significance threshold (Bonferroni correction), which makes detection harder.

Change Type Recommended Primary Metric
Model swap User preference (head-to-head)
Prompt optimization Task-specific accuracy
RAG configuration Faithfulness + context recall
Output format change User satisfaction score
Temperature tuning Quality variance (std dev of scores)

Sample Size Calculation

The most common mistake in LLM evaluation is running too few samples. Here's how to calculate the required size:

from scipy import stats
import numpy as np

def required_sample_size(
    baseline_mean: float,
    minimum_detectable_effect: float,
    baseline_std: float,
    alpha: float = 0.05,
    power: float = 0.80
) -> int:
    """
    Calculate required sample size for a two-sample t-test.
    
    Args:
        baseline_mean: Current system's mean score (e.g., 0.84)
        minimum_detectable_effect: Smallest improvement worth detecting (e.g., 0.03)
        baseline_std: Standard deviation of scores (e.g., 0.12)
        alpha: Significance level (probability of false positive)
        power: Statistical power (probability of detecting real effect)
    
    Returns:
        Required number of samples per group
    """
    effect_size = minimum_detectable_effect / baseline_std
    
    # Z-scores for alpha (two-tailed) and power
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    n = ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))

# Example: Testing prompt change
baseline_mean = 0.84
min_effect = 0.03    # We care about improvements of 0.03+
baseline_std = 0.12  # Typical variance for our evaluation metric

n_per_group = required_sample_size(baseline_mean, min_effect, baseline_std)
print(f"Required samples per group: {n_per_group}")  # ~252
print(f"Total samples needed: {n_per_group * 2}")    # ~504

Most teams run 50-100 samples and declare a winner. This gives you roughly 50% power to detect a 0.05 improvement — you'd flip a coin and get similar reliability.

Rule of thumb: For typical LLM quality scores (std dev ~0.10-0.15), you need 200-400 samples per group to detect meaningful improvements of 2-5 percentage points.

Running the Experiment

Parallel Evaluation Protocol

Evaluate both variants on the same inputs simultaneously. This eliminates input distribution variance:

import asyncio
from dataclasses import dataclass
from typing import Callable

@dataclass
class ABTestConfig:
    experiment_id: str
    control_model: str
    treatment_model: str
    control_prompt: str
    treatment_prompt: str
    n_samples: int
    evaluator: object

async def run_ab_test(
    config: ABTestConfig,
    inputs: list[dict]  # Your test inputs
) -> dict:
    """Run A/B test with paired evaluation."""
    
    assert len(inputs) >= config.n_samples, \
        f"Need {config.n_samples} inputs, got {len(inputs)}"
    
    # Use same inputs for both variants (paired design)
    test_inputs = inputs[:config.n_samples]
    
    async def evaluate_pair(input_item):
        # Run both variants on same input
        control_response, treatment_response = await asyncio.gather(
            generate(config.control_model, config.control_prompt, input_item),
            generate(config.treatment_model, config.treatment_prompt, input_item)
        )
        
        # Evaluate both
        control_score, treatment_score = await asyncio.gather(
            config.evaluator.evaluate(input_item, control_response),
            config.evaluator.evaluate(input_item, treatment_response)
        )
        
        return {
            'input_id': input_item['id'],
            'control_score': control_score.overall,
            'treatment_score': treatment_score.overall,
            'delta': treatment_score.overall - control_score.overall
        }
    
    # Evaluate all pairs concurrently (with rate limiting)
    semaphore = asyncio.Semaphore(10)
    async def bounded(item):
        async with semaphore:
            return await evaluate_pair(item)
    
    results = await asyncio.gather(*[bounded(inp) for inp in test_inputs])
    return analyze_results(results, config)

Analysis

from scipy import stats
import numpy as np

def analyze_results(results: list[dict], config: ABTestConfig) -> dict:
    control_scores = np.array([r['control_score'] for r in results])
    treatment_scores = np.array([r['treatment_score'] for r in results])
    deltas = treatment_scores - control_scores
    
    # Paired t-test (more powerful than independent since we used paired design)
    t_stat, p_value = stats.ttest_rel(treatment_scores, control_scores)
    
    # Effect size (Cohen's d for paired samples)
    cohens_d = deltas.mean() / deltas.std()
    
    # Bootstrap confidence interval on the delta
    bootstrap_deltas = [
        np.random.choice(deltas, len(deltas)).mean() 
        for _ in range(10000)
    ]
    ci_lower = np.percentile(bootstrap_deltas, 2.5)
    ci_upper = np.percentile(bootstrap_deltas, 97.5)
    
    # Practical significance: what fraction improved?
    improved = (deltas > 0).sum() / len(deltas)
    meaningfully_improved = (deltas > 0.05).sum() / len(deltas)  # > 5% improvement
    
    return {
        'experiment_id': config.experiment_id,
        'n_samples': len(results),
        'control': {
            'model': config.control_model,
            'mean': control_scores.mean(),
            'std': control_scores.std(),
        },
        'treatment': {
            'model': config.treatment_model,
            'mean': treatment_scores.mean(),
            'std': treatment_scores.std(),
        },
        'delta': {
            'mean': deltas.mean(),
            'ci_95': (ci_lower, ci_upper),
            'cohens_d': cohens_d,
        },
        'statistical': {
            'p_value': p_value,
            'significant_at_05': p_value < 0.05,
            'significant_at_01': p_value < 0.01,
        },
        'practical': {
            'pct_improved': improved,
            'pct_meaningfully_improved': meaningfully_improved,
        },
        'recommendation': make_recommendation(p_value, deltas.mean(), cohens_d)
    }

def make_recommendation(p_value, mean_delta, cohens_d):
    if p_value >= 0.05:
        return "NO_DIFFERENCE: Cannot reject null hypothesis. Gather more data."
    
    if mean_delta < 0:
        return "KEEP_CONTROL: Treatment is significantly worse."
    
    if abs(cohens_d) < 0.2:
        return "MARGINAL: Statistically significant but effect is small. Consider cost tradeoffs."
    
    if abs(cohens_d) >= 0.5:
        return "SHIP_TREATMENT: Large, significant improvement."
    
    return "LIKELY_SHIP: Significant improvement, moderate effect size."

Common Mistakes and How to Avoid Them

Peeking at Results

The problem: Running evaluation in batches and checking for significance after each batch. If you'd stop at the first batch that shows p<0.05, you inflate your false positive rate.

Example: You plan to run 400 samples. After 100, p=0.04 — you declare victory. But with this strategy, the probability of a false positive is 19%, not 5%.

Fix: Pre-commit to your sample size before running. Use sequential testing methods (like SPRT or mSPRT) if you genuinely need to stop early.

class SequentialTest:
    """
    Sequential probability ratio test for early stopping.
    Use this if you cannot pre-commit to a fixed sample size.
    """
    def __init__(self, alpha=0.05, beta=0.20, min_samples=50):
        self.alpha = alpha
        self.beta = beta
        self.min_samples = min_samples
        self.results = []
    
    def add_result(self, control_score: float, treatment_score: float):
        self.results.append(treatment_score - control_score)
    
    def should_stop(self) -> tuple[bool, str]:
        if len(self.results) < self.min_samples:
            return False, "Minimum samples not reached"
        
        deltas = np.array(self.results)
        t_stat, p_value = stats.ttest_1samp(deltas, 0)
        
        # O'Brien-Fleming boundary (conservative early stopping)
        n = len(self.results)
        n_final = 400  # your planned final sample size
        alpha_spending = self.alpha * np.sqrt(n / n_final)
        
        if p_value < alpha_spending:
            direction = "treatment" if deltas.mean() > 0 else "control"
            return True, f"Significant at p={p_value:.4f}. Prefer {direction}."
        
        return False, f"Continue. Current p={p_value:.4f}"

Testing Too Many Things At Once

The problem: Changing both the model and the prompt simultaneously. You can't tell which change drove the difference.

Fix: One variable per experiment. If you want to test both, run:

  1. Control (current model + current prompt)
  2. Treatment A (new model + current prompt)
  3. Treatment B (current model + new prompt)
  4. Treatment C (new model + new prompt)

This is a 2×2 factorial design. It requires 2x the samples but lets you measure interaction effects.

Not Accounting for Evaluation Noise

LLM-as-judge scores vary across evaluation runs. This adds noise on top of model output variance.

Measure evaluation noise:

def measure_eval_noise(evaluator, fixed_example: dict, n_runs: int = 20) -> float:
    """
    Run the same evaluation multiple times to measure judge variance.
    Returns standard deviation of scores across runs.
    """
    scores = []
    for _ in range(n_runs):
        result = evaluator.evaluate(
            question=fixed_example['input'],
            expected=fixed_example['expected'],
            actual=fixed_example['actual']
        )
        scores.append(result.overall)
    
    std = np.std(scores)
    print(f"Evaluation noise: std={std:.4f}, cv={std/np.mean(scores):.1%}")
    return std

# If evaluation std > 0.05 on the same input, you have noisy evals
# Mitigate by: setting temperature=0 for judge, averaging 3 runs

Mitigating evaluation noise:

  • Always use temperature=0 for your LLM judge
  • Average 3 evaluation runs for each example
  • Increase sample size to compensate for noise

Ignoring Subgroup Performance

A treatment that improves average quality might hurt performance on an important subgroup.

def subgroup_analysis(results: list[dict]) -> dict:
    """Check if the treatment effect is consistent across subgroups."""
    subgroups = {}
    
    for result in results:
        for tag in result.get('tags', ['untagged']):
            if tag not in subgroups:
                subgroups[tag] = []
            subgroups[tag].append(result['delta'])
    
    analysis = {}
    for tag, deltas in subgroups.items():
        if len(deltas) < 20:
            continue  # Too few samples for reliable subgroup estimate
        
        t_stat, p_value = stats.ttest_1samp(deltas, 0)
        analysis[tag] = {
            'n': len(deltas),
            'mean_delta': np.mean(deltas),
            'significant': p_value < 0.05,
            'direction': 'improvement' if np.mean(deltas) > 0 else 'regression'
        }
    
    return analysis

Head-to-Head (Preference) Testing

For user-facing quality differences that are hard to quantify with metrics, use preference testing:

async def run_preference_test(
    control_outputs: list[str],
    treatment_outputs: list[str],
    questions: list[str],
    judge_model: str = "claude-opus-4-6"
) -> dict:
    """Have an LLM judge which response it prefers."""
    import anthropic
    client = anthropic.Anthropic()
    
    preferences = []
    for q, ctrl, treat in zip(questions, control_outputs, treatment_outputs):
        prompt = f"""You are comparing two AI responses to the same question.

Question: {q}

Response A: {ctrl}

Response B: {treat}

Which response is better? Consider: accuracy, helpfulness, clarity, and completeness.
Answer with only "A", "B", or "TIE"."""
        
        response = client.messages.create(
            model=judge_model,
            max_tokens=10,
            temperature=0,
            messages=[{"role": "user", "content": prompt}]
        )
        preferences.append(response.content[0].text.strip())
    
    a_wins = preferences.count("A")
    b_wins = preferences.count("B")
    ties = preferences.count("TIE")
    total = len(preferences)
    
    # Binomial test: is B win rate significantly > 50%?
    # Exclude ties
    non_tie = a_wins + b_wins
    p_value = stats.binom_test(b_wins, non_tie, 0.5, alternative='greater')
    
    return {
        'A_wins': a_wins / total,
        'B_wins': b_wins / total,
        'ties': ties / total,
        'p_value_B_better': p_value,
        'B_significantly_better': p_value < 0.05
    }

Decision Framework

After running your experiment:

Scenario Decision
p ≥ 0.05 No evidence of difference. Keep control unless treatment has cost advantages.
p < 0.05, treatment worse Keep control.
p < 0.05, treatment better, Cohen's d < 0.2 Marginal improvement. Weigh against cost/latency changes.
p < 0.05, treatment better, Cohen's d ≥ 0.2 Ship treatment.
Subgroup regression despite overall improvement Investigate before shipping.

Always document the experiment: Record hypothesis, sample size, results, decision, and rationale. This institutional memory prevents re-running the same experiments and helps calibrate future ones.

Summary

Rigorous LLM A/B testing requires:

  1. Pre-specified hypotheses with a single primary metric
  2. Correct sample sizes — calculate before running, not after
  3. Paired experimental design — same inputs for both variants
  4. Appropriate statistical tests — paired t-test, bootstrap CI
  5. Subgroup analysis — verify consistent improvement across user segments
  6. No peeking — commit to sample size before seeing results

The overhead of statistical rigor is small compared to the cost of shipping a model change that you think is an improvement but actually isn't. Or worse, not shipping an improvement because your underpowered test showed p=0.07.

Read more