A/B Testing AI Features: Experiments With Guardrails for LLM Products

A/B Testing AI Features: Experiments With Guardrails for LLM Products

A/B testing AI features is harder than testing traditional software. Changing a button color produces a measurable, binary outcome — users click or they don't. Changing an LLM prompt produces different text outputs, and measuring which output is "better" requires defining what better even means for your use case.

This guide covers how to design, run, and safely conclude A/B tests on AI features — including model variants, prompt changes, and RAG pipeline configurations.

What You're Testing in AI A/B Experiments

AI A/B tests typically compare one or more of:

  • Model variants: GPT-4o vs. GPT-4o-mini, Claude Sonnet vs. Haiku
  • Prompt variants: Different system prompts, few-shot examples, output format instructions
  • Temperature/parameter variants: Higher vs. lower temperature for creative tasks
  • RAG configurations: Different chunk sizes, retrieval strategies, context window sizes
  • Feature flags: Show AI-generated content to 50% of users, static content to the other 50%

Each variant produces different outputs. The challenge is measuring which variant better serves your users.

Defining Success Metrics

Before running any experiment, define how you'll measure success. AI feature metrics fall into three categories:

Behavioral metrics (user actions):

  • Task completion rate (did the user complete what they came to do?)
  • Click-through rate on AI-generated recommendations
  • Copy/use rate (did the user copy the AI's text or regenerate?)
  • Session length and return rate

Quality metrics (automated evaluation):

  • Faithfulness score (for RAG systems)
  • Output length vs. user-specified target
  • Format compliance (did the LLM follow instructions?)
  • Latency (p50, p95, p99)

Satisfaction metrics (explicit feedback):

  • Thumbs up/down on responses
  • Star ratings
  • "Was this helpful?" prompts
class AIExperimentMetrics:
    def __init__(self, experiment_id: str):
        self.experiment_id = experiment_id
    
    def track_response(
        self,
        variant: str,
        query: str,
        response: str,
        latency_ms: float,
        user_id: str,
    ):
        self.db.insert({
            "experiment_id": self.experiment_id,
            "variant": variant,
            "query": query,
            "response": response,
            "response_length": len(response),
            "latency_ms": latency_ms,
            "user_id": user_id,
            "timestamp": datetime.utcnow().isoformat(),
        })
    
    def track_user_action(
        self,
        variant: str,
        user_id: str,
        action: str,  # "copy", "regenerate", "thumbs_up", "thumbs_down", "task_completed"
    ):
        self.db.insert({
            "experiment_id": self.experiment_id,
            "variant": variant,
            "user_id": user_id,
            "action": action,
            "timestamp": datetime.utcnow().isoformat(),
        })

Setting Up the Experiment Framework

Use a feature flag service or build simple variant assignment:

import hashlib
from enum import Enum

class Variant(Enum):
    CONTROL = "control"
    TREATMENT = "treatment"

def assign_variant(user_id: str, experiment_id: str, traffic_split: float = 0.5) -> Variant:
    """
    Deterministic variant assignment — same user always gets same variant.
    Prevents variant switching confounding results.
    """
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    normalized = (hash_value % 10000) / 10000  # 0.0 to 1.0
    
    return Variant.TREATMENT if normalized < traffic_split else Variant.CONTROL

# Usage
def handle_user_query(user_id: str, query: str) -> str:
    variant = assign_variant(user_id, experiment_id="prompt-v2-test")
    
    if variant == Variant.TREATMENT:
        # New prompt with chain-of-thought
        response = llm.complete(
            query,
            system_prompt=PROMPT_V2_WITH_COT,
        )
    else:
        # Existing prompt (control)
        response = llm.complete(
            query,
            system_prompt=PROMPT_V1_BASELINE,
        )
    
    metrics.track_response(
        variant=variant.value,
        query=query,
        response=response,
        latency_ms=elapsed_ms,
        user_id=user_id,
    )
    
    return response

Testing the Experiment Infrastructure

Before running a live experiment, test that the infrastructure works correctly:

def test_variant_assignment_is_deterministic():
    """Same user always gets the same variant."""
    user_id = "user-abc-123"
    experiment_id = "test-exp-001"
    
    variants = [assign_variant(user_id, experiment_id) for _ in range(10)]
    assert len(set(variants)) == 1, f"Variant changed across calls: {variants}"

def test_variant_assignment_splits_traffic_correctly():
    """50/50 split should assign roughly equal users to each variant."""
    users = [f"user-{i}" for i in range(1000)]
    variants = [assign_variant(u, "test-exp") for u in users]
    
    treatment_count = sum(1 for v in variants if v == Variant.TREATMENT)
    treatment_ratio = treatment_count / len(users)
    
    # Should be close to 0.5 (allow 5% tolerance)
    assert 0.45 <= treatment_ratio <= 0.55, (
        f"Traffic split is {treatment_ratio:.2%}, expected ~50%"
    )

def test_variant_isolation_between_experiments():
    """Same user can be in different variants across different experiments."""
    user_id = "user-test-123"
    
    variant_exp1 = assign_variant(user_id, "experiment-1")
    variant_exp2 = assign_variant(user_id, "experiment-2")
    
    # We can't assert they're different (they might be), but we test the mechanism works
    assert variant_exp1 in [Variant.CONTROL, Variant.TREATMENT]
    assert variant_exp2 in [Variant.CONTROL, Variant.TREATMENT]

Automated Quality Guards During the Experiment

Running a bad prompt in production while collecting data is costly. Add automated quality checks that pause or roll back the experiment if quality degrades:

class ExperimentGuardrail:
    def __init__(
        self,
        min_faithfulness: float = 0.80,
        max_latency_p95_ms: float = 3000,
        min_format_compliance: float = 0.95,
    ):
        self.min_faithfulness = min_faithfulness
        self.max_latency_p95_ms = max_latency_p95_ms
        self.min_format_compliance = min_format_compliance
    
    def check_response(self, response: str, context: str, latency_ms: float) -> dict:
        issues = []
        
        # Format compliance — did LLM follow output format?
        if not self.is_valid_format(response):
            issues.append({"type": "format", "severity": "high"})
        
        # Latency guard
        if latency_ms > self.max_latency_p95_ms:
            issues.append({"type": "latency", "value": latency_ms, "severity": "medium"})
        
        # Faithfulness (expensive — sample only)
        if random.random() < 0.1:  # Check 10% of responses
            faithfulness = self.score_faithfulness(response, context)
            if faithfulness < self.min_faithfulness:
                issues.append({"type": "faithfulness", "value": faithfulness, "severity": "high"})
        
        return {"ok": len(issues) == 0, "issues": issues}
    
    def is_valid_format(self, response: str) -> bool:
        # Check your expected format — e.g., must be valid JSON, or start with a bullet
        return len(response) > 10  # Minimal check — customize per use case

def handle_query_with_guardrails(user_id: str, query: str, context: str) -> str:
    variant = assign_variant(user_id, "exp-001")
    guardrail = ExperimentGuardrail()
    
    start = time.perf_counter()
    response = get_llm_response(variant, query, context)
    latency_ms = (time.perf_counter() - start) * 1000
    
    check = guardrail.check_response(response, context, latency_ms)
    
    if not check["ok"]:
        high_severity = [i for i in check["issues"] if i["severity"] == "high"]
        if high_severity:
            # Fall back to control and log
            logger.error(f"Guardrail triggered for {variant}: {high_severity}")
            return get_llm_response(Variant.CONTROL, query, context)
    
    return response

Statistical Analysis

Collect enough data before drawing conclusions. Use proper statistical tests:

from scipy import stats
import numpy as np

def analyze_experiment_results(control_data: dict, treatment_data: dict) -> dict:
    """
    control_data / treatment_data: {
        'task_completions': [0, 1, 1, 0, ...],
        'latencies_ms': [234, 456, ...],
        'thumbs_up_rate': float,
    }
    """
    results = {}
    
    # Task completion rate — chi-squared test for proportions
    control_completions = sum(control_data['task_completions'])
    control_total = len(control_data['task_completions'])
    treatment_completions = sum(treatment_data['task_completions'])
    treatment_total = len(treatment_data['task_completions'])
    
    contingency = [
        [control_completions, control_total - control_completions],
        [treatment_completions, treatment_total - treatment_completions],
    ]
    chi2, p_value, _, _ = stats.chi2_contingency(contingency)
    
    results['task_completion'] = {
        'control_rate': control_completions / control_total,
        'treatment_rate': treatment_completions / treatment_total,
        'p_value': p_value,
        'significant': p_value < 0.05,
        'relative_improvement': (treatment_completions / treatment_total) / (control_completions / control_total) - 1,
    }
    
    # Latency — Mann-Whitney U test (non-parametric, handles skewed distributions)
    u_stat, p_lat = stats.mannwhitneyu(
        control_data['latencies_ms'],
        treatment_data['latencies_ms'],
        alternative='two-sided'
    )
    
    results['latency'] = {
        'control_p50': np.percentile(control_data['latencies_ms'], 50),
        'treatment_p50': np.percentile(treatment_data['latencies_ms'], 50),
        'control_p95': np.percentile(control_data['latencies_ms'], 95),
        'treatment_p95': np.percentile(treatment_data['latencies_ms'], 95),
        'p_value': p_lat,
        'significant': p_lat < 0.05,
    }
    
    return results

def test_experiment_has_sufficient_power():
    """Verify you have enough data before calling results."""
    MIN_SAMPLES_PER_VARIANT = 500  # Adjust based on expected effect size
    
    control_count = db.count("experiment_results", variant="control", experiment_id="exp-001")
    treatment_count = db.count("experiment_results", variant="treatment", experiment_id="exp-001")
    
    assert control_count >= MIN_SAMPLES_PER_VARIANT, (
        f"Control has only {control_count} samples, need {MIN_SAMPLES_PER_VARIANT}"
    )
    assert treatment_count >= MIN_SAMPLES_PER_VARIANT, (
        f"Treatment has only {treatment_count} samples, need {MIN_SAMPLES_PER_VARIANT}"
    )

Gradual Rollout Pattern

Don't jump from 0% to 50% traffic. Use a staged rollout with automated checks at each stage:

ROLLOUT_STAGES = [
    {"traffic_pct": 0.01, "min_hours": 1, "max_error_rate": 0.05},
    {"traffic_pct": 0.05, "min_hours": 4, "max_error_rate": 0.03},
    {"traffic_pct": 0.25, "min_hours": 24, "max_error_rate": 0.02},
    {"traffic_pct": 0.50, "min_hours": 48, "max_error_rate": 0.02},
]

def can_advance_rollout(stage: dict, current_metrics: dict) -> tuple[bool, str]:
    """Returns (can_advance, reason)"""
    
    if current_metrics["hours_running"] < stage["min_hours"]:
        return False, f"Need {stage['min_hours']}h, only {current_metrics['hours_running']:.1f}h elapsed"
    
    if current_metrics["error_rate"] > stage["max_error_rate"]:
        return False, f"Error rate {current_metrics['error_rate']:.2%} exceeds {stage['max_error_rate']:.2%}"
    
    if current_metrics["sample_count"] < 100:
        return False, f"Only {current_metrics['sample_count']} samples, need at least 100"
    
    return True, "All checks passed"

def test_rollout_gates_work():
    """Verify that rollout gates correctly block advancement."""
    stage = {"traffic_pct": 0.05, "min_hours": 4, "max_error_rate": 0.03}
    
    # Should block: too few hours
    can_advance, reason = can_advance_rollout(stage, {
        "hours_running": 1.5,
        "error_rate": 0.01,
        "sample_count": 200,
    })
    assert not can_advance
    assert "hours" in reason.lower()
    
    # Should block: error rate too high
    can_advance, reason = can_advance_rollout(stage, {
        "hours_running": 5.0,
        "error_rate": 0.05,
        "sample_count": 200,
    })
    assert not can_advance
    assert "error rate" in reason.lower()
    
    # Should pass
    can_advance, reason = can_advance_rollout(stage, {
        "hours_running": 5.0,
        "error_rate": 0.01,
        "sample_count": 500,
    })
    assert can_advance

Common Mistakes in AI A/B Testing

Novelty effect: Users engage more with anything new. Run experiments long enough (minimum 2 weeks) to see past the initial novelty spike.

Peeking at results: Checking significance daily and stopping when p < 0.05 inflates false positive rates dramatically. Decide your sample size before the experiment and don't stop early.

Ignoring tail latency: An AI variant that's faster on average but has much worse p99 latency creates a poor experience for 1% of users — which at scale is a lot of people.

Testing prompts in isolation: Prompt changes interact with the model version, the context, and user query distribution. A prompt that's better in offline testing can be worse in production if offline queries don't represent real users.

Not testing the experiment logic itself: Your variant assignment, metric tracking, and rollout logic all need tests. A bug in the assignment code can silently corrupt your entire experiment.

AI A/B testing is how you move from intuition ("I think this prompt is better") to evidence ("this prompt increases task completion by 8% at p=0.03"). Build the infrastructure once, run experiments continuously, and let data — not instinct — drive your AI feature decisions.

Read more