ML Model Regression Testing: Catch Model Degradation Before Production

ML Model Regression Testing: Catch Model Degradation Before Production

ML models degrade in ways that traditional software doesn't. Code doesn't spontaneously get worse — but a model trained on March data can silently degrade by June as the distribution of real-world inputs shifts. A retrained model can score higher on aggregate metrics while regressing on the specific cases your users care about most.

ML regression testing is the practice of validating model behavior against known baselines before every deployment and monitoring for drift in production. This guide covers how to build that validation layer.

Why ML Models Regress

Unlike bugs in code, model degradation is often invisible and gradual:

  • Data drift: Production inputs drift from the training distribution over time
  • Concept drift: The underlying relationship between inputs and outputs changes (user behavior evolves, market conditions shift)
  • Retraining regression: A new model trained on more data scores better overall but worse on important subpopulations
  • Dependency drift: An upstream data pipeline changes its output format or semantics
  • Infrastructure changes: Different hardware, library versions, or serialization produces slightly different floating-point results

Regression testing catches these issues before they reach production.

The Model Validation Baseline

The foundation of regression testing is a fixed evaluation dataset that you never retrain on:

# evaluation/baseline.py
import json
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

class ModelBaseline:
    """
    Tracks performance metrics across model versions.
    Load once, compare every time.
    """
    
    def __init__(self, baseline_path: str):
        with open(baseline_path) as f:
            self.baseline = json.load(f)
    
    def evaluate(self, model, X_eval, y_eval) -> dict:
        predictions = model.predict(X_eval)
        probabilities = model.predict_proba(X_eval)[:, 1] if hasattr(model, 'predict_proba') else None
        
        metrics = {
            "accuracy": accuracy_score(y_eval, predictions),
            "f1_macro": f1_score(y_eval, predictions, average='macro'),
            "f1_weighted": f1_score(y_eval, predictions, average='weighted'),
        }
        
        if probabilities is not None:
            metrics["auc_roc"] = roc_auc_score(y_eval, probabilities)
        
        return metrics
    
    def compare_to_baseline(self, current_metrics: dict, max_regression_pct: float = 0.02) -> dict:
        """
        Returns comparison with pass/fail for each metric.
        max_regression_pct: maximum allowed drop relative to baseline (e.g., 0.02 = 2%)
        """
        results = {}
        
        for metric, current_value in current_metrics.items():
            if metric not in self.baseline:
                results[metric] = {"status": "no_baseline", "current": current_value}
                continue
            
            baseline_value = self.baseline[metric]
            regression = (baseline_value - current_value) / baseline_value
            
            results[metric] = {
                "baseline": baseline_value,
                "current": current_value,
                "regression_pct": regression,
                "passed": regression <= max_regression_pct,
            }
        
        return results

Writing Regression Tests

import pytest
import json
import numpy as np
from your_ml import load_model, load_evaluation_dataset

@pytest.fixture(scope="session")
def eval_data():
    return load_evaluation_dataset("data/eval/regression_eval_v2.parquet")

@pytest.fixture(scope="session")
def baseline():
    with open("evaluation/baseline_metrics.json") as f:
        return json.load(f)

@pytest.fixture(scope="session")
def candidate_model():
    return load_model("models/candidate/model.pkl")

def test_accuracy_not_regressed(candidate_model, eval_data, baseline):
    X, y = eval_data
    predictions = candidate_model.predict(X)
    accuracy = (predictions == y).mean()
    
    baseline_accuracy = baseline["accuracy"]
    regression = baseline_accuracy - accuracy
    
    assert regression <= 0.02, (
        f"Accuracy regressed by {regression:.2%}.\n"
        f"Baseline: {baseline_accuracy:.4f}, Candidate: {accuracy:.4f}"
    )

def test_f1_score_not_regressed(candidate_model, eval_data, baseline):
    from sklearn.metrics import f1_score
    X, y = eval_data
    predictions = candidate_model.predict(X)
    f1 = f1_score(y, predictions, average='weighted')
    
    baseline_f1 = baseline["f1_weighted"]
    regression = baseline_f1 - f1
    
    assert regression <= 0.02, (
        f"F1 regressed by {regression:.2%}.\n"
        f"Baseline: {baseline_f1:.4f}, Candidate: {f1:.4f}"
    )

def test_no_new_high_confidence_errors(candidate_model, eval_data, baseline):
    """
    High-confidence wrong predictions are more damaging than low-confidence ones.
    Test that we haven't introduced new high-confidence errors.
    """
    X, y = eval_data
    probabilities = candidate_model.predict_proba(X)
    predictions = candidate_model.predict(X)
    
    # Find high-confidence wrong predictions (>90% confidence but wrong)
    max_proba = probabilities.max(axis=1)
    wrong = predictions != y
    high_confidence_errors = (max_proba > 0.9) & wrong
    
    error_rate = high_confidence_errors.mean()
    baseline_error_rate = baseline.get("high_confidence_error_rate", 0)
    
    assert error_rate <= baseline_error_rate + 0.005, (
        f"High-confidence error rate increased from {baseline_error_rate:.4f} to {error_rate:.4f}"
    )

Slice-Based Testing

Aggregate metrics can hide regressions in specific subgroups. Test slices of your evaluation data:

def test_performance_on_critical_segments(candidate_model, eval_data, baseline):
    """
    Test model performance on specific data segments that matter most.
    Aggregate accuracy can improve while critical segments regress.
    """
    X, y, metadata = eval_data  # metadata contains segment info
    
    segments = {
        "new_users": metadata["user_age_days"] < 7,
        "enterprise_tier": metadata["tier"] == "enterprise",
        "mobile_traffic": metadata["platform"] == "mobile",
        "high_value_transactions": metadata["transaction_value"] > 1000,
    }
    
    for segment_name, mask in segments.items():
        if mask.sum() < 50:  # Skip segments with too few samples
            continue
        
        X_seg, y_seg = X[mask], y[mask]
        predictions = candidate_model.predict(X_seg)
        accuracy = (predictions == y_seg).mean()
        
        baseline_accuracy = baseline["segments"].get(segment_name, {}).get("accuracy")
        if baseline_accuracy is None:
            continue  # No baseline for this segment yet
        
        regression = baseline_accuracy - accuracy
        
        assert regression <= 0.03, (
            f"Segment '{segment_name}' accuracy regressed by {regression:.2%}.\n"
            f"Baseline: {baseline_accuracy:.4f}, Candidate: {accuracy:.4f}\n"
            f"Segment size: {mask.sum()} samples"
        )

Behavioral Regression Tests

Beyond aggregate metrics, lock in specific behaviors that must not change:

BEHAVIORAL_TEST_CASES = [
    # (input_features, expected_prediction, description)
    ({"age": 25, "income": 50000, "credit_score": 720}, 1, "approved_young_good_credit"),
    ({"age": 45, "income": 120000, "credit_score": 800}, 1, "approved_mid_excellent_credit"),
    ({"age": 30, "income": 25000, "credit_score": 580}, 0, "rejected_poor_credit"),
    ({"age": 65, "income": 80000, "credit_score": 750}, 1, "approved_senior_good_credit"),
]

@pytest.mark.parametrize("features, expected, description", BEHAVIORAL_TEST_CASES)
def test_behavioral_consistency(candidate_model, features, expected, description):
    """
    These specific cases define expected model behavior.
    If any of these flip, something fundamental has changed.
    """
    import pandas as pd
    X = pd.DataFrame([features])
    prediction = candidate_model.predict(X)[0]
    
    assert prediction == expected, (
        f"Behavioral test '{description}' failed.\n"
        f"Input: {features}\n"
        f"Expected: {expected}, Got: {prediction}"
    )

Data Drift Detection

Before running a model on new data, check if the data has drifted from the training distribution:

from scipy import stats
import numpy as np

def detect_feature_drift(
    train_data: np.ndarray,
    production_data: np.ndarray,
    feature_names: list[str],
    significance_level: float = 0.05,
) -> dict:
    """
    Run Kolmogorov-Smirnov test on each feature.
    Returns features that have drifted significantly.
    """
    drift_results = {}
    
    for i, feature_name in enumerate(feature_names):
        train_feature = train_data[:, i]
        prod_feature = production_data[:, i]
        
        ks_stat, p_value = stats.ks_2samp(train_feature, prod_feature)
        
        drift_results[feature_name] = {
            "ks_statistic": ks_stat,
            "p_value": p_value,
            "drifted": p_value < significance_level,
        }
    
    return drift_results

def test_production_data_not_drifted(candidate_model, train_data, recent_production_data, feature_names):
    drift = detect_feature_drift(
        train_data, 
        recent_production_data, 
        feature_names
    )
    
    drifted_features = [name for name, result in drift.items() if result["drifted"]]
    
    if drifted_features:
        pytest.warns(
            UserWarning,
            match="data drift",
            message=f"Warning: features have drifted from training distribution: {drifted_features}"
        )
        
        # Fail if too many features have drifted
        drift_fraction = len(drifted_features) / len(feature_names)
        assert drift_fraction < 0.3, (
            f"{drift_fraction:.0%} of features drifted — model may be unreliable.\n"
            f"Drifted: {drifted_features}"
        )

Latency and Memory Regression

Model performance isn't just accuracy — latency and memory matter for production:

import time
import tracemalloc
import numpy as np

def test_inference_latency_not_regressed(candidate_model, baseline):
    """Larger models are more accurate but may be too slow for production."""
    
    # Generate representative batch
    batch_size = 100
    X_batch = np.random.randn(batch_size, 20)  # 20 features
    
    # Warm up
    candidate_model.predict(X_batch)
    
    # Measure
    times = []
    for _ in range(10):
        start = time.perf_counter()
        candidate_model.predict(X_batch)
        times.append((time.perf_counter() - start) * 1000)
    
    p95_latency_ms = np.percentile(times, 95)
    baseline_latency_ms = baseline.get("inference_latency_p95_ms", float('inf'))
    
    assert p95_latency_ms <= baseline_latency_ms * 1.1, (
        f"Inference p95 latency increased from {baseline_latency_ms:.0f}ms to {p95_latency_ms:.0f}ms"
    )

def test_memory_footprint_acceptable(candidate_model, baseline):
    X_batch = np.random.randn(1000, 20)
    
    tracemalloc.start()
    candidate_model.predict(X_batch)
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    
    peak_mb = peak / 1024 / 1024
    baseline_mb = baseline.get("inference_peak_memory_mb", float('inf'))
    
    assert peak_mb <= baseline_mb * 1.2, (
        f"Peak memory {peak_mb:.1f}MB exceeds baseline {baseline_mb:.1f}MB by more than 20%"
    )

Automating Baseline Updates

When you intentionally improve a model and all regression tests pass, update the baseline:

def update_baseline(model, eval_data, output_path: str):
    """
    Run after a model improvement is validated and merged.
    Establishes new baseline for future regression testing.
    """
    X, y, metadata = eval_data
    predictions = model.predict(X)
    probabilities = model.predict_proba(X)
    
    from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
    
    # Compute segment metrics
    segments = {
        "new_users": metadata["user_age_days"] < 7,
        "enterprise_tier": metadata["tier"] == "enterprise",
    }
    
    segment_metrics = {}
    for name, mask in segments.items():
        if mask.sum() >= 50:
            segment_metrics[name] = {
                "accuracy": accuracy_score(y[mask], predictions[mask]),
                "sample_count": int(mask.sum()),
            }
    
    baseline = {
        "accuracy": accuracy_score(y, predictions),
        "f1_weighted": f1_score(y, predictions, average='weighted'),
        "auc_roc": roc_auc_score(y, probabilities[:, 1]),
        "high_confidence_error_rate": float(
            ((probabilities.max(axis=1) > 0.9) & (predictions != y)).mean()
        ),
        "segments": segment_metrics,
        "updated_at": datetime.utcnow().isoformat(),
        "eval_dataset_version": "v2",
        "sample_count": len(y),
    }
    
    with open(output_path, "w") as f:
        json.dump(baseline, f, indent=2)
    
    print(f"Baseline updated: accuracy={baseline['accuracy']:.4f}, F1={baseline['f1_weighted']:.4f}")

CI/CD Integration

Run regression tests on every model artifact:

# .github/workflows/model-regression.yml
name: ML Model Regression Tests

on:
  push:
    paths:
      - 'models/**'
      - 'training/**'

jobs:
  regression-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Download evaluation dataset
        run: aws s3 cp s3://your-bucket/eval-data/ data/eval/ --recursive
      
      - name: Run regression tests
        run: pytest tests/ml_regression/ -v --tb=short
      
      - name: Generate regression report
        if: always()
        run: python scripts/generate_regression_report.py
      
      - name: Upload regression report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: regression-report
          path: reports/regression_report.html

The Regression Testing Mindset

ML regression testing is fundamentally different from software regression testing in one key way: passing all tests doesn't mean your model is correct for every input — it means it's correct for your evaluation set. The evaluation set is a sample, and its quality determines the value of your tests.

Invest in your evaluation dataset:

  • Keep it current — add real-world edge cases you've discovered in production
  • Keep it balanced — ensure minority classes and critical subgroups are represented
  • Keep it honest — never train on it, and resist the temptation to curate it toward making your model look good
  • Version it — so you can compare apples to apples across model versions

A model that passes all your regression tests but fails your users is a model with an inadequate test suite, not a good model. Build evaluation datasets that represent the real world, and your regression tests become a genuine quality gate.

Read more