ML Model Regression Testing: Catch Model Degradation Before Production
ML models degrade in ways that traditional software doesn't. Code doesn't spontaneously get worse — but a model trained on March data can silently degrade by June as the distribution of real-world inputs shifts. A retrained model can score higher on aggregate metrics while regressing on the specific cases your users care about most.
ML regression testing is the practice of validating model behavior against known baselines before every deployment and monitoring for drift in production. This guide covers how to build that validation layer.
Why ML Models Regress
Unlike bugs in code, model degradation is often invisible and gradual:
- Data drift: Production inputs drift from the training distribution over time
- Concept drift: The underlying relationship between inputs and outputs changes (user behavior evolves, market conditions shift)
- Retraining regression: A new model trained on more data scores better overall but worse on important subpopulations
- Dependency drift: An upstream data pipeline changes its output format or semantics
- Infrastructure changes: Different hardware, library versions, or serialization produces slightly different floating-point results
Regression testing catches these issues before they reach production.
The Model Validation Baseline
The foundation of regression testing is a fixed evaluation dataset that you never retrain on:
# evaluation/baseline.py
import json
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
class ModelBaseline:
"""
Tracks performance metrics across model versions.
Load once, compare every time.
"""
def __init__(self, baseline_path: str):
with open(baseline_path) as f:
self.baseline = json.load(f)
def evaluate(self, model, X_eval, y_eval) -> dict:
predictions = model.predict(X_eval)
probabilities = model.predict_proba(X_eval)[:, 1] if hasattr(model, 'predict_proba') else None
metrics = {
"accuracy": accuracy_score(y_eval, predictions),
"f1_macro": f1_score(y_eval, predictions, average='macro'),
"f1_weighted": f1_score(y_eval, predictions, average='weighted'),
}
if probabilities is not None:
metrics["auc_roc"] = roc_auc_score(y_eval, probabilities)
return metrics
def compare_to_baseline(self, current_metrics: dict, max_regression_pct: float = 0.02) -> dict:
"""
Returns comparison with pass/fail for each metric.
max_regression_pct: maximum allowed drop relative to baseline (e.g., 0.02 = 2%)
"""
results = {}
for metric, current_value in current_metrics.items():
if metric not in self.baseline:
results[metric] = {"status": "no_baseline", "current": current_value}
continue
baseline_value = self.baseline[metric]
regression = (baseline_value - current_value) / baseline_value
results[metric] = {
"baseline": baseline_value,
"current": current_value,
"regression_pct": regression,
"passed": regression <= max_regression_pct,
}
return resultsWriting Regression Tests
import pytest
import json
import numpy as np
from your_ml import load_model, load_evaluation_dataset
@pytest.fixture(scope="session")
def eval_data():
return load_evaluation_dataset("data/eval/regression_eval_v2.parquet")
@pytest.fixture(scope="session")
def baseline():
with open("evaluation/baseline_metrics.json") as f:
return json.load(f)
@pytest.fixture(scope="session")
def candidate_model():
return load_model("models/candidate/model.pkl")
def test_accuracy_not_regressed(candidate_model, eval_data, baseline):
X, y = eval_data
predictions = candidate_model.predict(X)
accuracy = (predictions == y).mean()
baseline_accuracy = baseline["accuracy"]
regression = baseline_accuracy - accuracy
assert regression <= 0.02, (
f"Accuracy regressed by {regression:.2%}.\n"
f"Baseline: {baseline_accuracy:.4f}, Candidate: {accuracy:.4f}"
)
def test_f1_score_not_regressed(candidate_model, eval_data, baseline):
from sklearn.metrics import f1_score
X, y = eval_data
predictions = candidate_model.predict(X)
f1 = f1_score(y, predictions, average='weighted')
baseline_f1 = baseline["f1_weighted"]
regression = baseline_f1 - f1
assert regression <= 0.02, (
f"F1 regressed by {regression:.2%}.\n"
f"Baseline: {baseline_f1:.4f}, Candidate: {f1:.4f}"
)
def test_no_new_high_confidence_errors(candidate_model, eval_data, baseline):
"""
High-confidence wrong predictions are more damaging than low-confidence ones.
Test that we haven't introduced new high-confidence errors.
"""
X, y = eval_data
probabilities = candidate_model.predict_proba(X)
predictions = candidate_model.predict(X)
# Find high-confidence wrong predictions (>90% confidence but wrong)
max_proba = probabilities.max(axis=1)
wrong = predictions != y
high_confidence_errors = (max_proba > 0.9) & wrong
error_rate = high_confidence_errors.mean()
baseline_error_rate = baseline.get("high_confidence_error_rate", 0)
assert error_rate <= baseline_error_rate + 0.005, (
f"High-confidence error rate increased from {baseline_error_rate:.4f} to {error_rate:.4f}"
)Slice-Based Testing
Aggregate metrics can hide regressions in specific subgroups. Test slices of your evaluation data:
def test_performance_on_critical_segments(candidate_model, eval_data, baseline):
"""
Test model performance on specific data segments that matter most.
Aggregate accuracy can improve while critical segments regress.
"""
X, y, metadata = eval_data # metadata contains segment info
segments = {
"new_users": metadata["user_age_days"] < 7,
"enterprise_tier": metadata["tier"] == "enterprise",
"mobile_traffic": metadata["platform"] == "mobile",
"high_value_transactions": metadata["transaction_value"] > 1000,
}
for segment_name, mask in segments.items():
if mask.sum() < 50: # Skip segments with too few samples
continue
X_seg, y_seg = X[mask], y[mask]
predictions = candidate_model.predict(X_seg)
accuracy = (predictions == y_seg).mean()
baseline_accuracy = baseline["segments"].get(segment_name, {}).get("accuracy")
if baseline_accuracy is None:
continue # No baseline for this segment yet
regression = baseline_accuracy - accuracy
assert regression <= 0.03, (
f"Segment '{segment_name}' accuracy regressed by {regression:.2%}.\n"
f"Baseline: {baseline_accuracy:.4f}, Candidate: {accuracy:.4f}\n"
f"Segment size: {mask.sum()} samples"
)Behavioral Regression Tests
Beyond aggregate metrics, lock in specific behaviors that must not change:
BEHAVIORAL_TEST_CASES = [
# (input_features, expected_prediction, description)
({"age": 25, "income": 50000, "credit_score": 720}, 1, "approved_young_good_credit"),
({"age": 45, "income": 120000, "credit_score": 800}, 1, "approved_mid_excellent_credit"),
({"age": 30, "income": 25000, "credit_score": 580}, 0, "rejected_poor_credit"),
({"age": 65, "income": 80000, "credit_score": 750}, 1, "approved_senior_good_credit"),
]
@pytest.mark.parametrize("features, expected, description", BEHAVIORAL_TEST_CASES)
def test_behavioral_consistency(candidate_model, features, expected, description):
"""
These specific cases define expected model behavior.
If any of these flip, something fundamental has changed.
"""
import pandas as pd
X = pd.DataFrame([features])
prediction = candidate_model.predict(X)[0]
assert prediction == expected, (
f"Behavioral test '{description}' failed.\n"
f"Input: {features}\n"
f"Expected: {expected}, Got: {prediction}"
)Data Drift Detection
Before running a model on new data, check if the data has drifted from the training distribution:
from scipy import stats
import numpy as np
def detect_feature_drift(
train_data: np.ndarray,
production_data: np.ndarray,
feature_names: list[str],
significance_level: float = 0.05,
) -> dict:
"""
Run Kolmogorov-Smirnov test on each feature.
Returns features that have drifted significantly.
"""
drift_results = {}
for i, feature_name in enumerate(feature_names):
train_feature = train_data[:, i]
prod_feature = production_data[:, i]
ks_stat, p_value = stats.ks_2samp(train_feature, prod_feature)
drift_results[feature_name] = {
"ks_statistic": ks_stat,
"p_value": p_value,
"drifted": p_value < significance_level,
}
return drift_results
def test_production_data_not_drifted(candidate_model, train_data, recent_production_data, feature_names):
drift = detect_feature_drift(
train_data,
recent_production_data,
feature_names
)
drifted_features = [name for name, result in drift.items() if result["drifted"]]
if drifted_features:
pytest.warns(
UserWarning,
match="data drift",
message=f"Warning: features have drifted from training distribution: {drifted_features}"
)
# Fail if too many features have drifted
drift_fraction = len(drifted_features) / len(feature_names)
assert drift_fraction < 0.3, (
f"{drift_fraction:.0%} of features drifted — model may be unreliable.\n"
f"Drifted: {drifted_features}"
)Latency and Memory Regression
Model performance isn't just accuracy — latency and memory matter for production:
import time
import tracemalloc
import numpy as np
def test_inference_latency_not_regressed(candidate_model, baseline):
"""Larger models are more accurate but may be too slow for production."""
# Generate representative batch
batch_size = 100
X_batch = np.random.randn(batch_size, 20) # 20 features
# Warm up
candidate_model.predict(X_batch)
# Measure
times = []
for _ in range(10):
start = time.perf_counter()
candidate_model.predict(X_batch)
times.append((time.perf_counter() - start) * 1000)
p95_latency_ms = np.percentile(times, 95)
baseline_latency_ms = baseline.get("inference_latency_p95_ms", float('inf'))
assert p95_latency_ms <= baseline_latency_ms * 1.1, (
f"Inference p95 latency increased from {baseline_latency_ms:.0f}ms to {p95_latency_ms:.0f}ms"
)
def test_memory_footprint_acceptable(candidate_model, baseline):
X_batch = np.random.randn(1000, 20)
tracemalloc.start()
candidate_model.predict(X_batch)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
peak_mb = peak / 1024 / 1024
baseline_mb = baseline.get("inference_peak_memory_mb", float('inf'))
assert peak_mb <= baseline_mb * 1.2, (
f"Peak memory {peak_mb:.1f}MB exceeds baseline {baseline_mb:.1f}MB by more than 20%"
)Automating Baseline Updates
When you intentionally improve a model and all regression tests pass, update the baseline:
def update_baseline(model, eval_data, output_path: str):
"""
Run after a model improvement is validated and merged.
Establishes new baseline for future regression testing.
"""
X, y, metadata = eval_data
predictions = model.predict(X)
probabilities = model.predict_proba(X)
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
# Compute segment metrics
segments = {
"new_users": metadata["user_age_days"] < 7,
"enterprise_tier": metadata["tier"] == "enterprise",
}
segment_metrics = {}
for name, mask in segments.items():
if mask.sum() >= 50:
segment_metrics[name] = {
"accuracy": accuracy_score(y[mask], predictions[mask]),
"sample_count": int(mask.sum()),
}
baseline = {
"accuracy": accuracy_score(y, predictions),
"f1_weighted": f1_score(y, predictions, average='weighted'),
"auc_roc": roc_auc_score(y, probabilities[:, 1]),
"high_confidence_error_rate": float(
((probabilities.max(axis=1) > 0.9) & (predictions != y)).mean()
),
"segments": segment_metrics,
"updated_at": datetime.utcnow().isoformat(),
"eval_dataset_version": "v2",
"sample_count": len(y),
}
with open(output_path, "w") as f:
json.dump(baseline, f, indent=2)
print(f"Baseline updated: accuracy={baseline['accuracy']:.4f}, F1={baseline['f1_weighted']:.4f}")CI/CD Integration
Run regression tests on every model artifact:
# .github/workflows/model-regression.yml
name: ML Model Regression Tests
on:
push:
paths:
- 'models/**'
- 'training/**'
jobs:
regression-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Download evaluation dataset
run: aws s3 cp s3://your-bucket/eval-data/ data/eval/ --recursive
- name: Run regression tests
run: pytest tests/ml_regression/ -v --tb=short
- name: Generate regression report
if: always()
run: python scripts/generate_regression_report.py
- name: Upload regression report
if: always()
uses: actions/upload-artifact@v4
with:
name: regression-report
path: reports/regression_report.htmlThe Regression Testing Mindset
ML regression testing is fundamentally different from software regression testing in one key way: passing all tests doesn't mean your model is correct for every input — it means it's correct for your evaluation set. The evaluation set is a sample, and its quality determines the value of your tests.
Invest in your evaluation dataset:
- Keep it current — add real-world edge cases you've discovered in production
- Keep it balanced — ensure minority classes and critical subgroups are represented
- Keep it honest — never train on it, and resist the temptation to curate it toward making your model look good
- Version it — so you can compare apples to apples across model versions
A model that passes all your regression tests but fails your users is a model with an inadequate test suite, not a good model. Build evaluation datasets that represent the real world, and your regression tests become a genuine quality gate.