Detecting and Testing for ML Model Drift: A Production Guide
ML models degrade silently as the world changes. Data drift tests verify that incoming features match training distribution. Concept drift tests detect when the relationship between features and labels has changed. This guide covers both with Evidently, Alibi Detect, and CI integration.
The Drift Problem
You deploy a credit risk model trained on 2023 data. In 2024, a recession changes customer behavior — income distributions shift, spending patterns change, default rates move. Your model's predictions are now based on a reality that no longer exists.
This is concept drift: the statistical relationship between input features and the target variable has changed. Your model's accuracy degrades, but without monitoring, you'll only notice when business outcomes suffer — chargebacks increase, default rates are miscalibrated.
There are two types:
- Data drift (covariate shift) — input feature distributions change, even if the label relationship is stable
- Concept drift — the mapping from features to labels changes (yesterday's "risky customer" profile is today's "reliable customer")
Both require different detection methods.
Data Drift Detection with Evidently
Evidently is the most widely used open-source library for ML monitoring and drift detection.
Installation
pip install evidently pandas numpy scikit-learnBasic Data Drift Test
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
from evidently.metrics import DataDriftTable
# Reference data: your training/baseline distribution
reference_data = pd.read_parquet('data/reference_2023.parquet')
# Current data: recent production traffic
current_data = pd.read_parquet('data/current_week.parquet')
# Build a drift report
report = Report(metrics=[DataDriftPreset()])
report.fit(reference_data, current_data)
report.save_html('drift_report.html')
# Access results programmatically
result = report.as_dict()Writing Drift Tests
import pytest
from evidently.test_suite import TestSuite
from evidently.tests import (
TestShareOfDriftedColumns,
TestColumnDrift,
TestNumberOfDriftedColumns
)
@pytest.fixture
def reference_data():
return pd.read_parquet('data/reference_2023.parquet')
@pytest.fixture
def current_data():
return pd.read_parquet('data/current_production.parquet')
def test_less_than_30_percent_columns_drifted(reference_data, current_data):
"""At most 30% of feature columns should show drift."""
suite = TestSuite(tests=[
TestShareOfDriftedColumns(lt=0.3) # Less than 30% drifted
])
suite.run(reference_data=reference_data, current_data=current_data)
assert suite.as_dict()['summary']['all_passed'], \
"Too many feature columns are drifting — model may be unreliable"
def test_critical_features_not_drifted(reference_data, current_data):
"""Core business features must not drift — these most affect model predictions."""
critical_features = ['income', 'credit_score', 'debt_to_income_ratio']
suite = TestSuite(tests=[
TestColumnDrift(column_name=feature)
for feature in critical_features
])
suite.run(reference_data=reference_data, current_data=current_data)
results = suite.as_dict()
for test_result in results['tests']:
feature = test_result['parameters']['column_name']
status = test_result['status']
if status == 'FAIL':
drift_score = test_result['parameters'].get('drift_score', 'N/A')
pytest.fail(
f"Critical feature '{feature}' is drifting (score: {drift_score}). "
f"Consider retraining."
)
def test_no_new_null_columns(reference_data, current_data):
"""Columns with low null rates in reference must not have high null rates in production."""
threshold = 0.05 # 5% max null rate
for col in reference_data.columns:
ref_null_rate = reference_data[col].isna().mean()
curr_null_rate = current_data[col].isna().mean()
if ref_null_rate < threshold: # This column was clean in training
assert curr_null_rate < 0.15, \
f"Column '{col}' null rate jumped from {ref_null_rate:.2%} to {curr_null_rate:.2%}"Statistical Drift Tests
For lower-level control, use statistical tests directly:
from scipy import stats
import numpy as np
def ks_test_drift(reference: np.ndarray, current: np.ndarray,
threshold: float = 0.05) -> tuple[bool, float]:
"""
Kolmogorov-Smirnov test for distributional drift.
Returns (drifted: bool, p_value: float).
"""
ks_stat, p_value = stats.ks_2samp(reference, current)
return p_value < threshold, p_value
def psi_score(reference: np.ndarray, current: np.ndarray,
bins: int = 10) -> float:
"""
Population Stability Index (PSI).
PSI < 0.1: No drift. 0.1-0.25: Moderate drift. > 0.25: Significant drift.
"""
ref_hist, bin_edges = np.histogram(reference, bins=bins, density=True)
curr_hist, _ = np.histogram(current, bins=bin_edges, density=True)
# Avoid division by zero
ref_hist = np.where(ref_hist == 0, 1e-10, ref_hist)
curr_hist = np.where(curr_hist == 0, 1e-10, curr_hist)
psi = np.sum((curr_hist - ref_hist) * np.log(curr_hist / ref_hist))
return psi
def test_income_feature_psi_below_threshold(reference_data, current_data):
"""PSI for income must stay below 0.1 (no significant drift)."""
psi = psi_score(
reference_data['income'].dropna().values,
current_data['income'].dropna().values
)
assert psi < 0.1, \
f"Income PSI={psi:.3f} indicates significant drift (threshold: 0.1). " \
f"Review recent data collection or consider model retraining."
def test_ks_test_no_drift_in_age_feature(reference_data, current_data):
"""KS test must not find significant distributional difference in age."""
drifted, p_value = ks_test_drift(
reference_data['age'].dropna().values,
current_data['age'].dropna().values,
threshold=0.01 # Stricter threshold for this feature
)
assert not drifted, \
f"KS test detects drift in 'age' feature (p={p_value:.4f}). " \
f"Check data pipeline for age calculation changes."Concept Drift Detection with Alibi Detect
Concept drift requires knowing the actual labels — which you often don't have in real-time. Alibi Detect provides methods for detecting concept drift from model outputs alone:
pip install alibi-detectfrom alibi_detect.cd import KSDrift, MMDDrift
from alibi_detect.cd.pytorch import HiddenOutput
import torch
import numpy as np
def test_prediction_distribution_stable():
"""
Test that the distribution of model predictions (outputs) hasn't shifted.
A shift in prediction distribution often signals concept drift even without labels.
"""
# Load baseline predictions from a reference period
baseline_predictions = np.load('data/baseline_predictions.npy')
current_predictions = np.load('data/current_predictions.npy')
# KS Drift detector on prediction scores
detector = KSDrift(baseline_predictions, p_val=0.01)
result = detector.predict(current_predictions)
assert not result['data']['is_drift'], \
f"Prediction distribution has drifted! " \
f"p-value: {result['data']['p_val']:.4f}. " \
f"Consider retraining or investigating data quality."
def test_model_confidence_distribution_stable():
"""
Low-confidence predictions increasing in volume signals the model
is encountering unfamiliar inputs.
"""
baseline_confidence = np.load('data/baseline_confidence.npy')
current_confidence = np.load('data/current_confidence.npy')
# High fraction of uncertain predictions indicates out-of-distribution data
current_low_confidence_rate = (current_confidence < 0.6).mean()
baseline_low_confidence_rate = (baseline_confidence < 0.6).mean()
allowed_increase = 0.15 # 15 percentage point increase allowed
actual_increase = current_low_confidence_rate - baseline_low_confidence_rate
assert actual_increase < allowed_increase, \
f"Low-confidence prediction rate increased by {actual_increase:.1%} " \
f"(baseline: {baseline_low_confidence_rate:.1%}, current: {current_low_confidence_rate:.1%}). " \
f"Model may be seeing out-of-distribution inputs."Embedding Drift for NLP Models
For text models, compare embedding distributions:
def test_text_embedding_drift():
"""
For NLP models, drift in input embedding space indicates
the model is processing semantically different text than during training.
"""
from sentence_transformers import SentenceTransformer
from alibi_detect.cd import MMDDrift
encoder = SentenceTransformer('all-MiniLM-L6-v2')
# Reference: texts from training period
reference_texts = load_texts('data/reference_texts.jsonl')
current_texts = load_texts('data/current_texts.jsonl')
# Embed both
reference_embeddings = encoder.encode(reference_texts[:500])
current_embeddings = encoder.encode(current_texts[:500])
# MMD (Maximum Mean Discrepancy) is better for high-dimensional data
detector = MMDDrift(reference_embeddings, p_val=0.01, n_permutations=100)
result = detector.predict(current_embeddings)
assert not result['data']['is_drift'], \
f"Text embedding distribution has drifted (p={result['data']['p_val']:.4f}). " \
f"Check if user query patterns have changed significantly."Scheduled Drift Testing in CI
name: ML Drift Monitoring
on:
schedule:
- cron: '0 6 * * *' # Daily at 6am
jobs:
drift-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install evidently alibi-detect pandas numpy scipy
- name: Download recent production data
env:
S3_BUCKET: ${{ secrets.ML_DATA_BUCKET }}
run: python scripts/fetch_production_data.py --days 7
- name: Run drift tests
run: pytest tests/drift/ -v --tb=short --junitxml=drift-results.xml
- name: Upload drift report
if: always()
uses: actions/upload-artifact@v4
with:
name: drift-report
path: drift_report.html
- name: Alert on drift
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text": "⚠️ ML Drift Alert: Model drift detected. Review drift-report artifact."}'
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_ML_ALERTS }}Integrating with HelpMeTest
For continuous drift monitoring beyond scheduled CI runs:
# Run drift check as a health check
helpmetest health ml-drift-monitor 1h --<span class="hljs-built_in">command <span class="hljs-string">"python scripts/drift_check.py --fail-on-drift"This runs drift detection hourly and alerts if drift is detected, without needing a CI trigger.
Summary
ML drift monitoring requires two types of tests:
- Data drift — statistical tests (KS, PSI) and Evidently reports verify feature distributions match training data
- Concept drift — prediction distribution and confidence monitoring detect silent accuracy degradation
Key practices:
- Run drift checks daily or weekly, not just on deployment
- Monitor critical features more strictly than less important ones
- Use PSI < 0.1 as a no-drift threshold for tabular features
- For NLP models, compare embedding distributions rather than raw text statistics
- Alert immediately on critical feature drift, but investigate before emergency retraining
Drift detection is a maintenance task, not a one-time setup — the reference distributions need to be refreshed periodically as the world evolves.