Detecting and Testing for ML Model Drift: A Production Guide

Detecting and Testing for ML Model Drift: A Production Guide

ML models degrade silently as the world changes. Data drift tests verify that incoming features match training distribution. Concept drift tests detect when the relationship between features and labels has changed. This guide covers both with Evidently, Alibi Detect, and CI integration.


The Drift Problem

You deploy a credit risk model trained on 2023 data. In 2024, a recession changes customer behavior — income distributions shift, spending patterns change, default rates move. Your model's predictions are now based on a reality that no longer exists.

This is concept drift: the statistical relationship between input features and the target variable has changed. Your model's accuracy degrades, but without monitoring, you'll only notice when business outcomes suffer — chargebacks increase, default rates are miscalibrated.

There are two types:

  • Data drift (covariate shift) — input feature distributions change, even if the label relationship is stable
  • Concept drift — the mapping from features to labels changes (yesterday's "risky customer" profile is today's "reliable customer")

Both require different detection methods.


Data Drift Detection with Evidently

Evidently is the most widely used open-source library for ML monitoring and drift detection.

Installation

pip install evidently pandas numpy scikit-learn

Basic Data Drift Test

import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
from evidently.metrics import DataDriftTable

# Reference data: your training/baseline distribution
reference_data = pd.read_parquet('data/reference_2023.parquet')

# Current data: recent production traffic
current_data = pd.read_parquet('data/current_week.parquet')

# Build a drift report
report = Report(metrics=[DataDriftPreset()])
report.fit(reference_data, current_data)
report.save_html('drift_report.html')

# Access results programmatically
result = report.as_dict()

Writing Drift Tests

import pytest
from evidently.test_suite import TestSuite
from evidently.tests import (
    TestShareOfDriftedColumns,
    TestColumnDrift,
    TestNumberOfDriftedColumns
)


@pytest.fixture
def reference_data():
    return pd.read_parquet('data/reference_2023.parquet')


@pytest.fixture
def current_data():
    return pd.read_parquet('data/current_production.parquet')


def test_less_than_30_percent_columns_drifted(reference_data, current_data):
    """At most 30% of feature columns should show drift."""
    suite = TestSuite(tests=[
        TestShareOfDriftedColumns(lt=0.3)  # Less than 30% drifted
    ])
    suite.run(reference_data=reference_data, current_data=current_data)
    
    assert suite.as_dict()['summary']['all_passed'], \
        "Too many feature columns are drifting — model may be unreliable"


def test_critical_features_not_drifted(reference_data, current_data):
    """Core business features must not drift — these most affect model predictions."""
    critical_features = ['income', 'credit_score', 'debt_to_income_ratio']
    
    suite = TestSuite(tests=[
        TestColumnDrift(column_name=feature) 
        for feature in critical_features
    ])
    suite.run(reference_data=reference_data, current_data=current_data)
    
    results = suite.as_dict()
    
    for test_result in results['tests']:
        feature = test_result['parameters']['column_name']
        status = test_result['status']
        if status == 'FAIL':
            drift_score = test_result['parameters'].get('drift_score', 'N/A')
            pytest.fail(
                f"Critical feature '{feature}' is drifting (score: {drift_score}). "
                f"Consider retraining."
            )


def test_no_new_null_columns(reference_data, current_data):
    """Columns with low null rates in reference must not have high null rates in production."""
    threshold = 0.05  # 5% max null rate
    
    for col in reference_data.columns:
        ref_null_rate = reference_data[col].isna().mean()
        curr_null_rate = current_data[col].isna().mean()
        
        if ref_null_rate < threshold:  # This column was clean in training
            assert curr_null_rate < 0.15, \
                f"Column '{col}' null rate jumped from {ref_null_rate:.2%} to {curr_null_rate:.2%}"

Statistical Drift Tests

For lower-level control, use statistical tests directly:

from scipy import stats
import numpy as np


def ks_test_drift(reference: np.ndarray, current: np.ndarray, 
                   threshold: float = 0.05) -> tuple[bool, float]:
    """
    Kolmogorov-Smirnov test for distributional drift.
    Returns (drifted: bool, p_value: float).
    """
    ks_stat, p_value = stats.ks_2samp(reference, current)
    return p_value < threshold, p_value


def psi_score(reference: np.ndarray, current: np.ndarray, 
               bins: int = 10) -> float:
    """
    Population Stability Index (PSI). 
    PSI < 0.1: No drift. 0.1-0.25: Moderate drift. > 0.25: Significant drift.
    """
    ref_hist, bin_edges = np.histogram(reference, bins=bins, density=True)
    curr_hist, _ = np.histogram(current, bins=bin_edges, density=True)
    
    # Avoid division by zero
    ref_hist = np.where(ref_hist == 0, 1e-10, ref_hist)
    curr_hist = np.where(curr_hist == 0, 1e-10, curr_hist)
    
    psi = np.sum((curr_hist - ref_hist) * np.log(curr_hist / ref_hist))
    return psi


def test_income_feature_psi_below_threshold(reference_data, current_data):
    """PSI for income must stay below 0.1 (no significant drift)."""
    psi = psi_score(
        reference_data['income'].dropna().values,
        current_data['income'].dropna().values
    )
    
    assert psi < 0.1, \
        f"Income PSI={psi:.3f} indicates significant drift (threshold: 0.1). " \
        f"Review recent data collection or consider model retraining."


def test_ks_test_no_drift_in_age_feature(reference_data, current_data):
    """KS test must not find significant distributional difference in age."""
    drifted, p_value = ks_test_drift(
        reference_data['age'].dropna().values,
        current_data['age'].dropna().values,
        threshold=0.01  # Stricter threshold for this feature
    )
    
    assert not drifted, \
        f"KS test detects drift in 'age' feature (p={p_value:.4f}). " \
        f"Check data pipeline for age calculation changes."

Concept Drift Detection with Alibi Detect

Concept drift requires knowing the actual labels — which you often don't have in real-time. Alibi Detect provides methods for detecting concept drift from model outputs alone:

pip install alibi-detect
from alibi_detect.cd import KSDrift, MMDDrift
from alibi_detect.cd.pytorch import HiddenOutput
import torch
import numpy as np


def test_prediction_distribution_stable():
    """
    Test that the distribution of model predictions (outputs) hasn't shifted.
    A shift in prediction distribution often signals concept drift even without labels.
    """
    # Load baseline predictions from a reference period
    baseline_predictions = np.load('data/baseline_predictions.npy')
    current_predictions = np.load('data/current_predictions.npy')
    
    # KS Drift detector on prediction scores
    detector = KSDrift(baseline_predictions, p_val=0.01)
    result = detector.predict(current_predictions)
    
    assert not result['data']['is_drift'], \
        f"Prediction distribution has drifted! " \
        f"p-value: {result['data']['p_val']:.4f}. " \
        f"Consider retraining or investigating data quality."


def test_model_confidence_distribution_stable():
    """
    Low-confidence predictions increasing in volume signals the model 
    is encountering unfamiliar inputs.
    """
    baseline_confidence = np.load('data/baseline_confidence.npy')
    current_confidence = np.load('data/current_confidence.npy')
    
    # High fraction of uncertain predictions indicates out-of-distribution data
    current_low_confidence_rate = (current_confidence < 0.6).mean()
    baseline_low_confidence_rate = (baseline_confidence < 0.6).mean()
    
    allowed_increase = 0.15  # 15 percentage point increase allowed
    actual_increase = current_low_confidence_rate - baseline_low_confidence_rate
    
    assert actual_increase < allowed_increase, \
        f"Low-confidence prediction rate increased by {actual_increase:.1%} " \
        f"(baseline: {baseline_low_confidence_rate:.1%}, current: {current_low_confidence_rate:.1%}). " \
        f"Model may be seeing out-of-distribution inputs."

Embedding Drift for NLP Models

For text models, compare embedding distributions:

def test_text_embedding_drift():
    """
    For NLP models, drift in input embedding space indicates
    the model is processing semantically different text than during training.
    """
    from sentence_transformers import SentenceTransformer
    from alibi_detect.cd import MMDDrift
    
    encoder = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Reference: texts from training period
    reference_texts = load_texts('data/reference_texts.jsonl')
    current_texts = load_texts('data/current_texts.jsonl')
    
    # Embed both
    reference_embeddings = encoder.encode(reference_texts[:500])
    current_embeddings = encoder.encode(current_texts[:500])
    
    # MMD (Maximum Mean Discrepancy) is better for high-dimensional data
    detector = MMDDrift(reference_embeddings, p_val=0.01, n_permutations=100)
    result = detector.predict(current_embeddings)
    
    assert not result['data']['is_drift'], \
        f"Text embedding distribution has drifted (p={result['data']['p_val']:.4f}). " \
        f"Check if user query patterns have changed significantly."

Scheduled Drift Testing in CI

name: ML Drift Monitoring
on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6am

jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install evidently alibi-detect pandas numpy scipy
      
      - name: Download recent production data
        env:
          S3_BUCKET: ${{ secrets.ML_DATA_BUCKET }}
        run: python scripts/fetch_production_data.py --days 7
      
      - name: Run drift tests
        run: pytest tests/drift/ -v --tb=short --junitxml=drift-results.xml
      
      - name: Upload drift report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: drift-report
          path: drift_report.html
      
      - name: Alert on drift
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          payload: '{"text": "⚠️ ML Drift Alert: Model drift detected. Review drift-report artifact."}'
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_ML_ALERTS }}

Integrating with HelpMeTest

For continuous drift monitoring beyond scheduled CI runs:

# Run drift check as a health check
helpmetest health ml-drift-monitor 1h --<span class="hljs-built_in">command <span class="hljs-string">"python scripts/drift_check.py --fail-on-drift"

This runs drift detection hourly and alerts if drift is detected, without needing a CI trigger.


Summary

ML drift monitoring requires two types of tests:

  1. Data drift — statistical tests (KS, PSI) and Evidently reports verify feature distributions match training data
  2. Concept drift — prediction distribution and confidence monitoring detect silent accuracy degradation

Key practices:

  • Run drift checks daily or weekly, not just on deployment
  • Monitor critical features more strictly than less important ones
  • Use PSI < 0.1 as a no-drift threshold for tabular features
  • For NLP models, compare embedding distributions rather than raw text statistics
  • Alert immediately on critical feature drift, but investigate before emergency retraining

Drift detection is a maintenance task, not a one-time setup — the reference distributions need to be refreshed periodically as the world evolves.

Read more