Environment Drift Detection: Catching Config Differences Before They Break Tests

Environment Drift Detection: Catching Config Differences Before They Break Tests

Environment drift is quiet. It doesn't announce itself. It accumulates through configuration changes, dependency updates, and infrastructure modifications — each individually harmless, collectively producing an environment that no longer matches what your tests expect. By the time drift causes a visible failure, tracing it back to its source is archaeology.

Detection is the answer. Make drift visible as it happens, not after it breaks something.

What Environment Drift Looks Like

Drift takes several forms, and each requires a different detection approach.

Configuration drift is the most common. An environment variable that exists in production but not in staging. A feature flag toggled in one environment but not another. A connection string pointing at a different database version. These differences are invisible in code review because they live outside the codebase.

Dependency drift happens when package versions diverge between environments. Your application container is built from the same Dockerfile everywhere, but the base image was last pulled six months ago in staging and last week in production. The OS packages differ. A transitive dependency resolves differently.

Schema drift occurs when database migrations run in production but not in the staging environment that's supposed to mirror it. Or vice versa — a migration tested in staging never makes it to production. Your application code and your database schema get out of sync.

Infrastructure drift is when the underlying infrastructure diverges: different instance types, different network configurations, different storage backends. Often caused by manual changes made during incident response and never formalized.

Building a Drift Detection Baseline

You can't detect drift without knowing what "correct" looks like. Start by establishing a baseline: the known-good state of each environment, captured as data.

For configuration, a checksum approach works well:

#!/usr/bin/env python3
# scripts/capture-env-baseline.py

import hashlib
import json
from datetime import datetime

def capture_config_state(env_name: str, connection_params: dict) -> dict:
    """Capture a snapshot of critical configuration values."""
    
    state = {
        "environment": env_name,
        "captured_at": datetime.utcnow().isoformat(),
        "config": {}
    }
    
    # Capture env vars that matter (exclude secrets, capture keys only)
    critical_vars = [
        "NODE_ENV", "LOG_LEVEL", "FEATURE_FLAGS",
        "CACHE_TTL", "MAX_REQUEST_SIZE", "RATE_LIMIT_WINDOW"
    ]
    
    for var in critical_vars:
        # Hash secret values, capture non-secret values directly
        value = get_remote_env_var(var, **connection_params)
        if is_secret(var):
            state["config"][var] = {"hash": hashlib.sha256(value.encode()).hexdigest()}
        else:
            state["config"][var] = {"value": value}
    
    return state

# Capture and store baseline
production_state = capture_config_state("production", PROD_PARAMS)
with open("baselines/production.json", "w") as f:
    json.dump(production_state, f, indent=2)

Store this baseline in your repository. Now you can compare any environment against it.

Automated Drift Checks

A drift check compares the current state of an environment against the baseline and reports differences. Run it as a CI step before deploying:

#!/bin/bash
<span class="hljs-comment"># scripts/check-drift.sh

<span class="hljs-built_in">set -e

ENVIRONMENT=<span class="hljs-variable">${1:-staging}
BASELINE_FILE=<span class="hljs-string">"baselines/production.json"
CURRENT_STATE=$(<span class="hljs-built_in">mktemp)

<span class="hljs-built_in">echo <span class="hljs-string">"Capturing current state of $ENVIRONMENT..."
python3 scripts/capture-env-baseline.py <span class="hljs-string">"$ENVIRONMENT" > <span class="hljs-string">"$CURRENT_STATE"

<span class="hljs-built_in">echo <span class="hljs-string">"Comparing against production baseline..."
python3 scripts/compare-states.py <span class="hljs-string">"$BASELINE_FILE" <span class="hljs-string">"$CURRENT_STATE"

DIFF_COUNT=$(python3 scripts/compare-states.py <span class="hljs-string">"$BASELINE_FILE" <span class="hljs-string">"$CURRENT_STATE" --count)

<span class="hljs-keyword">if [ <span class="hljs-string">"$DIFF_COUNT" -gt 0 ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">""
  <span class="hljs-built_in">echo <span class="hljs-string">"WARNING: $DIFF_COUNT configuration difference(s) detected between <span class="hljs-variable">$ENVIRONMENT and production"
  <span class="hljs-built_in">echo <span class="hljs-string">"Run 'scripts/show-drift.sh $ENVIRONMENT' for details"
  <span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi

<span class="hljs-built_in">echo <span class="hljs-string">"No configuration drift detected"
# scripts/compare-states.py
import json
import sys
from deepdiff import DeepDiff

def compare(baseline_path, current_path):
    with open(baseline_path) as f:
        baseline = json.load(f)
    with open(current_path) as f:
        current = json.load(f)
    
    # Ignore fields that legitimately differ between environments
    exclude_paths = {
        "root['environment']",
        "root['captured_at']",
        "root['config']['DATABASE_URL']",  # legitimately different
        "root['config']['REDIS_URL']",     # legitimately different
    }
    
    diff = DeepDiff(
        baseline["config"],
        current["config"],
        exclude_paths=exclude_paths
    )
    
    if diff:
        for change_type, changes in diff.items():
            for path, change in changes.items():
                if change_type == "values_changed":
                    print(f"CHANGED: {path}")
                    print(f"  Expected: {change['old_value']}")
                    print(f"  Got:      {change['new_value']}")
                elif change_type == "dictionary_item_added":
                    print(f"ADDED: {path} (not in production)")
                elif change_type == "dictionary_item_removed":
                    print(f"MISSING: {path} (exists in production)")
        
        return len(diff)
    
    return 0

Database Schema Drift Detection

Schema drift deserves its own check. Compare the list of applied migrations between environments:

-- Run in each environment, compare outputs
SELECT migration_name, applied_at
FROM schema_migrations
ORDER BY applied_at;
# scripts/check-schema-drift.sh
PROD_MIGRATIONS=$(psql <span class="hljs-string">"$PROD_DATABASE_URL" -tAc \
  <span class="hljs-string">"SELECT migration_name FROM schema_migrations ORDER BY applied_at")

STAGING_MIGRATIONS=$(psql <span class="hljs-string">"$STAGING_DATABASE_URL" -tAc \
  <span class="hljs-string">"SELECT migration_name FROM schema_migrations ORDER BY applied_at")

diff <(<span class="hljs-built_in">echo <span class="hljs-string">"$PROD_MIGRATIONS") <(<span class="hljs-built_in">echo <span class="hljs-string">"$STAGING_MIGRATIONS")

<span class="hljs-keyword">if [ $? -ne 0 ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"ERROR: Schema drift detected between production and staging"
  <span class="hljs-built_in">echo <span class="hljs-string">"Production has migrations that staging does not (or vice versa)"
  <span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi

For more detailed schema comparison, pgdiff or migra can generate the SQL diff between two schemas:

# Using migra (Python)
pip install migra
migra postgresql://user@prod/mydb postgresql://user@staging/mydb
<span class="hljs-comment"># Outputs the SQL to bring staging in line with production

Dependency Version Checks

Docker image digests are the most reliable way to track dependency state:

# Record image digests at build time
docker inspect myapp:production --format=<span class="hljs-string">'{{.RepoDigests}}' > baselines/production-image-digest.txt

<span class="hljs-comment"># Compare at deploy time
CURRENT_DIGEST=$(docker inspect myapp:staging --format=<span class="hljs-string">'{{.RepoDigests}}')
BASELINE_DIGEST=$(<span class="hljs-built_in">cat baselines/production-image-digest.txt)

<span class="hljs-keyword">if [ <span class="hljs-string">"$CURRENT_DIGEST" != <span class="hljs-string">"$BASELINE_DIGEST" ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"WARNING: Image digest differs from production baseline"
  <span class="hljs-built_in">echo <span class="hljs-string">"This may indicate dependency version skew"
  
  <span class="hljs-comment"># Show the diff in installed packages
  docker run --<span class="hljs-built_in">rm myapp:production dpkg -l > /tmp/prod-packages.txt
  docker run --<span class="hljs-built_in">rm myapp:staging dpkg -l > /tmp/staging-packages.txt
  diff /tmp/prod-packages.txt /tmp/staging-packages.txt
<span class="hljs-keyword">fi

Drift Detection in Your CI Pipeline

Wire drift detection into your deployment pipeline as a blocking gate for staging, and a warning gate for less critical environments:

# .github/workflows/deploy-staging.yml
jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Check configuration drift
        run: ./scripts/check-drift.sh staging
        # Fails the pipeline if drift is detected
      
      - name: Check schema drift
        run: ./scripts/check-schema-drift.sh
      
      - name: Report drift status
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: 'Environment drift detected in staging',
              body: 'Drift check failed in CI. See workflow run for details.',
              labels: ['environment-drift', 'needs-attention']
            })
  
  deploy:
    needs: drift-check
    # Only runs if drift-check passes
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: ./scripts/deploy.sh staging

Remediation Workflow

Detection without remediation is just noise. When drift is detected, you need a clear path to fixing it.

For configuration drift, the remediation is almost always: update your IaC to reflect the production value, apply it to all environments, commit the change.

For schema drift, the remediation depends on direction:

  • Staging ahead of production: run the missing migrations in production before next deploy
  • Production ahead of staging: apply the missing migrations to staging

Automate the remediation hint as part of your drift alert:

# When drift is detected, also output the remediation command
<span class="hljs-keyword">if [ <span class="hljs-string">"$DRIFT_COUNT" -gt 0 ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">""
  <span class="hljs-built_in">echo <span class="hljs-string">"Remediation:"
  <span class="hljs-built_in">echo <span class="hljs-string">"  1. Review differences above"
  <span class="hljs-built_in">echo <span class="hljs-string">"  2. Update terraform/environments/staging.tfvars to match production"
  <span class="hljs-built_in">echo <span class="hljs-string">"  3. Run: terraform workspace select staging && terraform apply"
  <span class="hljs-built_in">echo <span class="hljs-string">"  4. Re-run this check to verify"
<span class="hljs-keyword">fi

The goal isn't just to detect drift — it's to make fixing it the path of least resistance. When the drift report includes the remediation steps, the person who sees it can fix it immediately rather than filing a ticket that gets lost.

Consistent environments produce consistent test results. Drift detection is what keeps them consistent over time.

Read more