Business Continuity Testing Checklist for Engineering Teams

Business Continuity Testing Checklist for Engineering Teams

Business continuity testing is not just for disaster recovery — it's about systematically validating that your organization can maintain critical operations under a wide range of adverse conditions. Unlike DR testing (which focuses on technical recovery), BCT covers the full picture: technology, people, processes, and communication.

This checklist is designed for engineering teams who own operational resilience. Use it to audit your current posture, identify gaps, and schedule the testing that turns your BCT plan from a document into a proven capability.

How to Use This Checklist

Rate each item on a 3-point scale:

  • ✅ Tested — validated within the past testing cycle
  • ⚠️ Untested — exists but has not been validated
  • ❌ Gap — not addressed at all

Work through categories in order of risk. Start with the items most likely to cause severe business impact if they fail.


Category 1: Infrastructure Resilience

Compute

  • Single instance failure: Can the application survive the loss of any individual server or pod without downtime?
    • Test: Terminate a random instance/pod while monitoring service health
    • Frequency: Weekly (automated)
  • Majority node failure: Can the application serve traffic when fewer than half of compute nodes are available?
    • Test: Cordon and drain N/2 nodes in your cluster
    • Frequency: Monthly
  • Auto-scaling under load: Does the application scale out before performance degrades?
    • Test: Load test at 2× normal traffic while monitoring response times and scale events
    • Frequency: Monthly
  • Spot/preemptible instance handling: If you use spot instances, do pods terminate gracefully when preempted?
    • Test: Force-terminate a spot instance and verify 30-second graceful shutdown
    • Frequency: Monthly

Networking

  • Load balancer failover: If your primary load balancer becomes unavailable, does the backup take over?
    • Test: Stop the primary LB process and verify traffic continues flowing
    • Frequency: Quarterly
  • DNS failover: Does your DNS failover to the backup address within your RTO when primary is unreachable?
    • Test: Make primary endpoint unreachable; measure DNS propagation time
    • Frequency: Quarterly
  • CDN fallback: If your CDN is unavailable, does traffic fall back to origin without total service failure?
    • Test: Simulate CDN failure; verify origin serves traffic
    • Frequency: Semi-annually
  • Cross-AZ traffic: Under single-AZ failure, does traffic route to remaining AZs automatically?
    • Test: Block traffic from one AZ; verify service continues from remaining AZs
    • Frequency: Quarterly

Storage

  • Primary storage failure: If primary storage becomes unavailable, do stateful services fail gracefully?
    • Test: Simulate disk full or storage unavailability
    • Frequency: Quarterly
  • Persistent volume rescheduling: Do PVs reattach correctly when pods are rescheduled to new nodes?
    • Test: Delete pod with PV; verify new pod attaches to same PV
    • Frequency: Monthly

Category 2: Data Resilience

Backup Integrity

  • Backup creation verification: Are backups actually being created on schedule?
    • Test: Check backup timestamps; alert if newest backup > RPO hours old
    • Frequency: Daily (automated)
  • Backup integrity check: Are backup files non-corrupted and complete?
    • Test: Verify file checksums; attempt partial restoration
    • Frequency: Weekly
  • Full restoration test: Can you restore from backup within your RTO?
    • Test: Restore latest backup to isolated environment; measure time; validate row counts
    • Frequency: Monthly
  • Point-in-time recovery: Can you restore to a specific point in time (not just the latest backup)?
    • Test: Restore to T-2 hours; verify specific records from that timestamp exist
    • Frequency: Quarterly

Replication

  • Replication lag monitoring: Is replication lag within your RPO at all times?
    • Test: Continuous monitoring; alert if lag exceeds RPO threshold
    • Frequency: Continuous
  • Replica promotion: Can you promote a replica to primary within your RTO?
    • Test: Manually trigger replica promotion; measure time; validate writes work
    • Frequency: Monthly
  • Multi-region replication: If you have a secondary region, is it current and writable?
    • Test: Fail over to secondary region; verify data age and write capability
    • Frequency: Quarterly

Data Integrity

  • Transaction consistency: Are transactions applied consistently after failover (no partial writes)?
    • Test: Run transactions during failover; verify either all committed or all rolled back
    • Frequency: Quarterly
  • Cache invalidation: Does cache invalidation work correctly after a failover event?
    • Test: Update data; trigger failover; verify cache reflects new data after recovery
    • Frequency: Quarterly

Category 3: Application Resilience

Service Dependencies

  • Graceful degradation: When a non-critical dependency is down, does the application serve reduced functionality rather than failing completely?
    • Test: Disable each non-critical service; verify application responds with fallback behavior
    • Frequency: Quarterly
  • Circuit breaker behavior: Do circuit breakers open under sustained dependency failure and close on recovery?
    • Test: Make a dependency return errors; verify circuit opens; restore dependency; verify circuit closes
    • Frequency: Quarterly
  • Timeout handling: Do all external service calls have timeouts? Do they handle timeouts gracefully?
    • Test: Inject latency > timeout into each dependency; verify timeout errors, not hangs
    • Frequency: Quarterly
  • Retry with backoff: Do retryable failures use exponential backoff (not tight retry loops)?
    • Test: Enable dependency errors; capture retry timing in logs; verify exponential pattern
    • Frequency: Quarterly

Statefulness

  • Stateless API recovery: After pod restart, do API requests succeed without any manual intervention?
    • Test: Kill all API pods; verify requests succeed after recovery
    • Frequency: Weekly
  • Session persistence: If session data is stored externally (Redis), do sessions survive pod restarts?
    • Test: Log in; restart API pods; verify session still valid
    • Frequency: Monthly
  • Queue durability: Are messages in queues preserved across service restarts?
    • Test: Publish messages; kill consumer; verify messages consumed after restart
    • Frequency: Monthly

Category 4: Security and Access

  • Credential rotation continuity: When credentials are rotated, does the service continue operating without downtime?
    • Test: Rotate a credential during peak hours; monitor for errors
    • Frequency: Every rotation (quarterly)
  • VPN/bastion availability: If your primary VPN endpoint fails, can engineers still access production systems?
    • Test: Disable primary VPN; attempt access via secondary path
    • Frequency: Quarterly
  • Break-glass access: In a P0 incident with primary auth unavailable, can authorized engineers access systems via break-glass procedure?
    • Test: Follow break-glass runbook; measure time to access
    • Frequency: Annually
  • Secrets management failover: If your secrets manager (Vault, AWS Secrets Manager) is unavailable, can applications continue with cached credentials?
    • Test: Make secrets manager unavailable; verify app continues with previously fetched secrets
    • Frequency: Quarterly

Category 5: Operational Readiness

Runbooks

  • Runbook accuracy: Are runbooks executable as written, with no missing steps or outdated commands?
    • Test: Follow runbook verbatim in a staging environment; document gaps
    • Frequency: Quarterly
  • Runbook discovery: Can on-call engineers find the right runbook within 2 minutes during an incident?
    • Test: Give a new on-call engineer a simulated incident; time their runbook discovery
    • Frequency: Semi-annually

Monitoring and Alerting

  • Alert delivery: When a critical alert fires, does it reach the on-call engineer via all configured channels?
    • Test: Trigger a test alert; verify receipt via PagerDuty, Slack, and SMS
    • Frequency: Monthly
  • Alert accuracy: Do critical alerts fire for genuine failures (not false positives)?
    • Test: Review last 30 days of alerts; calculate false positive rate; target < 10%
    • Frequency: Monthly
  • Dashboard availability: During an incident, are dashboards accessible even if the primary monitoring service is degraded?
    • Test: Simulate monitoring service degradation; verify dashboards still load
    • Frequency: Quarterly

On-Call

  • On-call coverage: Is there always someone on-call who can respond within your SLA?
    • Test: Audit on-call schedule for gaps; test paging at unexpected times
    • Frequency: Monthly
  • Escalation path: If primary on-call doesn't respond in 15 minutes, does escalation fire automatically?
    • Test: Don't acknowledge a test alert for 15 minutes; verify escalation
    • Frequency: Quarterly
  • Knowledge transfer: If primary on-call is unavailable, does a secondary have enough context to handle a P1?
    • Test: Pair shadow exercise; have secondary lead a simulated incident
    • Frequency: Quarterly

Category 6: Communication

  • Status page updates: Can you publish a status page update within 10 minutes of detecting a P1?
    • Test: Simulate P1; time status page update; verify accuracy
    • Frequency: Semi-annually
  • Customer communication template: Do you have pre-written templates for common incident types?
    • Test: Review templates for currency; update if product has changed
    • Frequency: Quarterly
  • Internal communication: Do all internal stakeholders (CTO, support, sales) have a single source of truth for incident status?
    • Test: Simulate incident; verify all stakeholders receive updates without requesting them
    • Frequency: Semi-annually

Automation Example: Running Checks in CI

#!/usr/bin/env python3
# bct_checks.py — Automated BCT validation

import subprocess
import boto3
import psycopg2
from datetime import datetime, timezone, timedelta
from dataclasses import dataclass

@dataclass
class CheckResult:
    name: str
    passed: bool
    details: str
    severity: str  # critical, high, medium

def check_backup_age(bucket: str, prefix: str, max_age_hours: int) -> CheckResult:
    s3 = boto3.client('s3')
    response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    
    if not response.get('Contents'):
        return CheckResult("backup_age", False, "No backups found", "critical")
    
    latest = max(response['Contents'], key=lambda x: x['LastModified'])
    age_hours = (datetime.now(timezone.utc) - latest['LastModified']).total_seconds() / 3600
    
    passed = age_hours <= max_age_hours
    return CheckResult(
        name="backup_age",
        passed=passed,
        details=f"Latest backup: {latest['Key']} ({age_hours:.1f}h ago)",
        severity="critical" if not passed else "medium"
    )

def check_replication_lag(primary_url: str, max_lag_seconds: int) -> CheckResult:
    with psycopg2.connect(primary_url) as conn:
        cur = conn.cursor()
        cur.execute("""
            SELECT client_addr, 
                   EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag_seconds
            FROM pg_stat_replication
        """)
        rows = cur.fetchall()
        
        if not rows:
            return CheckResult("replication_lag", False, "No replication connections", "critical")
        
        max_lag = max(row[1] or 0 for row in rows)
        passed = max_lag <= max_lag_seconds
        
        return CheckResult(
            name="replication_lag",
            passed=passed,
            details=f"Max lag: {max_lag:.1f}s (threshold: {max_lag_seconds}s)",
            severity="critical" if max_lag > max_lag_seconds * 2 else "high"
        )

def check_health_endpoint(url: str, timeout: int = 5) -> CheckResult:
    import requests
    try:
        resp = requests.get(url, timeout=timeout)
        passed = resp.status_code == 200
        return CheckResult(
            name=f"health_{url}",
            passed=passed,
            details=f"HTTP {resp.status_code}",
            severity="critical" if not passed else "medium"
        )
    except requests.RequestException as e:
        return CheckResult(
            name=f"health_{url}",
            passed=False,
            details=str(e),
            severity="critical"
        )

def run_all_checks(config: dict) -> list[CheckResult]:
    results = []
    
    # Backup checks
    results.append(check_backup_age(
        bucket=config['backup_bucket'],
        prefix=config['backup_prefix'],
        max_age_hours=config['rpo_hours']
    ))
    
    # Replication checks
    results.append(check_replication_lag(
        primary_url=config['primary_db_url'],
        max_lag_seconds=config['rpo_hours'] * 3600
    ))
    
    # Service health checks
    for url in config['health_endpoints']:
        results.append(check_health_endpoint(url))
    
    return results

if __name__ == '__main__':
    import json
    import sys
    
    config = json.load(open('bct_config.json'))
    results = run_all_checks(config)
    
    critical_failures = [r for r in results if not r.passed and r.severity == 'critical']
    
    print(f"BCT Check Results: {sum(r.passed for r in results)}/{len(results)} passed")
    for r in results:
        status = "✅" if r.passed else "❌"
        print(f"  {status} {r.name}: {r.details}")
    
    if critical_failures:
        print(f"\n{len(critical_failures)} critical failures!")
        sys.exit(1)

Testing Frequency Summary

Category Daily Weekly Monthly Quarterly Annual
Backup creation
Backup integrity
Pod restart
Full backup restore
Replica promotion
Node failure
AZ failure
Full DR simulation
Break-glass access
Tabletop exercise

Getting Started

If you have not run any BCT tests:

  1. Start with backup restoration — it's the highest-impact, lowest-risk test
  2. Add automated backup age monitoring within the first week
  3. Run a pod restart test in staging before production
  4. Schedule a tabletop exercise with your team within the next 30 days
  5. Build from there, adding one new test category per quarter

The goal is not to have a perfect checklist — it's to find out what's broken before a real disaster does. Every gap this checklist surfaces is an incident you prevented.

Read more