Disaster Recovery Testing: Runbooks, Game Days, and Tabletop Exercises

Disaster Recovery Testing: Runbooks, Game Days, and Tabletop Exercises

Disaster recovery plans that exist only in documentation are not plans — they're aspirations. The difference between a DR plan that works and one that fails during an actual disaster is whether you've tested it.

This guide covers the three tiers of DR testing: tabletop exercises, game days, and live DR drills.

The Three Tiers of DR Testing

Tabletop exercises — Paper-based walkthroughs. The team talks through a scenario without touching production. Cheapest to run, uncovers process and communication gaps.

Game days — Controlled failure injection in staging or production. Teams practice the response in real systems. Surfaces operational gaps (broken runbooks, wrong permissions, unclear ownership).

Live DR drills — Full failover and recovery in production. The most expensive and disruptive, but the only way to truly verify RTO and RPO.

RTO and RPO Definitions

Recovery Time Objective (RTO) — Maximum acceptable downtime. "We must be back online within 4 hours."

Recovery Point Objective (RPO) — Maximum acceptable data loss. "We can lose at most 1 hour of transactions."

These targets must be tested, not assumed:

def measure_recovery_time(start_failure_time, service_restored_time) -> float:
    """Actual RTO achieved during a DR drill."""
    return (service_restored_time - start_failure_time).total_seconds() / 60

def measure_data_loss(last_good_backup_time, failure_time) -> float:
    """Actual RPO — data loss window in minutes."""
    return (failure_time - last_good_backup_time).total_seconds() / 60

# Verify against targets
actual_rto_minutes = measure_recovery_time(failure_start, service_restored)
assert actual_rto_minutes <= TARGET_RTO_MINUTES, \
    f"RTO {actual_rto_minutes:.1f}min exceeds target {TARGET_RTO_MINUTES}min"

Writing Testable Runbooks

A runbook that can't be executed without guessing is a liability. Write runbooks that are:

  • Executable — every step is a concrete command or action
  • Verifiable — each step has a success criterion
  • Atomic — steps that can be paused and resumed at checkpoints

Runbook Template

# Runbook: Database Primary Failover

**Last tested:** 2026-01-15 by: @alice
**RTO target:** 15 minutes
**RPO target:** 60 seconds

## Prerequisites
- [ ] PagerDuty incident created: `https://pd.example.com/...`
- [ ] War room Slack channel opened: `#incident-YYYY-MM-DD`
- [ ] Primary on-call confirmed: `/pd escalate`

## Step 1: Confirm Failure (2 min)

Verify the primary is actually failing (not a monitoring glitch):

```bash
kubectl exec -it postgres-primary-0 -n database -- \
  psql -U postgres -c "SELECT 1"

Expected: ERROR: connection refused or timeout Not expected: 1 (meaning primary is fine — stop, investigate monitoring)

Checkpoint: Primary confirmed unreachable ✓

Step 2: Promote Replica (3 min)

# Identify which replica has the most recent WAL position
kubectl <span class="hljs-built_in">exec -it postgres-replica-0 -n database -- \
  psql -U postgres -c <span class="hljs-string">"SELECT pg_last_wal_receive_lsn()"

kubectl <span class="hljs-built_in">exec -it postgres-replica-1 -n database -- \
  psql -U postgres -c <span class="hljs-string">"SELECT pg_last_wal_receive_lsn()"

<span class="hljs-comment"># Promote the replica with the higher LSN (e.g., replica-0)
kubectl <span class="hljs-built_in">exec -it postgres-replica-0 -n database -- \
  pg_ctl promote -D /var/lib/postgresql/data

Success check:

kubectl exec -it postgres-replica-0 -n database -- \
  psql -U postgres -c <span class="hljs-string">"SELECT pg_is_in_recovery()"

Expected: f (false — replica is now primary)

Step 3: Update DNS / Service (2 min)

kubectl patch service postgres-primary -n database \
  -p '{"spec":{"selector":{"statefulset.kubernetes.io/pod-name":"postgres-replica-0"}}}'

Verify: kubectl get endpoints postgres-primary -n database Expected: endpoint shows postgres-replica-0's IP

Step 4: Verify Application Recovery (5 min)

Check application is accepting database connections:

curl -s https://api.example.com/health | jq <span class="hljs-string">'.database'

Expected: "ok"

Run smoke tests:

./scripts/smoke-test.sh https://api.example.com

Step 5: Notify Stakeholders

Post to #status-page: "Database failover complete. Service restored. Investigating root cause." Update PagerDuty incident status to "Resolved."

Post-Recovery

  • Schedule post-mortem within 48 hours
  • Document actual RTO: ___ minutes
  • Document any data loss: ___ minutes (RPO)
  • Create tickets for runbook improvements found during failover

### Testing Runbooks Automatically

```python
import subprocess
import re

def validate_runbook(runbook_path: str) -> list:
    """
    Static analysis of a runbook to catch common issues.
    Returns list of warnings.
    """
    with open(runbook_path) as f:
        content = f.read()
    
    warnings = []
    
    # Check for "Last tested" date
    if "Last tested" not in content:
        warnings.append("MISSING: 'Last tested' date — runbook may be stale")
    else:
        match = re.search(r'Last tested:\s*(\d{4}-\d{2}-\d{2})', content)
        if match:
            from datetime import datetime
            test_date = datetime.strptime(match.group(1), "%Y-%m-%d")
            days_since = (datetime.now() - test_date).days
            if days_since > 90:
                warnings.append(f"STALE: Runbook not tested in {days_since} days (threshold: 90)")
    
    # Check for success criteria
    code_blocks = re.findall(r'```bash(.*?)```', content, re.DOTALL)
    for i, block in enumerate(code_blocks):
        if "Expected" not in content.split("```bash")[i+1].split("```")[0][:500]:
            warnings.append(f"MISSING: No 'Expected' output defined after code block {i+1}")
    
    # Check for RTO target
    if "RTO target" not in content:
        warnings.append("MISSING: RTO target not defined")
    
    # Check for stakeholder notification step
    if "Notify" not in content and "stakeholder" not in content.lower():
        warnings.append("MISSING: No stakeholder notification step")
    
    return warnings

def test_runbooks_are_valid():
    import glob
    runbook_files = glob.glob("runbooks/*.md")
    
    all_warnings = {}
    for runbook in runbook_files:
        warnings = validate_runbook(runbook)
        if warnings:
            all_warnings[runbook] = warnings
    
    assert len(all_warnings) == 0, \
        "Runbook issues found:\n" + "\n".join(
            f"  {rb}:\n" + "\n".join(f"    - {w}" for w in warns)
            for rb, warns in all_warnings.items()
        )

Running a Tabletop Exercise

A tabletop exercise is a structured conversation. The facilitator presents a scenario, and participants talk through their responses.

Scenario Template

SCENARIO: Database corruption after schema migration

TIME: Thursday, 2:15 PM
CONTEXT: A new schema migration ran successfully in staging but is producing 
         data corruption on the production database.

SYMPTOMS:
- Sentry is showing DatabaseError exceptions at 30x normal rate
- A few customers are reporting they can't access their data
- The migration ran 45 minutes ago
- The migration cannot be rolled back (adds a NOT NULL column)

QUESTIONS TO DISCUSS:
1. Who are the first people to be notified? (Who owns the incident?)
2. What is the first thing you check?
3. How do you assess the scope of corruption?
4. What options do you have? (Point-in-time restore, data repair script, manual fixes?)
5. Who makes the decision to restore from backup?
6. How do you communicate with affected customers?
7. What is your RTO from this point? What is the RPO (data loss)?

Tabletop Exercise Facilitation Checklist

## Before the Tabletop

- [ ] Select scenario (rotate types: database, network, third-party, human error)
- [ ] Confirm all relevant team members can attend (on-call, engineering leads, support)
- [ ] Send scenario description 24h in advance OR keep it secret for realism

## During the Tabletop

- [ ] State the ground rules: no blame, focus on process
- [ ] Read the scenario aloud
- [ ] Ask probing questions when answers are vague ("Who specifically does that?")
- [ ] Note where participants disagree or are uncertain
- [ ] Track open questions that need follow-up
- [ ] Time the exercise (aim for 60-90 minutes)

## After the Tabletop

- [ ] Document findings within 48 hours
- [ ] Create tickets for each gap found
- [ ] Schedule follow-up to verify gaps are closed
- [ ] Update runbooks based on decisions made during the tabletop

Game Day Checklist

A game day exercises the actual systems, not just the process:

## Pre-Game Day (1 week before)

- [ ] Define success criteria and metrics to track
- [ ] Set communication plan (who notifies whom, what channels)
- [ ] Notify affected teams and stakeholders
- [ ] Create rollback plan (how to stop the game day if something goes wrong)
- [ ] Verify staging environment is production-equivalent

## Day-Of Setup (30 min before)

- [ ] Open war room channel
- [ ] Confirm all participants are available
- [ ] Verify monitoring dashboards are accessible
- [ ] Start recording metrics baseline (throughput, error rate, latency)
- [ ] Confirm rollback command is ready to execute

## During the Game Day

- [ ] Document each action with timestamp
- [ ] Track actual vs. expected RTO at each step
- [ ] Record all surprises (things that didn't work as expected)
- [ ] If system enters uncontrolled state → execute rollback → pause game day

## Post-Game Day (within 24 hours)

- [ ] Document actual RTO and RPO achieved
- [ ] List all runbook steps that failed or needed updates
- [ ] Create action items with owners and deadlines
- [ ] Schedule next game day (quarterly is common)

Measuring DR Maturity

def calculate_dr_maturity_score(runbooks: list, game_days: list) -> dict:
    """Score DR maturity across key dimensions."""
    
    return {
        "runbook_coverage": {
            "score": len([r for r in runbooks if r.last_tested_within_90_days]) / len(runbooks),
            "description": "Fraction of runbooks tested within 90 days"
        },
        "rto_achievement": {
            "score": len([gd for gd in game_days if gd.actual_rto <= gd.target_rto]) / len(game_days),
            "description": "Fraction of game days that met RTO target"
        },
        "game_day_frequency": {
            "score": min(1.0, len(game_days) / 4),  # 4 per year = 1.0
            "description": "Game days per year (target: 4)"
        },
        "tabletop_frequency": {
            "score": min(1.0, len([t for t in tabletops if t.age_days < 90]) / 4),
            "description": "Tabletop exercises per quarter"
        }
    }

DR testing is an investment in confidence. When a real disaster happens at 3 AM, your team's behavior under pressure is determined by muscle memory built through practice. Every tabletop, every game day, every updated runbook is a deposit into that account.

Read more