Testing

Game Day Exercises: How to Run Resilience Testing Like Netflix and AWS

HelpMeTest

22 May 2026 — 8 min read

A game day exercise is a structured event where an engineering team deliberately causes failures in their own systems — and then observes, learns, and improves from the experience. It's the practice that turned Netflix into the company that wrote Chaos Monkey, that AWS uses before every major infrastructure change, and that Google builds into the discipline of Site Reliability Engineering.

The goal is not to show that systems fail. They will. The goal is to find out how they fail — before your customers do.

What Makes a Game Day Different from Chaos Engineering

Chaos engineering (running automated, continuous failure injection) is a long-term practice. A game day is a focused, planned event:

	Chaos Engineering	Game Day
Frequency	Continuous	Quarterly or event-driven
Team involvement	Automated, low overhead	Full team engagement
Scope	Narrow, specific failure types	Broad, realistic scenarios
Learning	Metrics and alerts	Observation, hypothesis, discussion
Goal	Find regressions	Deepen team knowledge and readiness

Both are valuable. Game days are where your team builds intuition about system behavior under stress — something automated tests can't provide.

How Netflix Pioneered Game Days

Netflix's Chaos Monkey (2011) started as a game day concept before becoming automated. The original insight: if you're going to experience random failure in production, you might as well practice it deliberately, on your schedule, with everyone paying attention.

Netflix's principles for resilience testing:

Build the habit before the crisis. Teams that run regular game days respond better to real incidents because they have muscle memory.
Failure in production is the only real test. Staging environments don't have the complexity, load, or dependencies of production.
Blameless post-mortems amplify learning. The goal of a game day is systems knowledge, not finding who wrote the bad code.
Automate what you've learned. Each game day finding that's worth preventing becomes a new automated chaos test.

Planning a Game Day

Step 1: Choose the Scenario

The best game day scenarios are:

Realistic: Based on failures that have happened or plausibly could happen
Scoped: Focused enough to complete in a day, broad enough to be interesting
Unknown outcome: The team shouldn't know exactly how the system will respond
Fixable: Within the scope of what the team can address

Good scenarios:

Primary database fails at peak traffic
Third-party payment provider becomes unavailable
Message queue fills to capacity
A critical microservice is deployed with a memory leak
Network partition between two data centers
Primary on-call is unavailable during a P1

Scenario sources:

Past incidents ("what if this had been worse?")
Near-misses ("we almost had this happen last quarter")
Single points of failure identified in architecture review
External dependency risks
New infrastructure components that haven't been tested under stress

Step 2: Define Hypotheses

Before the game day, write down your hypotheses about system behavior. This forces the team to commit to predictions and makes the learning concrete.

Scenario: Primary PostgreSQL database fails at 2pm (peak traffic)

Hypotheses:
H1: Patroni will detect the failure within 30 seconds
H2: The replica will be promoted within 60 seconds
H3: Application error rate will spike to 100% during the 60-second window
H4: Connection pool will re-establish connections within 10 seconds of promotion
H5: No data loss (zero transactions lost)
H6: Health dashboard will show the failure within 90 seconds
H7: PagerDuty alert will fire to on-call within 2 minutes

After the game day, you'll have evidence for each hypothesis. Most teams are surprised by which ones were wrong.

Step 3: Define Blast Radius

Before executing, agree on:

What's in scope: Which systems, environments, and failure types
What's out of scope: Systems you won't touch, customer segments you'll protect
Stop conditions: Criteria that trigger an abort if real damage is occurring
Rollback plan: How you restore systems if something goes unexpectedly wrong

# game_day_plan.yaml
scenario: "Database Primary Failover"
date: 2026-06-15
time: 14:00-17:00 PDT

scope:
  environment: production
  systems:
    - postgres-primary
    - api-service
    - worker-service
  excluded:
    - payment-processor (third-party, no test credentials)
    - data-warehouse (separate team owns this)

stop_conditions:
  - payment processing fails for > 5 minutes
  - data loss confirmed
  - customer-facing error rate > 50% for > 10 minutes
  - on-call engineer is not available for rollback

rollback:
  - restart postgres-primary service
  - manually re-add primary to Patroni cluster as replica
  - estimated rollback time: 15 minutes

participants:
  - game_master: Sarah (SRE lead)
  - responders: [Mike (backend), Ana (infra), Tom (on-call)]
  - observers: [CTO, Engineering Manager]
  - note_taker: Ana

Step 4: Prepare the Environment

Before the game day:

Confirm all participants know their roles
Verify you have access to all necessary systems and runbooks
Test your failure injection mechanism in staging
Confirm monitoring and dashboards are working
Establish communication channels (incident Slack channel)
Pre-write status page update templates
Check that rollback procedure works

Running the Game Day

The Game Day Structure

T-30: Pre-game briefing (30 min)

Introduce scenario and hypotheses to all participants
Confirm roles, communication channels, and stop conditions
Answer questions — everyone should understand the plan

T-0: Inject failure

Game master executes the failure injection
All participants move to observation mode
Note-taker begins capturing timeline

T+5 to T+RTO: Observe and respond

Team observes system behavior and responds as in a real incident
Game master does NOT guide the response — that's the point
Observers watch and take notes
Note-taker captures every action with timestamps

T+RTO: Restore + debrief

Restore systems to normal state
Immediately run a structured debrief (while observations are fresh)

The Game Master Role

The game master designs and executes the scenario — but does not participate in the response. Their job during the exercise is to:

Execute the failure injection at the planned time
Monitor stop conditions and call an abort if necessary
Track the timeline precisely
Resist the urge to help the responding team

# Game master execution checklist
<span class="hljs-comment"># T-0: Trigger failure

<span class="hljs-built_in">echo <span class="hljs-string">"=== GAME DAY START: $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" <span class="hljs-pipe">| <span class="hljs-built_in">tee -a game_day.log

<span class="hljs-comment"># Inject failure: stop PostgreSQL primary
<span class="hljs-built_in">echo <span class="hljs-string">"Stopping postgres-primary..." <span class="hljs-pipe">| <span class="hljs-built_in">tee -a game_day.log
ssh postgres-primary <span class="hljs-string">"sudo systemctl stop postgresql"

<span class="hljs-comment"># Monitor and record
<span class="hljs-keyword">while <span class="hljs-literal">true; <span class="hljs-keyword">do
  TIMESTAMP=$(<span class="hljs-built_in">date -u +%H:%M:%S)
  PATRONI_STATUS=$(patronictl -c /etc/patroni/patroni.yml list 2>/dev/null <span class="hljs-pipe">| <span class="hljs-built_in">tail -3)
  APP_STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" http://api.internal/health <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"000")
  <span class="hljs-built_in">echo <span class="hljs-string">"$TIMESTAMP <span class="hljs-pipe">| App: <span class="hljs-variable">$APP_STATUS <span class="hljs-pipe">| <span class="hljs-variable">$PATRONI_STATUS" >> game_day.log
  <span class="hljs-built_in">sleep 5
<span class="hljs-keyword">done

The Note-Taker Role

The note-taker captures a timeline of everything that happens:

14:00:00 — Failure injected (postgres-primary stopped)
14:00:45 — PagerDuty alert fires to Mike
14:01:12 — Mike acknowledges alert, joins #incident channel
14:01:30 — Mike posts: "seeing 500s from api-service"
14:02:00 — Patroni detects failure, begins leader election
14:02:45 — postgres-replica promoted to leader
14:03:10 — App starts reconnecting (connection pool recovery)
14:03:30 — API health checks returning 200
14:04:00 — Mike declares incident resolved
14:04:15 — Status page updated

Total downtime: 3m30s (target RTO: 2m)

This timeline becomes the primary artifact for the post-mortem.

The Debrief

Run the debrief immediately after the game day — same day, within an hour of restoration. Memory fades fast.

Debrief Structure (60-90 minutes)

1. Timeline review (15 min) Walk through the note-taker's timeline. Ensure everyone agrees on what happened and when.

2. Hypothesis review (20 min) Go through each pre-game hypothesis:

What did we predict?
What actually happened?
Why was the prediction wrong (if it was)?

H1: Patroni detects failure within 30s
  Predicted: 30s
  Actual: 45s (exceeded)
  Why: Default Patroni TTL is 30s, plus 10-15s for health check cycle
  Action: Consider reducing Patroni TTL to 20s in production

H3: Error rate spikes to 100% during failover window
  Predicted: 100%
  Actual: 73% (better than expected)
  Why: Some requests hit the replica during failover window
  Learning: Read traffic was serving successfully, only writes failed

3. Surprises (10 min) What happened that nobody predicted? Surprises are the most valuable learning.

4. Wins (5 min) What worked well? Reinforce positive behaviors.

5. Action items (20 min) For each finding, generate a specific, owned action item:

Finding	Action	Owner	Due
RTO exceeded target by 90s	Reduce Patroni TTL from 30s to 15s	Mike	June 22
Alert took 45s to fire	Decrease health check interval	Ana	June 20
Runbook step 4 was unclear	Rewrite runbook step 4	Sarah	June 18
Connection pool exhaustion during recovery	Increase pool size and add retry	Mike	June 25

6. What we'll automate (5 min) Which game day tests should become automated chaos tests running continuously?

Game Day Templates

Scenario Template

# Game Day: [Scenario Name]

**Date:** [Date]  
**Time:** [Start] - [End]  
**Environment:** [Production/Staging]

## Scenario Description
[2-3 sentences describing what failure will be simulated]

## Hypotheses
1. [What we predict will happen]
2. [What we predict will happen]
...

## Success Criteria
- RTO: ≤ [X] minutes
- RPO: ≤ [X] minutes  
- Error rate: < [X]% during failover window

## Participants
- Game Master: 
- Responders: 
- Observers: 
- Note-taker: 

## Stop Conditions
- [ ] [Condition 1]
- [ ] [Condition 2]

## Failure Injection
```bash
# Exact commands to inject failure

Rollback Procedure

# Exact commands to restore systems


### Post-Mortem Template

```markdown
# Game Day Post-Mortem: [Scenario Name]

**Date:** [Date]  
**Duration:** [Total time from injection to restoration]  
**Actual RTO:** [X] minutes (Target: [Y] minutes)

## Timeline
[From note-taker]

## Hypothesis Results
| Hypothesis | Predicted | Actual | Met? |
|------------|-----------|--------|------|
| ... | ... | ... | Yes/No |

## Surprises
1. [Something unexpected that happened]

## Wins
1. [Something that worked well]

## Action Items
| Finding | Action | Owner | Due Date |
|---------|--------|-------|----------|
| ... | ... | ... | ... |

## Automation Candidates
- [ ] [Test to automate]

How AWS Runs Game Days

AWS calls these exercises "Wheel of Misfortune" (borrowed from Google SRE). Their approach:

No warning to responders. The responders don't know the scenario in advance — they practice responding to the unknown, like a real incident.
Real on-call rotations. They run game days during actual on-call shifts, not special scheduled times.
Production by default. AWS tests in production because staging doesn't have the same complexity and dependencies.
Two-pizza team scope. Each game day is scoped to one team's systems — broad enough to be realistic, narrow enough to be learnable.

Game Day Frequency

Organization Size	Recommended Frequency
Early startup (<10 engineers)	Quarterly tabletop + semi-annual live exercise
Growth stage (10-50 engineers)	Quarterly live exercises
Scale stage (50+ engineers)	Monthly for SRE team; quarterly for each product team

Even one game day per year is dramatically better than none. Start small — a 2-hour tabletop exercise with three scenarios teaches more than most teams know about their own systems.

Summary

A game day is a structured learning event, not a test you pass or fail. The value comes from:

Hypothesis validation — You learn where your mental model of the system is wrong
Team readiness — Responders build muscle memory for incident response
Runbook validation — You find the gaps in documentation before real incidents do
Action items — You leave with specific improvements to implement
Automation seeds — The findings drive your chaos engineering roadmap

Every production incident that surprises your team is a game day you didn't run. Choose the planned version.