Game Day Exercises: How to Run Resilience Testing Like Netflix and AWS
A game day exercise is a structured event where an engineering team deliberately causes failures in their own systems — and then observes, learns, and improves from the experience. It's the practice that turned Netflix into the company that wrote Chaos Monkey, that AWS uses before every major infrastructure change, and that Google builds into the discipline of Site Reliability Engineering.
The goal is not to show that systems fail. They will. The goal is to find out how they fail — before your customers do.
What Makes a Game Day Different from Chaos Engineering
Chaos engineering (running automated, continuous failure injection) is a long-term practice. A game day is a focused, planned event:
| Chaos Engineering | Game Day | |
|---|---|---|
| Frequency | Continuous | Quarterly or event-driven |
| Team involvement | Automated, low overhead | Full team engagement |
| Scope | Narrow, specific failure types | Broad, realistic scenarios |
| Learning | Metrics and alerts | Observation, hypothesis, discussion |
| Goal | Find regressions | Deepen team knowledge and readiness |
Both are valuable. Game days are where your team builds intuition about system behavior under stress — something automated tests can't provide.
How Netflix Pioneered Game Days
Netflix's Chaos Monkey (2011) started as a game day concept before becoming automated. The original insight: if you're going to experience random failure in production, you might as well practice it deliberately, on your schedule, with everyone paying attention.
Netflix's principles for resilience testing:
- Build the habit before the crisis. Teams that run regular game days respond better to real incidents because they have muscle memory.
- Failure in production is the only real test. Staging environments don't have the complexity, load, or dependencies of production.
- Blameless post-mortems amplify learning. The goal of a game day is systems knowledge, not finding who wrote the bad code.
- Automate what you've learned. Each game day finding that's worth preventing becomes a new automated chaos test.
Planning a Game Day
Step 1: Choose the Scenario
The best game day scenarios are:
- Realistic: Based on failures that have happened or plausibly could happen
- Scoped: Focused enough to complete in a day, broad enough to be interesting
- Unknown outcome: The team shouldn't know exactly how the system will respond
- Fixable: Within the scope of what the team can address
Good scenarios:
- Primary database fails at peak traffic
- Third-party payment provider becomes unavailable
- Message queue fills to capacity
- A critical microservice is deployed with a memory leak
- Network partition between two data centers
- Primary on-call is unavailable during a P1
Scenario sources:
- Past incidents ("what if this had been worse?")
- Near-misses ("we almost had this happen last quarter")
- Single points of failure identified in architecture review
- External dependency risks
- New infrastructure components that haven't been tested under stress
Step 2: Define Hypotheses
Before the game day, write down your hypotheses about system behavior. This forces the team to commit to predictions and makes the learning concrete.
Scenario: Primary PostgreSQL database fails at 2pm (peak traffic)
Hypotheses:
H1: Patroni will detect the failure within 30 seconds
H2: The replica will be promoted within 60 seconds
H3: Application error rate will spike to 100% during the 60-second window
H4: Connection pool will re-establish connections within 10 seconds of promotion
H5: No data loss (zero transactions lost)
H6: Health dashboard will show the failure within 90 seconds
H7: PagerDuty alert will fire to on-call within 2 minutesAfter the game day, you'll have evidence for each hypothesis. Most teams are surprised by which ones were wrong.
Step 3: Define Blast Radius
Before executing, agree on:
- What's in scope: Which systems, environments, and failure types
- What's out of scope: Systems you won't touch, customer segments you'll protect
- Stop conditions: Criteria that trigger an abort if real damage is occurring
- Rollback plan: How you restore systems if something goes unexpectedly wrong
# game_day_plan.yaml
scenario: "Database Primary Failover"
date: 2026-06-15
time: 14:00-17:00 PDT
scope:
environment: production
systems:
- postgres-primary
- api-service
- worker-service
excluded:
- payment-processor (third-party, no test credentials)
- data-warehouse (separate team owns this)
stop_conditions:
- payment processing fails for > 5 minutes
- data loss confirmed
- customer-facing error rate > 50% for > 10 minutes
- on-call engineer is not available for rollback
rollback:
- restart postgres-primary service
- manually re-add primary to Patroni cluster as replica
- estimated rollback time: 15 minutes
participants:
- game_master: Sarah (SRE lead)
- responders: [Mike (backend), Ana (infra), Tom (on-call)]
- observers: [CTO, Engineering Manager]
- note_taker: AnaStep 4: Prepare the Environment
Before the game day:
- Confirm all participants know their roles
- Verify you have access to all necessary systems and runbooks
- Test your failure injection mechanism in staging
- Confirm monitoring and dashboards are working
- Establish communication channels (incident Slack channel)
- Pre-write status page update templates
- Check that rollback procedure works
Running the Game Day
The Game Day Structure
T-30: Pre-game briefing (30 min)
- Introduce scenario and hypotheses to all participants
- Confirm roles, communication channels, and stop conditions
- Answer questions — everyone should understand the plan
T-0: Inject failure
- Game master executes the failure injection
- All participants move to observation mode
- Note-taker begins capturing timeline
T+5 to T+RTO: Observe and respond
- Team observes system behavior and responds as in a real incident
- Game master does NOT guide the response — that's the point
- Observers watch and take notes
- Note-taker captures every action with timestamps
T+RTO: Restore + debrief
- Restore systems to normal state
- Immediately run a structured debrief (while observations are fresh)
The Game Master Role
The game master designs and executes the scenario — but does not participate in the response. Their job during the exercise is to:
- Execute the failure injection at the planned time
- Monitor stop conditions and call an abort if necessary
- Track the timeline precisely
- Resist the urge to help the responding team
# Game master execution checklist
<span class="hljs-comment"># T-0: Trigger failure
<span class="hljs-built_in">echo <span class="hljs-string">"=== GAME DAY START: $(date -u +%Y-%m-%dT%H:%M:%SZ) ===" <span class="hljs-pipe">| <span class="hljs-built_in">tee -a game_day.log
<span class="hljs-comment"># Inject failure: stop PostgreSQL primary
<span class="hljs-built_in">echo <span class="hljs-string">"Stopping postgres-primary..." <span class="hljs-pipe">| <span class="hljs-built_in">tee -a game_day.log
ssh postgres-primary <span class="hljs-string">"sudo systemctl stop postgresql"
<span class="hljs-comment"># Monitor and record
<span class="hljs-keyword">while <span class="hljs-literal">true; <span class="hljs-keyword">do
TIMESTAMP=$(<span class="hljs-built_in">date -u +%H:%M:%S)
PATRONI_STATUS=$(patronictl -c /etc/patroni/patroni.yml list 2>/dev/null <span class="hljs-pipe">| <span class="hljs-built_in">tail -3)
APP_STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" http://api.internal/health <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"000")
<span class="hljs-built_in">echo <span class="hljs-string">"$TIMESTAMP <span class="hljs-pipe">| App: <span class="hljs-variable">$APP_STATUS <span class="hljs-pipe">| <span class="hljs-variable">$PATRONI_STATUS" >> game_day.log
<span class="hljs-built_in">sleep 5
<span class="hljs-keyword">doneThe Note-Taker Role
The note-taker captures a timeline of everything that happens:
14:00:00 — Failure injected (postgres-primary stopped)
14:00:45 — PagerDuty alert fires to Mike
14:01:12 — Mike acknowledges alert, joins #incident channel
14:01:30 — Mike posts: "seeing 500s from api-service"
14:02:00 — Patroni detects failure, begins leader election
14:02:45 — postgres-replica promoted to leader
14:03:10 — App starts reconnecting (connection pool recovery)
14:03:30 — API health checks returning 200
14:04:00 — Mike declares incident resolved
14:04:15 — Status page updated
Total downtime: 3m30s (target RTO: 2m)This timeline becomes the primary artifact for the post-mortem.
The Debrief
Run the debrief immediately after the game day — same day, within an hour of restoration. Memory fades fast.
Debrief Structure (60-90 minutes)
1. Timeline review (15 min) Walk through the note-taker's timeline. Ensure everyone agrees on what happened and when.
2. Hypothesis review (20 min) Go through each pre-game hypothesis:
- What did we predict?
- What actually happened?
- Why was the prediction wrong (if it was)?
H1: Patroni detects failure within 30s
Predicted: 30s
Actual: 45s (exceeded)
Why: Default Patroni TTL is 30s, plus 10-15s for health check cycle
Action: Consider reducing Patroni TTL to 20s in production
H3: Error rate spikes to 100% during failover window
Predicted: 100%
Actual: 73% (better than expected)
Why: Some requests hit the replica during failover window
Learning: Read traffic was serving successfully, only writes failed3. Surprises (10 min) What happened that nobody predicted? Surprises are the most valuable learning.
4. Wins (5 min) What worked well? Reinforce positive behaviors.
5. Action items (20 min) For each finding, generate a specific, owned action item:
| Finding | Action | Owner | Due |
|---|---|---|---|
| RTO exceeded target by 90s | Reduce Patroni TTL from 30s to 15s | Mike | June 22 |
| Alert took 45s to fire | Decrease health check interval | Ana | June 20 |
| Runbook step 4 was unclear | Rewrite runbook step 4 | Sarah | June 18 |
| Connection pool exhaustion during recovery | Increase pool size and add retry | Mike | June 25 |
6. What we'll automate (5 min) Which game day tests should become automated chaos tests running continuously?
Game Day Templates
Scenario Template
# Game Day: [Scenario Name]
**Date:** [Date]
**Time:** [Start] - [End]
**Environment:** [Production/Staging]
## Scenario Description
[2-3 sentences describing what failure will be simulated]
## Hypotheses
1. [What we predict will happen]
2. [What we predict will happen]
...
## Success Criteria
- RTO: ≤ [X] minutes
- RPO: ≤ [X] minutes
- Error rate: < [X]% during failover window
## Participants
- Game Master:
- Responders:
- Observers:
- Note-taker:
## Stop Conditions
- [ ] [Condition 1]
- [ ] [Condition 2]
## Failure Injection
```bash
# Exact commands to inject failureRollback Procedure
# Exact commands to restore systems
### Post-Mortem Template
```markdown
# Game Day Post-Mortem: [Scenario Name]
**Date:** [Date]
**Duration:** [Total time from injection to restoration]
**Actual RTO:** [X] minutes (Target: [Y] minutes)
## Timeline
[From note-taker]
## Hypothesis Results
| Hypothesis | Predicted | Actual | Met? |
|------------|-----------|--------|------|
| ... | ... | ... | Yes/No |
## Surprises
1. [Something unexpected that happened]
## Wins
1. [Something that worked well]
## Action Items
| Finding | Action | Owner | Due Date |
|---------|--------|-------|----------|
| ... | ... | ... | ... |
## Automation Candidates
- [ ] [Test to automate]How AWS Runs Game Days
AWS calls these exercises "Wheel of Misfortune" (borrowed from Google SRE). Their approach:
- No warning to responders. The responders don't know the scenario in advance — they practice responding to the unknown, like a real incident.
- Real on-call rotations. They run game days during actual on-call shifts, not special scheduled times.
- Production by default. AWS tests in production because staging doesn't have the same complexity and dependencies.
- Two-pizza team scope. Each game day is scoped to one team's systems — broad enough to be realistic, narrow enough to be learnable.
Game Day Frequency
| Organization Size | Recommended Frequency |
|---|---|
| Early startup (<10 engineers) | Quarterly tabletop + semi-annual live exercise |
| Growth stage (10-50 engineers) | Quarterly live exercises |
| Scale stage (50+ engineers) | Monthly for SRE team; quarterly for each product team |
Even one game day per year is dramatically better than none. Start small — a 2-hour tabletop exercise with three scenarios teaches more than most teams know about their own systems.
Summary
A game day is a structured learning event, not a test you pass or fail. The value comes from:
- Hypothesis validation — You learn where your mental model of the system is wrong
- Team readiness — Responders build muscle memory for incident response
- Runbook validation — You find the gaps in documentation before real incidents do
- Action items — You leave with specific improvements to implement
- Automation seeds — The findings drive your chaos engineering roadmap
Every production incident that surprises your team is a game day you didn't run. Choose the planned version.