How to Write a Chaos Engineering Runbook: Templates and Best Practices

How to Write a Chaos Engineering Runbook: Templates and Best Practices

A chaos engineering runbook is the document that makes your experiments repeatable, learnable, and defensible. Without runbooks, chaos experiments are one-off events — knowledge that lives in one engineer's head and disappears when they leave. With runbooks, experiments become institutional memory and the foundation of a mature resilience program.

This guide covers what a chaos runbook should contain, how to write one, and how to use it to build an organizational practice around resilience testing.

Why Runbooks Matter in Chaos Engineering

Repeatability. An experiment documented in a runbook can be run by any team member, not just the original author. This is essential as chaos programs scale beyond the initial enthusiasts.

Accountability. Runbooks establish who authorized the experiment, who ran it, and what the expected and actual blast radius was. In regulated industries, this documentation is often required.

Learning persistence. The most valuable output of a chaos experiment is what you learned. Runbooks capture this knowledge formally. An experiment without a runbook produces knowledge that evaporates.

Safety. Runbooks include pre-experiment checks, rollback procedures, and emergency stop conditions. They're the guardrails that prevent chaos experiments from becoming actual incidents.

Stakeholder communication. When running chaos experiments that affect shared environments, runbooks give stakeholders (operations, management, security) a way to understand what will happen before it happens.

The Chaos Runbook Structure

A complete chaos runbook has seven sections:

1. Experiment Header

# Experiment: [Short descriptive name]

**ID:** CHAOS-2026-042
**Status:** Draft | Approved | Completed
**Owner:** @sre-team-member
**Date Created:** 2026-05-20
**Last Run:****Environment:** Staging only / Staging + Production

## Summary
One paragraph describing what this experiment tests and why it matters.

2. Hypothesis

The most important section. State exactly what you believe will happen:

## Hypothesis

**When:** The payment service is experiencing 20% HTTP 500 errors

**We believe:** The checkout service will degrade gracefully — it will:
- Return a user-friendly error message on the checkout page
- Not expose internal error details to users
- Retry failed payment requests up to 3 times before surfacing the error
- Log the failure with full context for debugging

**We will know this is true when:**
- checkout_error_rate < 25% (measured in Prometheus, 5-minute window)
- No `payment_error` strings visible in browser DevTools network responses
- Exactly 3 payment attempts visible in payment service logs per checkout failure
- checkout_payment_retry_total counter increments in Prometheus

The hypothesis must be falsifiable — you must be able to determine definitively whether it held or failed.

3. Steady State

Define what "normal" looks like so you can detect deviation:

## Steady State

Confirm all of the following before starting the experiment:

| Metric | Threshold | Current | Source |
|---|---|---|---|
| API error rate (5xx) | < 0.5% | Check Grafana → Error Rates | Prometheus |
| API p99 latency | < 500ms | Check Grafana → Latency | Prometheus |
| Active user sessions | > 100 | Check Grafana → Sessions | DataDog |
| Payment service health | GREEN | Check PagerDuty | Healthcheck |
| No active incidents | — | Check PagerDuty | PagerDuty |

**Do not start the experiment if any threshold is breached or any incident is active.**

4. Blast Radius Assessment

Document who and what will be affected:

## Blast Radius

**Environment:** Staging (customers/users.staging.yourapp.com)
**Expected impact:** Users on the checkout flow in staging
**Duration:** 15 minutes (10-minute experiment + 5-minute recovery observation)
**Affected services:** checkout-service, payment-service

**Not affected:**
- Production environment
- All other user flows (browse, cart, profile)
- Internal admin tools

**Risk level:** LOW (staging only, isolated payment failure injection)

**Stakeholder notifications required:**
- [ ] QA team lead: @qa-lead
- [ ] Staging environment users: email notification 30 min before

5. Execution Steps

Step-by-step instructions for running the experiment:

## Execution Steps

### Pre-Experiment Checklist (15 minutes before)

- [ ] Verify steady state metrics are within thresholds (see Steady State section)
- [ ] Notify stakeholders (see Blast Radius section)
- [ ] Open monitoring dashboard: https://grafana.internal/d/checkout-resilience
- [ ] Have rollback procedure ready (see Rollback section)
- [ ] Confirm you have access to the chaos tooling

### Experiment Execution

**Step 1: Start monitoring**
```bash
# Open Grafana dashboard in browser
open https://grafana.internal/d/checkout-resilience

# Start capturing logs
kubectl logs -n staging deploy/checkout-service -f | tee /tmp/chaos-checkout-$(date +%Y%m%d-%H%M%S).log &

Step 2: Inject fault (10:00 AM)

chaos run --rollback-strategy=always \
  experiments/payment-service-failure.yaml

Step 3: Observe and record (10:00-10:10 AM) Record what you see in the monitoring dashboard:

  • checkout_error_rate peak: ___
  • checkout_p99_latency peak: ___
  • First error visible to users: ___ seconds after injection
  • User-visible error message: ___

Step 4: Stop injection The Chaos Toolkit will stop the experiment automatically. Note the time: ___

Step 5: Observe recovery (10:10-10:15 AM) Record recovery:

  • Time to return to steady state: ___ seconds
  • Any residual effects: ___

Post-Experiment

  • Verify all metrics have returned to steady state
  • Review checkout-service logs for unexpected errors
  • Save all screenshots and log files
  • Fill in Findings section (below)

### 6. Rollback Procedure

What to do if things go wrong:

```markdown
## Rollback Procedure

### Emergency Stop (use immediately if experiment goes wrong)

```bash
# Stop all running Chaos Toolkit experiments
chaos terminate

# If payment service fault is still injected:
kubectl rollout undo deployment/payment-service -n staging

# Verify services recovering:
watch -n5 'kubectl get pods -n staging'

When to Trigger Emergency Stop

Stop immediately if:

  • checkout_error_rate exceeds 50% (much higher than expected)
  • Production environment shows any anomalies (should not happen, but verify)
  • Any PagerDuty incident fires during the experiment
  • The experiment does not complete within 20 minutes

Emergency contacts:

  • Platform team: #platform-ops Slack channel
  • On-call SRE: PagerDuty → HelpMeTest SRE

### 7. Findings and Remediation

Fill in after the experiment:

```markdown
## Findings

**Date Run:** 2026-05-21 10:00 AM
**Run By:** @sre-engineer
**Duration:** 12 minutes

### What Happened

[Describe what actually happened during the experiment in plain language]

### Hypothesis Verification

| Hypothesis | Expected | Actual | Passed? |
|---|---|---|---|
| Checkout error rate < 25% | < 25% | 18% | ✅ |
| No internal error details exposed | None | None | ✅ |
| 3 retry attempts per failure | 3 retries | 1 retry | ❌ |
| checkout_payment_retry_total increments | Yes | No | ❌ |

**Overall: PARTIAL PASS** — Hypothesis partially held. Error rate was acceptable, but retry logic was not working as expected.

### Unexpected Findings

1. The checkout service only retried once, not three times as configured. The retry configuration appears to be misconfigured.
2. The error message shown to users included the payment provider name ("Stripe"), which may be more information than we want to expose.

### Metrics During Experiment

- checkout_error_rate peak: 18%
- checkout_p99_latency peak: 2,100ms
- Recovery time: 45 seconds after injection stopped

### Screenshots

- [Grafana dashboard during experiment](link)
- [User-visible error message](link)

### Remediation Items

| Finding | Action | Owner | Priority | Due |
|---|---|---|---|---|
| Retry count is 1, not 3 | Fix retry configuration in checkout-service | @backend-team | HIGH | 2026-05-28 |
| Stripe name exposed in error message | Update error message to generic text | @backend-team | MEDIUM | 2026-06-04 |

### Next Steps

- [ ] Fix retry configuration (tracked: JIRA-1234)
- [ ] Fix error message (tracked: JIRA-1235)
- [ ] Re-run this experiment after fixes are deployed to verify

### Approved for Production? 

☐ Yes — hypothesis held, no remediation required before production run  
☒ No — remediation required first (see Remediation Items)

Managing a Library of Chaos Runbooks

As your chaos program matures, you'll accumulate dozens of runbooks. Organize them for discoverability:

Directory Structure

chaos/
├── README.md                          # index of all experiments
├── experiments/
│   ├── pod-termination/
│   │   ├── runbook.md
│   │   └── experiment.yaml
│   ├── network-latency/
│   │   ├── runbook.md
│   │   └── experiment.yaml
│   └── payment-service-failure/
│       ├── runbook.md
│       └── experiment.yaml
├── templates/
│   └── runbook-template.md
└── findings/
    └── 2026-Q2-findings-summary.md

Runbook Status Tracking

Maintain an index of all experiments with their current status:

# Chaos Engineering Runbook Index

| Experiment | Last Run | Status | Findings | Owner |
|---|---|---|---|---|
| Pod termination | 2026-05-15 | ✅ PASS | None | @sre-team |
| Network latency | 2026-04-28 | ⚠️ PARTIAL | Retry misconfigured | @backend-team |
| Payment failure | 2026-05-21 | ❌ FAIL | See CHAOS-042 | @backend-team |
| Database connection pool | — | Draft | — | @db-team |

Quarterly Runbook Review

Each quarter, review all runbooks:

  • Are the steady state thresholds still accurate?
  • Have services changed in ways that affect the experiment?
  • Are remediation items from previous runs addressed?
  • Are there new failure modes worth adding to the library?

Runbook Anti-Patterns

Vague hypotheses. "The system will be resilient" is not a hypothesis. Every runbook must have specific, measurable assertions.

Missing rollback procedures. Every runbook must have a tested rollback procedure. "Revert the change" is not a rollback procedure.

Runbooks without findings. A runbook that was executed but never updated with findings is a runbook that produced no lasting value. Make findings mandatory.

No stakeholder communication plan. Chaos experiments that affect shared environments must notify affected teams. Surprises create distrust.

Runbooks that never get re-run. Runbooks should be re-run after remediation and periodically thereafter. Systems change; past passes don't guarantee future passes.

Starting Your Runbook Library

Begin with the five scenarios most likely to affect your users:

  1. Most critical service pod termination — What happens when your most important service loses a pod?
  2. Database latency — How does the application behave when database queries slow down?
  3. External API failure — What happens when your payment provider or email service is unavailable?
  4. Authentication service degradation — Does the application fail safely when auth is slow?
  5. Connection pool exhaustion — What happens under high concurrent load?

Write the runbook for each. Run the experiment. Document findings. Fix what's broken. Repeat quarterly.


HelpMeTest's continuous monitoring provides the steady-state baseline data that chaos runbooks require — without it, you're guessing about what "normal" looks like. Start free.

Read more