Chaos Engineering Intro: Build Systems That Survive Failure
Netflix's infrastructure team had a problem in 2010: they couldn't be sure their systems would survive an AWS outage. Their solution — a tool that randomly terminates production instances — became Chaos Monkey, and the discipline around it became chaos engineering.
The core idea: instead of hoping your system handles failure gracefully, you prove it.
What Chaos Engineering Is
Chaos engineering is the practice of deliberately introducing failures into a system to verify that it continues to function and recover correctly.
It's not about breaking things randomly. It's a disciplined experiment:
- Form a hypothesis — "If a database replica fails, our application will route to the primary with no user-visible impact"
- Define steady state — measure what normal looks like (error rate, latency, throughput)
- Inject a controlled failure — kill the replica
- Observe — did the system maintain steady state?
- Learn — if not, you found a real weakness before customers did
The experiment is controlled, time-limited, and run with the ability to abort and roll back.
Why Teams Do It
Production failures are expensive. The average cost of downtime is $5,600 per minute (Gartner). Most outages aren't caused by the failure itself — they're caused by the system's failure to handle the failure.
You can't test resilience without exercising it. Unit tests and integration tests verify behavior in normal conditions. They can't tell you what happens when a downstream service is slow, a disk fills up, or a node loses network connectivity. Chaos experiments test the things that actually cause incidents.
Confidence at scale. As systems grow — more services, more dependencies, more infrastructure — it becomes impossible to reason about failure modes. Chaos engineering gives you evidence instead of assumptions.
Chaos vs. Reliability Testing
| Chaos Engineering | Reliability Testing | |
|---|---|---|
| When | Ongoing, production | Pre-production |
| Target | Real system under real load | Simulated conditions |
| Goal | Discover unknown weaknesses | Verify known requirements |
| Risk | Controlled real impact | No production impact |
Reliability testing (load tests, disaster recovery drills) is planned and verifies specific requirements. Chaos engineering discovers weaknesses you didn't think to test for.
Both matter. Chaos engineering doesn't replace reliability testing — it reveals what to add to reliability tests.
Chaos Monkey and the Simian Army
Netflix open-sourced Chaos Monkey in 2012. The original version had one job: randomly terminate EC2 instances in production during business hours.
The reasoning: if your instances can be killed at any moment, your engineers build systems that tolerate it. And they find out about the gaps during business hours when the whole team is available, not at 2am on Saturday.
Netflix expanded this into the Simian Army:
- Chaos Monkey — kills instances
- Latency Monkey — introduces network delays
- Conformity Monkey — finds instances that don't follow best practices
- Janitor Monkey — cleans up unused resources
- Security Monkey — finds security policy violations
- Chaos Gorilla — simulates an entire availability zone going down
Modern chaos tools have expanded far beyond the original AWS-focused toolset.
Chaos Monkey Setup
The original Chaos Monkey is now part of the Spinnaker ecosystem, but there are simpler alternatives for teams not running Spinnaker.
Running Chaos Monkey with Spinnaker
Add to config.yml:
chaosmonkey:
enabled: true
schedule:
cron: "0 * * * 1-5" # Every hour, weekdays only
accounts:
- name: production
regions:
- us-east-1
groups:
- application-asgAlternatives for Kubernetes
For Kubernetes-native chaos, Chaos Mesh and Litmus are more practical:
# Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh <span class="hljs-pipe">| bashBasic pod failure experiment:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-experiment
spec:
action: pod-failure
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-server
duration: 60s
scheduler:
cron: "@every 10m"This kills one API server pod every 10 minutes. If your autoscaler and health checks work correctly, this is invisible to users. If they don't, you'll find out now.
The GameDay
A GameDay is a scheduled chaos experiment involving the whole team.
Structure of a well-run GameDay:
Before:
- Define the scenario (network partition between services A and B)
- Agree on success criteria (error rate stays below 0.1%, P99 stays below 500ms)
- Identify rollback steps
- Notify stakeholders
- Establish the "steady state" baseline
During:
- Inject the failure
- Monitor dashboards
- Take notes on system and team behavior
- Roll back immediately if impact exceeds expected bounds
After:
- Document what happened
- File tickets for every weakness discovered
- Update runbooks
- Share findings across teams
GameDays start small (a single service in staging) and grow more ambitious as confidence and capability develop.
Principles for Getting Started
Start in staging, not production. Run your first experiments against a non-production environment. Once you have tooling, observability, and runbooks in place, graduate to production.
One failure at a time. Don't inject multiple failures simultaneously. You need to isolate cause and effect.
Have an abort condition. Define the conditions under which you stop: error rate exceeds 1%, on-call engineer requests halt, alert fires. Build the kill switch before starting.
Measure before you inject. Chaos experiments require a baseline. If you don't know what normal looks like, you can't tell if the experiment degraded it.
Fix what you find. Chaos experiments that reveal weaknesses but don't result in fixes are theater. Every finding should produce a ticket.
What to Inject First
Easy starting points:
| Failure | Expected behavior | Common gap |
|---|---|---|
| Kill one instance/pod | Traffic routes to healthy instances | Health checks too slow, no retries |
| Kill all instances of a service | Dependent services degrade gracefully | No circuit breaker, hard dependency |
| Add 200ms latency to database | App stays responsive, queries time out gracefully | No timeout configured, cascade failure |
| Exhaust disk space on a node | App fails gracefully, alert fires | No disk monitoring, writes fail silently |
| Terminate a worker process | Job queue drains without loss | No ack timeout, jobs dropped |
Start with the failure most likely to happen. Look at your incident history — what caused the last three outages? Start there.
Observability Requirements
You can't run chaos experiments without observability. At minimum, you need:
- Metrics: error rate, latency (P50, P95, P99), throughput, saturation
- Alerts: firing when user-visible impact exceeds threshold
- Dashboards: real-time view of system state during experiments
- Logs: structured, searchable, correlated with request IDs
If you can't answer "is the system in steady state right now?", you can't run chaos experiments safely.
Chaos Engineering Tools
| Tool | Best For | License |
|---|---|---|
| Chaos Monkey | AWS EC2 termination | Open source |
| Chaos Mesh | Kubernetes | Open source |
| Litmus | Kubernetes, GitOps workflows | Open source |
| Gremlin | Full-featured, SaaS | Commercial |
| AWS Fault Injection Service | AWS-native | Commercial |
| k6 | Application-level, scripted | Open source |
Tools are not the bottleneck. The bottleneck is observability, team buy-in, and the discipline to act on findings.
Is Your System Ready for Chaos?
Signs you're ready:
- You have on-call rotation with runbooks
- Incidents are documented and reviewed
- You have basic metrics and alerting
- Services have health checks and circuit breakers
Signs you're not ready:
- No alerting in place
- Single points of failure you know about but haven't fixed
- No rollback process for deployments
- Teams don't have insight into what "normal" looks like
Fix the known weaknesses first. Chaos engineering is for discovering unknown weaknesses — don't use it to rediscover what you already know.
The Goal
The goal of chaos engineering isn't to cause outages. It's to develop confidence that your system handles failure gracefully, and to discover the cases where it doesn't before your users do.
Teams that practice chaos engineering regularly report fewer incidents, shorter mean time to recovery, and more confidence shipping changes to production. That's the outcome — not the experiments themselves, but the resilience improvements they drive.