Chaos Engineering Intro: Build Systems That Survive Failure

Chaos Engineering Intro: Build Systems That Survive Failure

Netflix's infrastructure team had a problem in 2010: they couldn't be sure their systems would survive an AWS outage. Their solution — a tool that randomly terminates production instances — became Chaos Monkey, and the discipline around it became chaos engineering.

The core idea: instead of hoping your system handles failure gracefully, you prove it.

What Chaos Engineering Is

Chaos engineering is the practice of deliberately introducing failures into a system to verify that it continues to function and recover correctly.

It's not about breaking things randomly. It's a disciplined experiment:

  1. Form a hypothesis — "If a database replica fails, our application will route to the primary with no user-visible impact"
  2. Define steady state — measure what normal looks like (error rate, latency, throughput)
  3. Inject a controlled failure — kill the replica
  4. Observe — did the system maintain steady state?
  5. Learn — if not, you found a real weakness before customers did

The experiment is controlled, time-limited, and run with the ability to abort and roll back.

Why Teams Do It

Production failures are expensive. The average cost of downtime is $5,600 per minute (Gartner). Most outages aren't caused by the failure itself — they're caused by the system's failure to handle the failure.

You can't test resilience without exercising it. Unit tests and integration tests verify behavior in normal conditions. They can't tell you what happens when a downstream service is slow, a disk fills up, or a node loses network connectivity. Chaos experiments test the things that actually cause incidents.

Confidence at scale. As systems grow — more services, more dependencies, more infrastructure — it becomes impossible to reason about failure modes. Chaos engineering gives you evidence instead of assumptions.

Chaos vs. Reliability Testing

Chaos Engineering Reliability Testing
When Ongoing, production Pre-production
Target Real system under real load Simulated conditions
Goal Discover unknown weaknesses Verify known requirements
Risk Controlled real impact No production impact

Reliability testing (load tests, disaster recovery drills) is planned and verifies specific requirements. Chaos engineering discovers weaknesses you didn't think to test for.

Both matter. Chaos engineering doesn't replace reliability testing — it reveals what to add to reliability tests.

Chaos Monkey and the Simian Army

Netflix open-sourced Chaos Monkey in 2012. The original version had one job: randomly terminate EC2 instances in production during business hours.

The reasoning: if your instances can be killed at any moment, your engineers build systems that tolerate it. And they find out about the gaps during business hours when the whole team is available, not at 2am on Saturday.

Netflix expanded this into the Simian Army:

  • Chaos Monkey — kills instances
  • Latency Monkey — introduces network delays
  • Conformity Monkey — finds instances that don't follow best practices
  • Janitor Monkey — cleans up unused resources
  • Security Monkey — finds security policy violations
  • Chaos Gorilla — simulates an entire availability zone going down

Modern chaos tools have expanded far beyond the original AWS-focused toolset.

Chaos Monkey Setup

The original Chaos Monkey is now part of the Spinnaker ecosystem, but there are simpler alternatives for teams not running Spinnaker.

Running Chaos Monkey with Spinnaker

Add to config.yml:

chaosmonkey:
  enabled: true
  schedule:
    cron: "0 * * * 1-5"  # Every hour, weekdays only
  accounts:
    - name: production
      regions:
        - us-east-1
      groups:
        - application-asg

Alternatives for Kubernetes

For Kubernetes-native chaos, Chaos Mesh and Litmus are more practical:

# Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh <span class="hljs-pipe">| bash

Basic pod failure experiment:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-experiment
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: 60s
  scheduler:
    cron: "@every 10m"

This kills one API server pod every 10 minutes. If your autoscaler and health checks work correctly, this is invisible to users. If they don't, you'll find out now.

The GameDay

A GameDay is a scheduled chaos experiment involving the whole team.

Structure of a well-run GameDay:

Before:

  • Define the scenario (network partition between services A and B)
  • Agree on success criteria (error rate stays below 0.1%, P99 stays below 500ms)
  • Identify rollback steps
  • Notify stakeholders
  • Establish the "steady state" baseline

During:

  • Inject the failure
  • Monitor dashboards
  • Take notes on system and team behavior
  • Roll back immediately if impact exceeds expected bounds

After:

  • Document what happened
  • File tickets for every weakness discovered
  • Update runbooks
  • Share findings across teams

GameDays start small (a single service in staging) and grow more ambitious as confidence and capability develop.

Principles for Getting Started

Start in staging, not production. Run your first experiments against a non-production environment. Once you have tooling, observability, and runbooks in place, graduate to production.

One failure at a time. Don't inject multiple failures simultaneously. You need to isolate cause and effect.

Have an abort condition. Define the conditions under which you stop: error rate exceeds 1%, on-call engineer requests halt, alert fires. Build the kill switch before starting.

Measure before you inject. Chaos experiments require a baseline. If you don't know what normal looks like, you can't tell if the experiment degraded it.

Fix what you find. Chaos experiments that reveal weaknesses but don't result in fixes are theater. Every finding should produce a ticket.

What to Inject First

Easy starting points:

Failure Expected behavior Common gap
Kill one instance/pod Traffic routes to healthy instances Health checks too slow, no retries
Kill all instances of a service Dependent services degrade gracefully No circuit breaker, hard dependency
Add 200ms latency to database App stays responsive, queries time out gracefully No timeout configured, cascade failure
Exhaust disk space on a node App fails gracefully, alert fires No disk monitoring, writes fail silently
Terminate a worker process Job queue drains without loss No ack timeout, jobs dropped

Start with the failure most likely to happen. Look at your incident history — what caused the last three outages? Start there.

Observability Requirements

You can't run chaos experiments without observability. At minimum, you need:

  • Metrics: error rate, latency (P50, P95, P99), throughput, saturation
  • Alerts: firing when user-visible impact exceeds threshold
  • Dashboards: real-time view of system state during experiments
  • Logs: structured, searchable, correlated with request IDs

If you can't answer "is the system in steady state right now?", you can't run chaos experiments safely.

Chaos Engineering Tools

Tool Best For License
Chaos Monkey AWS EC2 termination Open source
Chaos Mesh Kubernetes Open source
Litmus Kubernetes, GitOps workflows Open source
Gremlin Full-featured, SaaS Commercial
AWS Fault Injection Service AWS-native Commercial
k6 Application-level, scripted Open source

Tools are not the bottleneck. The bottleneck is observability, team buy-in, and the discipline to act on findings.

Is Your System Ready for Chaos?

Signs you're ready:

  • You have on-call rotation with runbooks
  • Incidents are documented and reviewed
  • You have basic metrics and alerting
  • Services have health checks and circuit breakers

Signs you're not ready:

  • No alerting in place
  • Single points of failure you know about but haven't fixed
  • No rollback process for deployments
  • Teams don't have insight into what "normal" looks like

Fix the known weaknesses first. Chaos engineering is for discovering unknown weaknesses — don't use it to rediscover what you already know.

The Goal

The goal of chaos engineering isn't to cause outages. It's to develop confidence that your system handles failure gracefully, and to discover the cases where it doesn't before your users do.

Teams that practice chaos engineering regularly report fewer incidents, shorter mean time to recovery, and more confidence shipping changes to production. That's the outcome — not the experiments themselves, but the resilience improvements they drive.

Read more