Testing Strategy

Chaos Engineering Intro: Build Systems That Survive Failure

HelpMeTest

14 May 2026 — 5 min read

Netflix's infrastructure team had a problem in 2010: they couldn't be sure their systems would survive an AWS outage. Their solution — a tool that randomly terminates production instances — became Chaos Monkey, and the discipline around it became chaos engineering.

The core idea: instead of hoping your system handles failure gracefully, you prove it.

What Chaos Engineering Is

Chaos engineering is the practice of deliberately introducing failures into a system to verify that it continues to function and recover correctly.

It's not about breaking things randomly. It's a disciplined experiment:

Form a hypothesis — "If a database replica fails, our application will route to the primary with no user-visible impact"
Define steady state — measure what normal looks like (error rate, latency, throughput)
Inject a controlled failure — kill the replica
Observe — did the system maintain steady state?
Learn — if not, you found a real weakness before customers did

The experiment is controlled, time-limited, and run with the ability to abort and roll back.

Why Teams Do It

Production failures are expensive. The average cost of downtime is $5,600 per minute (Gartner). Most outages aren't caused by the failure itself — they're caused by the system's failure to handle the failure.

You can't test resilience without exercising it. Unit tests and integration tests verify behavior in normal conditions. They can't tell you what happens when a downstream service is slow, a disk fills up, or a node loses network connectivity. Chaos experiments test the things that actually cause incidents.

Confidence at scale. As systems grow — more services, more dependencies, more infrastructure — it becomes impossible to reason about failure modes. Chaos engineering gives you evidence instead of assumptions.

Chaos vs. Reliability Testing

	Chaos Engineering	Reliability Testing
When	Ongoing, production	Pre-production
Target	Real system under real load	Simulated conditions
Goal	Discover unknown weaknesses	Verify known requirements
Risk	Controlled real impact	No production impact

Reliability testing (load tests, disaster recovery drills) is planned and verifies specific requirements. Chaos engineering discovers weaknesses you didn't think to test for.

Both matter. Chaos engineering doesn't replace reliability testing — it reveals what to add to reliability tests.

Chaos Monkey and the Simian Army

Netflix open-sourced Chaos Monkey in 2012. The original version had one job: randomly terminate EC2 instances in production during business hours.

The reasoning: if your instances can be killed at any moment, your engineers build systems that tolerate it. And they find out about the gaps during business hours when the whole team is available, not at 2am on Saturday.

Netflix expanded this into the Simian Army:

Chaos Monkey — kills instances
Latency Monkey — introduces network delays
Conformity Monkey — finds instances that don't follow best practices
Janitor Monkey — cleans up unused resources
Security Monkey — finds security policy violations
Chaos Gorilla — simulates an entire availability zone going down

Modern chaos tools have expanded far beyond the original AWS-focused toolset.

Chaos Monkey Setup

The original Chaos Monkey is now part of the Spinnaker ecosystem, but there are simpler alternatives for teams not running Spinnaker.

Running Chaos Monkey with Spinnaker

Add to config.yml:

chaosmonkey:
  enabled: true
  schedule:
    cron: "0 * * * 1-5"  # Every hour, weekdays only
  accounts:
    - name: production
      regions:
        - us-east-1
      groups:
        - application-asg

Alternatives for Kubernetes

For Kubernetes-native chaos, Chaos Mesh and Litmus are more practical:

# Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh <span class="hljs-pipe">| bash

Basic pod failure experiment:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-experiment
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: 60s
  scheduler:
    cron: "@every 10m"

This kills one API server pod every 10 minutes. If your autoscaler and health checks work correctly, this is invisible to users. If they don't, you'll find out now.

The GameDay

A GameDay is a scheduled chaos experiment involving the whole team.

Structure of a well-run GameDay:

Before:

Define the scenario (network partition between services A and B)
Agree on success criteria (error rate stays below 0.1%, P99 stays below 500ms)
Identify rollback steps
Notify stakeholders
Establish the "steady state" baseline

During:

Inject the failure
Monitor dashboards
Take notes on system and team behavior
Roll back immediately if impact exceeds expected bounds

After:

Document what happened
File tickets for every weakness discovered
Update runbooks
Share findings across teams

GameDays start small (a single service in staging) and grow more ambitious as confidence and capability develop.

Principles for Getting Started

Start in staging, not production. Run your first experiments against a non-production environment. Once you have tooling, observability, and runbooks in place, graduate to production.

One failure at a time. Don't inject multiple failures simultaneously. You need to isolate cause and effect.

Have an abort condition. Define the conditions under which you stop: error rate exceeds 1%, on-call engineer requests halt, alert fires. Build the kill switch before starting.

Measure before you inject. Chaos experiments require a baseline. If you don't know what normal looks like, you can't tell if the experiment degraded it.

Fix what you find. Chaos experiments that reveal weaknesses but don't result in fixes are theater. Every finding should produce a ticket.

What to Inject First

Easy starting points:

Failure	Expected behavior	Common gap
Kill one instance/pod	Traffic routes to healthy instances	Health checks too slow, no retries
Kill all instances of a service	Dependent services degrade gracefully	No circuit breaker, hard dependency
Add 200ms latency to database	App stays responsive, queries time out gracefully	No timeout configured, cascade failure
Exhaust disk space on a node	App fails gracefully, alert fires	No disk monitoring, writes fail silently
Terminate a worker process	Job queue drains without loss	No ack timeout, jobs dropped

Start with the failure most likely to happen. Look at your incident history — what caused the last three outages? Start there.

Observability Requirements

You can't run chaos experiments without observability. At minimum, you need:

Metrics: error rate, latency (P50, P95, P99), throughput, saturation
Alerts: firing when user-visible impact exceeds threshold
Dashboards: real-time view of system state during experiments
Logs: structured, searchable, correlated with request IDs

If you can't answer "is the system in steady state right now?", you can't run chaos experiments safely.

Chaos Engineering Tools

Tool	Best For	License
Chaos Monkey	AWS EC2 termination	Open source
Chaos Mesh	Kubernetes	Open source
Litmus	Kubernetes, GitOps workflows	Open source
Gremlin	Full-featured, SaaS	Commercial
AWS Fault Injection Service	AWS-native	Commercial
k6	Application-level, scripted	Open source

Tools are not the bottleneck. The bottleneck is observability, team buy-in, and the discipline to act on findings.

Is Your System Ready for Chaos?

Signs you're ready:

You have on-call rotation with runbooks
Incidents are documented and reviewed
You have basic metrics and alerting
Services have health checks and circuit breakers

Signs you're not ready:

No alerting in place
Single points of failure you know about but haven't fixed
No rollback process for deployments
Teams don't have insight into what "normal" looks like

Fix the known weaknesses first. Chaos engineering is for discovering unknown weaknesses — don't use it to rediscover what you already know.

The Goal

The goal of chaos engineering isn't to cause outages. It's to develop confidence that your system handles failure gracefully, and to discover the cases where it doesn't before your users do.

Teams that practice chaos engineering regularly report fewer incidents, shorter mean time to recovery, and more confidence shipping changes to production. That's the outcome — not the experiments themselves, but the resilience improvements they drive.