Chaos Monkey: Netflix's Tool for Building Resilient Systems
Netflix streams to over 230 million subscribers worldwide. When a server fails at 9 PM on a Friday, the engineers on call cannot afford to learn for the first time that their service cannot handle the outage. Chaos Monkey was born from this exact problem: a tool that randomly terminates production instances so that teams are forced to build services that survive failure gracefully—not because a post-mortem demanded it, but because the system was tested under real conditions every single day.
This guide covers what Chaos Monkey is, how the broader Simian Army fits together, how to wire it into a Spring Boot application, and how to make chaos testing a first-class part of your CI/CD pipeline.
What Is Chaos Monkey?
Chaos Monkey is an open-source resilience tool originally developed by Netflix. Its job is simple: it randomly selects virtual machine instances or containers within a configured scope and terminates them. The assumption is that failures will happen in production regardless of how carefully you build your system. The question is whether your system recovers automatically, or whether a human must intervene at 3 AM to restart a process.
The name comes from the idea of a wild monkey running through a data center, randomly pulling cables. Controlled chaos in a test or staging environment is far less dangerous than uncontrolled chaos in production—but Netflix eventually ran Chaos Monkey in production too, because staging environments rarely reflect real traffic and dependency patterns accurately.
The original Netflix implementation targeted AWS Auto Scaling Groups. The modern open-source version, Chaos Monkey for Spring Boot (developed by CodeNarc and widely adopted), targets Spring Boot applications running in any environment.
The Simian Army
Netflix did not stop at Chaos Monkey. They built a whole "Simian Army" of failure-injection tools, each targeting a different failure mode:
- Chaos Monkey — terminates random instances
- Latency Monkey — injects artificial delays into RESTful client-server communication
- Conformity Monkey — finds instances not following best practices and shuts them down
- Doctor Monkey — checks health indicators on each instance and removes unhealthy instances from service
- Janitor Monkey — finds and removes unused resources to reduce clutter and cost
- Security Monkey — detects security violations and misconfigurations
- 10-18 Monkey — detects problems with applications serving customers in multiple regions with different languages and character sets
- Chaos Gorilla — simulates the outage of an entire AWS Availability Zone
Most of these tools are conceptual ancestors of modern observability and cloud-native tooling. Chaos Monkey itself remains the most widely replicated concept, and the Spring Boot variant is the most accessible starting point for teams not running on Netflix's infrastructure.
Chaos Monkey for Spring Boot
The Spring Boot variant of Chaos Monkey provides an assault model: it intercepts Spring beans and randomly introduces failures such as exceptions, latency, or application kills. It is configured entirely through Spring properties, which makes it easy to enable per environment.
Adding the Dependency
<!-- pom.xml -->
<dependency>
<groupId>de.codecentric</groupId>
<artifactId>chaos-monkey-spring-boot</artifactId>
<version>3.1.0</version>
</dependency>Or with Gradle:
// build.gradle
implementation 'de.codecentric:chaos-monkey-spring-boot:3.1.0'Enabling Chaos Monkey
Chaos Monkey is disabled by default. You activate it with a Spring profile:
# application-chaos.yml
spring:
profiles:
active: chaos-monkey
chaos:
monkey:
enabled: true
assaults:
level: 5 # 1-10; higher = more frequent attacks
latencyActive: true
latencyRangeStart: 1000 # ms
latencyRangeEnd: 3000 # ms
exceptionsActive: false
killApplicationActive: false
watcher:
service: true # watch @Service beans
restController: true # watch @RestController beans
component: false
repository: false
restTemplate: false
webClient: falseTo launch with this profile:
java -jar myapp.jar --spring.profiles.active=chaos-monkeyConfiguring Watchers
Watchers tell Chaos Monkey which Spring beans to assault. You can target repositories, services, REST controllers, and more:
chaos:
monkey:
watcher:
service: true
repository: true
restController: true
component: false
restTemplate: true # outbound HTTP calls
webClient: true # reactive outbound callsTargeting repositories is particularly useful for simulating database latency. Targeting restTemplate or webClient simulates downstream service degradation without actually breaking a downstream service.
Using the Actuator Endpoint
Chaos Monkey exposes a management endpoint via Spring Actuator. You can change assault configuration at runtime without restarting:
# Enable latency assaults at runtime
curl -X POST http://localhost:8080/actuator/chaosmonkey/assaults \
-H <span class="hljs-string">'Content-Type: application/json' \
-d <span class="hljs-string">'{
"latencyActive": true,
"latencyRangeStart": 2000,
"latencyRangeEnd": 5000,
"level": 3
}'
<span class="hljs-comment"># Check current configuration
curl http://localhost:8080/actuator/chaosmonkey
<span class="hljs-comment"># Disable Chaos Monkey entirely
curl -X POST http://localhost:8080/actuator/chaosmonkey/enable \
-H <span class="hljs-string">'Content-Type: application/json' \
-d <span class="hljs-string">'{"enabledToggle": false}'This runtime configurability is key for GameDay exercises: you can ramp up assault levels incrementally and observe behavior without a deploy cycle.
Exception Assaults
Beyond latency, you can configure Chaos Monkey to throw runtime exceptions on a percentage of requests:
chaos:
monkey:
assaults:
level: 3
exceptionsActive: true
exception:
type: java.lang.RuntimeException
arguments:
- type: java.lang.String
value: "Chaos Monkey was here"This forces you to verify that your exception handling, circuit breakers, and fallback logic actually work under production-like conditions rather than in isolated unit tests.
Integrating with Resilience4j
Chaos Monkey is most valuable when combined with a resilience library like Resilience4j. The circuit breaker should open when Chaos Monkey starts injecting failures, and fallback methods should execute automatically.
@Service
public class ProductService {
private final ProductRepository repository;
@CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback")
@Retry(name = "productService")
@TimeLimiter(name = "productService")
public CompletableFuture<Product> getProduct(Long id) {
return CompletableFuture.supplyAsync(() -> repository.findById(id)
.orElseThrow(() -> new ProductNotFoundException(id)));
}
public CompletableFuture<Product> getProductFallback(Long id, Exception ex) {
log.warn("Fallback triggered for product {}: {}", id, ex.getMessage());
return CompletableFuture.completedFuture(Product.defaultProduct(id));
}
}# Resilience4j configuration
resilience4j:
circuitbreaker:
instances:
productService:
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 3
retry:
instances:
productService:
maxAttempts: 3
waitDuration: 500ms
timelimiter:
instances:
productService:
timeoutDuration: 2sWhen Chaos Monkey injects 3-second latency and the TimeLimiter is set to 2 seconds, the circuit breaker should trip after enough failures and start routing to the fallback. If it does not, you have a configuration problem that Chaos Monkey just helped you discover before a real outage did.
CI/CD Integration
Running chaos experiments in CI has one important precondition: you need an environment that behaves like production, meaning real databases, real message queues, and real downstream dependencies (or close approximations via contract tests). Chaos Monkey in CI is most useful in a dedicated chaos stage that runs after normal integration tests pass.
GitHub Actions Example
# .github/workflows/chaos.yml
name: Chaos Engineering
on:
schedule:
- cron: '0 2 * * 1-5' # Weeknight runs
workflow_dispatch: # Manual trigger for GameDays
jobs:
chaos-test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: test
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- name: Set up JDK 21
uses: actions/setup-java@v4
with:
java-version: '21'
distribution: 'temurin'
- name: Start application with Chaos Monkey
run: |
mvn spring-boot:run \
-Dspring-boot.run.profiles=chaos-monkey,test \
-Dspring-boot.run.arguments="--chaos.monkey.enabled=true" &
echo $! > app.pid
sleep 30
- name: Run chaos scenario tests
run: |
# Enable latency assault
curl -X POST http://localhost:8080/actuator/chaosmonkey/assaults \
-H 'Content-Type: application/json' \
-d '{"latencyActive":true,"latencyRangeStart":1000,"latencyRangeEnd":3000,"level":5}'
# Run test scenarios against the app
mvn test -Dtest=ChaosScenarioTest -Dsurefire.failIfNoSpecifiedTests=false
- name: Collect metrics
if: always()
run: |
curl http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.calls
curl http://localhost:8080/actuator/health
- name: Stop application
if: always()
run: kill $(cat app.pid) || trueBest Practices
Start with latency, not kills. Latency is the most common real-world failure mode—a database that is slow degrades your service long before it fails completely. Latency assaults reveal timeout misconfiguration, missing circuit breakers, and thread pool exhaustion far more often than kill assaults do.
Define your steady-state hypothesis first. Before running any experiment, write down what "normal" looks like: p99 response time under 200ms, error rate below 0.1%, no dead-letter queue growth. Run Chaos Monkey and compare observed metrics against the hypothesis. An experiment without a hypothesis is just noise.
Control the blast radius. Start with non-critical services or specific bean types. Do not enable Chaos Monkey on your payment processing service on day one. Expand scope as confidence grows.
Always have a kill switch. The Actuator endpoint is your kill switch. Ensure it is accessible during GameDay exercises and that on-call engineers know how to use it.
Run chaos in staging before production. This seems obvious but is frequently skipped. Staging runs reveal broken fallbacks and misconfigured timeouts without user impact.
Correlate with observability. Chaos Monkey's value is zero if you cannot observe the effects. Ensure distributed tracing, structured logging, and metrics dashboards are all in place before you start injecting failures.
Measuring Success
A chaos experiment is successful not when nothing breaks, but when the system recovers automatically within an acceptable time window and users experience degraded (not failed) service. Track:
- MTTR (Mean Time to Recovery) — how long does it take for the circuit breaker to close and service to resume?
- Error budget consumption — how much of your SLO error budget was consumed by the chaos event?
- Fallback hit rate — what percentage of requests were served by fallback paths?
- False alarm rate — did any alerts fire that required human intervention, or did the system self-heal?
Over time, chaos experiments should drive these numbers in the right direction. If they do not, the experiments are revealing real resilience gaps that need engineering work.
Conclusion
Chaos Monkey forces a fundamental shift in how teams think about reliability. Instead of hoping failures do not happen, you design for the certainty that they will. The Spring Boot implementation makes this accessible to any team running a JVM-based microservices architecture. Start with a single service, a single assault type, and a clear hypothesis. Build confidence gradually. By the time you encounter an unplanned outage, your system will have already practiced recovering from it hundreds of times.