Chaos Monkey: Netflix's Tool for Building Resilient Systems

Chaos Monkey: Netflix's Tool for Building Resilient Systems

Netflix streams to over 230 million subscribers worldwide. When a server fails at 9 PM on a Friday, the engineers on call cannot afford to learn for the first time that their service cannot handle the outage. Chaos Monkey was born from this exact problem: a tool that randomly terminates production instances so that teams are forced to build services that survive failure gracefully—not because a post-mortem demanded it, but because the system was tested under real conditions every single day.

This guide covers what Chaos Monkey is, how the broader Simian Army fits together, how to wire it into a Spring Boot application, and how to make chaos testing a first-class part of your CI/CD pipeline.

What Is Chaos Monkey?

Chaos Monkey is an open-source resilience tool originally developed by Netflix. Its job is simple: it randomly selects virtual machine instances or containers within a configured scope and terminates them. The assumption is that failures will happen in production regardless of how carefully you build your system. The question is whether your system recovers automatically, or whether a human must intervene at 3 AM to restart a process.

The name comes from the idea of a wild monkey running through a data center, randomly pulling cables. Controlled chaos in a test or staging environment is far less dangerous than uncontrolled chaos in production—but Netflix eventually ran Chaos Monkey in production too, because staging environments rarely reflect real traffic and dependency patterns accurately.

The original Netflix implementation targeted AWS Auto Scaling Groups. The modern open-source version, Chaos Monkey for Spring Boot (developed by CodeNarc and widely adopted), targets Spring Boot applications running in any environment.

The Simian Army

Netflix did not stop at Chaos Monkey. They built a whole "Simian Army" of failure-injection tools, each targeting a different failure mode:

  • Chaos Monkey — terminates random instances
  • Latency Monkey — injects artificial delays into RESTful client-server communication
  • Conformity Monkey — finds instances not following best practices and shuts them down
  • Doctor Monkey — checks health indicators on each instance and removes unhealthy instances from service
  • Janitor Monkey — finds and removes unused resources to reduce clutter and cost
  • Security Monkey — detects security violations and misconfigurations
  • 10-18 Monkey — detects problems with applications serving customers in multiple regions with different languages and character sets
  • Chaos Gorilla — simulates the outage of an entire AWS Availability Zone

Most of these tools are conceptual ancestors of modern observability and cloud-native tooling. Chaos Monkey itself remains the most widely replicated concept, and the Spring Boot variant is the most accessible starting point for teams not running on Netflix's infrastructure.

Chaos Monkey for Spring Boot

The Spring Boot variant of Chaos Monkey provides an assault model: it intercepts Spring beans and randomly introduces failures such as exceptions, latency, or application kills. It is configured entirely through Spring properties, which makes it easy to enable per environment.

Adding the Dependency

<!-- pom.xml -->
<dependency>
    <groupId>de.codecentric</groupId>
    <artifactId>chaos-monkey-spring-boot</artifactId>
    <version>3.1.0</version>
</dependency>

Or with Gradle:

// build.gradle
implementation 'de.codecentric:chaos-monkey-spring-boot:3.1.0'

Enabling Chaos Monkey

Chaos Monkey is disabled by default. You activate it with a Spring profile:

# application-chaos.yml
spring:
  profiles:
    active: chaos-monkey

chaos:
  monkey:
    enabled: true
    assaults:
      level: 5                    # 1-10; higher = more frequent attacks
      latencyActive: true
      latencyRangeStart: 1000     # ms
      latencyRangeEnd: 3000       # ms
      exceptionsActive: false
      killApplicationActive: false
    watcher:
      service: true               # watch @Service beans
      restController: true        # watch @RestController beans
      component: false
      repository: false
      restTemplate: false
      webClient: false

To launch with this profile:

java -jar myapp.jar --spring.profiles.active=chaos-monkey

Configuring Watchers

Watchers tell Chaos Monkey which Spring beans to assault. You can target repositories, services, REST controllers, and more:

chaos:
  monkey:
    watcher:
      service: true
      repository: true
      restController: true
      component: false
      restTemplate: true    # outbound HTTP calls
      webClient: true       # reactive outbound calls

Targeting repositories is particularly useful for simulating database latency. Targeting restTemplate or webClient simulates downstream service degradation without actually breaking a downstream service.

Using the Actuator Endpoint

Chaos Monkey exposes a management endpoint via Spring Actuator. You can change assault configuration at runtime without restarting:

# Enable latency assaults at runtime
curl -X POST http://localhost:8080/actuator/chaosmonkey/assaults \
  -H <span class="hljs-string">'Content-Type: application/json' \
  -d <span class="hljs-string">'{
    "latencyActive": true,
    "latencyRangeStart": 2000,
    "latencyRangeEnd": 5000,
    "level": 3
  }'

<span class="hljs-comment"># Check current configuration
curl http://localhost:8080/actuator/chaosmonkey

<span class="hljs-comment"># Disable Chaos Monkey entirely
curl -X POST http://localhost:8080/actuator/chaosmonkey/enable \
  -H <span class="hljs-string">'Content-Type: application/json' \
  -d <span class="hljs-string">'{"enabledToggle": false}'

This runtime configurability is key for GameDay exercises: you can ramp up assault levels incrementally and observe behavior without a deploy cycle.

Exception Assaults

Beyond latency, you can configure Chaos Monkey to throw runtime exceptions on a percentage of requests:

chaos:
  monkey:
    assaults:
      level: 3
      exceptionsActive: true
      exception:
        type: java.lang.RuntimeException
        arguments:
          - type: java.lang.String
            value: "Chaos Monkey was here"

This forces you to verify that your exception handling, circuit breakers, and fallback logic actually work under production-like conditions rather than in isolated unit tests.

Integrating with Resilience4j

Chaos Monkey is most valuable when combined with a resilience library like Resilience4j. The circuit breaker should open when Chaos Monkey starts injecting failures, and fallback methods should execute automatically.

@Service
public class ProductService {

    private final ProductRepository repository;

    @CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback")
    @Retry(name = "productService")
    @TimeLimiter(name = "productService")
    public CompletableFuture<Product> getProduct(Long id) {
        return CompletableFuture.supplyAsync(() -> repository.findById(id)
            .orElseThrow(() -> new ProductNotFoundException(id)));
    }

    public CompletableFuture<Product> getProductFallback(Long id, Exception ex) {
        log.warn("Fallback triggered for product {}: {}", id, ex.getMessage());
        return CompletableFuture.completedFuture(Product.defaultProduct(id));
    }
}
# Resilience4j configuration
resilience4j:
  circuitbreaker:
    instances:
      productService:
        slidingWindowSize: 10
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
        permittedNumberOfCallsInHalfOpenState: 3
  retry:
    instances:
      productService:
        maxAttempts: 3
        waitDuration: 500ms
  timelimiter:
    instances:
      productService:
        timeoutDuration: 2s

When Chaos Monkey injects 3-second latency and the TimeLimiter is set to 2 seconds, the circuit breaker should trip after enough failures and start routing to the fallback. If it does not, you have a configuration problem that Chaos Monkey just helped you discover before a real outage did.

CI/CD Integration

Running chaos experiments in CI has one important precondition: you need an environment that behaves like production, meaning real databases, real message queues, and real downstream dependencies (or close approximations via contract tests). Chaos Monkey in CI is most useful in a dedicated chaos stage that runs after normal integration tests pass.

GitHub Actions Example

# .github/workflows/chaos.yml
name: Chaos Engineering

on:
  schedule:
    - cron: '0 2 * * 1-5'   # Weeknight runs
  workflow_dispatch:          # Manual trigger for GameDays

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - name: Set up JDK 21
        uses: actions/setup-java@v4
        with:
          java-version: '21'
          distribution: 'temurin'

      - name: Start application with Chaos Monkey
        run: |
          mvn spring-boot:run \
            -Dspring-boot.run.profiles=chaos-monkey,test \
            -Dspring-boot.run.arguments="--chaos.monkey.enabled=true" &
          echo $! > app.pid
          sleep 30

      - name: Run chaos scenario tests
        run: |
          # Enable latency assault
          curl -X POST http://localhost:8080/actuator/chaosmonkey/assaults \
            -H 'Content-Type: application/json' \
            -d '{"latencyActive":true,"latencyRangeStart":1000,"latencyRangeEnd":3000,"level":5}'

          # Run test scenarios against the app
          mvn test -Dtest=ChaosScenarioTest -Dsurefire.failIfNoSpecifiedTests=false

      - name: Collect metrics
        if: always()
        run: |
          curl http://localhost:8080/actuator/metrics/resilience4j.circuitbreaker.calls
          curl http://localhost:8080/actuator/health

      - name: Stop application
        if: always()
        run: kill $(cat app.pid) || true

Best Practices

Start with latency, not kills. Latency is the most common real-world failure mode—a database that is slow degrades your service long before it fails completely. Latency assaults reveal timeout misconfiguration, missing circuit breakers, and thread pool exhaustion far more often than kill assaults do.

Define your steady-state hypothesis first. Before running any experiment, write down what "normal" looks like: p99 response time under 200ms, error rate below 0.1%, no dead-letter queue growth. Run Chaos Monkey and compare observed metrics against the hypothesis. An experiment without a hypothesis is just noise.

Control the blast radius. Start with non-critical services or specific bean types. Do not enable Chaos Monkey on your payment processing service on day one. Expand scope as confidence grows.

Always have a kill switch. The Actuator endpoint is your kill switch. Ensure it is accessible during GameDay exercises and that on-call engineers know how to use it.

Run chaos in staging before production. This seems obvious but is frequently skipped. Staging runs reveal broken fallbacks and misconfigured timeouts without user impact.

Correlate with observability. Chaos Monkey's value is zero if you cannot observe the effects. Ensure distributed tracing, structured logging, and metrics dashboards are all in place before you start injecting failures.

Measuring Success

A chaos experiment is successful not when nothing breaks, but when the system recovers automatically within an acceptable time window and users experience degraded (not failed) service. Track:

  • MTTR (Mean Time to Recovery) — how long does it take for the circuit breaker to close and service to resume?
  • Error budget consumption — how much of your SLO error budget was consumed by the chaos event?
  • Fallback hit rate — what percentage of requests were served by fallback paths?
  • False alarm rate — did any alerts fire that required human intervention, or did the system self-heal?

Over time, chaos experiments should drive these numbers in the right direction. If they do not, the experiments are revealing real resilience gaps that need engineering work.

Conclusion

Chaos Monkey forces a fundamental shift in how teams think about reliability. Instead of hoping failures do not happen, you design for the certainty that they will. The Spring Boot implementation makes this accessible to any team running a JVM-based microservices architecture. Start with a single service, a single assault type, and a clear hypothesis. Build confidence gradually. By the time you encounter an unplanned outage, your system will have already practiced recovering from it hundreds of times.

Read more