Chaos Engineering for Microservices: Breaking Things to Build Resilience

Chaos Engineering for Microservices: Breaking Things to Build Resilience

Distributed systems fail in ways you don't expect. Chaos engineering is the practice of intentionally introducing failures into your system to discover weaknesses before they cause production incidents. For microservices, it's not optional — it's how you gain confidence that your system actually works under adverse conditions.

The Core Premise

Netflix coined the term "chaos engineering" when they created Chaos Monkey. The insight is simple: if random failures in production are inevitable, you should practice handling them in controlled conditions first.

A chaos experiment follows this pattern:

  1. Hypothesize — "If the inventory service becomes slow, checkout should still complete within 3 seconds by using cached inventory data"
  2. Define the steady state — measure normal behavior (error rate, latency, throughput)
  3. Introduce the chaos — make inventory service slow
  4. Observe — did the system stay in steady state?
  5. Remediate — if not, fix the weakness and repeat

Tools for Chaos Engineering

Toxiproxy (Local Development)

Toxiproxy is the easiest way to introduce network failures locally. It proxies TCP connections and lets you add "toxics" — latency, bandwidth limits, connection resets.

# Install Toxiproxy
brew install toxiproxy

<span class="hljs-comment"># Start the proxy server
toxiproxy-server &

<span class="hljs-comment"># Create a proxy for the database
toxiproxy-cli create postgres --listen localhost:25432 --upstream localhost:5432

<span class="hljs-comment"># Add 500ms latency
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type latency --attribute latency=500

<span class="hljs-comment"># Limit bandwidth to 100KB/s
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type bandwidth --attribute rate=100

<span class="hljs-comment"># Drop 20% of packets
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type slow_close --attribute delay=200

<span class="hljs-comment"># Reset connections randomly
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type reset_peer --attribute <span class="hljs-built_in">timeout=1000

In your test suite, control Toxiproxy via its API:

import requests

TOXIPROXY_API = 'http://localhost:8474'

def add_latency(proxy_name, latency_ms):
    requests.post(f'{TOXIPROXY_API}/proxies/{proxy_name}/toxics', json={
        'name': 'latency',
        'type': 'latency',
        'attributes': {'latency': latency_ms}
    })

def remove_toxic(proxy_name, toxic_name):
    requests.delete(f'{TOXIPROXY_API}/proxies/{proxy_name}/toxics/{toxic_name}')

def test_checkout_handles_slow_database():
    add_latency('postgres', 2000)
    
    try:
        start = time.time()
        response = requests.post('http://localhost:3000/checkout', json=order_data)
        elapsed = time.time() - start
        
        # Should timeout and return a degraded response, not hang indefinitely
        assert response.status_code in (200, 503)
        assert elapsed < 5.0, "Request hung instead of timing out"
        
        if response.status_code == 503:
            assert 'retry_after' in response.json()
    finally:
        remove_toxic('postgres', 'latency')

LitmusChaos (Kubernetes)

LitmusChaos is an open-source chaos engineering platform for Kubernetes. It provides a library of chaos experiments as Kubernetes custom resources.

# Install LitmusChaos
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml

<span class="hljs-comment"># Install chaos experiment CRDs
kubectl apply -f https://hub.litmuschaos.io/api/chaos?file=charts/generic/pod-delete/experiment.yaml

Run a pod deletion experiment:

# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: app=order-service
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"
def test_service_survives_pod_deletion():
    """Service should maintain >99% availability during pod chaos."""
    
    # Start chaos experiment
    apply_manifest('pod-delete-experiment.yaml')
    
    errors = 0
    total = 0
    
    # Send traffic for 90 seconds
    deadline = time.time() + 90
    while time.time() < deadline:
        try:
            response = requests.get('http://order-service/health', timeout=2)
            total += 1
            if response.status_code != 200:
                errors += 1
        except requests.exceptions.RequestException:
            errors += 1
            total += 1
        
        time.sleep(0.1)
    
    delete_manifest('pod-delete-experiment.yaml')
    
    error_rate = errors / total
    assert error_rate < 0.01, f"Error rate {error_rate:.1%} exceeds 1% SLO during pod chaos"

Chaos Monkey for Spring Boot

For Java/Spring services, Chaos Monkey for Spring Boot adds chaos features directly to the application:

<dependency>
    <groupId>de.codecentric</groupId>
    <artifactId>chaos-monkey-spring-boot</artifactId>
    <version>3.1.0</version>
</dependency>
# application-chaos.yml
chaos:
  monkey:
    enabled: true
    watcher:
      service: true
      repository: true
    assaults:
      level: 5
      latencyActive: true
      latencyRangeStart: 1000
      latencyRangeEnd: 3000
      exceptionsActive: true
      exception:
        type: java.lang.RuntimeException
        arguments:
          - type: java.lang.String
            value: "chaos-monkey-exception"

Common Chaos Experiments

1. Pod/Process Killing

Hypothesis: "Killing an instance of service X should not cause user-visible errors because other instances handle the traffic."

def test_payment_service_survives_instance_loss():
    initial_pods = get_pod_count('payment-service')
    
    # Kill one pod
    kill_random_pod('payment-service')
    
    # Immediately send requests
    for i in range(20):
        response = requests.post('http://checkout/pay', json=payment_data)
        assert response.status_code == 200, \
            f"Request {i} failed after pod kill: {response.status_code}"
        time.sleep(0.5)
    
    # Verify deployment scaled back up
    wait_for_pods('payment-service', count=initial_pods, timeout=60)

2. Network Partition

Hypothesis: "If service A can't reach service B, service A should use circuit breaker and return cached/degraded results."

def test_circuit_breaker_activates_during_network_partition():
    # Block traffic between order-service and inventory-service
    block_traffic('order-service', 'inventory-service')
    
    try:
        response = requests.post('http://order-service/checkout', json=order_data)
        
        # Should succeed with degraded inventory check (no stock reservation)
        # or fail fast with proper error — NOT hang for 30 seconds
        assert response.elapsed.total_seconds() < 5.0
        
        # Check that circuit breaker metrics are updated
        metrics = get_metrics('order-service')
        assert metrics['circuit_breaker.inventory.state'] in ('open', 'half-open')
    finally:
        unblock_traffic('order-service', 'inventory-service')

3. Memory Pressure

Hypothesis: "When service X is under memory pressure, it should shed load gracefully rather than crash."

# LitmusChaos memory stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
spec:
  experiments:
    - name: pod-memory-hog
      spec:
        components:
          <span class="hljs-built_in">env:
            - name: MEMORY_CONSUMPTION
              value: <span class="hljs-string">"1024"  <span class="hljs-comment"># MB
            - name: TOTAL_CHAOS_DURATION
              value: <span class="hljs-string">"120"

4. Disk I/O Saturation

Hypothesis: "When the database node has high disk I/O, write operations queue and complete slowly, but reads from replica continue at normal latency."

# Stress disk I/O on a specific pod
kubectl <span class="hljs-built_in">exec -it postgres-0 -- stress-ng --io 4 --<span class="hljs-built_in">timeout 60s
def test_read_replica_handles_primary_disk_stress():
    # Saturate primary disk
    start_disk_stress('postgres-primary')
    
    try:
        # Write operations should be slow
        write_start = time.time()
        create_order(order_data)
        write_time = time.time() - write_start
        assert write_time > 1.0  # Confirm stress is working
        
        # Read operations from replica should remain fast
        read_start = time.time()
        orders = get_orders(customer_id='abc')
        read_time = time.time() - read_start
        assert read_time < 0.5, f"Read from replica took {read_time}s — too slow"
    finally:
        stop_disk_stress('postgres-primary')

5. DNS Failures

Hypothesis: "Service discovery failures should be handled by cached DNS entries, not cause immediate errors."

def test_service_survives_temporary_dns_failure():
    # Cause DNS failures for payment service name resolution
    corrupt_dns('payment-service')
    
    try:
        # Services with cached DNS should continue working
        response = requests.post('http://order-service/checkout', json=order_data)
        
        # May fail if cache is cold, but should fail fast
        assert response.elapsed.total_seconds() < 3.0
    finally:
        restore_dns('payment-service')

Running Chaos in CI/CD

Add a chaos test stage to your pipeline, but be strategic:

# .github/workflows/chaos.yml
name: Chaos Tests
on:
  schedule:
    - cron: '0 2 * * 1'  # Weekly, Monday 2am
  workflow_dispatch:  # Manual trigger

jobs:
  chaos-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to chaos environment
        run: ./deploy.sh chaos-env
      
      - name: Run chaos experiments
        run: |
          pytest tests/chaos/ -v \
            --chaos-env=$CHAOS_ENV_URL \
            --timeout=600
      
      - name: Generate chaos report
        run: python scripts/chaos-report.py

Don't run chaos tests on every commit — they're slow and intentionally cause failures. Run them:

  • On a schedule (weekly)
  • Before major releases
  • When making changes to resilience code (timeouts, retries, circuit breakers)

Chaos Engineering Maturity

Level 1 — Explore manually: Use Toxiproxy or Istio fault injection ad-hoc to understand failure modes. No automation yet.

Level 2 — Automated experiment catalog: Encode your chaos experiments as code. Run them manually before releases.

Level 3 — Continuous chaos: Run experiments automatically in staging on a schedule. Alert on SLO violations during chaos.

Level 4 — Production chaos: Netflix level. Run controlled experiments in production with automatic rollback if SLOs degrade.

Most teams get enormous value from levels 1-2 without the operational complexity of levels 3-4.

What to Observe During Chaos

  • Error rate — does it stay within SLO?
  • Latency — do p99 latencies stay acceptable?
  • Retry storms — do retries amplify load on a recovering service?
  • Data consistency — are there orphaned records or partial writes?
  • Alerting — did your monitors fire during the chaos?
  • Logs — are errors logged clearly with actionable context?

Chaos engineering without observability is just sabotage. Instrument everything before you start breaking things.

The goal isn't to find every possible failure — it's to build a team culture where resilience is a first-class concern, and where engineers are confident the system can handle adversity. That confidence only comes from empirical testing.

Read more