Chaos Engineering for Microservices: Breaking Things to Build Resilience
Distributed systems fail in ways you don't expect. Chaos engineering is the practice of intentionally introducing failures into your system to discover weaknesses before they cause production incidents. For microservices, it's not optional — it's how you gain confidence that your system actually works under adverse conditions.
The Core Premise
Netflix coined the term "chaos engineering" when they created Chaos Monkey. The insight is simple: if random failures in production are inevitable, you should practice handling them in controlled conditions first.
A chaos experiment follows this pattern:
- Hypothesize — "If the inventory service becomes slow, checkout should still complete within 3 seconds by using cached inventory data"
- Define the steady state — measure normal behavior (error rate, latency, throughput)
- Introduce the chaos — make inventory service slow
- Observe — did the system stay in steady state?
- Remediate — if not, fix the weakness and repeat
Tools for Chaos Engineering
Toxiproxy (Local Development)
Toxiproxy is the easiest way to introduce network failures locally. It proxies TCP connections and lets you add "toxics" — latency, bandwidth limits, connection resets.
# Install Toxiproxy
brew install toxiproxy
<span class="hljs-comment"># Start the proxy server
toxiproxy-server &
<span class="hljs-comment"># Create a proxy for the database
toxiproxy-cli create postgres --listen localhost:25432 --upstream localhost:5432
<span class="hljs-comment"># Add 500ms latency
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type latency --attribute latency=500
<span class="hljs-comment"># Limit bandwidth to 100KB/s
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type bandwidth --attribute rate=100
<span class="hljs-comment"># Drop 20% of packets
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type slow_close --attribute delay=200
<span class="hljs-comment"># Reset connections randomly
toxiproxy-cli toxic add postgres --<span class="hljs-built_in">type reset_peer --attribute <span class="hljs-built_in">timeout=1000In your test suite, control Toxiproxy via its API:
import requests
TOXIPROXY_API = 'http://localhost:8474'
def add_latency(proxy_name, latency_ms):
requests.post(f'{TOXIPROXY_API}/proxies/{proxy_name}/toxics', json={
'name': 'latency',
'type': 'latency',
'attributes': {'latency': latency_ms}
})
def remove_toxic(proxy_name, toxic_name):
requests.delete(f'{TOXIPROXY_API}/proxies/{proxy_name}/toxics/{toxic_name}')
def test_checkout_handles_slow_database():
add_latency('postgres', 2000)
try:
start = time.time()
response = requests.post('http://localhost:3000/checkout', json=order_data)
elapsed = time.time() - start
# Should timeout and return a degraded response, not hang indefinitely
assert response.status_code in (200, 503)
assert elapsed < 5.0, "Request hung instead of timing out"
if response.status_code == 503:
assert 'retry_after' in response.json()
finally:
remove_toxic('postgres', 'latency')LitmusChaos (Kubernetes)
LitmusChaos is an open-source chaos engineering platform for Kubernetes. It provides a library of chaos experiments as Kubernetes custom resources.
# Install LitmusChaos
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
<span class="hljs-comment"># Install chaos experiment CRDs
kubectl apply -f https://hub.litmuschaos.io/api/chaos?file=charts/generic/pod-delete/experiment.yamlRun a pod deletion experiment:
# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-service-chaos
namespace: default
spec:
appinfo:
appns: default
applabel: app=order-service
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"def test_service_survives_pod_deletion():
"""Service should maintain >99% availability during pod chaos."""
# Start chaos experiment
apply_manifest('pod-delete-experiment.yaml')
errors = 0
total = 0
# Send traffic for 90 seconds
deadline = time.time() + 90
while time.time() < deadline:
try:
response = requests.get('http://order-service/health', timeout=2)
total += 1
if response.status_code != 200:
errors += 1
except requests.exceptions.RequestException:
errors += 1
total += 1
time.sleep(0.1)
delete_manifest('pod-delete-experiment.yaml')
error_rate = errors / total
assert error_rate < 0.01, f"Error rate {error_rate:.1%} exceeds 1% SLO during pod chaos"Chaos Monkey for Spring Boot
For Java/Spring services, Chaos Monkey for Spring Boot adds chaos features directly to the application:
<dependency>
<groupId>de.codecentric</groupId>
<artifactId>chaos-monkey-spring-boot</artifactId>
<version>3.1.0</version>
</dependency># application-chaos.yml
chaos:
monkey:
enabled: true
watcher:
service: true
repository: true
assaults:
level: 5
latencyActive: true
latencyRangeStart: 1000
latencyRangeEnd: 3000
exceptionsActive: true
exception:
type: java.lang.RuntimeException
arguments:
- type: java.lang.String
value: "chaos-monkey-exception"Common Chaos Experiments
1. Pod/Process Killing
Hypothesis: "Killing an instance of service X should not cause user-visible errors because other instances handle the traffic."
def test_payment_service_survives_instance_loss():
initial_pods = get_pod_count('payment-service')
# Kill one pod
kill_random_pod('payment-service')
# Immediately send requests
for i in range(20):
response = requests.post('http://checkout/pay', json=payment_data)
assert response.status_code == 200, \
f"Request {i} failed after pod kill: {response.status_code}"
time.sleep(0.5)
# Verify deployment scaled back up
wait_for_pods('payment-service', count=initial_pods, timeout=60)2. Network Partition
Hypothesis: "If service A can't reach service B, service A should use circuit breaker and return cached/degraded results."
def test_circuit_breaker_activates_during_network_partition():
# Block traffic between order-service and inventory-service
block_traffic('order-service', 'inventory-service')
try:
response = requests.post('http://order-service/checkout', json=order_data)
# Should succeed with degraded inventory check (no stock reservation)
# or fail fast with proper error — NOT hang for 30 seconds
assert response.elapsed.total_seconds() < 5.0
# Check that circuit breaker metrics are updated
metrics = get_metrics('order-service')
assert metrics['circuit_breaker.inventory.state'] in ('open', 'half-open')
finally:
unblock_traffic('order-service', 'inventory-service')3. Memory Pressure
Hypothesis: "When service X is under memory pressure, it should shed load gracefully rather than crash."
# LitmusChaos memory stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
spec:
experiments:
- name: pod-memory-hog
spec:
components:
<span class="hljs-built_in">env:
- name: MEMORY_CONSUMPTION
value: <span class="hljs-string">"1024" <span class="hljs-comment"># MB
- name: TOTAL_CHAOS_DURATION
value: <span class="hljs-string">"120"4. Disk I/O Saturation
Hypothesis: "When the database node has high disk I/O, write operations queue and complete slowly, but reads from replica continue at normal latency."
# Stress disk I/O on a specific pod
kubectl <span class="hljs-built_in">exec -it postgres-0 -- stress-ng --io 4 --<span class="hljs-built_in">timeout 60sdef test_read_replica_handles_primary_disk_stress():
# Saturate primary disk
start_disk_stress('postgres-primary')
try:
# Write operations should be slow
write_start = time.time()
create_order(order_data)
write_time = time.time() - write_start
assert write_time > 1.0 # Confirm stress is working
# Read operations from replica should remain fast
read_start = time.time()
orders = get_orders(customer_id='abc')
read_time = time.time() - read_start
assert read_time < 0.5, f"Read from replica took {read_time}s — too slow"
finally:
stop_disk_stress('postgres-primary')5. DNS Failures
Hypothesis: "Service discovery failures should be handled by cached DNS entries, not cause immediate errors."
def test_service_survives_temporary_dns_failure():
# Cause DNS failures for payment service name resolution
corrupt_dns('payment-service')
try:
# Services with cached DNS should continue working
response = requests.post('http://order-service/checkout', json=order_data)
# May fail if cache is cold, but should fail fast
assert response.elapsed.total_seconds() < 3.0
finally:
restore_dns('payment-service')Running Chaos in CI/CD
Add a chaos test stage to your pipeline, but be strategic:
# .github/workflows/chaos.yml
name: Chaos Tests
on:
schedule:
- cron: '0 2 * * 1' # Weekly, Monday 2am
workflow_dispatch: # Manual trigger
jobs:
chaos-tests:
runs-on: ubuntu-latest
steps:
- name: Deploy to chaos environment
run: ./deploy.sh chaos-env
- name: Run chaos experiments
run: |
pytest tests/chaos/ -v \
--chaos-env=$CHAOS_ENV_URL \
--timeout=600
- name: Generate chaos report
run: python scripts/chaos-report.pyDon't run chaos tests on every commit — they're slow and intentionally cause failures. Run them:
- On a schedule (weekly)
- Before major releases
- When making changes to resilience code (timeouts, retries, circuit breakers)
Chaos Engineering Maturity
Level 1 — Explore manually: Use Toxiproxy or Istio fault injection ad-hoc to understand failure modes. No automation yet.
Level 2 — Automated experiment catalog: Encode your chaos experiments as code. Run them manually before releases.
Level 3 — Continuous chaos: Run experiments automatically in staging on a schedule. Alert on SLO violations during chaos.
Level 4 — Production chaos: Netflix level. Run controlled experiments in production with automatic rollback if SLOs degrade.
Most teams get enormous value from levels 1-2 without the operational complexity of levels 3-4.
What to Observe During Chaos
- Error rate — does it stay within SLO?
- Latency — do p99 latencies stay acceptable?
- Retry storms — do retries amplify load on a recovering service?
- Data consistency — are there orphaned records or partial writes?
- Alerting — did your monitors fire during the chaos?
- Logs — are errors logged clearly with actionable context?
Chaos engineering without observability is just sabotage. Instrument everything before you start breaking things.
The goal isn't to find every possible failure — it's to build a team culture where resilience is a first-class concern, and where engineers are confident the system can handle adversity. That confidence only comes from empirical testing.