Chaos Engineering with Chaos Monkey and Litmus Chaos for Kubernetes
Chaos engineering is the practice of deliberately introducing failures into a system to discover weaknesses before they cause incidents. Netflix coined the term with Chaos Monkey — a tool that randomly terminates EC2 instances to force engineers to build resilient services.
For Kubernetes workloads, Litmus Chaos has become the most widely used chaos engineering framework. This guide covers how to set up chaos experiments, define steady-state hypotheses, and integrate chaos testing into your reliability practice.
The Chaos Engineering Process
Chaos engineering follows a scientific method:
- Define steady state — what does "working normally" look like? (throughput, latency, error rate)
- Hypothesize — "I believe the system will maintain steady state even if X happens"
- Introduce chaos — inject the failure
- Observe — does the system maintain steady state?
- Automate and repeat — run experiments continuously, not just after incidents
The goal is to find weaknesses before your users find them.
Setting Up Litmus Chaos
# Install Litmus using helm
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus \
--namespace litmus \
--create-namespace \
--<span class="hljs-built_in">set portal.frontend.service.type=ClusterIP
<span class="hljs-comment"># Install Chaos Hub (pre-built experiments)
kubectl apply -f https://hub.litmuschaos.io/api/chaos?file=charts/generic/k8s-pod-delete/experiment.yamlYour First Chaos Experiment: Pod Kill
The simplest experiment: kill a random pod and verify your service recovers.
# pod-kill-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
appinfo:
appns: default
applabel: "app=nginx"
appkind: deployment
chaosServiceAccount: litmus-admin
# Steady-state checks
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30" # Inject chaos for 30 seconds
- name: CHAOS_INTERVAL
value: "10" # Kill pods every 10 seconds
- name: FORCE
value: "false" # Graceful termination (SIGTERM)kubectl apply -f pod-kill-experiment.yaml
# Watch experiment progress
kubectl get chaosresult nginx-chaos-pod-delete -o jsonpath=<span class="hljs-string">'{.status.experimentstatus.verdict}'Defining Steady-State Hypotheses with Probes
Litmus Chaos supports probes that check steady state before and during chaos:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-resilience-chaos
namespace: default
spec:
appinfo:
appns: default
applabel: "app=api-service"
appkind: deployment
experiments:
- name: pod-delete
spec:
probe:
# HTTP probe: check API is responding
- name: "api-availability-probe"
type: "httpProbe"
httpProbe/inputs:
url: "http://api-service/health"
insecureSkipVerify: false
responseTimeout: 2000 # ms
method:
get:
criteria: "=="
responseCode: "200"
mode: "Continuous" # Check throughout the experiment
runProperties:
probeTimeout: 3
interval: 5 # Check every 5 seconds
retry: 2
# Prometheus probe: check error rate during chaos
- name: "error-rate-probe"
type: "promProbe"
promProbe/inputs:
endpoint: "http://prometheus:9090"
query: |
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))
comparator:
type: "float"
criteria: "<="
value: "0.05" # Error rate should stay under 5% during chaos
mode: "Continuous"
runProperties:
probeTimeout: 3
interval: 10
retry: 2
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "15"Common Kubernetes Chaos Experiments
Network Latency Injection
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_INTERFACE
value: "eth0"
- name: NETWORK_LATENCY
value: "2000" # Add 2 seconds of latency
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: TARGET_PODS
value: "api-service-7d4b9f8c-xkl4p"Hypothesis: "Even with 2 seconds of network latency injected into the API service, our frontend should show a degraded but functional UI (not a 500 error), because we have configured timeouts and fallbacks."
CPU Stress
- name: pod-cpu-hog
spec:
components:
env:
- name: CPU_CORES
value: "2" # Consume 2 CPU cores
- name: CPU_LOAD
value: "100" # 100% utilization
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: PODS_AFFECTED_PERC
value: "50" # Affect 50% of matching podsNode Drain
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: node-drain-chaos
namespace: litmus
spec:
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: node-drain
spec:
components:
env:
- name: TARGET_NODE
value: "worker-node-2"
- name: TOTAL_CHAOS_DURATION
value: "120"Hypothesis: "When worker-node-2 is drained, Kubernetes reschedules all affected pods within 2 minutes, and the service maintains > 99% availability during rescheduling."
Writing Chaos Tests in Code
For repeatable, version-controlled chaos experiments, define experiments programmatically:
import subprocess
import time
import requests
import json
class ChaosExperiment:
def __init__(self, name: str, engine_yaml: str, target_url: str):
self.name = name
self.engine_yaml = engine_yaml
self.target_url = target_url
def measure_steady_state(self, duration_seconds: int = 30) -> dict:
"""Measure baseline metrics before chaos."""
metrics = {"success_rates": [], "latencies_ms": []}
start = time.time()
while time.time() - start < duration_seconds:
t = time.perf_counter()
try:
response = requests.get(f"{self.target_url}/health", timeout=5)
elapsed = (time.perf_counter() - t) * 1000
metrics["success_rates"].append(1 if response.status_code == 200 else 0)
metrics["latencies_ms"].append(elapsed)
except Exception:
metrics["success_rates"].append(0)
metrics["latencies_ms"].append(5000)
time.sleep(1)
return {
"success_rate": sum(metrics["success_rates"]) / len(metrics["success_rates"]),
"p95_latency_ms": sorted(metrics["latencies_ms"])[int(0.95 * len(metrics["latencies_ms"]))]
}
def run(self) -> dict:
print(f"[Chaos] Measuring steady state...")
baseline = self.measure_steady_state(duration_seconds=30)
print(f" Success rate: {baseline['success_rate']:.2%}")
print(f" P95 latency: {baseline['p95_latency_ms']:.1f}ms")
print(f"[Chaos] Applying chaos engine: {self.name}")
subprocess.run(["kubectl", "apply", "-f", self.engine_yaml], check=True)
print("[Chaos] Measuring during chaos...")
during_chaos = self.measure_steady_state(duration_seconds=60)
print(f" Success rate: {during_chaos['success_rate']:.2%}")
print(f" P95 latency: {during_chaos['p95_latency_ms']:.1f}ms")
print("[Chaos] Waiting for recovery...")
time.sleep(30)
print("[Chaos] Measuring post-chaos...")
after_chaos = self.measure_steady_state(duration_seconds=30)
print(f" Success rate: {after_chaos['success_rate']:.2%}")
print(f" P95 latency: {after_chaos['p95_latency_ms']:.1f}ms")
# Cleanup
subprocess.run(["kubectl", "delete", "-f", self.engine_yaml], check=True)
return {
"baseline": baseline,
"during_chaos": during_chaos,
"after_chaos": after_chaos
}
def test_api_survives_pod_deletion():
experiment = ChaosExperiment(
name="api-pod-delete",
engine_yaml="experiments/pod-delete.yaml",
target_url="https://api.example.com"
)
results = experiment.run()
# During chaos: should maintain 95% success rate
assert results["during_chaos"]["success_rate"] >= 0.95, \
f"Success rate during chaos {results['during_chaos']['success_rate']:.2%} below 95%"
# After chaos: should return to baseline within 30 seconds
assert results["after_chaos"]["success_rate"] >= results["baseline"]["success_rate"] - 0.01, \
"Service did not recover to baseline after chaos"
# Latency during chaos: should stay under 3x baseline P95
baseline_p95 = results["baseline"]["p95_latency_ms"]
chaos_p95 = results["during_chaos"]["p95_latency_ms"]
assert chaos_p95 <= baseline_p95 * 3, \
f"Latency during chaos ({chaos_p95:.1f}ms) is more than 3x baseline ({baseline_p95:.1f}ms)"Chaos in CI/CD
Running chaos experiments in CI catches regressions before they reach production:
# .github/workflows/chaos-tests.yml
name: Chaos Testing
on:
schedule:
- cron: '0 2 * * *' # Daily at 2am
workflow_dispatch:
jobs:
chaos-tests:
runs-on: ubuntu-latest
environment: staging # Only run against staging
steps:
- uses: actions/checkout@v4
- name: Set up kubectl
uses: azure/setup-kubectl@v3
- name: Configure kubeconfig
run: echo "${{ secrets.STAGING_KUBECONFIG }}" | base64 -d > ~/.kube/config
- name: Install Litmus
run: |
helm upgrade --install chaos litmuschaos/litmus \
--namespace litmus --create-namespace
- name: Run pod-delete experiment
run: |
kubectl apply -f chaos/pod-delete-engine.yaml
sleep 90 # Wait for experiment to complete
verdict=$(kubectl get chaosresult api-chaos-pod-delete \
-o jsonpath='{.status.experimentstatus.verdict}')
if [ "$verdict" != "Pass" ]; then
echo "Chaos experiment failed: $verdict"
exit 1
fi
- name: Run network-latency experiment
run: |
kubectl apply -f chaos/network-latency-engine.yaml
sleep 150
verdict=$(kubectl get chaosresult api-chaos-network-latency \
-o jsonpath='{.status.experimentstatus.verdict}')
[ "$verdict" = "Pass" ] || exit 1
- name: Cleanup
if: always()
run: |
kubectl delete chaosengine --all -n defaultGame Days
Beyond automated chaos tests, run periodic game days — structured exercises where your team simulates specific failure scenarios:
| Scenario | What to Simulate | What to Validate |
|---|---|---|
| Database failover | Primary DB killed | App fails over within 30s |
| Region outage | All pods in zone-a killed | Traffic routes to zone-b |
| Cache failure | Redis pod killed | App degrades gracefully (slower, not down) |
| Dependency timeout | Upstream service adds 30s latency | App times out and returns cached/fallback |
| Certificate expiry | TLS cert replaced with expired cert | Alert fires before users notice |
Chaos engineering is most valuable when it's boring — when experiments run, find nothing, and confirm your system is as resilient as you thought. That's the signal that your reliability investments are working.