Chaos Mesh Testing Guide: PodChaos, NetworkChaos & Kubernetes Fault Injection
Chaos Mesh is a Kubernetes-native chaos engineering platform. It injects failures — pod crashes, network delays, CPU stress, disk failures — directly into your cluster via Kubernetes CRDs, without modifying application code. The goal is to discover resilience failures before they happen in production.
Why Chaos Engineering
Your monitoring, alerting, and recovery procedures are theoretical until tested under real failure conditions. Questions that chaos experiments answer:
- Does the service restart automatically when pods crash?
- Does the circuit breaker activate when downstream latency increases?
- Does the database connection pool recover when connectivity is restored?
- Does HPA scale up before user-visible errors appear under CPU pressure?
Installation
# Install via Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--version 2.6.3
<span class="hljs-comment"># Verify installation
kubectl get pods -n chaos-mesh
<span class="hljs-comment"># chaos-controller-manager, chaos-daemon, chaos-dashboard should be runningPodChaos: Pod Failure and Kill
PodChaos injects pod-level failures. The two most useful types:
pod-failure — marks the pod as unhealthy (SIGSTOP). The pod stays running but fails readiness/liveness probes. Kubernetes removes it from service endpoints and may restart it.
pod-kill — sends SIGKILL to the pod. Kubernetes immediately restarts it (if managed by a Deployment/StatefulSet).
Pod Kill Experiment
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-test
namespace: production
spec:
action: pod-kill
mode: one # one | all | fixed | fixed-percent | random-max-percent
selector:
namespaces:
- production
labelSelectors:
app: api-service
duration: "30s" # How long the experiment runs
gracePeriod: 0 # SIGKILL immediately# Apply the experiment
kubectl apply -f pod-kill.yaml
<span class="hljs-comment"># Watch what happens
kubectl get pods -n production -w
<span class="hljs-comment"># Check experiment status
kubectl describe podchaos pod-kill-test -n production
<span class="hljs-comment"># Cleanup
kubectl delete -f pod-kill.yamlPod Failure Experiment (Readiness Probe Failure)
# pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-test
namespace: production
spec:
action: pod-failure
mode: fixed-percent
value: "33" # Fail 33% of matched pods
selector:
namespaces:
- production
labelSelectors:
app: payment-service
duration: "5m"This tests whether your load balancer correctly removes unhealthy pods from rotation and whether your service degrades gracefully with reduced capacity.
Container Kill
Kill a specific container in a multi-container pod:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: sidecar-kill-test
namespace: production
spec:
action: container-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: my-service
containerNames:
- "envoy-sidecar"
duration: "1m"NetworkChaos: Latency, Packet Loss, Partition
NetworkChaos modifies network traffic between pods. This is the most valuable chaos type for testing distributed systems.
Network Delay (Latency Injection)
# network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-test
namespace: production
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: order-service
delay:
latency: "500ms" # Add 500ms to all outgoing connections
correlation: "25" # 25% correlation to make delay realistic
jitter: "50ms" # +/- 50ms variation
direction: to # to | from | both
target: # Optional: only affect traffic to this selector
selector:
namespaces:
- production
labelSelectors:
app: payment-service
mode: all
duration: "5m"What to watch during this experiment:
- Does order-service's circuit breaker open when payment-service latency spikes?
- Do request timeouts trigger at the expected threshold?
- Does the error rate remain within SLA (retries absorbing some failures)?
Packet Loss
# network-loss.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: packet-loss-test
namespace: production
spec:
action: loss
mode: one
selector:
namespaces:
- production
labelSelectors:
app: user-service
loss:
loss: "30" # 30% packet loss
correlation: "50"
direction: to
duration: "3m"Network Partition (Simulate Split-Brain)
# network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: db-partition-test
namespace: production
spec:
action: partition
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api-service
direction: to
target:
selector:
namespaces:
- production
labelSelectors:
app: postgres
mode: all
duration: "2m"This tests whether your application correctly detects database connectivity loss and whether it recovers cleanly when connectivity is restored (connection pool reset, in-flight transaction handling).
StressChaos: CPU and Memory Pressure
StressChaos generates CPU or memory load inside pods, simulating resource contention.
CPU Stress
# cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-test
namespace: production
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-service
stressors:
cpu:
workers: 4 # Number of CPU-burning workers
load: 80 # CPU load percentage (0-100)
duration: "5m"What to verify:
- HPA triggers and scales up within your acceptable time window
- CPU throttling doesn't cause timeout cascades
- Pods don't get OOM-killed due to CPU steal affecting memory allocator
Memory Stress
# memory-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress-test
namespace: production
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: worker-service
stressors:
memory:
workers: 2
size: "512MB" # Allocate 512MB per worker
duration: "3m"Tests OOM Kill behavior and whether Kubernetes restarts the pod in the expected time.
Experiment Scheduling
Run chaos experiments automatically on a schedule — weekdays during business hours to catch regressions early:
# scheduled-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: weekly-pod-kill
namespace: chaos-testing
spec:
schedule: "0 10 * * 2" # Every Tuesday at 10 AM
type: PodChaos
historyLimit: 5
concurrencyPolicy: Forbid # Don't run if previous is still active
podChaos:
action: pod-kill
mode: one
selector:
namespaces:
- staging
labelSelectors:
app: api-service
duration: "1m"# List scheduled experiments
kubectl get schedule -A
<span class="hljs-comment"># View experiment history
kubectl get podchaos -AWorkflow: Multi-Step Chaos Scenarios
Chaos Mesh Workflows run experiments in sequence, enabling complex failure scenarios:
# workflow.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: cascading-failure-test
namespace: chaos-testing
spec:
entry: entry-task
templates:
- name: entry-task
templateType: Serial # Run steps in order
deadline: 20m
children:
- add-network-delay
- wait-for-alerts
- kill-pods
- verify-recovery
- name: add-network-delay
templateType: NetworkChaos
networkChaos:
action: delay
mode: all
selector:
namespaces: [production]
labelSelectors:
app: payment-service
delay:
latency: "2s"
duration: 5m
- name: wait-for-alerts
templateType: Suspend
deadline: 5m # Wait 5 minutes (enough time for alerts to fire)
- name: kill-pods
templateType: PodChaos
podChaos:
action: pod-kill
mode: fixed-percent
value: "50"
selector:
namespaces: [production]
labelSelectors:
app: api-service
duration: 2m
- name: verify-recovery
templateType: Suspend
deadline: 10m # Wait for recovery, then experiment endsIntegrating Chaos into CI
Run chaos experiments against staging as part of pre-release testing:
# .github/workflows/chaos-test.yml
name: Chaos Engineering Tests
on:
schedule:
- cron: '0 9 * * 2,4' # Tuesday and Thursday mornings
workflow_dispatch:
jobs:
chaos-tests:
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
kubeconfig: ${{ secrets.KUBE_CONFIG }}
- name: Record baseline metrics
run: |
# Get baseline error rate
BASELINE_ERRORS=$(kubectl exec -n monitoring deploy/prometheus -- \
promtool query instant \
'sum(rate(http_requests_total{status=~"5.."}[5m]))' \
| jq '.data.result[0].value[1]' -r)
echo "BASELINE_ERRORS=$BASELINE_ERRORS" >> $GITHUB_ENV
- name: Run pod kill experiment
run: |
kubectl apply -f chaos/pod-kill-staging.yaml
# Wait for experiment duration
sleep 90
# Check experiment ran successfully
STATUS=$(kubectl get podchaos pod-kill-staging -n chaos-testing \
-o jsonpath='{.status.experiment.phase}')
echo "Experiment status: $STATUS"
- name: Measure impact
run: |
# Query error rate during chaos
CHAOS_ERRORS=$(kubectl exec -n monitoring deploy/prometheus -- \
promtool query instant \
'sum(rate(http_requests_total{status=~"5.."}[2m]))' \
| jq '.data.result[0].value[1]' -r)
echo "Baseline error rate: $BASELINE_ERRORS"
echo "Chaos error rate: $CHAOS_ERRORS"
# Fail if error rate increased by more than 5x
python3 -c "
baseline = float('$BASELINE_ERRORS') or 0.001
chaos = float('$CHAOS_ERRORS')
ratio = chaos / baseline
print(f'Error rate ratio: {ratio:.2f}x')
if ratio > 5:
raise SystemExit('Error rate too high during chaos — resilience failure')
"
- name: Wait for recovery
run: |
kubectl delete -f chaos/pod-kill-staging.yaml
sleep 30
# Verify error rate returned to baseline
RECOVERY_ERRORS=$(kubectl exec -n monitoring deploy/prometheus -- \
promtool query instant \
'sum(rate(http_requests_total{status=~"5.."}[1m]))' \
| jq '.data.result[0].value[1]' -r)
echo "Recovery error rate: $RECOVERY_ERRORS"
- name: Cleanup
if: always()
run: kubectl delete -f chaos/ --ignore-not-foundObservability During Experiments
Chaos experiments are only useful when you can measure their impact.
Metrics to Watch
# Error rate during chaos
kubectl <span class="hljs-built_in">exec -n monitoring deploy/prometheus -- \
promtool query range \
<span class="hljs-string">'sum(rate(http_requests_total{status=~"5.."}[1m])) by (service)' \
--start=-10m --end=now --step=15s
<span class="hljs-comment"># Pod restart count (did Kubernetes recover?)
kubectl get pods -n production <span class="hljs-pipe">| grep -E <span class="hljs-string">"RESTARTS|[1-9][0-9]* "
<span class="hljs-comment"># HPA scaling activity
kubectl get hpa -n production -w
<span class="hljs-comment"># Response latency percentiles
kubectl <span class="hljs-built_in">exec -n monitoring deploy/prometheus -- \
promtool query instant \
<span class="hljs-string">'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))'Chaos Dashboard
The Chaos Mesh dashboard provides real-time experiment visualization:
# Port-forward the dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
<span class="hljs-comment"># Open http://localhost:2333Designing Your Chaos Experiment Catalog
Start with the most likely and most impactful failure modes:
| Priority | Experiment | What It Tests |
|---|---|---|
| 1 | Pod kill (1 of N) | Kubernetes self-healing, traffic routing |
| 2 | Network delay (100ms) to database | Circuit breaker, timeout configuration |
| 3 | Network delay (2s) between services | Cascading timeout behavior |
| 4 | CPU stress (80%) | HPA responsiveness, request queuing |
| 5 | Network partition to database | Failover, connection pool recovery |
| 6 | Kill all pods simultaneously | Cold-start behavior, database connection storms |
Run experiments individually first, then combine them to test cascading failures. Document the results — expected behavior vs. actual behavior — and treat deviations as bugs to fix.