Chaos Mesh Testing Guide: PodChaos, NetworkChaos & Kubernetes Fault Injection

Chaos Mesh Testing Guide: PodChaos, NetworkChaos & Kubernetes Fault Injection

Chaos Mesh is a Kubernetes-native chaos engineering platform. It injects failures — pod crashes, network delays, CPU stress, disk failures — directly into your cluster via Kubernetes CRDs, without modifying application code. The goal is to discover resilience failures before they happen in production.

Why Chaos Engineering

Your monitoring, alerting, and recovery procedures are theoretical until tested under real failure conditions. Questions that chaos experiments answer:

  • Does the service restart automatically when pods crash?
  • Does the circuit breaker activate when downstream latency increases?
  • Does the database connection pool recover when connectivity is restored?
  • Does HPA scale up before user-visible errors appear under CPU pressure?

Installation

# Install via Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --version 2.6.3

<span class="hljs-comment"># Verify installation
kubectl get pods -n chaos-mesh
<span class="hljs-comment"># chaos-controller-manager, chaos-daemon, chaos-dashboard should be running

PodChaos: Pod Failure and Kill

PodChaos injects pod-level failures. The two most useful types:

pod-failure — marks the pod as unhealthy (SIGSTOP). The pod stays running but fails readiness/liveness probes. Kubernetes removes it from service endpoints and may restart it.

pod-kill — sends SIGKILL to the pod. Kubernetes immediately restarts it (if managed by a Deployment/StatefulSet).

Pod Kill Experiment

# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-test
  namespace: production
spec:
  action: pod-kill
  mode: one          # one | all | fixed | fixed-percent | random-max-percent
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
  duration: "30s"    # How long the experiment runs
  gracePeriod: 0     # SIGKILL immediately
# Apply the experiment
kubectl apply -f pod-kill.yaml

<span class="hljs-comment"># Watch what happens
kubectl get pods -n production -w

<span class="hljs-comment"># Check experiment status
kubectl describe podchaos pod-kill-test -n production

<span class="hljs-comment"># Cleanup
kubectl delete -f pod-kill.yaml

Pod Failure Experiment (Readiness Probe Failure)

# pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-test
  namespace: production
spec:
  action: pod-failure
  mode: fixed-percent
  value: "33"      # Fail 33% of matched pods
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  duration: "5m"

This tests whether your load balancer correctly removes unhealthy pods from rotation and whether your service degrades gracefully with reduced capacity.

Container Kill

Kill a specific container in a multi-container pod:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: sidecar-kill-test
  namespace: production
spec:
  action: container-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: my-service
  containerNames:
    - "envoy-sidecar"
  duration: "1m"

NetworkChaos: Latency, Packet Loss, Partition

NetworkChaos modifies network traffic between pods. This is the most valuable chaos type for testing distributed systems.

Network Delay (Latency Injection)

# network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-test
  namespace: production
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: order-service
  delay:
    latency: "500ms"     # Add 500ms to all outgoing connections
    correlation: "25"    # 25% correlation to make delay realistic
    jitter: "50ms"       # +/- 50ms variation
  direction: to          # to | from | both
  target:                # Optional: only affect traffic to this selector
    selector:
      namespaces:
        - production
      labelSelectors:
        app: payment-service
    mode: all
  duration: "5m"

What to watch during this experiment:

  • Does order-service's circuit breaker open when payment-service latency spikes?
  • Do request timeouts trigger at the expected threshold?
  • Does the error rate remain within SLA (retries absorbing some failures)?

Packet Loss

# network-loss.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: packet-loss-test
  namespace: production
spec:
  action: loss
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: user-service
  loss:
    loss: "30"        # 30% packet loss
    correlation: "50"
  direction: to
  duration: "3m"

Network Partition (Simulate Split-Brain)

# network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: db-partition-test
  namespace: production
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
  direction: to
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        app: postgres
    mode: all
  duration: "2m"

This tests whether your application correctly detects database connectivity loss and whether it recovers cleanly when connectivity is restored (connection pool reset, in-flight transaction handling).

StressChaos: CPU and Memory Pressure

StressChaos generates CPU or memory load inside pods, simulating resource contention.

CPU Stress

# cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-test
  namespace: production
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-service
  stressors:
    cpu:
      workers: 4       # Number of CPU-burning workers
      load: 80         # CPU load percentage (0-100)
  duration: "5m"

What to verify:

  • HPA triggers and scales up within your acceptable time window
  • CPU throttling doesn't cause timeout cascades
  • Pods don't get OOM-killed due to CPU steal affecting memory allocator

Memory Stress

# memory-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-test
  namespace: production
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: worker-service
  stressors:
    memory:
      workers: 2
      size: "512MB"    # Allocate 512MB per worker
  duration: "3m"

Tests OOM Kill behavior and whether Kubernetes restarts the pod in the expected time.

Experiment Scheduling

Run chaos experiments automatically on a schedule — weekdays during business hours to catch regressions early:

# scheduled-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-pod-kill
  namespace: chaos-testing
spec:
  schedule: "0 10 * * 2"    # Every Tuesday at 10 AM
  type: PodChaos
  historyLimit: 5
  concurrencyPolicy: Forbid  # Don't run if previous is still active
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - staging
      labelSelectors:
        app: api-service
    duration: "1m"
# List scheduled experiments
kubectl get schedule -A

<span class="hljs-comment"># View experiment history
kubectl get podchaos -A

Workflow: Multi-Step Chaos Scenarios

Chaos Mesh Workflows run experiments in sequence, enabling complex failure scenarios:

# workflow.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: cascading-failure-test
  namespace: chaos-testing
spec:
  entry: entry-task
  templates:
    - name: entry-task
      templateType: Serial    # Run steps in order
      deadline: 20m
      children:
        - add-network-delay
        - wait-for-alerts
        - kill-pods
        - verify-recovery

    - name: add-network-delay
      templateType: NetworkChaos
      networkChaos:
        action: delay
        mode: all
        selector:
          namespaces: [production]
          labelSelectors:
            app: payment-service
        delay:
          latency: "2s"
        duration: 5m

    - name: wait-for-alerts
      templateType: Suspend
      deadline: 5m     # Wait 5 minutes (enough time for alerts to fire)

    - name: kill-pods
      templateType: PodChaos
      podChaos:
        action: pod-kill
        mode: fixed-percent
        value: "50"
        selector:
          namespaces: [production]
          labelSelectors:
            app: api-service
        duration: 2m

    - name: verify-recovery
      templateType: Suspend
      deadline: 10m    # Wait for recovery, then experiment ends

Integrating Chaos into CI

Run chaos experiments against staging as part of pre-release testing:

# .github/workflows/chaos-test.yml
name: Chaos Engineering Tests

on:
  schedule:
    - cron: '0 9 * * 2,4'  # Tuesday and Thursday mornings
  workflow_dispatch:

jobs:
  chaos-tests:
    runs-on: ubuntu-latest
    environment: staging

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}

      - name: Record baseline metrics
        run: |
          # Get baseline error rate
          BASELINE_ERRORS=$(kubectl exec -n monitoring deploy/prometheus -- \
            promtool query instant \
            'sum(rate(http_requests_total{status=~"5.."}[5m]))' \
            | jq '.data.result[0].value[1]' -r)
          echo "BASELINE_ERRORS=$BASELINE_ERRORS" >> $GITHUB_ENV

      - name: Run pod kill experiment
        run: |
          kubectl apply -f chaos/pod-kill-staging.yaml
          
          # Wait for experiment duration
          sleep 90
          
          # Check experiment ran successfully
          STATUS=$(kubectl get podchaos pod-kill-staging -n chaos-testing \
            -o jsonpath='{.status.experiment.phase}')
          echo "Experiment status: $STATUS"

      - name: Measure impact
        run: |
          # Query error rate during chaos
          CHAOS_ERRORS=$(kubectl exec -n monitoring deploy/prometheus -- \
            promtool query instant \
            'sum(rate(http_requests_total{status=~"5.."}[2m]))' \
            | jq '.data.result[0].value[1]' -r)
          
          echo "Baseline error rate: $BASELINE_ERRORS"
          echo "Chaos error rate: $CHAOS_ERRORS"
          
          # Fail if error rate increased by more than 5x
          python3 -c "
          baseline = float('$BASELINE_ERRORS') or 0.001
          chaos = float('$CHAOS_ERRORS')
          ratio = chaos / baseline
          print(f'Error rate ratio: {ratio:.2f}x')
          if ratio > 5:
              raise SystemExit('Error rate too high during chaos — resilience failure')
          "

      - name: Wait for recovery
        run: |
          kubectl delete -f chaos/pod-kill-staging.yaml
          sleep 30
          
          # Verify error rate returned to baseline
          RECOVERY_ERRORS=$(kubectl exec -n monitoring deploy/prometheus -- \
            promtool query instant \
            'sum(rate(http_requests_total{status=~"5.."}[1m]))' \
            | jq '.data.result[0].value[1]' -r)
          
          echo "Recovery error rate: $RECOVERY_ERRORS"

      - name: Cleanup
        if: always()
        run: kubectl delete -f chaos/ --ignore-not-found

Observability During Experiments

Chaos experiments are only useful when you can measure their impact.

Metrics to Watch

# Error rate during chaos
kubectl <span class="hljs-built_in">exec -n monitoring deploy/prometheus -- \
  promtool query range \
  <span class="hljs-string">'sum(rate(http_requests_total{status=~"5.."}[1m])) by (service)' \
  --start=-10m --end=now --step=15s

<span class="hljs-comment"># Pod restart count (did Kubernetes recover?)
kubectl get pods -n production <span class="hljs-pipe">| grep -E <span class="hljs-string">"RESTARTS|[1-9][0-9]* "

<span class="hljs-comment"># HPA scaling activity
kubectl get hpa -n production -w

<span class="hljs-comment"># Response latency percentiles
kubectl <span class="hljs-built_in">exec -n monitoring deploy/prometheus -- \
  promtool query instant \
  <span class="hljs-string">'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))'

Chaos Dashboard

The Chaos Mesh dashboard provides real-time experiment visualization:

# Port-forward the dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

<span class="hljs-comment"># Open http://localhost:2333

Designing Your Chaos Experiment Catalog

Start with the most likely and most impactful failure modes:

Priority Experiment What It Tests
1 Pod kill (1 of N) Kubernetes self-healing, traffic routing
2 Network delay (100ms) to database Circuit breaker, timeout configuration
3 Network delay (2s) between services Cascading timeout behavior
4 CPU stress (80%) HPA responsiveness, request queuing
5 Network partition to database Failover, connection pool recovery
6 Kill all pods simultaneously Cold-start behavior, database connection storms

Run experiments individually first, then combine them to test cascading failures. Document the results — expected behavior vs. actual behavior — and treat deviations as bugs to fix.

Read more