Chaos Engineering with Chaos Monkey and Litmus Chaos for Kubernetes

Chaos Engineering with Chaos Monkey and Litmus Chaos for Kubernetes

Chaos engineering is the practice of deliberately introducing failures into a system to discover weaknesses before they cause incidents. Netflix coined the term with Chaos Monkey — a tool that randomly terminates EC2 instances to force engineers to build resilient services.

For Kubernetes workloads, Litmus Chaos has become the most widely used chaos engineering framework. This guide covers how to set up chaos experiments, define steady-state hypotheses, and integrate chaos testing into your reliability practice.

The Chaos Engineering Process

Chaos engineering follows a scientific method:

  1. Define steady state — what does "working normally" look like? (throughput, latency, error rate)
  2. Hypothesize — "I believe the system will maintain steady state even if X happens"
  3. Introduce chaos — inject the failure
  4. Observe — does the system maintain steady state?
  5. Automate and repeat — run experiments continuously, not just after incidents

The goal is to find weaknesses before your users find them.

Setting Up Litmus Chaos

# Install Litmus using helm
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus \
  --namespace litmus \
  --create-namespace \
  --<span class="hljs-built_in">set portal.frontend.service.type=ClusterIP

<span class="hljs-comment"># Install Chaos Hub (pre-built experiments)
kubectl apply -f https://hub.litmuschaos.io/api/chaos?file=charts/generic/k8s-pod-delete/experiment.yaml

Your First Chaos Experiment: Pod Kill

The simplest experiment: kill a random pod and verify your service recovers.

# pod-kill-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: "app=nginx"
    appkind: deployment
  
  chaosServiceAccount: litmus-admin
  
  # Steady-state checks
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"      # Inject chaos for 30 seconds
            - name: CHAOS_INTERVAL
              value: "10"      # Kill pods every 10 seconds
            - name: FORCE
              value: "false"   # Graceful termination (SIGTERM)
kubectl apply -f pod-kill-experiment.yaml

# Watch experiment progress
kubectl get chaosresult nginx-chaos-pod-delete -o jsonpath=<span class="hljs-string">'{.status.experimentstatus.verdict}'

Defining Steady-State Hypotheses with Probes

Litmus Chaos supports probes that check steady state before and during chaos:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-resilience-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: "app=api-service"
    appkind: deployment
  
  experiments:
    - name: pod-delete
      spec:
        probe:
          # HTTP probe: check API is responding
          - name: "api-availability-probe"
            type: "httpProbe"
            httpProbe/inputs:
              url: "http://api-service/health"
              insecureSkipVerify: false
              responseTimeout: 2000  # ms
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: "Continuous"  # Check throughout the experiment
            runProperties:
              probeTimeout: 3
              interval: 5   # Check every 5 seconds
              retry: 2
          
          # Prometheus probe: check error rate during chaos
          - name: "error-rate-probe"
            type: "promProbe"
            promProbe/inputs:
              endpoint: "http://prometheus:9090"
              query: |
                sum(rate(http_requests_total{status=~"5.."}[1m]))
                /
                sum(rate(http_requests_total[1m]))
              comparator:
                type: "float"
                criteria: "<="
                value: "0.05"   # Error rate should stay under 5% during chaos
            mode: "Continuous"
            runProperties:
              probeTimeout: 3
              interval: 10
              retry: 2
        
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "15"

Common Kubernetes Chaos Experiments

Network Latency Injection

- name: pod-network-latency
  spec:
    components:
      env:
        - name: NETWORK_INTERFACE
          value: "eth0"
        - name: NETWORK_LATENCY
          value: "2000"   # Add 2 seconds of latency
        - name: TOTAL_CHAOS_DURATION
          value: "120"
        - name: TARGET_PODS
          value: "api-service-7d4b9f8c-xkl4p"

Hypothesis: "Even with 2 seconds of network latency injected into the API service, our frontend should show a degraded but functional UI (not a 500 error), because we have configured timeouts and fallbacks."

CPU Stress

- name: pod-cpu-hog
  spec:
    components:
      env:
        - name: CPU_CORES
          value: "2"        # Consume 2 CPU cores
        - name: CPU_LOAD
          value: "100"      # 100% utilization
        - name: TOTAL_CHAOS_DURATION
          value: "60"
        - name: PODS_AFFECTED_PERC
          value: "50"       # Affect 50% of matching pods

Node Drain

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: node-drain-chaos
  namespace: litmus
spec:
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: node-drain
      spec:
        components:
          env:
            - name: TARGET_NODE
              value: "worker-node-2"
            - name: TOTAL_CHAOS_DURATION
              value: "120"

Hypothesis: "When worker-node-2 is drained, Kubernetes reschedules all affected pods within 2 minutes, and the service maintains > 99% availability during rescheduling."

Writing Chaos Tests in Code

For repeatable, version-controlled chaos experiments, define experiments programmatically:

import subprocess
import time
import requests
import json

class ChaosExperiment:
    def __init__(self, name: str, engine_yaml: str, target_url: str):
        self.name = name
        self.engine_yaml = engine_yaml
        self.target_url = target_url
    
    def measure_steady_state(self, duration_seconds: int = 30) -> dict:
        """Measure baseline metrics before chaos."""
        metrics = {"success_rates": [], "latencies_ms": []}
        
        start = time.time()
        while time.time() - start < duration_seconds:
            t = time.perf_counter()
            try:
                response = requests.get(f"{self.target_url}/health", timeout=5)
                elapsed = (time.perf_counter() - t) * 1000
                metrics["success_rates"].append(1 if response.status_code == 200 else 0)
                metrics["latencies_ms"].append(elapsed)
            except Exception:
                metrics["success_rates"].append(0)
                metrics["latencies_ms"].append(5000)
            time.sleep(1)
        
        return {
            "success_rate": sum(metrics["success_rates"]) / len(metrics["success_rates"]),
            "p95_latency_ms": sorted(metrics["latencies_ms"])[int(0.95 * len(metrics["latencies_ms"]))]
        }
    
    def run(self) -> dict:
        print(f"[Chaos] Measuring steady state...")
        baseline = self.measure_steady_state(duration_seconds=30)
        print(f"  Success rate: {baseline['success_rate']:.2%}")
        print(f"  P95 latency: {baseline['p95_latency_ms']:.1f}ms")
        
        print(f"[Chaos] Applying chaos engine: {self.name}")
        subprocess.run(["kubectl", "apply", "-f", self.engine_yaml], check=True)
        
        print("[Chaos] Measuring during chaos...")
        during_chaos = self.measure_steady_state(duration_seconds=60)
        print(f"  Success rate: {during_chaos['success_rate']:.2%}")
        print(f"  P95 latency: {during_chaos['p95_latency_ms']:.1f}ms")
        
        print("[Chaos] Waiting for recovery...")
        time.sleep(30)
        
        print("[Chaos] Measuring post-chaos...")
        after_chaos = self.measure_steady_state(duration_seconds=30)
        print(f"  Success rate: {after_chaos['success_rate']:.2%}")
        print(f"  P95 latency: {after_chaos['p95_latency_ms']:.1f}ms")
        
        # Cleanup
        subprocess.run(["kubectl", "delete", "-f", self.engine_yaml], check=True)
        
        return {
            "baseline": baseline,
            "during_chaos": during_chaos,
            "after_chaos": after_chaos
        }

def test_api_survives_pod_deletion():
    experiment = ChaosExperiment(
        name="api-pod-delete",
        engine_yaml="experiments/pod-delete.yaml",
        target_url="https://api.example.com"
    )
    
    results = experiment.run()
    
    # During chaos: should maintain 95% success rate
    assert results["during_chaos"]["success_rate"] >= 0.95, \
        f"Success rate during chaos {results['during_chaos']['success_rate']:.2%} below 95%"
    
    # After chaos: should return to baseline within 30 seconds
    assert results["after_chaos"]["success_rate"] >= results["baseline"]["success_rate"] - 0.01, \
        "Service did not recover to baseline after chaos"
    
    # Latency during chaos: should stay under 3x baseline P95
    baseline_p95 = results["baseline"]["p95_latency_ms"]
    chaos_p95 = results["during_chaos"]["p95_latency_ms"]
    assert chaos_p95 <= baseline_p95 * 3, \
        f"Latency during chaos ({chaos_p95:.1f}ms) is more than 3x baseline ({baseline_p95:.1f}ms)"

Chaos in CI/CD

Running chaos experiments in CI catches regressions before they reach production:

# .github/workflows/chaos-tests.yml
name: Chaos Testing

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2am
  workflow_dispatch:

jobs:
  chaos-tests:
    runs-on: ubuntu-latest
    environment: staging  # Only run against staging
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up kubectl
        uses: azure/setup-kubectl@v3
      
      - name: Configure kubeconfig
        run: echo "${{ secrets.STAGING_KUBECONFIG }}" | base64 -d > ~/.kube/config
      
      - name: Install Litmus
        run: |
          helm upgrade --install chaos litmuschaos/litmus \
            --namespace litmus --create-namespace
      
      - name: Run pod-delete experiment
        run: |
          kubectl apply -f chaos/pod-delete-engine.yaml
          sleep 90  # Wait for experiment to complete
          
          verdict=$(kubectl get chaosresult api-chaos-pod-delete \
            -o jsonpath='{.status.experimentstatus.verdict}')
          
          if [ "$verdict" != "Pass" ]; then
            echo "Chaos experiment failed: $verdict"
            exit 1
          fi
      
      - name: Run network-latency experiment
        run: |
          kubectl apply -f chaos/network-latency-engine.yaml
          sleep 150
          
          verdict=$(kubectl get chaosresult api-chaos-network-latency \
            -o jsonpath='{.status.experimentstatus.verdict}')
          
          [ "$verdict" = "Pass" ] || exit 1
      
      - name: Cleanup
        if: always()
        run: |
          kubectl delete chaosengine --all -n default

Game Days

Beyond automated chaos tests, run periodic game days — structured exercises where your team simulates specific failure scenarios:

Scenario What to Simulate What to Validate
Database failover Primary DB killed App fails over within 30s
Region outage All pods in zone-a killed Traffic routes to zone-b
Cache failure Redis pod killed App degrades gracefully (slower, not down)
Dependency timeout Upstream service adds 30s latency App times out and returns cached/fallback
Certificate expiry TLS cert replaced with expired cert Alert fires before users notice

Chaos engineering is most valuable when it's boring — when experiments run, find nothing, and confirm your system is as resilient as you thought. That's the signal that your reliability investments are working.

Read more