LitmusChaos: CNCF Kubernetes Chaos Engineering Guide

LitmusChaos: CNCF Kubernetes Chaos Engineering Guide

LitmusChaos is a CNCF-graduated chaos engineering platform built natively for Kubernetes. It uses Kubernetes-native custom resources to define, run, and observe chaos experiments — no external tools, no sidecars, no proprietary agents. Chaos experiments are just YAML that runs in your cluster like any other workload.

Why Chaos Engineering on Kubernetes

Kubernetes adds a new class of reliability concerns that traditional monitoring doesn't reveal:

  • What happens when a pod gets evicted mid-request?
  • Does your application handle node drain gracefully?
  • When a dependent service slows down (not crashes), does it cascade?
  • Are your PodDisruptionBudgets configured correctly?
  • Do your readiness probes actually gate traffic correctly?

These questions are impossible to answer from code review or load testing. You need to deliberately introduce the failure and observe what happens. LitmusChaos makes this systematic.

Architecture

LitmusChaos runs three components in your cluster:

Chaos Operator: Watches for ChaosEngine resources and orchestrates experiment execution.

Chaos Experiments: Pre-built experiment definitions stored in ChaosHub. Each experiment (pod-delete, node-cpu-hog, network-latency, etc.) runs as a Kubernetes Job.

Chaos Center: Optional web UI for managing experiments, scheduling, and observability (useful for teams, not required for CLI-driven workflows).

The flow: you create a ChaosEngine CR → operator picks it up → runs the specified ChaosExperiment as a Job → collects results → updates the ChaosEngine status.

Installation

Install via Helm:

# Add Litmus Helm repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

<span class="hljs-comment"># Install Litmus in its own namespace
kubectl create ns litmus
helm install chaos litmuschaos/litmus \
  --namespace litmus \
  --<span class="hljs-built_in">set portal.frontend.service.type=NodePort

Verify installation:

kubectl get pods -n litmus
# NAME                                        READY   STATUS    
<span class="hljs-comment"># chaos-litmus-frontend-d66...               1/1     Running   
<span class="hljs-comment"># chaos-litmus-server-...                    1/1     Running   
<span class="hljs-comment"># chaos-operator-ce-...                      1/1     Running   

Install the generic experiment CRDs:

kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yaml

Your First Experiment: Pod Delete

The simplest experiment — randomly delete a pod and verify your deployment recovers:

1. Label your target deployment:

kubectl label deployment my-app app.kubernetes.io/part-of=my-app

2. Create a ServiceAccount for the experiment:

# pod-delete-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-delete-sa
  namespace: default
  labels:
    name: pod-delete-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-delete-sa
  namespace: default
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log", "events", "replicationcontrollers"]
    verbs: ["create", "list", "get", "patch", "update", "delete", "deletecollection"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets", "replicasets", "daemonsets"]
    verbs: ["list", "get"]
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines", "chaosexperiments", "chaosresults"]
    verbs: ["create", "list", "get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-delete-sa
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-delete-sa
subjects:
  - kind: ServiceAccount
    name: pod-delete-sa
    namespace: default
kubectl apply -f pod-delete-rbac.yaml

3. Create the ChaosEngine:

# pod-delete-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: my-app-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: "app=my-app"
    appkind: deployment
  chaosServiceAccount: pod-delete-sa
  monitoring: false
  jobCleanUpPolicy: delete
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"        # Run chaos for 30 seconds
            - name: CHAOS_INTERVAL
              value: "10"        # Delete a pod every 10 seconds
            - name: FORCE
              value: "false"     # Graceful delete (use true for kill -9)
kubectl apply -f pod-delete-engine.yaml

4. Monitor the experiment:

# Watch the ChaosEngine status
kubectl describe chaosengine my-app-chaos

<span class="hljs-comment"># Watch the chaos job
kubectl get <span class="hljs-built_in">jobs

<span class="hljs-comment"># See the chaos result
kubectl describe chaosresult my-app-chaos-pod-delete

Reading Chaos Results

kubectl describe chaosresult my-app-chaos-pod-delete

Output:

Status:
  Experiment Status:
    Fail Step: N/A
    Phase: Completed
    Probe Success Percentage: 100
    Verdict: Pass
  History:
    Failed Runs: 0
    Passed Runs: 3
    Stopped Runs: 0

Verdict: Pass means your application recovered from all pod deletions within the duration. Verdict: Fail means something went wrong — check the chaos job logs:

kubectl logs -l job-name=my-app-chaos-pod-delete -n default

Network Chaos Experiments

Simulate network latency between your services:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-latency-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: "app=my-app"
    appkind: deployment
  chaosServiceAccount: pod-delete-sa
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: NETWORK_LATENCY
              value: "2000"      # 2000ms latency
            - name: JITTER
              value: "500"       # ±500ms jitter
            - name: CONTAINER_RUNTIME
              value: containerd
            - name: SOCKET_PATH
              value: /run/containerd/containerd.sock

This is where chaos engineering gets valuable. You're not testing if your service crashes under 2 seconds of latency — you're testing if your timeouts, retries, and circuit breakers actually work. If a downstream service slows to 2 seconds and your upstream has a 30-second timeout with no circuit breaker, you've just found a cascading failure scenario.

CPU and Memory Stress

Test resource exhaustion:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: cpu-stress-chaos
spec:
  appinfo:
    appns: default
    applabel: "app=my-app"
    appkind: deployment
  experiments:
    - name: pod-cpu-hog
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CPU_CORES
              value: "1"         # Hog 1 CPU core per pod
            - name: CPU_LOAD
              value: "100"       # 100% load on those cores

CPU stress reveals:

  • Whether your HPA triggers correctly and scales up
  • Whether throttled pods cause request timeouts for callers
  • Whether your resource limits are set correctly (pod gets OOMKilled vs CPU throttled)

Probes: Continuous Validation During Chaos

LitmusChaos probes validate your system health while chaos runs. Instead of checking if the system survives, you check if it continues to serve traffic:

experiments:
  - name: pod-delete
    spec:
      probe:
        - name: check-api-responds
          type: httpProbe
          httpProbe/inputs:
            url: http://my-app-service/health
            insecureSkipVerify: false
            method:
              get:
                criteria: "=="
                responseCode: "200"
          mode: Continuous
          runProperties:
            probeTimeout: 5
            interval: 5
            retry: 1

With mode: Continuous, the probe runs every 5 seconds throughout the chaos duration. If the health endpoint returns anything other than 200 for more than 1 retry, the chaos experiment fails. This catches scenarios where your application briefly becomes unavailable during pod restarts.

Scheduling Chaos

For regular chaos execution, use ChaosSchedule:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: weekly-pod-delete
  namespace: default
spec:
  schedule:
    repeat:
      properties:
        minChaosInterval: "2h"    # Don't run more than once per 2 hours
      workHours:
        - includedHours: "9-17"   # Only during business hours
      workDays:
        includedDays: "Mon,Tue,Wed,Thu,Fri"
  engineTemplateSpec:
    appinfo:
      appns: default
      applabel: "app=my-app"
      appkind: deployment
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              - name: TOTAL_CHAOS_DURATION
                value: "30"

Running scheduled chaos during business hours (not nights/weekends) ensures your team is available to respond when chaos exposes real issues.

ChaosHub: Pre-Built Experiments

LitmusChaos ships with a library of pre-built experiments:

Category Experiments
Pod pod-delete, pod-cpu-hog, pod-memory-hog, pod-network-latency, pod-network-loss, pod-network-corruption
Node node-cpu-hog, node-memory-hog, node-io-stress, node-restart, node-drain, node-taint
K8s container-kill, disk-fill, k8s-pod-delete
AWS ec2-stop-by-id, ebs-loss-by-id, rds-instance-stop
Azure azure-instance-stop, azure-disk-loss
GCP gcp-vm-instance-stop, gcp-vm-disk-loss

Browse experiments:

kubectl get chaosexperiment

CI/CD Integration

Running chaos experiments in CI validates that your application's resilience characteristics don't regress:

# .github/workflows/chaos-tests.yml
name: Chaos Tests
on:
  schedule:
    - cron: '0 10 * * 1'   # Monday 10am

jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up kubectl
        uses: azure/setup-kubectl@v3
        with:
          version: 'latest'

      - name: Apply chaos experiment
        run: |
          kubectl apply -f chaos/pod-delete-engine.yaml
          
      - name: Wait for experiment to complete
        run: |
          kubectl wait chaosengine my-app-chaos \
            --for=jsonpath='{.status.engineStatus}'=completed \
            --timeout=120s

      - name: Check result
        run: |
          VERDICT=$(kubectl get chaosresult my-app-chaos-pod-delete \
            -o jsonpath='{.status.experimentStatus.verdict}')
          echo "Chaos result: $VERDICT"
          if [ "$VERDICT" != "Pass" ]; then
            echo "Chaos experiment failed!"
            kubectl describe chaosresult my-app-chaos-pod-delete
            exit 1
          fi

      - name: Cleanup
        if: always()
        run: kubectl delete chaosengine my-app-chaos --ignore-not-found

Observability During Chaos

Connect chaos experiments to your observability stack. LitmusChaos emits Prometheus metrics:

# Check if Litmus metrics are exposed
kubectl port-forward svc/chaos-litmus-server 9091:9091 -n litmus
curl http://localhost:9091/metrics <span class="hljs-pipe">| grep litmus

Key metrics:

  • litmuschaos_passed_experiments — experiments that passed
  • litmuschaos_failed_experiments — experiments that failed
  • litmuschaos_awaited_experiments — experiments in progress

Add chaos markers to your Grafana dashboards to correlate chaos events with application metrics. When pod deletion spikes latency on your service dashboard, you can see the chaos annotation at the exact moment it started.

LitmusChaos is the most mature Kubernetes-native chaos platform. Its CNCF graduation signals production-readiness, and its open experiment library means you can start running meaningful chaos without building custom tooling. The key is to start simple (pod-delete), instrument properly (probes + Prometheus), and expand to more complex scenarios as your confidence in the baseline grows.

Read more