LitmusChaos: CNCF Kubernetes Chaos Engineering Guide
LitmusChaos is a CNCF-graduated chaos engineering platform built natively for Kubernetes. It uses Kubernetes-native custom resources to define, run, and observe chaos experiments — no external tools, no sidecars, no proprietary agents. Chaos experiments are just YAML that runs in your cluster like any other workload.
Why Chaos Engineering on Kubernetes
Kubernetes adds a new class of reliability concerns that traditional monitoring doesn't reveal:
- What happens when a pod gets evicted mid-request?
- Does your application handle node drain gracefully?
- When a dependent service slows down (not crashes), does it cascade?
- Are your PodDisruptionBudgets configured correctly?
- Do your readiness probes actually gate traffic correctly?
These questions are impossible to answer from code review or load testing. You need to deliberately introduce the failure and observe what happens. LitmusChaos makes this systematic.
Architecture
LitmusChaos runs three components in your cluster:
Chaos Operator: Watches for ChaosEngine resources and orchestrates experiment execution.
Chaos Experiments: Pre-built experiment definitions stored in ChaosHub. Each experiment (pod-delete, node-cpu-hog, network-latency, etc.) runs as a Kubernetes Job.
Chaos Center: Optional web UI for managing experiments, scheduling, and observability (useful for teams, not required for CLI-driven workflows).
The flow: you create a ChaosEngine CR → operator picks it up → runs the specified ChaosExperiment as a Job → collects results → updates the ChaosEngine status.
Installation
Install via Helm:
# Add Litmus Helm repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
<span class="hljs-comment"># Install Litmus in its own namespace
kubectl create ns litmus
helm install chaos litmuschaos/litmus \
--namespace litmus \
--<span class="hljs-built_in">set portal.frontend.service.type=NodePortVerify installation:
kubectl get pods -n litmus
# NAME READY STATUS
<span class="hljs-comment"># chaos-litmus-frontend-d66... 1/1 Running
<span class="hljs-comment"># chaos-litmus-server-... 1/1 Running
<span class="hljs-comment"># chaos-operator-ce-... 1/1 Running Install the generic experiment CRDs:
kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yamlYour First Experiment: Pod Delete
The simplest experiment — randomly delete a pod and verify your deployment recovers:
1. Label your target deployment:
kubectl label deployment my-app app.kubernetes.io/part-of=my-app2. Create a ServiceAccount for the experiment:
# pod-delete-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-delete-sa
namespace: default
labels:
name: pod-delete-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-delete-sa
namespace: default
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "pods/log", "events", "replicationcontrollers"]
verbs: ["create", "list", "get", "patch", "update", "delete", "deletecollection"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "replicasets", "daemonsets"]
verbs: ["list", "get"]
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines", "chaosexperiments", "chaosresults"]
verbs: ["create", "list", "get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-delete-sa
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: pod-delete-sa
subjects:
- kind: ServiceAccount
name: pod-delete-sa
namespace: defaultkubectl apply -f pod-delete-rbac.yaml3. Create the ChaosEngine:
# pod-delete-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: my-app-chaos
namespace: default
spec:
appinfo:
appns: default
applabel: "app=my-app"
appkind: deployment
chaosServiceAccount: pod-delete-sa
monitoring: false
jobCleanUpPolicy: delete
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30" # Run chaos for 30 seconds
- name: CHAOS_INTERVAL
value: "10" # Delete a pod every 10 seconds
- name: FORCE
value: "false" # Graceful delete (use true for kill -9)kubectl apply -f pod-delete-engine.yaml4. Monitor the experiment:
# Watch the ChaosEngine status
kubectl describe chaosengine my-app-chaos
<span class="hljs-comment"># Watch the chaos job
kubectl get <span class="hljs-built_in">jobs
<span class="hljs-comment"># See the chaos result
kubectl describe chaosresult my-app-chaos-pod-deleteReading Chaos Results
kubectl describe chaosresult my-app-chaos-pod-deleteOutput:
Status:
Experiment Status:
Fail Step: N/A
Phase: Completed
Probe Success Percentage: 100
Verdict: Pass
History:
Failed Runs: 0
Passed Runs: 3
Stopped Runs: 0Verdict: Pass means your application recovered from all pod deletions within the duration. Verdict: Fail means something went wrong — check the chaos job logs:
kubectl logs -l job-name=my-app-chaos-pod-delete -n defaultNetwork Chaos Experiments
Simulate network latency between your services:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-latency-chaos
namespace: default
spec:
appinfo:
appns: default
applabel: "app=my-app"
appkind: deployment
chaosServiceAccount: pod-delete-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: NETWORK_LATENCY
value: "2000" # 2000ms latency
- name: JITTER
value: "500" # ±500ms jitter
- name: CONTAINER_RUNTIME
value: containerd
- name: SOCKET_PATH
value: /run/containerd/containerd.sockThis is where chaos engineering gets valuable. You're not testing if your service crashes under 2 seconds of latency — you're testing if your timeouts, retries, and circuit breakers actually work. If a downstream service slows to 2 seconds and your upstream has a 30-second timeout with no circuit breaker, you've just found a cascading failure scenario.
CPU and Memory Stress
Test resource exhaustion:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: cpu-stress-chaos
spec:
appinfo:
appns: default
applabel: "app=my-app"
appkind: deployment
experiments:
- name: pod-cpu-hog
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CPU_CORES
value: "1" # Hog 1 CPU core per pod
- name: CPU_LOAD
value: "100" # 100% load on those coresCPU stress reveals:
- Whether your HPA triggers correctly and scales up
- Whether throttled pods cause request timeouts for callers
- Whether your resource limits are set correctly (pod gets OOMKilled vs CPU throttled)
Probes: Continuous Validation During Chaos
LitmusChaos probes validate your system health while chaos runs. Instead of checking if the system survives, you check if it continues to serve traffic:
experiments:
- name: pod-delete
spec:
probe:
- name: check-api-responds
type: httpProbe
httpProbe/inputs:
url: http://my-app-service/health
insecureSkipVerify: false
method:
get:
criteria: "=="
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 5
retry: 1With mode: Continuous, the probe runs every 5 seconds throughout the chaos duration. If the health endpoint returns anything other than 200 for more than 1 retry, the chaos experiment fails. This catches scenarios where your application briefly becomes unavailable during pod restarts.
Scheduling Chaos
For regular chaos execution, use ChaosSchedule:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: weekly-pod-delete
namespace: default
spec:
schedule:
repeat:
properties:
minChaosInterval: "2h" # Don't run more than once per 2 hours
workHours:
- includedHours: "9-17" # Only during business hours
workDays:
includedDays: "Mon,Tue,Wed,Thu,Fri"
engineTemplateSpec:
appinfo:
appns: default
applabel: "app=my-app"
appkind: deployment
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"Running scheduled chaos during business hours (not nights/weekends) ensures your team is available to respond when chaos exposes real issues.
ChaosHub: Pre-Built Experiments
LitmusChaos ships with a library of pre-built experiments:
| Category | Experiments |
|---|---|
| Pod | pod-delete, pod-cpu-hog, pod-memory-hog, pod-network-latency, pod-network-loss, pod-network-corruption |
| Node | node-cpu-hog, node-memory-hog, node-io-stress, node-restart, node-drain, node-taint |
| K8s | container-kill, disk-fill, k8s-pod-delete |
| AWS | ec2-stop-by-id, ebs-loss-by-id, rds-instance-stop |
| Azure | azure-instance-stop, azure-disk-loss |
| GCP | gcp-vm-instance-stop, gcp-vm-disk-loss |
Browse experiments:
kubectl get chaosexperimentCI/CD Integration
Running chaos experiments in CI validates that your application's resilience characteristics don't regress:
# .github/workflows/chaos-tests.yml
name: Chaos Tests
on:
schedule:
- cron: '0 10 * * 1' # Monday 10am
jobs:
chaos:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up kubectl
uses: azure/setup-kubectl@v3
with:
version: 'latest'
- name: Apply chaos experiment
run: |
kubectl apply -f chaos/pod-delete-engine.yaml
- name: Wait for experiment to complete
run: |
kubectl wait chaosengine my-app-chaos \
--for=jsonpath='{.status.engineStatus}'=completed \
--timeout=120s
- name: Check result
run: |
VERDICT=$(kubectl get chaosresult my-app-chaos-pod-delete \
-o jsonpath='{.status.experimentStatus.verdict}')
echo "Chaos result: $VERDICT"
if [ "$VERDICT" != "Pass" ]; then
echo "Chaos experiment failed!"
kubectl describe chaosresult my-app-chaos-pod-delete
exit 1
fi
- name: Cleanup
if: always()
run: kubectl delete chaosengine my-app-chaos --ignore-not-foundObservability During Chaos
Connect chaos experiments to your observability stack. LitmusChaos emits Prometheus metrics:
# Check if Litmus metrics are exposed
kubectl port-forward svc/chaos-litmus-server 9091:9091 -n litmus
curl http://localhost:9091/metrics <span class="hljs-pipe">| grep litmusKey metrics:
litmuschaos_passed_experiments— experiments that passedlitmuschaos_failed_experiments— experiments that failedlitmuschaos_awaited_experiments— experiments in progress
Add chaos markers to your Grafana dashboards to correlate chaos events with application metrics. When pod deletion spikes latency on your service dashboard, you can see the chaos annotation at the exact moment it started.
LitmusChaos is the most mature Kubernetes-native chaos platform. Its CNCF graduation signals production-readiness, and its open experiment library means you can start running meaningful chaos without building custom tooling. The key is to start simple (pod-delete), instrument properly (probes + Prometheus), and expand to more complex scenarios as your confidence in the baseline grows.