LitmusChaos: Chaos Engineering for Kubernetes
Kubernetes made it dramatically easier to deploy and scale applications. It also made it dramatically easier to build complex distributed systems that fail in ways no single developer fully understands. A pod gets evicted. A node runs out of memory. The network between two namespaces develops 200ms of latency. Do your applications recover? Does your monitoring detect the problem? Does your on-call alert fire before users notice?
LitmusChaos is the CNCF-graduated chaos engineering platform built specifically for Kubernetes. It provides a declarative, Kubernetes-native approach to running chaos experiments: you define what failure to inject as a Kubernetes custom resource, and the Litmus operator handles the rest. This guide walks through the full stack—architecture, installation, writing experiments, scheduling them, and making sense of results in the Litmus Portal.
LitmusChaos Architecture
LitmusChaos has three main layers:
ChaosHub is a Git-based repository of pre-built chaos experiments. The default hub (hosted by the Litmus team) contains 50+ experiments targeting pods, nodes, network, storage, and cloud provider resources. You can host private hubs for custom experiments.
Litmus Operator runs inside your cluster and watches for ChaosEngine custom resources. When it detects one, it schedules the appropriate experiment runner pod, injects the failure, monitors the steady-state, and records results back into the cluster.
Litmus Portal is an optional web UI and backend service that provides experiment management, scheduling, team access control, and analytics dashboards. It is particularly useful for large teams running many experiments across multiple clusters.
The core custom resource types are:
ChaosExperiment— defines the failure type and its parametersChaosEngine— binds an experiment to a target application and triggers executionChaosResult— written by the operator after each run with pass/fail verdict and metricsChaosSchedule— schedules repeatedChaosEngineruns on a cron expression
Installing LitmusChaos via Helm
LitmusChaos requires cluster-admin permissions because it manages resources across namespaces. Install the operator and portal together:
# Add the Litmus Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
<span class="hljs-comment"># Create a dedicated namespace
kubectl create namespace litmus
<span class="hljs-comment"># Install the Litmus Portal (includes operator, frontend, and backend)
helm install chaos litmuschaos/litmus \
--namespace litmus \
--<span class="hljs-built_in">set portal.frontend.service.type=LoadBalancer \
--<span class="hljs-built_in">set portal.server.graphqlServer.replicaCount=1 \
--version 3.4.0
<span class="hljs-comment"># Watch the pods come up
kubectl get pods -n litmus -wOnce all pods are running, retrieve the frontend URL:
kubectl get svc -n litmus chaos-litmus-frontend-serviceThe default credentials are admin / litmus. Change them immediately on first login.
For clusters without a load balancer, use port-forwarding:
kubectl port-forward svc/chaos-litmus-frontend-service 9091:9091 -n litmusInstalling Just the Operator (Lightweight)
For CI environments where the portal UI is not needed, install only the operator and CRDs:
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.4.0.yamlWriting Your First ChaosExperiment: pod-delete
The pod-delete experiment randomly kills pods in a target deployment. This is the Kubernetes equivalent of Chaos Monkey's instance termination. Before creating the engine, install the experiment from the ChaosHub:
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.4.0?file=charts/generic/pod-delete/experiment.yaml \
-n your-app-namespaceNow create a service account with the permissions the experiment needs:
# rbac-pod-delete.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-delete-sa
namespace: your-app-namespace
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-delete-role
namespace: your-app-namespace
rules:
- apiGroups: [""]
resources: ["pods", "events"]
verbs: ["create", "list", "get", "patch", "update", "delete", "deletecollection"]
- apiGroups: [""]
resources: ["pods/exec", "pods/log", "replicationcontrollers"]
verbs: ["get", "list", "create"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "replicasets", "daemonsets"]
verbs: ["list", "get"]
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines", "chaosexperiments", "chaosresults"]
verbs: ["create", "list", "get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-delete-rb
namespace: your-app-namespace
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: pod-delete-role
subjects:
- kind: ServiceAccount
name: pod-delete-sa
namespace: your-app-namespacekubectl apply -f rbac-pod-delete.yamlNow define the ChaosEngine. This is what actually triggers the experiment:
# engine-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos-engine
namespace: your-app-namespace
spec:
appinfo:
appns: your-app-namespace
applabel: "app=nginx" # Matches pods via label selector
appkind: deployment
chaosServiceAccount: pod-delete-sa
monitoring: true
jobCleanUpPolicy: retain # retain | delete — retain lets you inspect runner logs
annotationCheck: "false"
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # Run experiment for 60 seconds
- name: CHAOS_INTERVAL
value: "10" # Delete a pod every 10 seconds
- name: FORCE
value: "false" # Graceful termination
- name: PODS_AFFECTED_PERC
value: "50" # Kill up to 50% of matching pods
probe:
- name: "check-nginx-response"
type: httpProbe
httpProbe/inputs:
url: "http://nginx-service.your-app-namespace.svc.cluster.local"
insecureSkipVerify: false
method:
get:
criteria: "=="
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 3
retry: 2
probePollingInterval: 2Apply it and watch the results:
kubectl apply -f engine-pod-delete.yaml
kubectl get chaosresult -n your-app-namespace -wThe ChaosResult will show Verdict: Pass if the probes stayed healthy throughout the experiment, or Verdict: Fail if any probe check failed during chaos injection.
Network Latency Experiment
The pod-network-latency experiment injects TC (traffic control) rules into the pod's network namespace to simulate slow network links. This is essential for testing timeout configurations and circuit breaker behavior.
# Install the experiment definition
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.4.0?file=charts/generic/pod-network-latency/experiment.yaml \
-n your-app-namespace# engine-network-latency.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-latency-engine
namespace: your-app-namespace
spec:
appinfo:
appns: your-app-namespace
applabel: "app=payment-service"
appkind: deployment
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: NETWORK_LATENCY
value: "2000" # 2000ms latency
- name: JITTER
value: "200" # ±200ms jitter
- name: CONTAINER_RUNTIME
value: "containerd"
- name: SOCKET_PATH
value: "/run/containerd/containerd.sock"
- name: DESTINATION_IPS
value: "" # Empty = affect all IPs; set to target specific services
- name: DESTINATION_HOSTS
value: "database-service" # Only affect traffic to this host
probe:
- name: "payment-latency-probe"
type: httpProbe
httpProbe/inputs:
url: "http://payment-service.your-app-namespace.svc.cluster.local/health"
responseTimeout: 5000 # 5s — experiment should still pass with 2s latency added
method:
get:
criteria: "=="
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 8
interval: 5
retry: 3Notice the DESTINATION_HOSTS parameter. Targeting a specific hostname rather than all traffic is the blast radius control mechanism—you can simulate "the database is slow" without affecting all network traffic from the pod.
Scheduling with ChaosSchedule
Running experiments manually is useful for GameDays but does not give you continuous confidence. ChaosSchedule wraps a ChaosEngine in a cron expression:
# schedule-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx-pod-delete
namespace: your-app-namespace
spec:
schedule:
now: false
once:
executionTime: ""
repeat:
timeRange:
startTime: "2024-01-01T09:00:00Z"
endTime: "2025-12-31T17:00:00Z"
properties:
minChaosInterval: "2h" # At least 2 hours between runs
workDays:
includedDays: "Mon,Tue,Wed,Thu,Fri"
engineTemplateSpec:
appinfo:
appns: your-app-namespace
applabel: "app=nginx"
appkind: deployment
chaosServiceAccount: pod-delete-sa
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: PODS_AFFECTED_PERC
value: "25"This schedule runs pod-delete experiments every 2 hours during business hours on weekdays. The results accumulate in ChaosResult resources and the Litmus Portal aggregates them into trend dashboards.
Probes: Defining Steady-State
Probes are the mechanism that defines what "healthy" looks like during and after chaos. LitmusChaos supports four probe types:
httpProbe checks an HTTP endpoint. Use this for services that expose a health endpoint.
cmdProbe runs a shell command inside the cluster and checks the exit code or output. Useful for database connectivity checks or custom business logic validation.
k8sProbe checks the state of a Kubernetes resource. For example, verifying that a Deployment has at least N ready replicas throughout the experiment.
promProbe queries a Prometheus metric and checks it against a threshold. This is the most powerful probe for validating SLO-level behavior.
# Prometheus probe example
probe:
- name: "error-rate-probe"
type: promProbe
promProbe/inputs:
endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
query: |
sum(rate(http_requests_total{status=~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
comparator:
type: float
criteria: "<="
value: "0.05" # Error rate must stay at or below 5%
mode: Continuous
runProperties:
probeTimeout: 10
interval: 15
retry: 2
probePollingInterval: 5This probe continuously queries Prometheus during the experiment and fails the ChaosResult if the HTTP error rate exceeds 5%. This directly measures SLO impact rather than just checking if the service is reachable.
Observability with Litmus Portal
The Litmus Portal provides several observability features worth enabling:
Experiment Analytics shows pass/fail trends over time per experiment and per application. It is where you detect gradual degradation—an experiment that passed reliably for three months but started failing two weeks ago signals a regression introduced in a recent deploy.
Chaos Dashboard overlays chaos experiment timing onto application metrics from Prometheus/Grafana. This makes it trivial to correlate "chaos was injected at 14:32" with "p99 latency spiked at 14:33 and recovered at 14:36."
Workflow Builder provides a visual editor for composing multi-step chaos workflows where experiments run sequentially or in parallel with pre- and post-conditions.
To connect Prometheus to the portal:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: litmus-portal-prometheus-config
namespace: litmus
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'chaos-exporter'
static_configs:
- targets: ['chaos-exporter.litmus.svc.cluster.local:8080']
EOFThe chaos-exporter service (installed with the operator) exposes Prometheus metrics including experiment pass/fail rates, chaos injection duration, and probe verdict counts.
Advanced: Writing Custom Experiments
When the pre-built experiments do not cover your failure mode, you can write custom experiments. A LitmusChaos experiment is a Kubernetes Job with the chaos logic in a container. The experiment definition declares the parameters it accepts:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: custom-disk-fill
namespace: your-app-namespace
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
image: "your-registry/custom-chaos:latest"
imagePullPolicy: Always
args:
- -c
- ./disk-fill
command:
- /bin/bash
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: FILL_PERCENTAGE
value: "80"
labels:
name: custom-disk-fillYour container image runs the chaos logic and exits 0 for success or non-zero for failure. The Litmus operator handles everything else: scheduling, result recording, probe execution, and cleanup.
Best Practices for LitmusChaos
Start with game days, not schedules. Run experiments manually during a planned session with the full team present before automating them. This gives everyone a chance to understand what the experiment does and to observe the system response together.
One experiment per engine. Although a single ChaosEngine can list multiple experiments, keep engines focused. It makes it easier to correlate which experiment caused which effect.
Use jobCleanUpPolicy: retain during development. The default is delete, which removes runner pods immediately after the experiment. Retaining them lets you inspect logs for debugging. Switch to delete for automated scheduled runs.
Tag your chaos resources. Add labels to ChaosEngine resources indicating the team, application, and severity so you can filter results in the portal.
Integrate results into deployment gates. A ChaosResult resource with Verdict: Fail can be detected in CI and block a deployment. Use kubectl get chaosresult in your pipeline to check experiment health before promoting to production.
LitmusChaos gives Kubernetes-native teams a complete chaos engineering stack without leaving the Kubernetes API. By expressing failure scenarios as declarative resources, experiments become version-controlled, reviewable, and reproducible—the same properties we expect from infrastructure-as-code, now applied to resilience testing.