LitmusChaos: Chaos Engineering for Kubernetes

LitmusChaos: Chaos Engineering for Kubernetes

Kubernetes made it dramatically easier to deploy and scale applications. It also made it dramatically easier to build complex distributed systems that fail in ways no single developer fully understands. A pod gets evicted. A node runs out of memory. The network between two namespaces develops 200ms of latency. Do your applications recover? Does your monitoring detect the problem? Does your on-call alert fire before users notice?

LitmusChaos is the CNCF-graduated chaos engineering platform built specifically for Kubernetes. It provides a declarative, Kubernetes-native approach to running chaos experiments: you define what failure to inject as a Kubernetes custom resource, and the Litmus operator handles the rest. This guide walks through the full stack—architecture, installation, writing experiments, scheduling them, and making sense of results in the Litmus Portal.

LitmusChaos Architecture

LitmusChaos has three main layers:

ChaosHub is a Git-based repository of pre-built chaos experiments. The default hub (hosted by the Litmus team) contains 50+ experiments targeting pods, nodes, network, storage, and cloud provider resources. You can host private hubs for custom experiments.

Litmus Operator runs inside your cluster and watches for ChaosEngine custom resources. When it detects one, it schedules the appropriate experiment runner pod, injects the failure, monitors the steady-state, and records results back into the cluster.

Litmus Portal is an optional web UI and backend service that provides experiment management, scheduling, team access control, and analytics dashboards. It is particularly useful for large teams running many experiments across multiple clusters.

The core custom resource types are:

  • ChaosExperiment — defines the failure type and its parameters
  • ChaosEngine — binds an experiment to a target application and triggers execution
  • ChaosResult — written by the operator after each run with pass/fail verdict and metrics
  • ChaosSchedule — schedules repeated ChaosEngine runs on a cron expression

Installing LitmusChaos via Helm

LitmusChaos requires cluster-admin permissions because it manages resources across namespaces. Install the operator and portal together:

# Add the Litmus Helm repository
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

<span class="hljs-comment"># Create a dedicated namespace
kubectl create namespace litmus

<span class="hljs-comment"># Install the Litmus Portal (includes operator, frontend, and backend)
helm install chaos litmuschaos/litmus \
  --namespace litmus \
  --<span class="hljs-built_in">set portal.frontend.service.type=LoadBalancer \
  --<span class="hljs-built_in">set portal.server.graphqlServer.replicaCount=1 \
  --version 3.4.0

<span class="hljs-comment"># Watch the pods come up
kubectl get pods -n litmus -w

Once all pods are running, retrieve the frontend URL:

kubectl get svc -n litmus chaos-litmus-frontend-service

The default credentials are admin / litmus. Change them immediately on first login.

For clusters without a load balancer, use port-forwarding:

kubectl port-forward svc/chaos-litmus-frontend-service 9091:9091 -n litmus

Installing Just the Operator (Lightweight)

For CI environments where the portal UI is not needed, install only the operator and CRDs:

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.4.0.yaml

Writing Your First ChaosExperiment: pod-delete

The pod-delete experiment randomly kills pods in a target deployment. This is the Kubernetes equivalent of Chaos Monkey's instance termination. Before creating the engine, install the experiment from the ChaosHub:

kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.4.0?file=charts/generic/pod-delete/experiment.yaml \
  -n your-app-namespace

Now create a service account with the permissions the experiment needs:

# rbac-pod-delete.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-delete-sa
  namespace: your-app-namespace
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-delete-role
  namespace: your-app-namespace
rules:
  - apiGroups: [""]
    resources: ["pods", "events"]
    verbs: ["create", "list", "get", "patch", "update", "delete", "deletecollection"]
  - apiGroups: [""]
    resources: ["pods/exec", "pods/log", "replicationcontrollers"]
    verbs: ["get", "list", "create"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets", "replicasets", "daemonsets"]
    verbs: ["list", "get"]
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines", "chaosexperiments", "chaosresults"]
    verbs: ["create", "list", "get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-delete-rb
  namespace: your-app-namespace
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-delete-role
subjects:
  - kind: ServiceAccount
    name: pod-delete-sa
    namespace: your-app-namespace
kubectl apply -f rbac-pod-delete.yaml

Now define the ChaosEngine. This is what actually triggers the experiment:

# engine-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos-engine
  namespace: your-app-namespace
spec:
  appinfo:
    appns: your-app-namespace
    applabel: "app=nginx"           # Matches pods via label selector
    appkind: deployment
  chaosServiceAccount: pod-delete-sa
  monitoring: true
  jobCleanUpPolicy: retain          # retain | delete — retain lets you inspect runner logs
  annotationCheck: "false"
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"           # Run experiment for 60 seconds
            - name: CHAOS_INTERVAL
              value: "10"           # Delete a pod every 10 seconds
            - name: FORCE
              value: "false"        # Graceful termination
            - name: PODS_AFFECTED_PERC
              value: "50"           # Kill up to 50% of matching pods
        probe:
          - name: "check-nginx-response"
            type: httpProbe
            httpProbe/inputs:
              url: "http://nginx-service.your-app-namespace.svc.cluster.local"
              insecureSkipVerify: false
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 3
              retry: 2
              probePollingInterval: 2

Apply it and watch the results:

kubectl apply -f engine-pod-delete.yaml
kubectl get chaosresult -n your-app-namespace -w

The ChaosResult will show Verdict: Pass if the probes stayed healthy throughout the experiment, or Verdict: Fail if any probe check failed during chaos injection.

Network Latency Experiment

The pod-network-latency experiment injects TC (traffic control) rules into the pod's network namespace to simulate slow network links. This is essential for testing timeout configurations and circuit breaker behavior.

# Install the experiment definition
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.4.0?file=charts/generic/pod-network-latency/experiment.yaml \
  -n your-app-namespace
# engine-network-latency.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-latency-engine
  namespace: your-app-namespace
spec:
  appinfo:
    appns: your-app-namespace
    applabel: "app=payment-service"
    appkind: deployment
  chaosServiceAccount: pod-network-latency-sa
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: NETWORK_LATENCY
              value: "2000"         # 2000ms latency
            - name: JITTER
              value: "200"          # ±200ms jitter
            - name: CONTAINER_RUNTIME
              value: "containerd"
            - name: SOCKET_PATH
              value: "/run/containerd/containerd.sock"
            - name: DESTINATION_IPS
              value: ""             # Empty = affect all IPs; set to target specific services
            - name: DESTINATION_HOSTS
              value: "database-service"  # Only affect traffic to this host
        probe:
          - name: "payment-latency-probe"
            type: httpProbe
            httpProbe/inputs:
              url: "http://payment-service.your-app-namespace.svc.cluster.local/health"
              responseTimeout: 5000  # 5s — experiment should still pass with 2s latency added
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 8
              interval: 5
              retry: 3

Notice the DESTINATION_HOSTS parameter. Targeting a specific hostname rather than all traffic is the blast radius control mechanism—you can simulate "the database is slow" without affecting all network traffic from the pod.

Scheduling with ChaosSchedule

Running experiments manually is useful for GameDays but does not give you continuous confidence. ChaosSchedule wraps a ChaosEngine in a cron expression:

# schedule-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx-pod-delete
  namespace: your-app-namespace
spec:
  schedule:
    now: false
    once:
      executionTime: ""
    repeat:
      timeRange:
        startTime: "2024-01-01T09:00:00Z"
        endTime: "2025-12-31T17:00:00Z"
      properties:
        minChaosInterval: "2h"    # At least 2 hours between runs
      workDays:
        includedDays: "Mon,Tue,Wed,Thu,Fri"
  engineTemplateSpec:
    appinfo:
      appns: your-app-namespace
      applabel: "app=nginx"
      appkind: deployment
    chaosServiceAccount: pod-delete-sa
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              - name: TOTAL_CHAOS_DURATION
                value: "30"
              - name: CHAOS_INTERVAL
                value: "10"
              - name: PODS_AFFECTED_PERC
                value: "25"

This schedule runs pod-delete experiments every 2 hours during business hours on weekdays. The results accumulate in ChaosResult resources and the Litmus Portal aggregates them into trend dashboards.

Probes: Defining Steady-State

Probes are the mechanism that defines what "healthy" looks like during and after chaos. LitmusChaos supports four probe types:

httpProbe checks an HTTP endpoint. Use this for services that expose a health endpoint.

cmdProbe runs a shell command inside the cluster and checks the exit code or output. Useful for database connectivity checks or custom business logic validation.

k8sProbe checks the state of a Kubernetes resource. For example, verifying that a Deployment has at least N ready replicas throughout the experiment.

promProbe queries a Prometheus metric and checks it against a threshold. This is the most powerful probe for validating SLO-level behavior.

# Prometheus probe example
probe:
  - name: "error-rate-probe"
    type: promProbe
    promProbe/inputs:
      endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
      query: |
        sum(rate(http_requests_total{status=~"5.."}[1m])) /
        sum(rate(http_requests_total[1m]))
      comparator:
        type: float
        criteria: "<="
        value: "0.05"    # Error rate must stay at or below 5%
    mode: Continuous
    runProperties:
      probeTimeout: 10
      interval: 15
      retry: 2
      probePollingInterval: 5

This probe continuously queries Prometheus during the experiment and fails the ChaosResult if the HTTP error rate exceeds 5%. This directly measures SLO impact rather than just checking if the service is reachable.

Observability with Litmus Portal

The Litmus Portal provides several observability features worth enabling:

Experiment Analytics shows pass/fail trends over time per experiment and per application. It is where you detect gradual degradation—an experiment that passed reliably for three months but started failing two weeks ago signals a regression introduced in a recent deploy.

Chaos Dashboard overlays chaos experiment timing onto application metrics from Prometheus/Grafana. This makes it trivial to correlate "chaos was injected at 14:32" with "p99 latency spiked at 14:33 and recovered at 14:36."

Workflow Builder provides a visual editor for composing multi-step chaos workflows where experiments run sequentially or in parallel with pre- and post-conditions.

To connect Prometheus to the portal:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: litmus-portal-prometheus-config
  namespace: litmus
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'chaos-exporter'
        static_configs:
          - targets: ['chaos-exporter.litmus.svc.cluster.local:8080']
EOF

The chaos-exporter service (installed with the operator) exposes Prometheus metrics including experiment pass/fail rates, chaos injection duration, and probe verdict counts.

Advanced: Writing Custom Experiments

When the pre-built experiments do not cover your failure mode, you can write custom experiments. A LitmusChaos experiment is a Kubernetes Job with the chaos logic in a container. The experiment definition declares the parameters it accepts:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: custom-disk-fill
  namespace: your-app-namespace
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["get", "list"]
    image: "your-registry/custom-chaos:latest"
    imagePullPolicy: Always
    args:
      - -c
      - ./disk-fill
    command:
      - /bin/bash
    env:
      - name: TOTAL_CHAOS_DURATION
        value: "60"
      - name: FILL_PERCENTAGE
        value: "80"
    labels:
      name: custom-disk-fill

Your container image runs the chaos logic and exits 0 for success or non-zero for failure. The Litmus operator handles everything else: scheduling, result recording, probe execution, and cleanup.

Best Practices for LitmusChaos

Start with game days, not schedules. Run experiments manually during a planned session with the full team present before automating them. This gives everyone a chance to understand what the experiment does and to observe the system response together.

One experiment per engine. Although a single ChaosEngine can list multiple experiments, keep engines focused. It makes it easier to correlate which experiment caused which effect.

Use jobCleanUpPolicy: retain during development. The default is delete, which removes runner pods immediately after the experiment. Retaining them lets you inspect logs for debugging. Switch to delete for automated scheduled runs.

Tag your chaos resources. Add labels to ChaosEngine resources indicating the team, application, and severity so you can filter results in the portal.

Integrate results into deployment gates. A ChaosResult resource with Verdict: Fail can be detected in CI and block a deployment. Use kubectl get chaosresult in your pipeline to check experiment health before promoting to production.

LitmusChaos gives Kubernetes-native teams a complete chaos engineering stack without leaving the Kubernetes API. By expressing failure scenarios as declarative resources, experiments become version-controlled, reviewable, and reproducible—the same properties we expect from infrastructure-as-code, now applied to resilience testing.

Read more