Chaos Toolkit: Open-Source Chaos Engineering Getting Started Guide

Chaos Toolkit: Open-Source Chaos Engineering Getting Started Guide

Chaos Toolkit is an open-source chaos engineering tool that defines experiments as declarative JSON or YAML files. Each experiment specifies a steady-state hypothesis (what "normal" looks like), a set of fault-injection actions, and rollback activities. Chaos Toolkit verifies your hypothesis before chaos, injects faults, then verifies the hypothesis again — reporting whether your system maintained its steady state through the disruption.

The Chaos Toolkit Philosophy

Chaos Toolkit separates concerns clearly:

Steady-state hypothesis: What does normal look like? Before running any chaos, verify the system is in steady state. If it's already degraded, the experiment is invalid. After chaos, re-verify steady state — if it's different, chaos revealed a weakness.

Method: The fault injection. What you're going to break.

Rollback: What to restore afterward. Always runs, even if the experiment fails.

Probes vs Actions: Probes read state (HTTP calls, file reads, process checks). Actions change state (kill processes, inject latency, delete resources). Experiments combine both.

This structure makes experiments readable and auditable — a non-technical stakeholder can understand what each experiment does.

Installation

pip install chaostoolkit

# Verify
chaos --version

Install extensions for your infrastructure:

# Kubernetes support
pip install chaostoolkit-kubernetes

<span class="hljs-comment"># AWS support
pip install chaostoolkit-aws

<span class="hljs-comment"># Spring Boot actuator support
pip install chaostoolkit-spring

<span class="hljs-comment"># Prometheus support
pip install chaostoolkit-prometheus

<span class="hljs-comment"># Wiremock support (for API chaos)
pip install chaostoolkit-wiremock

Your First Experiment

A minimal experiment that verifies a web service is accessible during a simulated pod deletion:

{
  "version": "1.0.0",
  "title": "Web service remains available during pod restart",
  "description": "Verify the web service continues to serve requests when a pod is deleted",
  "tags": ["kubernetes", "availability"],
  "steady-state-hypothesis": {
    "title": "Service is accessible",
    "probes": [
      {
        "type": "probe",
        "name": "service-responds-with-200",
        "tolerance": 200,
        "provider": {
          "type": "http",
          "url": "http://my-app.default.svc.cluster.local/health",
          "timeout": 3
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "kill-random-pod",
      "provider": {
        "type": "python",
        "module": "chaosk8s.pod.actions",
        "func": "terminate_pods",
        "arguments": {
          "label_selector": "app=my-app",
          "ns": "default",
          "rand": true
        }
      }
    },
    {
      "type": "probe",
      "name": "wait-for-pod-to-restart",
      "provider": {
        "type": "process",
        "path": "sleep",
        "arguments": "5"
      }
    }
  ],
  "rollbacks": []
}

Run it:

chaos run experiment.json

Output:

[2024-01-15 10:00:00 INFO] Validating the experiment's syntax
[2024-01-15 10:00:00 INFO] Running experiment: Web service remains available during pod restart
[2024-01-15 10:00:00 INFO] Steady-state strategy: default
[2024-01-15 10:00:00 INFO] Executing steady-state hypothesis: Service is accessible
[2024-01-15 10:00:00 INFO] Probe: service-responds-with-200
[2024-01-15 10:00:00 INFO] Steady state hypothesis is met!
[2024-01-15 10:00:00 INFO] Playing your experiment's method now...
[2024-01-15 10:00:00 INFO] Action: kill-random-pod
[2024-01-15 10:00:05 INFO] Probe: wait-for-pod-to-restart
[2024-01-15 10:00:05 INFO] Verifying the experiment's steady-state hypothesis...
[2024-01-15 10:00:05 INFO] Probe: service-responds-with-200
[2024-01-15 10:00:05 CRITICAL] Steady state probe 'service-responds-with-200' is not in the expected state: HTTP call returned status 503

[2024-01-15 10:00:05 CRITICAL] Failed to run experiment 'Web service remains available during pod restart'

The experiment found a real problem: deleting a pod causes a 503 for 5 seconds. Your application might need a second replica and a properly configured PodDisruptionBudget.

Steady-State Hypothesis in Depth

The hypothesis is both a precondition and a postcondition. It must pass before chaos runs and after chaos runs. Each probe has a tolerance that defines what "healthy" means:

HTTP status code:

{
  "type": "probe",
  "name": "api-returns-200",
  "tolerance": 200,
  "provider": {
    "type": "http",
    "url": "http://api/health"
  }
}

HTTP response body match:

{
  "type": "probe",
  "name": "api-reports-healthy",
  "tolerance": {
    "type": "jsonpath",
    "path": "$.status",
    "expect": "healthy"
  },
  "provider": {
    "type": "http",
    "url": "http://api/health",
    "timeout": 3
  }
}

Process exit code:

{
  "type": "probe",
  "name": "service-process-running",
  "tolerance": 0,
  "provider": {
    "type": "process",
    "path": "pgrep",
    "arguments": "my-service"
  }
}

Prometheus metric value:

{
  "type": "probe",
  "name": "error-rate-below-threshold",
  "tolerance": {
    "type": "range",
    "range": [0, 0.01]
  },
  "provider": {
    "type": "python",
    "module": "chaosprometheus.probes",
    "func": "query",
    "arguments": {
      "query": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))",
      "when": "now"
    }
  }
}

Kubernetes Experiments

The chaostoolkit-kubernetes extension provides actions and probes for Kubernetes:

Available actions:

# Terminate pods
chaosk8s.pod.actions.terminate_pods(label_selector, ns, rand=False, grace_period=-1)

# Scale deployment
chaosk8s.deployment.actions.scale_deployment(name, replicas, ns)

# Delete namespace resources
chaosk8s.node.actions.drain_nodes(label_selector, delete_pods_with_local_storage=False)

Network chaos (requires NetworkChaos or Toxiproxy):

{
  "type": "action",
  "name": "add-network-latency",
  "provider": {
    "type": "process",
    "path": "kubectl",
    "arguments": "exec deployment/my-app -- tc qdisc add dev eth0 root netem delay 500ms"
  }
}

Full Kubernetes experiment:

{
  "version": "1.0.0",
  "title": "Service survives database pod restart",
  "steady-state-hypothesis": {
    "title": "API and database are accessible",
    "probes": [
      {
        "type": "probe",
        "name": "api-healthy",
        "tolerance": 200,
        "provider": {
          "type": "http",
          "url": "http://api.default.svc/health",
          "timeout": 5
        }
      },
      {
        "type": "probe",
        "name": "db-pods-running",
        "tolerance": 1,
        "provider": {
          "type": "python",
          "module": "chaosk8s.pod.probes",
          "func": "pods_in_phase",
          "arguments": {
            "label_selector": "app=postgres",
            "phase": "Running",
            "ns": "default"
          }
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "terminate-database-pod",
      "provider": {
        "type": "python",
        "module": "chaosk8s.pod.actions",
        "func": "terminate_pods",
        "arguments": {
          "label_selector": "app=postgres",
          "ns": "default",
          "rand": true,
          "grace_period": 0
        }
      }
    }
  ],
  "rollbacks": []
}

AWS Experiments

The chaostoolkit-aws extension integrates with AWS services:

{
  "version": "1.0.0",
  "title": "Application survives AZ failure",
  "steady-state-hypothesis": {
    "title": "Application serves traffic",
    "probes": [
      {
        "type": "probe",
        "name": "alb-serving-requests",
        "tolerance": 200,
        "provider": {
          "type": "http",
          "url": "https://api.example.com/health",
          "timeout": 5
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "stop-instances-in-az",
      "provider": {
        "type": "python",
        "module": "chaosaws.ec2.actions",
        "func": "stop_instances",
        "arguments": {
          "filters": [
            {"Name": "tag:Environment", "Values": ["staging"]},
            {"Name": "availability-zone", "Values": ["us-east-1a"]}
          ]
        }
      }
    },
    {
      "type": "probe",
      "name": "wait-30-seconds",
      "provider": {
        "type": "process",
        "path": "sleep",
        "arguments": "30"
      }
    }
  ],
  "rollbacks": [
    {
      "type": "action",
      "name": "restart-stopped-instances",
      "provider": {
        "type": "python",
        "module": "chaosaws.ec2.actions",
        "func": "start_instances",
        "arguments": {
          "filters": [
            {"Name": "tag:Environment", "Values": ["staging"]},
            {"Name": "availability-zone", "Values": ["us-east-1a"]}
          ]
        }
      }
    }
  ]
}

Note the rollbacks section — it starts stopped instances even if the experiment fails. Rollbacks ensure your infrastructure returns to a clean state.

Writing Custom Actions and Probes

Custom probes are Python functions:

# custom_probes.py
import requests

def check_order_creation_rate(threshold: float, url: str) -> bool:
    """Check that orders per minute exceeds threshold."""
    response = requests.get(f"{url}/metrics/order-rate")
    current_rate = response.json()["orders_per_minute"]
    return current_rate >= threshold

Reference in your experiment:

{
  "type": "probe",
  "name": "orders-flowing",
  "tolerance": true,
  "provider": {
    "type": "python",
    "module": "custom_probes",
    "func": "check_order_creation_rate",
    "arguments": {
      "threshold": 100,
      "url": "http://api-gateway"
    }
  }
}

YAML Format

JSON experiments can be verbose. Chaos Toolkit also accepts YAML:

version: "1.0.0"
title: Service survives network partition
description: Verify service handles 2-second network latency gracefully

steady-state-hypothesis:
  title: API responds within 1 second
  probes:
    - type: probe
      name: api-fast-response
      tolerance:
        type: range
        range: [200, 299]
      provider:
        type: http
        url: http://api/health
        timeout: 1

method:
  - type: action
    name: add-network-latency
    provider:
      type: process
      path: kubectl
      arguments: "exec -n default deploy/my-app -- tc qdisc add dev eth0 root netem delay 2000ms"

  - type: probe
    name: check-under-chaos
    provider:
      type: http
      url: http://api/health
      timeout: 5

rollbacks:
  - type: action
    name: remove-network-latency
    provider:
      type: process
      path: kubectl
      arguments: "exec -n default deploy/my-app -- tc qdisc del dev eth0 root"

CI/CD Integration

# .github/workflows/chaos.yml
name: Chaos Engineering Tests
on:
  schedule:
    - cron: '0 9 * * 3'  # Wednesdays at 9am

jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
          
      - name: Install Chaos Toolkit
        run: |
          pip install chaostoolkit chaostoolkit-kubernetes
          
      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBECONFIG_STAGING }}
          
      - name: Run experiments
        run: |
          for experiment in chaos-experiments/*.json; do
            echo "Running: $experiment"
            chaos run $experiment || {
              echo "EXPERIMENT FAILED: $experiment"
              exit 1
            }
          done
          
      - name: Upload journal
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: chaos-journals
          path: "*.json"

Reading the Chaos Journal

Chaos Toolkit writes a journal file (chaos-report.json by default) with the full experiment run:

chaos run experiment.json --journal-path my-journal.json

Generate a human-readable report:

pip install chaostoolkit-reporting
chaos report --export-format=pdf my-journal.json chaos-report.pdf

The journal captures every probe and action with timestamps, inputs, and outputs — a complete record of what happened and when.

Chaos Toolkit vs Gremlin vs LitmusChaos

Chaos Toolkit: Open-source, experiment-as-code, extensible via Python. No SaaS required. Best for teams that want maximum control and want experiments version-controlled alongside their application code.

Gremlin: Commercial SaaS with a polished UI. More accessible for teams that don't want to write experiments as code. Powerful but costs money.

LitmusChaos: Open-source, Kubernetes-native, CNCF-graduated. Best for Kubernetes-heavy teams that want native CRD-based experiments. Chaos Toolkit can integrate with LitmusChaos as a backend.

Chaos Toolkit's strength is its extensibility and its declarative experiment format. When you want chaos experiments to be as reviewable as your infrastructure code — stored in Git, reviewed in PRs, traceable to specific failures — Chaos Toolkit's approach fits naturally.

Read more