Chaos Toolkit: Open-Source Chaos Engineering Getting Started Guide
Chaos Toolkit is an open-source chaos engineering tool that defines experiments as declarative JSON or YAML files. Each experiment specifies a steady-state hypothesis (what "normal" looks like), a set of fault-injection actions, and rollback activities. Chaos Toolkit verifies your hypothesis before chaos, injects faults, then verifies the hypothesis again — reporting whether your system maintained its steady state through the disruption.
The Chaos Toolkit Philosophy
Chaos Toolkit separates concerns clearly:
Steady-state hypothesis: What does normal look like? Before running any chaos, verify the system is in steady state. If it's already degraded, the experiment is invalid. After chaos, re-verify steady state — if it's different, chaos revealed a weakness.
Method: The fault injection. What you're going to break.
Rollback: What to restore afterward. Always runs, even if the experiment fails.
Probes vs Actions: Probes read state (HTTP calls, file reads, process checks). Actions change state (kill processes, inject latency, delete resources). Experiments combine both.
This structure makes experiments readable and auditable — a non-technical stakeholder can understand what each experiment does.
Installation
pip install chaostoolkit
# Verify
chaos --versionInstall extensions for your infrastructure:
# Kubernetes support
pip install chaostoolkit-kubernetes
<span class="hljs-comment"># AWS support
pip install chaostoolkit-aws
<span class="hljs-comment"># Spring Boot actuator support
pip install chaostoolkit-spring
<span class="hljs-comment"># Prometheus support
pip install chaostoolkit-prometheus
<span class="hljs-comment"># Wiremock support (for API chaos)
pip install chaostoolkit-wiremockYour First Experiment
A minimal experiment that verifies a web service is accessible during a simulated pod deletion:
{
"version": "1.0.0",
"title": "Web service remains available during pod restart",
"description": "Verify the web service continues to serve requests when a pod is deleted",
"tags": ["kubernetes", "availability"],
"steady-state-hypothesis": {
"title": "Service is accessible",
"probes": [
{
"type": "probe",
"name": "service-responds-with-200",
"tolerance": 200,
"provider": {
"type": "http",
"url": "http://my-app.default.svc.cluster.local/health",
"timeout": 3
}
}
]
},
"method": [
{
"type": "action",
"name": "kill-random-pod",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=my-app",
"ns": "default",
"rand": true
}
}
},
{
"type": "probe",
"name": "wait-for-pod-to-restart",
"provider": {
"type": "process",
"path": "sleep",
"arguments": "5"
}
}
],
"rollbacks": []
}Run it:
chaos run experiment.jsonOutput:
[2024-01-15 10:00:00 INFO] Validating the experiment's syntax
[2024-01-15 10:00:00 INFO] Running experiment: Web service remains available during pod restart
[2024-01-15 10:00:00 INFO] Steady-state strategy: default
[2024-01-15 10:00:00 INFO] Executing steady-state hypothesis: Service is accessible
[2024-01-15 10:00:00 INFO] Probe: service-responds-with-200
[2024-01-15 10:00:00 INFO] Steady state hypothesis is met!
[2024-01-15 10:00:00 INFO] Playing your experiment's method now...
[2024-01-15 10:00:00 INFO] Action: kill-random-pod
[2024-01-15 10:00:05 INFO] Probe: wait-for-pod-to-restart
[2024-01-15 10:00:05 INFO] Verifying the experiment's steady-state hypothesis...
[2024-01-15 10:00:05 INFO] Probe: service-responds-with-200
[2024-01-15 10:00:05 CRITICAL] Steady state probe 'service-responds-with-200' is not in the expected state: HTTP call returned status 503
[2024-01-15 10:00:05 CRITICAL] Failed to run experiment 'Web service remains available during pod restart'The experiment found a real problem: deleting a pod causes a 503 for 5 seconds. Your application might need a second replica and a properly configured PodDisruptionBudget.
Steady-State Hypothesis in Depth
The hypothesis is both a precondition and a postcondition. It must pass before chaos runs and after chaos runs. Each probe has a tolerance that defines what "healthy" means:
HTTP status code:
{
"type": "probe",
"name": "api-returns-200",
"tolerance": 200,
"provider": {
"type": "http",
"url": "http://api/health"
}
}HTTP response body match:
{
"type": "probe",
"name": "api-reports-healthy",
"tolerance": {
"type": "jsonpath",
"path": "$.status",
"expect": "healthy"
},
"provider": {
"type": "http",
"url": "http://api/health",
"timeout": 3
}
}Process exit code:
{
"type": "probe",
"name": "service-process-running",
"tolerance": 0,
"provider": {
"type": "process",
"path": "pgrep",
"arguments": "my-service"
}
}Prometheus metric value:
{
"type": "probe",
"name": "error-rate-below-threshold",
"tolerance": {
"type": "range",
"range": [0, 0.01]
},
"provider": {
"type": "python",
"module": "chaosprometheus.probes",
"func": "query",
"arguments": {
"query": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))",
"when": "now"
}
}
}Kubernetes Experiments
The chaostoolkit-kubernetes extension provides actions and probes for Kubernetes:
Available actions:
# Terminate pods
chaosk8s.pod.actions.terminate_pods(label_selector, ns, rand=False, grace_period=-1)
# Scale deployment
chaosk8s.deployment.actions.scale_deployment(name, replicas, ns)
# Delete namespace resources
chaosk8s.node.actions.drain_nodes(label_selector, delete_pods_with_local_storage=False)Network chaos (requires NetworkChaos or Toxiproxy):
{
"type": "action",
"name": "add-network-latency",
"provider": {
"type": "process",
"path": "kubectl",
"arguments": "exec deployment/my-app -- tc qdisc add dev eth0 root netem delay 500ms"
}
}Full Kubernetes experiment:
{
"version": "1.0.0",
"title": "Service survives database pod restart",
"steady-state-hypothesis": {
"title": "API and database are accessible",
"probes": [
{
"type": "probe",
"name": "api-healthy",
"tolerance": 200,
"provider": {
"type": "http",
"url": "http://api.default.svc/health",
"timeout": 5
}
},
{
"type": "probe",
"name": "db-pods-running",
"tolerance": 1,
"provider": {
"type": "python",
"module": "chaosk8s.pod.probes",
"func": "pods_in_phase",
"arguments": {
"label_selector": "app=postgres",
"phase": "Running",
"ns": "default"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "terminate-database-pod",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=postgres",
"ns": "default",
"rand": true,
"grace_period": 0
}
}
}
],
"rollbacks": []
}AWS Experiments
The chaostoolkit-aws extension integrates with AWS services:
{
"version": "1.0.0",
"title": "Application survives AZ failure",
"steady-state-hypothesis": {
"title": "Application serves traffic",
"probes": [
{
"type": "probe",
"name": "alb-serving-requests",
"tolerance": 200,
"provider": {
"type": "http",
"url": "https://api.example.com/health",
"timeout": 5
}
}
]
},
"method": [
{
"type": "action",
"name": "stop-instances-in-az",
"provider": {
"type": "python",
"module": "chaosaws.ec2.actions",
"func": "stop_instances",
"arguments": {
"filters": [
{"Name": "tag:Environment", "Values": ["staging"]},
{"Name": "availability-zone", "Values": ["us-east-1a"]}
]
}
}
},
{
"type": "probe",
"name": "wait-30-seconds",
"provider": {
"type": "process",
"path": "sleep",
"arguments": "30"
}
}
],
"rollbacks": [
{
"type": "action",
"name": "restart-stopped-instances",
"provider": {
"type": "python",
"module": "chaosaws.ec2.actions",
"func": "start_instances",
"arguments": {
"filters": [
{"Name": "tag:Environment", "Values": ["staging"]},
{"Name": "availability-zone", "Values": ["us-east-1a"]}
]
}
}
}
]
}Note the rollbacks section — it starts stopped instances even if the experiment fails. Rollbacks ensure your infrastructure returns to a clean state.
Writing Custom Actions and Probes
Custom probes are Python functions:
# custom_probes.py
import requests
def check_order_creation_rate(threshold: float, url: str) -> bool:
"""Check that orders per minute exceeds threshold."""
response = requests.get(f"{url}/metrics/order-rate")
current_rate = response.json()["orders_per_minute"]
return current_rate >= thresholdReference in your experiment:
{
"type": "probe",
"name": "orders-flowing",
"tolerance": true,
"provider": {
"type": "python",
"module": "custom_probes",
"func": "check_order_creation_rate",
"arguments": {
"threshold": 100,
"url": "http://api-gateway"
}
}
}YAML Format
JSON experiments can be verbose. Chaos Toolkit also accepts YAML:
version: "1.0.0"
title: Service survives network partition
description: Verify service handles 2-second network latency gracefully
steady-state-hypothesis:
title: API responds within 1 second
probes:
- type: probe
name: api-fast-response
tolerance:
type: range
range: [200, 299]
provider:
type: http
url: http://api/health
timeout: 1
method:
- type: action
name: add-network-latency
provider:
type: process
path: kubectl
arguments: "exec -n default deploy/my-app -- tc qdisc add dev eth0 root netem delay 2000ms"
- type: probe
name: check-under-chaos
provider:
type: http
url: http://api/health
timeout: 5
rollbacks:
- type: action
name: remove-network-latency
provider:
type: process
path: kubectl
arguments: "exec -n default deploy/my-app -- tc qdisc del dev eth0 root"CI/CD Integration
# .github/workflows/chaos.yml
name: Chaos Engineering Tests
on:
schedule:
- cron: '0 9 * * 3' # Wednesdays at 9am
jobs:
chaos:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install Chaos Toolkit
run: |
pip install chaostoolkit chaostoolkit-kubernetes
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
kubeconfig: ${{ secrets.KUBECONFIG_STAGING }}
- name: Run experiments
run: |
for experiment in chaos-experiments/*.json; do
echo "Running: $experiment"
chaos run $experiment || {
echo "EXPERIMENT FAILED: $experiment"
exit 1
}
done
- name: Upload journal
uses: actions/upload-artifact@v3
if: always()
with:
name: chaos-journals
path: "*.json"Reading the Chaos Journal
Chaos Toolkit writes a journal file (chaos-report.json by default) with the full experiment run:
chaos run experiment.json --journal-path my-journal.jsonGenerate a human-readable report:
pip install chaostoolkit-reporting
chaos report --export-format=pdf my-journal.json chaos-report.pdfThe journal captures every probe and action with timestamps, inputs, and outputs — a complete record of what happened and when.
Chaos Toolkit vs Gremlin vs LitmusChaos
Chaos Toolkit: Open-source, experiment-as-code, extensible via Python. No SaaS required. Best for teams that want maximum control and want experiments version-controlled alongside their application code.
Gremlin: Commercial SaaS with a polished UI. More accessible for teams that don't want to write experiments as code. Powerful but costs money.
LitmusChaos: Open-source, Kubernetes-native, CNCF-graduated. Best for Kubernetes-heavy teams that want native CRD-based experiments. Chaos Toolkit can integrate with LitmusChaos as a backend.
Chaos Toolkit's strength is its extensibility and its declarative experiment format. When you want chaos experiments to be as reviewable as your infrastructure code — stored in Git, reviewed in PRs, traceable to specific failures — Chaos Toolkit's approach fits naturally.