Chaos Engineering in CI/CD: Automating Resilience Validation in Your Pipeline
Most teams run chaos experiments manually — a team gathers, runs an experiment, watches dashboards, and documents findings. This works, but it doesn't scale. Manual chaos experiments happen infrequently, require coordination, and catch problems after code has already been deployed to production.
Chaos engineering in CI/CD changes this. Automated chaos experiments run on every deployment to staging, validating that new code hasn't broken your system's resilience properties before it reaches production.
Why Automate Chaos in CI/CD?
Catch resilience regressions before production. A code change that works functionally may break a circuit breaker implementation or remove a retry mechanism. Automated chaos tests catch this before the change ships.
Continuous validation. Infrastructure changes, dependency updates, and configuration changes can all affect resilience. Automated chaos runs against every change.
Remove the coordination overhead. Manual chaos experiments require scheduling, team coordination, and post-experiment documentation. CI/CD integration removes all of this.
Build confidence in deployments. When your pipeline includes automated chaos validation, "did we break resilience?" becomes a question the pipeline answers automatically.
What Belongs in CI/CD vs. What Stays Manual
Not all chaos experiments are suited for CI/CD automation. Choose carefully:
Good for CI/CD automation:
- Predictable, fast experiments (under 10 minutes)
- Isolated to staging/test environments
- Well-defined steady state with automated verification
- Low blast radius — affects only the service under test
- Repeatable with deterministic results
Stay manual:
- GameDay scenarios (complex, multi-team)
- Production chaos (requires human judgment)
- Novel hypothesis exploration
- Long-duration soak experiments
- Experiments with unclear steady state
Start with a small set of automated experiments — 5-10 scenarios that represent your most critical resilience properties. Expand from there.
Chaos Toolkit: The Foundation for CI/CD Chaos
Chaos Toolkit is the most CI/CD-friendly chaos engineering tool. It's:
- CLI-first (runs in any CI environment)
- Declarative (experiments as YAML)
- Extensible (plugins for AWS, Kubernetes, Gremlin, etc.)
- Produces structured output (JSON reports)
Installation
pip install chaostoolkit
pip install chaostoolkit-kubernetes # for Kubernetes experiments
pip install chaostoolkit-aws <span class="hljs-comment"># for AWS experimentsBasic Experiment Structure
# experiments/pod-termination.yaml
version: 1.0.0
title: API resilience when pods are terminated
description: >
Verify the API continues serving requests (with acceptable latency)
when one-third of pods are terminated unexpectedly.
tags:
- kubernetes
- resilience
- api
steady-state-hypothesis:
title: API is responsive
probes:
- name: api-responds-to-health-check
type: probe
tolerance: 200
provider:
type: http
url: http://api-service.staging/health
timeout: 5
- name: error-rate-acceptable
type: probe
tolerance: true
provider:
type: python
module: chaossprometheus.probes
func: query_within_range
arguments:
query: |
sum(rate(http_requests_total{namespace="staging",status=~"5.."}[2m]))
/ sum(rate(http_requests_total{namespace="staging"}[2m])) * 100
min: 0
max: 5.0
method:
- type: action
name: terminate-one-third-of-pods
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=api-service
ns: staging
qty: 0.33
rand: true
pauses:
after: 60 # wait 60s after injection for system to react
rollback:
[] # Kubernetes will restart terminated pods automaticallyRunning in CI
# Execute experiment and capture exit code
chaos run --rollback-strategy=always experiments/pod-termination.yaml
<span class="hljs-comment"># Exit code 0: hypothesis held (resilient)
<span class="hljs-comment"># Exit code 1: hypothesis failed (resilience regression)GitHub Actions Integration
# .github/workflows/resilience-validation.yaml
name: Resilience Validation
on:
push:
branches: [main]
jobs:
deploy-staging:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: ./scripts/deploy.sh staging
- name: Wait for deployment
run: ./scripts/wait-for-healthy.sh staging 120
chaos-validation:
needs: deploy-staging
runs-on: ubuntu-latest
strategy:
fail-fast: false # run all experiments even if one fails
matrix:
experiment:
- pod-termination
- network-latency
- memory-pressure
- dependency-failure
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Chaos Toolkit
run: |
pip install chaostoolkit chaostoolkit-kubernetes chaostoolkit-prometheus
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
kubeconfig: ${{ secrets.STAGING_KUBECONFIG }}
- name: Run chaos experiment
run: |
chaos run --rollback-strategy=always \
--journal-path=chaos-journal-${{ matrix.experiment }}.json \
experiments/${{ matrix.experiment }}.yaml
env:
PROMETHEUS_URL: ${{ vars.STAGING_PROMETHEUS_URL }}
- name: Upload chaos journal
if: always()
uses: actions/upload-artifact@v4
with:
name: chaos-journal-${{ matrix.experiment }}
path: chaos-journal-${{ matrix.experiment }}.json
- name: Publish chaos results
if: always()
run: |
python scripts/publish-chaos-results.py \
--journal chaos-journal-${{ matrix.experiment }}.json \
--experiment ${{ matrix.experiment }} \
--build ${{ github.sha }}
production-deploy:
needs: chaos-validation
runs-on: ubuntu-latest
steps:
- name: Deploy to production
run: ./scripts/deploy.sh productionCore Experiment Library
Build a library of standard experiments for your stack:
Pod Termination (Kubernetes)
# experiments/pod-termination.yaml
method:
- type: action
name: kill-random-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=api-service,tier=backend
ns: staging
qty: 1
rand: trueNetwork Latency (Kubernetes + Toxiproxy)
# experiments/network-latency.yaml
method:
- type: action
name: add-latency-to-database
provider:
type: python
module: chaostoxiproxy.actions
func: add_latency_to_upstream
arguments:
proxy_name: database-proxy
latency_ms: 500
jitter_ms: 100
pauses:
after: 30
rollback:
- type: action
name: remove-database-latency
provider:
type: python
module: chaostoxiproxy.actions
func: remove_upstream_toxics
arguments:
proxy_name: database-proxyDependency HTTP Failures (WireMock)
# experiments/dependency-failure.yaml
method:
- type: action
name: make-payment-service-fail
provider:
type: http
url: http://wiremock:8080/__admin/mappings
method: POST
headers:
Content-Type: application/json
arguments:
request:
urlPattern: '/payments/.*'
method: POST
response:
status: 503
body: '{"error": "Service unavailable"}'
rollback:
- type: action
name: restore-payment-service
provider:
type: http
url: http://wiremock:8080/__admin/mappings
method: DELETEMemory Pressure (Linux stress-ng)
# experiments/memory-pressure.yaml
method:
- type: action
name: memory-stress
background: true
provider:
type: process
path: kubectl
arguments: |
exec deploy/api-service -n staging --
stress-ng --vm 1 --vm-bytes 70% --timeout 60s
pauses:
after: 45
rollback: [] # stress-ng auto-terminates after 60sParallelizing Chaos Experiments
Running experiments sequentially can make your pipeline slow. Parallelize independent experiments:
# GitHub Actions matrix: run experiments in parallel
chaos-experiments:
strategy:
matrix:
experiment: [pod-termination, network-latency, memory-pressure]
max-parallel: 3 # run all simultaneously
steps:
- name: Run ${{ matrix.experiment }}
run: chaos run experiments/${{ matrix.experiment }}.yamlCaution: Only parallelize experiments that target independent components. Running pod termination and memory pressure on the same service simultaneously makes results uninterpretable.
Chaos Results as PR Comments
Publish chaos experiment results directly to pull requests:
# scripts/publish-chaos-results.py
import json
import os
import requests
def publish_results(journal_path, experiment, build_sha):
with open(journal_path) as f:
journal = json.load(f)
passed = journal['status'] == 'completed'
hypothesis_held = all(
probe['status'] == 'succeeded'
for probe in journal.get('steady_states', {}).get('after', {}).get('probes', [])
)
emoji = '✅' if (passed and hypothesis_held) else '❌'
status = 'PASSED' if (passed and hypothesis_held) else 'FAILED'
comment = f"""## {emoji} Chaos Experiment: {experiment}
**Status:** {status}
**Build:** {build_sha[:8]}
| Check | Result |
|---|---|
| Experiment completed | {'✅' if passed else '❌'} |
| Steady state held | {'✅' if hypothesis_held else '❌'} |
"""
# Post to GitHub PR
pr_number = os.environ.get('PR_NUMBER')
if pr_number:
requests.post(
f"https://api.github.com/repos/{os.environ['GITHUB_REPOSITORY']}/issues/{pr_number}/comments",
json={'body': comment},
headers={'Authorization': f"Bearer {os.environ['GITHUB_TOKEN']}"},
)Chaos Pipeline Anti-Patterns
Running chaos against production in CI. Automated pipeline chaos must target isolated staging environments. Production chaos requires human oversight.
No rollback strategy. Every automated chaos experiment must have a rollback. --rollback-strategy=always ensures cleanup even on experiment failure.
Experiments too large for CI. An experiment that takes 45 minutes to run will block every deploy. Keep automated chaos experiments under 10 minutes.
Not isolating experiment blast radius. If your staging environment is shared, a chaos experiment affecting one service may interfere with another team's tests. Use namespace isolation.
Treating chaos failures as flaky tests. If a chaos experiment fails, it found a real resilience problem — don't retry to make it pass. Fix the resilience issue, then rerun.
No chaos experiment versioning. As your system changes, old experiments may no longer apply. Version and review your experiment library quarterly.
Getting Started: First Automated Chaos Experiment
- Install Chaos Toolkit and configure kubectl access to staging
- Verify steady state by running the health check probe manually
- Write your first experiment: pod termination for your most critical service
- Run it manually:
chaos run experiments/pod-termination.yaml - Add to CI after the first three manual runs produce consistent, interpretable results
Start small. One experiment per critical service, run on every merge to main. Build from there.
HelpMeTest's health monitoring provides the steady-state baseline that automated chaos experiments need to determine whether resilience is holding. Start free.