Chaos Engineering in CI/CD: Automating Resilience Validation in Your Pipeline

Chaos Engineering in CI/CD: Automating Resilience Validation in Your Pipeline

Most teams run chaos experiments manually — a team gathers, runs an experiment, watches dashboards, and documents findings. This works, but it doesn't scale. Manual chaos experiments happen infrequently, require coordination, and catch problems after code has already been deployed to production.

Chaos engineering in CI/CD changes this. Automated chaos experiments run on every deployment to staging, validating that new code hasn't broken your system's resilience properties before it reaches production.

Why Automate Chaos in CI/CD?

Catch resilience regressions before production. A code change that works functionally may break a circuit breaker implementation or remove a retry mechanism. Automated chaos tests catch this before the change ships.

Continuous validation. Infrastructure changes, dependency updates, and configuration changes can all affect resilience. Automated chaos runs against every change.

Remove the coordination overhead. Manual chaos experiments require scheduling, team coordination, and post-experiment documentation. CI/CD integration removes all of this.

Build confidence in deployments. When your pipeline includes automated chaos validation, "did we break resilience?" becomes a question the pipeline answers automatically.

What Belongs in CI/CD vs. What Stays Manual

Not all chaos experiments are suited for CI/CD automation. Choose carefully:

Good for CI/CD automation:

  • Predictable, fast experiments (under 10 minutes)
  • Isolated to staging/test environments
  • Well-defined steady state with automated verification
  • Low blast radius — affects only the service under test
  • Repeatable with deterministic results

Stay manual:

  • GameDay scenarios (complex, multi-team)
  • Production chaos (requires human judgment)
  • Novel hypothesis exploration
  • Long-duration soak experiments
  • Experiments with unclear steady state

Start with a small set of automated experiments — 5-10 scenarios that represent your most critical resilience properties. Expand from there.

Chaos Toolkit: The Foundation for CI/CD Chaos

Chaos Toolkit is the most CI/CD-friendly chaos engineering tool. It's:

  • CLI-first (runs in any CI environment)
  • Declarative (experiments as YAML)
  • Extensible (plugins for AWS, Kubernetes, Gremlin, etc.)
  • Produces structured output (JSON reports)

Installation

pip install chaostoolkit
pip install chaostoolkit-kubernetes  # for Kubernetes experiments
pip install chaostoolkit-aws         <span class="hljs-comment"># for AWS experiments

Basic Experiment Structure

# experiments/pod-termination.yaml
version: 1.0.0
title: API resilience when pods are terminated
description: >
  Verify the API continues serving requests (with acceptable latency) 
  when one-third of pods are terminated unexpectedly.

tags:
  - kubernetes
  - resilience
  - api

steady-state-hypothesis:
  title: API is responsive
  probes:
    - name: api-responds-to-health-check
      type: probe
      tolerance: 200
      provider:
        type: http
        url: http://api-service.staging/health
        timeout: 5
    - name: error-rate-acceptable
      type: probe
      tolerance: true
      provider:
        type: python
        module: chaossprometheus.probes
        func: query_within_range
        arguments:
          query: |
            sum(rate(http_requests_total{namespace="staging",status=~"5.."}[2m])) 
            / sum(rate(http_requests_total{namespace="staging"}[2m])) * 100
          min: 0
          max: 5.0

method:
  - type: action
    name: terminate-one-third-of-pods
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: app=api-service
        ns: staging
        qty: 0.33
        rand: true

pauses:
  after: 60  # wait 60s after injection for system to react

rollback:
  []  # Kubernetes will restart terminated pods automatically

Running in CI

# Execute experiment and capture exit code
chaos run --rollback-strategy=always experiments/pod-termination.yaml

<span class="hljs-comment"># Exit code 0: hypothesis held (resilient)
<span class="hljs-comment"># Exit code 1: hypothesis failed (resilience regression)

GitHub Actions Integration

# .github/workflows/resilience-validation.yaml
name: Resilience Validation

on:
  push:
    branches: [main]

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: ./scripts/deploy.sh staging
      - name: Wait for deployment
        run: ./scripts/wait-for-healthy.sh staging 120

  chaos-validation:
    needs: deploy-staging
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false  # run all experiments even if one fails
      matrix:
        experiment:
          - pod-termination
          - network-latency
          - memory-pressure
          - dependency-failure
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      
      - name: Install Chaos Toolkit
        run: |
          pip install chaostoolkit chaostoolkit-kubernetes chaostoolkit-prometheus
      
      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.STAGING_KUBECONFIG }}
      
      - name: Run chaos experiment
        run: |
          chaos run --rollback-strategy=always \
            --journal-path=chaos-journal-${{ matrix.experiment }}.json \
            experiments/${{ matrix.experiment }}.yaml
        env:
          PROMETHEUS_URL: ${{ vars.STAGING_PROMETHEUS_URL }}
      
      - name: Upload chaos journal
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: chaos-journal-${{ matrix.experiment }}
          path: chaos-journal-${{ matrix.experiment }}.json
      
      - name: Publish chaos results
        if: always()
        run: |
          python scripts/publish-chaos-results.py \
            --journal chaos-journal-${{ matrix.experiment }}.json \
            --experiment ${{ matrix.experiment }} \
            --build ${{ github.sha }}

  production-deploy:
    needs: chaos-validation
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: ./scripts/deploy.sh production

Core Experiment Library

Build a library of standard experiments for your stack:

Pod Termination (Kubernetes)

# experiments/pod-termination.yaml
method:
  - type: action
    name: kill-random-pod
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: app=api-service,tier=backend
        ns: staging
        qty: 1
        rand: true

Network Latency (Kubernetes + Toxiproxy)

# experiments/network-latency.yaml
method:
  - type: action
    name: add-latency-to-database
    provider:
      type: python
      module: chaostoxiproxy.actions
      func: add_latency_to_upstream
      arguments:
        proxy_name: database-proxy
        latency_ms: 500
        jitter_ms: 100

pauses:
  after: 30

rollback:
  - type: action
    name: remove-database-latency
    provider:
      type: python
      module: chaostoxiproxy.actions
      func: remove_upstream_toxics
      arguments:
        proxy_name: database-proxy

Dependency HTTP Failures (WireMock)

# experiments/dependency-failure.yaml
method:
  - type: action
    name: make-payment-service-fail
    provider:
      type: http
      url: http://wiremock:8080/__admin/mappings
      method: POST
      headers:
        Content-Type: application/json
      arguments:
        request:
          urlPattern: '/payments/.*'
          method: POST
        response:
          status: 503
          body: '{"error": "Service unavailable"}'

rollback:
  - type: action
    name: restore-payment-service
    provider:
      type: http
      url: http://wiremock:8080/__admin/mappings
      method: DELETE

Memory Pressure (Linux stress-ng)

# experiments/memory-pressure.yaml
method:
  - type: action
    name: memory-stress
    background: true
    provider:
      type: process
      path: kubectl
      arguments: |
        exec deploy/api-service -n staging -- 
        stress-ng --vm 1 --vm-bytes 70% --timeout 60s

pauses:
  after: 45

rollback: []  # stress-ng auto-terminates after 60s

Parallelizing Chaos Experiments

Running experiments sequentially can make your pipeline slow. Parallelize independent experiments:

# GitHub Actions matrix: run experiments in parallel
chaos-experiments:
  strategy:
    matrix:
      experiment: [pod-termination, network-latency, memory-pressure]
    max-parallel: 3  # run all simultaneously
  steps:
    - name: Run ${{ matrix.experiment }}
      run: chaos run experiments/${{ matrix.experiment }}.yaml

Caution: Only parallelize experiments that target independent components. Running pod termination and memory pressure on the same service simultaneously makes results uninterpretable.

Chaos Results as PR Comments

Publish chaos experiment results directly to pull requests:

# scripts/publish-chaos-results.py
import json
import os
import requests

def publish_results(journal_path, experiment, build_sha):
    with open(journal_path) as f:
        journal = json.load(f)
    
    passed = journal['status'] == 'completed'
    hypothesis_held = all(
        probe['status'] == 'succeeded'
        for probe in journal.get('steady_states', {}).get('after', {}).get('probes', [])
    )
    
    emoji = '✅' if (passed and hypothesis_held) else '❌'
    status = 'PASSED' if (passed and hypothesis_held) else 'FAILED'
    
    comment = f"""## {emoji} Chaos Experiment: {experiment}

**Status:** {status}
**Build:** {build_sha[:8]}

| Check | Result |
|---|---|
| Experiment completed | {'✅' if passed else '❌'} |
| Steady state held | {'✅' if hypothesis_held else '❌'} |
"""
    
    # Post to GitHub PR
    pr_number = os.environ.get('PR_NUMBER')
    if pr_number:
        requests.post(
            f"https://api.github.com/repos/{os.environ['GITHUB_REPOSITORY']}/issues/{pr_number}/comments",
            json={'body': comment},
            headers={'Authorization': f"Bearer {os.environ['GITHUB_TOKEN']}"},
        )

Chaos Pipeline Anti-Patterns

Running chaos against production in CI. Automated pipeline chaos must target isolated staging environments. Production chaos requires human oversight.

No rollback strategy. Every automated chaos experiment must have a rollback. --rollback-strategy=always ensures cleanup even on experiment failure.

Experiments too large for CI. An experiment that takes 45 minutes to run will block every deploy. Keep automated chaos experiments under 10 minutes.

Not isolating experiment blast radius. If your staging environment is shared, a chaos experiment affecting one service may interfere with another team's tests. Use namespace isolation.

Treating chaos failures as flaky tests. If a chaos experiment fails, it found a real resilience problem — don't retry to make it pass. Fix the resilience issue, then rerun.

No chaos experiment versioning. As your system changes, old experiments may no longer apply. Version and review your experiment library quarterly.

Getting Started: First Automated Chaos Experiment

  1. Install Chaos Toolkit and configure kubectl access to staging
  2. Verify steady state by running the health check probe manually
  3. Write your first experiment: pod termination for your most critical service
  4. Run it manually: chaos run experiments/pod-termination.yaml
  5. Add to CI after the first three manual runs produce consistent, interpretable results

Start small. One experiment per critical service, run on every merge to main. Build from there.


HelpMeTest's health monitoring provides the steady-state baseline that automated chaos experiments need to determine whether resilience is holding. Start free.

Read more