Security Chaos Engineering: Testing Your System's Resilience to Attacks

Security Chaos Engineering: Testing Your System's Resilience to Attacks

Traditional chaos engineering focuses on infrastructure failure: pods terminating, networks degrading, databases slowing down. Security chaos engineering extends this to adversarial conditions: authentication systems under attack, authorization bypasses attempted, API rate limits hammered.

The question isn't just "does the system survive?" — it's "does the system secure itself correctly when under attack?"

What Security Chaos Engineering Tests

Security chaos engineering validates a different set of hypotheses than reliability chaos:

Authentication resilience: When authentication servers are unavailable, does the application fail closed (deny all access) or fail open (allow unauthenticated access)?

Authorization under load: When the authorization service is slow, do applications wait (possibly timing out legitimately) or short-circuit (skipping checks)?

Rate limiting effectiveness: When API endpoints receive 10,000 requests per minute, do rate limits correctly throttle, or do they silently fail to protect?

Certificate expiration handling: When a TLS certificate expires (or is close to expiry), does the application detect this and alert before users are affected?

Secret rotation resilience: When credentials are rotated, do services pick up new credentials without downtime?

DDoS mitigation: When synthetic DDoS traffic is injected, do protection mechanisms activate and limit blast radius?

Each is a testable hypothesis with a clear pass/fail criterion.

Experiment 1: Authentication Service Unavailability

Hypothesis: When the authentication service is unavailable, all protected API endpoints return 401/503 — no endpoint becomes accessible without authentication.

This tests the critical "fail closed" property. If your application bypasses authentication checks when the auth service is slow or down, you have a catastrophic security vulnerability that only manifests under failure conditions.

# Chaos experiment: auth service unavailable
title: Auth service unavailability
description: Verify system fails closed when auth is unavailable

steady-state-hypothesis:
  title: Protected endpoints require authentication
  probes:
    - name: unauthenticated-request-rejected
      type: probe
      tolerance: 401
      provider:
        type: http
        url: http://api.staging/users/profile
        # No Authorization header
    - name: authenticated-request-succeeds
      type: probe
      tolerance: 200
      provider:
        type: http
        url: http://api.staging/users/profile
        headers:
          Authorization: Bearer ${VALID_TOKEN}

method:
  - type: action
    name: scale-auth-service-to-zero
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: scale_deployment
      arguments:
        name: auth-service
        replicas: 0
        ns: staging

pauses:
  after: 10

  # Critical: verify protected endpoints are NOT accessible without auth
  - type: probe
    name: protected-endpoint-still-requires-auth
    tolerance: 401  # or 503 — NOT 200
    provider:
      type: http
      url: http://api.staging/users/profile
      timeout: 10

  - type: probe
    name: admin-endpoint-inaccessible
    tolerance:
      type: jsonpath
      path: $.status
      value: 401
    provider:
      type: http
      url: http://api.staging/admin/users

rollback:
  - type: action
    name: restore-auth-service
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: scale_deployment
      arguments:
        name: auth-service
        replicas: 3
        ns: staging

What failure looks like: If the hypothesis fails — protected endpoints return 200 without authentication when auth is down — you have found a critical security bug that only manifests under failure conditions.

Implementation fix: JWT verification libraries should reject requests if they can't validate the token, not assume tokens are valid:

# Wrong: fails open when auth service is unavailable
async def verify_token(token):
    try:
        user = await auth_service.verify(token)
        return user
    except AuthServiceUnavailable:
        return None  # 🚨 returns None = unauthenticated, caller may allow through

# Correct: fails closed
async def verify_token(token):
    try:
        user = await auth_service.verify(token)
        return user
    except AuthServiceUnavailable:
        raise UnauthorizedError("Authentication service unavailable")  # 401

Experiment 2: Rate Limiting Under Flood

Hypothesis: When an API endpoint receives 1,000 requests per minute from a single IP, the rate limiter activates within 5 seconds and blocks subsequent requests with 429 responses. Legitimate users from other IPs are not affected.

# chaos/rate_limit_flood.py
import asyncio
import aiohttp
import time

async def flood_endpoint(url, requests_per_second=20, duration=30):
    """Send rapid requests to trigger rate limiting."""
    results = {'success': 0, 'rate_limited': 0, 'errors': 0}
    start = time.time()
    
    async with aiohttp.ClientSession() as session:
        while time.time() - start < duration:
            tasks = [
                session.get(url, headers={'X-Forwarded-For': '1.2.3.4'})
                for _ in range(requests_per_second)
            ]
            responses = await asyncio.gather(*tasks, return_exceptions=True)
            
            for resp in responses:
                if isinstance(resp, Exception):
                    results['errors'] += 1
                elif resp.status == 429:
                    results['rate_limited'] += 1
                elif resp.status == 200:
                    results['success'] += 1
            
            await asyncio.sleep(1)
    
    return results

async def run_experiment():
    # Flood from "attacker" IP
    flood_task = asyncio.create_task(
        flood_endpoint('http://api.staging/search', requests_per_second=50, duration=30)
    )
    
    # Simultaneously verify "legitimate" traffic is not affected
    legitimate_task = asyncio.create_task(
        check_legitimate_traffic('http://api.staging/search', duration=30)
    )
    
    flood_results, legitimate_results = await asyncio.gather(flood_task, legitimate_task)
    
    # Assertions
    assert flood_results['rate_limited'] > 0, "Rate limiting never activated"
    assert legitimate_results['blocked'] == 0, "Legitimate traffic was incorrectly blocked"
    
    rate_limit_activation_time = get_rate_limit_activation_time(flood_results)
    assert rate_limit_activation_time < 5, \
        f"Rate limiting took {rate_limit_activation_time}s to activate (max: 5s)"

asyncio.run(run_experiment())

Common finding: Rate limiters that use in-memory counters don't work in multi-instance deployments — each instance has its own counter, and attackers bypass limits by distributing requests across instances. Use Redis-backed rate limiting:

# Redis-backed rate limiting with Lua script (atomic operation)
import redis

redis_client = redis.Redis(host='redis.internal')

def is_rate_limited(ip, limit=100, window=60):
    key = f'rate_limit:{ip}'
    pipeline = redis_client.pipeline()
    pipeline.incr(key)
    pipeline.expire(key, window)
    results = pipeline.execute()
    return results[0] > limit

Experiment 3: TLS Certificate Expiry Detection

Hypothesis: When a TLS certificate is within 30 days of expiration, an alert fires automatically before users are affected.

This experiment doesn't actually expire a certificate — instead, it validates that your monitoring detects impending expiry.

#!/bin/bash
<span class="hljs-comment"># Check certificate expiry for your domains
<span class="hljs-function">check_cert_expiry() {
    <span class="hljs-built_in">local domain=<span class="hljs-variable">$1
    <span class="hljs-built_in">local threshold_days=<span class="hljs-variable">$2
    
    expiry=$(<span class="hljs-built_in">echo <span class="hljs-pipe">| openssl s_client -connect <span class="hljs-string">"$domain:443" -servername <span class="hljs-string">"$domain" 2>/dev/null \
        <span class="hljs-pipe">| openssl x509 -noout -enddate 2>/dev/null \
        <span class="hljs-pipe">| <span class="hljs-built_in">cut -d= -f2)
    
    expiry_epoch=$(<span class="hljs-built_in">date -d <span class="hljs-string">"$expiry" +%s)
    now_epoch=$(<span class="hljs-built_in">date +%s)
    days_until_expiry=$(( (expiry_epoch - now_epoch) / <span class="hljs-number">86400 ))
    
    <span class="hljs-built_in">echo <span class="hljs-string">"Domain: $domain <span class="hljs-pipe">| Expires: <span class="hljs-variable">$expiry <span class="hljs-pipe">| Days remaining: <span class="hljs-variable">$days_until_expiry"
    
    <span class="hljs-keyword">if [ <span class="hljs-string">"$days_until_expiry" -lt <span class="hljs-string">"$threshold_days" ]; <span class="hljs-keyword">then
        <span class="hljs-built_in">echo <span class="hljs-string">"ALERT: Certificate expires in $days_until_expiry days (threshold: <span class="hljs-variable">$threshold_days)"
        <span class="hljs-built_in">return 1
    <span class="hljs-keyword">fi
    <span class="hljs-built_in">return 0
}

<span class="hljs-comment"># Check all your domains
DOMAINS=(<span class="hljs-string">"api.yourapp.com" <span class="hljs-string">"app.yourapp.com" <span class="hljs-string">"admin.yourapp.com")
THRESHOLD=30

<span class="hljs-keyword">for domain <span class="hljs-keyword">in <span class="hljs-string">"${DOMAINS[@]}"; <span class="hljs-keyword">do
    check_cert_expiry <span class="hljs-string">"$domain" <span class="hljs-string">"$THRESHOLD"
<span class="hljs-keyword">done

Integrate this into your CI or as a cron job that alerts your ops team.

For chaos testing: use a short-lived certificate in staging to verify your alerting actually fires:

# Generate a certificate that expires in 5 days (for testing)
openssl req -x509 -newkey rsa:4096 -keyout test-key.pem -out test-cert.pem \
    -days 5 -nodes -subj <span class="hljs-string">"/CN=api.staging.yourapp.com"

<span class="hljs-comment"># Deploy to staging, verify alert fires

Experiment 4: Secret Rotation Resilience

Hypothesis: When database credentials are rotated, the application reconnects with new credentials within 60 seconds without requiring a deployment.

# chaos/secret_rotation.py
import subprocess
import time
import requests

def rotate_database_password():
    """Rotate the staging database password via AWS Secrets Manager."""
    subprocess.run([
        'aws', 'secretsmanager', 'rotate-secret',
        '--secret-id', 'staging/db-password',
        '--region', 'us-east-1',
    ], check=True)

def wait_for_reconnection(api_url, timeout=120):
    """Wait for application to reconnect with new credentials."""
    start = time.time()
    
    while time.time() - start < timeout:
        try:
            resp = requests.get(f'{api_url}/health', timeout=5)
            if resp.status_code == 200:
                health = resp.json()
                if health.get('database') == 'connected':
                    return time.time() - start  # return time to reconnect
        except requests.exceptions.RequestException:
            pass
        time.sleep(5)
    
    return None  # timeout - failed to reconnect

# Run experiment
print("Rotating database credentials...")
rotate_database_password()

print("Waiting for application to reconnect...")
reconnect_time = wait_for_reconnection('http://api.staging')

if reconnect_time is None:
    print("FAIL: Application did not reconnect within 120 seconds")
    exit(1)

print(f"PASS: Application reconnected in {reconnect_time:.1f} seconds")
assert reconnect_time < 60, f"Reconnection took too long: {reconnect_time}s"

Common finding: Applications that read credentials at startup and cache them indefinitely fail when credentials rotate. Use AWS Secrets Manager SDK or Vault agent to fetch fresh credentials periodically:

import boto3
import json
import time

class RotatingCredentialManager:
    def __init__(self, secret_name, refresh_interval=300):
        self.secret_name = secret_name
        self.refresh_interval = refresh_interval
        self._credentials = None
        self._last_refresh = 0
        self.sm = boto3.client('secretsmanager')
    
    def get(self):
        if time.time() - self._last_refresh > self.refresh_interval:
            self._refresh()
        return self._credentials
    
    def _refresh(self):
        response = self.sm.get_secret_value(SecretId=self.secret_name)
        self._credentials = json.loads(response['SecretString'])
        self._last_refresh = time.time()

Experiment 5: Authorization Cache Poisoning

Hypothesis: When a user's permissions are revoked, they lose access within 60 seconds — cached permissions do not allow continued access after revocation.

def test_permission_revocation_propagates():
    """Verify that permission revocation is reflected within 60 seconds."""
    
    # Create test user with admin access
    user = create_test_user(role='admin')
    token = login(user)
    
    # Verify admin access works
    resp = requests.get('/admin/users', headers={'Authorization': f'Bearer {token}'})
    assert resp.status_code == 200, 'Admin access should work before revocation'
    
    # Revoke admin permissions
    revoke_role(user.id, 'admin')
    
    # Poll until access is denied or timeout
    start = time.time()
    access_denied = False
    
    while time.time() - start < 60:
        resp = requests.get('/admin/users', headers={'Authorization': f'Bearer {token}'})
        if resp.status_code == 403:
            access_denied = True
            revocation_time = time.time() - start
            break
        time.sleep(5)
    
    assert access_denied, 'Permission was never revoked within 60 seconds'
    print(f"Permission revocation propagated in {revocation_time:.1f}s")
    assert revocation_time < 60, f"Revocation took too long: {revocation_time}s"
    
    # Cleanup
    delete_test_user(user.id)

Common finding: Applications that cache authorization decisions (for performance) don't invalidate the cache on permission changes. This means a revoked user can continue accessing resources until the cache expires.

Fix: Use cache invalidation events. When permissions change, publish an event that clears affected cache entries:

# On permission change:
def revoke_user_role(user_id, role):
    db.execute('DELETE FROM user_roles WHERE user_id = ? AND role = ?', [user_id, role])
    redis.delete(f'user_permissions:{user_id}')  # invalidate permission cache
    event_bus.publish('permission.revoked', {'user_id': user_id, 'role': role})

Security Chaos Testing Environment Requirements

Security chaos experiments must run in strictly isolated environments:

Never run against production. Authentication failure injection and rate limit testing must only target isolated staging.

Use synthetic test users. Create dedicated test accounts for chaos experiments. Never use real user credentials.

Audit all chaos actions. Every security chaos experiment should be logged in your audit trail — what experiment ran, when, by whom.

Scope your blast radius. A rate limit flood test should only affect the specific test endpoint, not all endpoints for all users.

Clean up test users. Test accounts created for security chaos experiments must be deleted at experiment end to prevent accumulation of privileged test credentials.

The Security Chaos Engineering Mindset

Security chaos engineering is about asking "how does our system behave under attack?" rather than "does our security configuration pass a checklist?"

The most valuable findings are the ones that don't appear in static security audits: the authentication bypass that only occurs when the auth service is slow, the authorization cache that doesn't invalidate on permission revocation, the rate limiter that fails silently under high load.

These vulnerabilities exist in production systems today. Security chaos engineering surfaces them in controlled conditions before attackers find them.


HelpMeTest monitors application availability and behavior continuously — critical for detecting security-related degradation in production. Start free.

Read more