Security Chaos Engineering: Testing Your System's Resilience to Attacks
Traditional chaos engineering focuses on infrastructure failure: pods terminating, networks degrading, databases slowing down. Security chaos engineering extends this to adversarial conditions: authentication systems under attack, authorization bypasses attempted, API rate limits hammered.
The question isn't just "does the system survive?" — it's "does the system secure itself correctly when under attack?"
What Security Chaos Engineering Tests
Security chaos engineering validates a different set of hypotheses than reliability chaos:
Authentication resilience: When authentication servers are unavailable, does the application fail closed (deny all access) or fail open (allow unauthenticated access)?
Authorization under load: When the authorization service is slow, do applications wait (possibly timing out legitimately) or short-circuit (skipping checks)?
Rate limiting effectiveness: When API endpoints receive 10,000 requests per minute, do rate limits correctly throttle, or do they silently fail to protect?
Certificate expiration handling: When a TLS certificate expires (or is close to expiry), does the application detect this and alert before users are affected?
Secret rotation resilience: When credentials are rotated, do services pick up new credentials without downtime?
DDoS mitigation: When synthetic DDoS traffic is injected, do protection mechanisms activate and limit blast radius?
Each is a testable hypothesis with a clear pass/fail criterion.
Experiment 1: Authentication Service Unavailability
Hypothesis: When the authentication service is unavailable, all protected API endpoints return 401/503 — no endpoint becomes accessible without authentication.
This tests the critical "fail closed" property. If your application bypasses authentication checks when the auth service is slow or down, you have a catastrophic security vulnerability that only manifests under failure conditions.
# Chaos experiment: auth service unavailable
title: Auth service unavailability
description: Verify system fails closed when auth is unavailable
steady-state-hypothesis:
title: Protected endpoints require authentication
probes:
- name: unauthenticated-request-rejected
type: probe
tolerance: 401
provider:
type: http
url: http://api.staging/users/profile
# No Authorization header
- name: authenticated-request-succeeds
type: probe
tolerance: 200
provider:
type: http
url: http://api.staging/users/profile
headers:
Authorization: Bearer ${VALID_TOKEN}
method:
- type: action
name: scale-auth-service-to-zero
provider:
type: python
module: chaosk8s.deployment.actions
func: scale_deployment
arguments:
name: auth-service
replicas: 0
ns: staging
pauses:
after: 10
# Critical: verify protected endpoints are NOT accessible without auth
- type: probe
name: protected-endpoint-still-requires-auth
tolerance: 401 # or 503 — NOT 200
provider:
type: http
url: http://api.staging/users/profile
timeout: 10
- type: probe
name: admin-endpoint-inaccessible
tolerance:
type: jsonpath
path: $.status
value: 401
provider:
type: http
url: http://api.staging/admin/users
rollback:
- type: action
name: restore-auth-service
provider:
type: python
module: chaosk8s.deployment.actions
func: scale_deployment
arguments:
name: auth-service
replicas: 3
ns: stagingWhat failure looks like: If the hypothesis fails — protected endpoints return 200 without authentication when auth is down — you have found a critical security bug that only manifests under failure conditions.
Implementation fix: JWT verification libraries should reject requests if they can't validate the token, not assume tokens are valid:
# Wrong: fails open when auth service is unavailable
async def verify_token(token):
try:
user = await auth_service.verify(token)
return user
except AuthServiceUnavailable:
return None # 🚨 returns None = unauthenticated, caller may allow through
# Correct: fails closed
async def verify_token(token):
try:
user = await auth_service.verify(token)
return user
except AuthServiceUnavailable:
raise UnauthorizedError("Authentication service unavailable") # 401Experiment 2: Rate Limiting Under Flood
Hypothesis: When an API endpoint receives 1,000 requests per minute from a single IP, the rate limiter activates within 5 seconds and blocks subsequent requests with 429 responses. Legitimate users from other IPs are not affected.
# chaos/rate_limit_flood.py
import asyncio
import aiohttp
import time
async def flood_endpoint(url, requests_per_second=20, duration=30):
"""Send rapid requests to trigger rate limiting."""
results = {'success': 0, 'rate_limited': 0, 'errors': 0}
start = time.time()
async with aiohttp.ClientSession() as session:
while time.time() - start < duration:
tasks = [
session.get(url, headers={'X-Forwarded-For': '1.2.3.4'})
for _ in range(requests_per_second)
]
responses = await asyncio.gather(*tasks, return_exceptions=True)
for resp in responses:
if isinstance(resp, Exception):
results['errors'] += 1
elif resp.status == 429:
results['rate_limited'] += 1
elif resp.status == 200:
results['success'] += 1
await asyncio.sleep(1)
return results
async def run_experiment():
# Flood from "attacker" IP
flood_task = asyncio.create_task(
flood_endpoint('http://api.staging/search', requests_per_second=50, duration=30)
)
# Simultaneously verify "legitimate" traffic is not affected
legitimate_task = asyncio.create_task(
check_legitimate_traffic('http://api.staging/search', duration=30)
)
flood_results, legitimate_results = await asyncio.gather(flood_task, legitimate_task)
# Assertions
assert flood_results['rate_limited'] > 0, "Rate limiting never activated"
assert legitimate_results['blocked'] == 0, "Legitimate traffic was incorrectly blocked"
rate_limit_activation_time = get_rate_limit_activation_time(flood_results)
assert rate_limit_activation_time < 5, \
f"Rate limiting took {rate_limit_activation_time}s to activate (max: 5s)"
asyncio.run(run_experiment())Common finding: Rate limiters that use in-memory counters don't work in multi-instance deployments — each instance has its own counter, and attackers bypass limits by distributing requests across instances. Use Redis-backed rate limiting:
# Redis-backed rate limiting with Lua script (atomic operation)
import redis
redis_client = redis.Redis(host='redis.internal')
def is_rate_limited(ip, limit=100, window=60):
key = f'rate_limit:{ip}'
pipeline = redis_client.pipeline()
pipeline.incr(key)
pipeline.expire(key, window)
results = pipeline.execute()
return results[0] > limitExperiment 3: TLS Certificate Expiry Detection
Hypothesis: When a TLS certificate is within 30 days of expiration, an alert fires automatically before users are affected.
This experiment doesn't actually expire a certificate — instead, it validates that your monitoring detects impending expiry.
#!/bin/bash
<span class="hljs-comment"># Check certificate expiry for your domains
<span class="hljs-function">check_cert_expiry() {
<span class="hljs-built_in">local domain=<span class="hljs-variable">$1
<span class="hljs-built_in">local threshold_days=<span class="hljs-variable">$2
expiry=$(<span class="hljs-built_in">echo <span class="hljs-pipe">| openssl s_client -connect <span class="hljs-string">"$domain:443" -servername <span class="hljs-string">"$domain" 2>/dev/null \
<span class="hljs-pipe">| openssl x509 -noout -enddate 2>/dev/null \
<span class="hljs-pipe">| <span class="hljs-built_in">cut -d= -f2)
expiry_epoch=$(<span class="hljs-built_in">date -d <span class="hljs-string">"$expiry" +%s)
now_epoch=$(<span class="hljs-built_in">date +%s)
days_until_expiry=$(( (expiry_epoch - now_epoch) / <span class="hljs-number">86400 ))
<span class="hljs-built_in">echo <span class="hljs-string">"Domain: $domain <span class="hljs-pipe">| Expires: <span class="hljs-variable">$expiry <span class="hljs-pipe">| Days remaining: <span class="hljs-variable">$days_until_expiry"
<span class="hljs-keyword">if [ <span class="hljs-string">"$days_until_expiry" -lt <span class="hljs-string">"$threshold_days" ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"ALERT: Certificate expires in $days_until_expiry days (threshold: <span class="hljs-variable">$threshold_days)"
<span class="hljs-built_in">return 1
<span class="hljs-keyword">fi
<span class="hljs-built_in">return 0
}
<span class="hljs-comment"># Check all your domains
DOMAINS=(<span class="hljs-string">"api.yourapp.com" <span class="hljs-string">"app.yourapp.com" <span class="hljs-string">"admin.yourapp.com")
THRESHOLD=30
<span class="hljs-keyword">for domain <span class="hljs-keyword">in <span class="hljs-string">"${DOMAINS[@]}"; <span class="hljs-keyword">do
check_cert_expiry <span class="hljs-string">"$domain" <span class="hljs-string">"$THRESHOLD"
<span class="hljs-keyword">doneIntegrate this into your CI or as a cron job that alerts your ops team.
For chaos testing: use a short-lived certificate in staging to verify your alerting actually fires:
# Generate a certificate that expires in 5 days (for testing)
openssl req -x509 -newkey rsa:4096 -keyout test-key.pem -out test-cert.pem \
-days 5 -nodes -subj <span class="hljs-string">"/CN=api.staging.yourapp.com"
<span class="hljs-comment"># Deploy to staging, verify alert firesExperiment 4: Secret Rotation Resilience
Hypothesis: When database credentials are rotated, the application reconnects with new credentials within 60 seconds without requiring a deployment.
# chaos/secret_rotation.py
import subprocess
import time
import requests
def rotate_database_password():
"""Rotate the staging database password via AWS Secrets Manager."""
subprocess.run([
'aws', 'secretsmanager', 'rotate-secret',
'--secret-id', 'staging/db-password',
'--region', 'us-east-1',
], check=True)
def wait_for_reconnection(api_url, timeout=120):
"""Wait for application to reconnect with new credentials."""
start = time.time()
while time.time() - start < timeout:
try:
resp = requests.get(f'{api_url}/health', timeout=5)
if resp.status_code == 200:
health = resp.json()
if health.get('database') == 'connected':
return time.time() - start # return time to reconnect
except requests.exceptions.RequestException:
pass
time.sleep(5)
return None # timeout - failed to reconnect
# Run experiment
print("Rotating database credentials...")
rotate_database_password()
print("Waiting for application to reconnect...")
reconnect_time = wait_for_reconnection('http://api.staging')
if reconnect_time is None:
print("FAIL: Application did not reconnect within 120 seconds")
exit(1)
print(f"PASS: Application reconnected in {reconnect_time:.1f} seconds")
assert reconnect_time < 60, f"Reconnection took too long: {reconnect_time}s"Common finding: Applications that read credentials at startup and cache them indefinitely fail when credentials rotate. Use AWS Secrets Manager SDK or Vault agent to fetch fresh credentials periodically:
import boto3
import json
import time
class RotatingCredentialManager:
def __init__(self, secret_name, refresh_interval=300):
self.secret_name = secret_name
self.refresh_interval = refresh_interval
self._credentials = None
self._last_refresh = 0
self.sm = boto3.client('secretsmanager')
def get(self):
if time.time() - self._last_refresh > self.refresh_interval:
self._refresh()
return self._credentials
def _refresh(self):
response = self.sm.get_secret_value(SecretId=self.secret_name)
self._credentials = json.loads(response['SecretString'])
self._last_refresh = time.time()Experiment 5: Authorization Cache Poisoning
Hypothesis: When a user's permissions are revoked, they lose access within 60 seconds — cached permissions do not allow continued access after revocation.
def test_permission_revocation_propagates():
"""Verify that permission revocation is reflected within 60 seconds."""
# Create test user with admin access
user = create_test_user(role='admin')
token = login(user)
# Verify admin access works
resp = requests.get('/admin/users', headers={'Authorization': f'Bearer {token}'})
assert resp.status_code == 200, 'Admin access should work before revocation'
# Revoke admin permissions
revoke_role(user.id, 'admin')
# Poll until access is denied or timeout
start = time.time()
access_denied = False
while time.time() - start < 60:
resp = requests.get('/admin/users', headers={'Authorization': f'Bearer {token}'})
if resp.status_code == 403:
access_denied = True
revocation_time = time.time() - start
break
time.sleep(5)
assert access_denied, 'Permission was never revoked within 60 seconds'
print(f"Permission revocation propagated in {revocation_time:.1f}s")
assert revocation_time < 60, f"Revocation took too long: {revocation_time}s"
# Cleanup
delete_test_user(user.id)Common finding: Applications that cache authorization decisions (for performance) don't invalidate the cache on permission changes. This means a revoked user can continue accessing resources until the cache expires.
Fix: Use cache invalidation events. When permissions change, publish an event that clears affected cache entries:
# On permission change:
def revoke_user_role(user_id, role):
db.execute('DELETE FROM user_roles WHERE user_id = ? AND role = ?', [user_id, role])
redis.delete(f'user_permissions:{user_id}') # invalidate permission cache
event_bus.publish('permission.revoked', {'user_id': user_id, 'role': role})Security Chaos Testing Environment Requirements
Security chaos experiments must run in strictly isolated environments:
Never run against production. Authentication failure injection and rate limit testing must only target isolated staging.
Use synthetic test users. Create dedicated test accounts for chaos experiments. Never use real user credentials.
Audit all chaos actions. Every security chaos experiment should be logged in your audit trail — what experiment ran, when, by whom.
Scope your blast radius. A rate limit flood test should only affect the specific test endpoint, not all endpoints for all users.
Clean up test users. Test accounts created for security chaos experiments must be deleted at experiment end to prevent accumulation of privileged test credentials.
The Security Chaos Engineering Mindset
Security chaos engineering is about asking "how does our system behave under attack?" rather than "does our security configuration pass a checklist?"
The most valuable findings are the ones that don't appear in static security audits: the authentication bypass that only occurs when the auth service is slow, the authorization cache that doesn't invalidate on permission revocation, the rate limiter that fails silently under high load.
These vulnerabilities exist in production systems today. Security chaos engineering surfaces them in controlled conditions before attackers find them.
HelpMeTest monitors application availability and behavior continuously — critical for detecting security-related degradation in production. Start free.