Chaos Engineering for Databases: Testing How Your App Survives Data Layer Failures

Chaos Engineering for Databases: Testing How Your App Survives Data Layer Failures

Database failures are uniquely destructive. Most applications can degrade gracefully when a downstream API is slow — retry logic, circuit breakers, and fallback responses let users continue with reduced functionality. But when the database fails, most applications fail completely.

This makes the database the highest-value target for chaos engineering. Understanding exactly how your application behaves when the database is unavailable, slow, or returning errors is essential knowledge — and knowledge most teams only gain after a production incident.

Database Failure Modes Worth Testing

Before designing experiments, understand the failure modes that actually occur in production:

Connection pool exhaustion: Too many concurrent queries saturate the connection pool. New requests hang waiting for a connection rather than failing fast.

Replication lag: Read replicas fall behind the primary. Applications routing reads to replicas see stale data. In the worst case, you read data that was written seconds ago but doesn't exist yet on the replica.

Slow query storms: A missing index, a new query pattern, or a large dataset causes queries to slow down. These often cascade — slower queries hold connections longer, exhausting the pool.

Primary failover: In high-availability setups, the primary database fails and a replica is promoted. Applications must handle the brief disconnection during promotion.

Disk full: The database runs out of disk space. Writes fail; reads may still succeed. Error messages are often confusing.

Network partition: The application loses network connectivity to the database. Connection timeouts fire after your TCP timeout (often 30-60 seconds without explicit configuration).

Each is a distinct failure mode requiring different application-level handling.

Experiment 1: Connection Pool Exhaustion

The most common database-related production incident is connection pool exhaustion. When your connection pool is full, new database requests wait (or fail immediately, depending on configuration) rather than executing.

Hypothesis: When the database connection pool is 90% utilized, the API will return 503 errors with an appropriate error message rather than hanging indefinitely.

Setup: Saturate connections using parallel long-running queries.

-- Create a stored procedure that holds a connection
CREATE OR REPLACE PROCEDURE hold_connection(seconds INT)
LANGUAGE plpgsql AS $$
BEGIN
    PERFORM pg_sleep(seconds);
END;
$$;
# Chaos experiment: saturate connection pool
import psycopg2
import threading
import time

def hold_connection(duration_seconds=30):
    conn = psycopg2.connect(
        host='staging-db.internal',
        dbname='app',
        user='chaos_user',
        password=os.environ['CHAOS_DB_PASSWORD'],
    )
    with conn.cursor() as cur:
        cur.execute(f'CALL hold_connection({duration_seconds})')
    conn.close()

# Spawn threads to hold connections
MAX_POOL_SIZE = 20
holder_threads = []

for i in range(int(MAX_POOL_SIZE * 0.9)):  # hold 90% of pool
    t = threading.Thread(target=hold_connection, args=(60,))
    t.daemon = True
    t.start()
    holder_threads.append(t)

print(f"Holding {len(holder_threads)} connections for 60 seconds...")
time.sleep(5)  # give connections time to be established

# Now verify application behavior under pool exhaustion
# (external: check health endpoint, run functional tests)

What to observe:

  • Does your API return 503 or timeout? 503 is correct behavior.
  • Does the error message expose internal connection details? It shouldn't.
  • How long does a request wait before giving up? Should be under 5 seconds.
  • Do connection pool metrics alert? They should.

Common finding: Applications without explicit connection timeout configuration hang for 30-60 seconds (TCP default) before failing. Set connection pool wait timeout explicitly:

# SQLAlchemy: explicit connection pool configuration
engine = create_engine(
    DATABASE_URL,
    pool_size=10,
    max_overflow=5,
    pool_timeout=3,      # fail after 3 seconds of waiting for connection
    pool_pre_ping=True,  # verify connections before use
)

Experiment 2: Replication Lag

Read replicas are a common scaling strategy. But replication lag means reads from replicas may return stale data. Test your application's behavior when replication lag is significant.

Hypothesis: When replication lag exceeds 10 seconds, the application either reads from the primary or shows users an appropriate "data may be delayed" message rather than silently showing stale data.

Inject lag using Toxiproxy:

# Create a proxy in front of the read replica
toxiproxy-cli create --listen 0.0.0.0:5433 \
  --upstream staging-replica.internal:5432 \
  replica_proxy

<span class="hljs-comment"># Add latency toxic to simulate replication lag
toxiproxy-cli toxic add --<span class="hljs-built_in">type latency \
  --attribute latency=10000 \
  --attribute jitter=1000 \
  replica_proxy

Alternative: use pg_wal_replay_pause on the replica:

-- Connect to replica and pause WAL replay
SELECT pg_wal_replay_pause();

-- Wait 30 seconds while primary receives writes
-- Then resume
SELECT pg_wal_replay_resume();

What to observe:

  • Does the application read stale data silently?
  • Does replication lag trigger any alerts?
  • Does the application fall back to primary reads when lag is detected?

Common finding: Applications route all reads to replicas without monitoring lag. A user writes a record and immediately queries for it — it's not there. The application needs lag detection logic or should route "read-your-own-writes" queries to the primary.

Experiment 3: Slow Query Injection

Long-running queries hold connections and cause cascading slowdowns. Test your application's behavior when queries slow down significantly.

Using pg_sleep in statements:

-- PostgreSQL: simulate slow queries for specific tables
-- (using a rewrite rule in staging — NEVER in production)
CREATE RULE slow_query AS ON SELECT TO users DO ALSO 
    SELECT pg_sleep(2);

Using Toxiproxy latency on the database port:

# Add 3-second latency to all database connections
toxiproxy-cli toxic add --<span class="hljs-built_in">type latency \
  --attribute latency=3000 \
  db_proxy

What to observe:

  • Do HTTP request timeouts fire before database query timeouts? They should.
  • Are slow queries logged? They should be.
  • Does the connection pool fill up as slow queries hold connections?
  • Does the application have per-query timeout settings?

Configure per-query timeouts:

# Set statement timeout per connection
@event.listens_for(engine, 'connect')
def set_statement_timeout(dbapi_connection, connection_record):
    cursor = dbapi_connection.cursor()
    cursor.execute('SET statement_timeout = 5000')  # 5 second timeout
    cursor.close()

Experiment 4: Primary Failover

In HA database setups, primary failover is a normal operation. Test your application handles the brief disconnection during promotion.

Using AWS RDS Multi-AZ failover:

# Trigger RDS failover (requires AWS CLI)
aws rds reboot-db-instance \
  --db-instance-identifier staging-primary \
  --force-failover

Using Patroni (self-hosted Postgres HA):

# Trigger Patroni failover
patronictl -c /etc/patroni.yaml failover staging-cluster \
  --master primary-node \
  --force

What to observe:

  • How long is the application unavailable? Should be under 60 seconds for most HA setups.
  • Do database connections reconnect automatically?
  • Are active requests at failover time retried or returned as errors?
  • Do health checks detect the failover?

Common finding: Applications that cache the database hostname don't reconnect to the new primary automatically. They need DNS-aware connection pooling or health check logic that triggers reconnection.

Connection retry configuration:

# SQLAlchemy: connection retry on disconnect
@event.listens_for(engine, 'engine_connect')
def ping_connection(connection, branch):
    if branch:
        return
    try:
        connection.scalar(select(1))
    except Exception:
        connection.close()
        # Raises an error to invalidate the connection pool
        raise exc.DBAPIError.instance(
            None, None, None,
            connection_invalidated=True
        )

Experiment 5: Disk Full Simulation

A database running out of disk space causes write failures with often-confusing error messages.

Simulate with a filesystem fill:

# In a staging environment: fill disk to 95%
fallocate -l $(<span class="hljs-built_in">df --output=avail /var/lib/postgresql <span class="hljs-pipe">| <span class="hljs-built_in">tail -1 <span class="hljs-pipe">| awk <span class="hljs-string">'{print $1-500000}')K /tmp/diskfill

<span class="hljs-comment"># Observe application behavior
<span class="hljs-comment"># Then clean up:
<span class="hljs-built_in">rm /tmp/diskfill

What to observe:

  • What error does your application surface to users?
  • Are write failures distinguished from read failures?
  • Does the application alert on disk utilization before it's full?

Common finding: Applications surface raw database error messages like "ERROR: could not write to file: No space left on device" to users. Implement error translation to return user-friendly messages.

Chaos Toolkit Experiment for Database Failures

# experiments/database-connection-pool.yaml
version: 1.0.0
title: Database connection pool exhaustion
description: Verify API returns 503 when connection pool is exhausted

steady-state-hypothesis:
  title: API is serving requests normally
  probes:
    - name: api-health-check
      type: probe
      tolerance: 200
      provider:
        type: http
        url: http://api.staging/health
        timeout: 5
    - name: create-user-api-works
      type: probe
      tolerance: 201
      provider:
        type: http
        url: http://api.staging/users
        method: POST
        headers:
          Content-Type: application/json
          Authorization: Bearer ${AUTH_TOKEN}
        arguments:
          email: chaos-test@example.com

method:
  - type: action
    name: saturate-connection-pool
    background: true
    provider:
      type: python
      module: experiments.database_chaos
      func: saturate_connections
      arguments:
        connection_string: ${DATABASE_URL}
        num_connections: 18  # 90% of pool size 20
        hold_duration: 60

pauses:
  after: 5  # wait for connections to be established

  # Add probes during the experiment to verify behavior:
  - type: probe
    name: verify-api-returns-503-on-new-requests
    tolerance: 503
    provider:
      type: http
      url: http://api.staging/users
      method: POST

rollback: []  # connections release automatically after hold_duration

Database Chaos Testing Checklist

Before considering database chaos tests complete, verify:

  • Connection pool exhaustion → 503 with timeout, not hang
  • Connection pool wait time is configured explicitly (not TCP default)
  • Slow queries → per-query statement timeout fires
  • Slow queries are logged via slow_query_log or log_min_duration_statement
  • Primary failover → reconnection within 60 seconds
  • Replication lag → application handles stale reads
  • Disk full → write failures return clear (not raw DB) errors
  • Connection pool metrics are monitored with alerts

Common Database Chaos Mistakes

Testing against production databases. Database chaos experiments should only run against isolated staging databases with non-production data. Never run connection-holding or latency injection against production.

Not restoring connection limits. Experiments that hold connections must have cleanup logic. A failed experiment that doesn't clean up can leave your staging environment unusable.

Ignoring read replica routing. Applications that mix primary and replica reads need chaos tests for both paths. A test that only exercises primary reads misses replica-specific failures.

No pre-experiment baseline. Measure baseline connection pool utilization, query latency, and error rates before the experiment. Without a baseline, you can't interpret results.


HelpMeTest monitors application health endpoints and catches regressions that database failures can introduce. Start free.

Read more