Kubernetes Failover Testing: Simulating Node Failures and Pod Disruptions
Kubernetes is designed for resilience — but resilience requires testing. A pod that doesn't restart cleanly, a deployment that fails to roll over to healthy nodes, a service that routes traffic to terminating pods — these failures are silent until they happen at 3am during peak traffic.
This guide covers systematic Kubernetes failover testing: how to simulate failures at every layer, validate that your cluster responds correctly, and automate these tests before problems reach production.
Kubernetes Failure Modes
Understanding what can fail helps you design tests that cover the right scenarios:
| Layer | Failure | Impact |
|---|---|---|
| Pod | OOM kill, crash, liveness probe failure | Application restarts or traffic rerouted |
| Node | Hardware failure, kernel panic, cordon | Pods rescheduled to other nodes |
| Network | Packet loss, latency spike, partition | Service degradation |
| Storage | PV unavailable, slow disk | Stateful workloads fail |
| Control plane | API server unavailable | No new deployments; existing workloads continue |
| Availability zone | AZ outage | All nodes in AZ lost |
Test each layer. Don't assume Kubernetes handles failures because it's theoretically designed to — verify that it handles your specific workloads with your specific configuration.
Pod-Level Failure Testing
Validating Pod Restart Behavior
#!/bin/bash
<span class="hljs-comment"># test_pod_restart.sh — Validate pods restart and traffic routes correctly
NAMESPACE=<span class="hljs-string">"production"
DEPLOYMENT=<span class="hljs-string">"api"
SERVICE_URL=<span class="hljs-string">"http://api.internal/health"
RTO_SECONDS=30
<span class="hljs-built_in">echo <span class="hljs-string">"=== Pod Restart Failover Test ==="
START_TIME=$(<span class="hljs-built_in">date +%s)
<span class="hljs-comment"># Record initial pod
INITIAL_PODS=$(kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -l app=<span class="hljs-string">"$DEPLOYMENT" \
-o jsonpath=<span class="hljs-string">'{.items[*].metadata.name}')
<span class="hljs-built_in">echo <span class="hljs-string">"Initial pods: $INITIAL_PODS"
<span class="hljs-comment"># Start continuous health monitoring
HEALTH_LOG=$(<span class="hljs-built_in">mktemp)
<span class="hljs-function">monitor_health() {
<span class="hljs-keyword">while <span class="hljs-literal">true; <span class="hljs-keyword">do
STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" --max-time 2 <span class="hljs-string">"$SERVICE_URL" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"000")
<span class="hljs-built_in">echo <span class="hljs-string">"$(date +%s) <span class="hljs-variable">$STATUS" >> <span class="hljs-string">"$HEALTH_LOG"
<span class="hljs-built_in">sleep 0.5
<span class="hljs-keyword">done
}
monitor_health &
MONITOR_PID=$!
<span class="hljs-built_in">sleep 5
<span class="hljs-comment"># Kill all pods (force delete simulates crash)
<span class="hljs-built_in">echo <span class="hljs-string">"Deleting all pods..."
KILL_TIME=$(<span class="hljs-built_in">date +%s)
kubectl delete pods -n <span class="hljs-string">"$NAMESPACE" -l app=<span class="hljs-string">"$DEPLOYMENT" --grace-period=0 --force
<span class="hljs-comment"># Wait for recovery
RECOVERED=<span class="hljs-literal">false
<span class="hljs-keyword">while [ <span class="hljs-string">"$(date +%s)" -lt <span class="hljs-string">"$((KILL_TIME + RTO_SECONDS + 10))" ]; <span class="hljs-keyword">do
STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" --max-time 2 <span class="hljs-string">"$SERVICE_URL" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"000")
<span class="hljs-keyword">if [ <span class="hljs-string">"$STATUS" = <span class="hljs-string">"200" ]; <span class="hljs-keyword">then
RECOVERY_TIME=$(<span class="hljs-built_in">date +%s)
ACTUAL_RTO=$((RECOVERY_TIME - KILL_TIME))
<span class="hljs-built_in">echo <span class="hljs-string">"Service recovered in ${ACTUAL_RTO}s"
RECOVERED=<span class="hljs-literal">true
<span class="hljs-built_in">break
<span class="hljs-keyword">fi
<span class="hljs-built_in">sleep 1
<span class="hljs-keyword">done
<span class="hljs-built_in">kill <span class="hljs-string">"$MONITOR_PID" 2>/dev/null <span class="hljs-pipe">|| <span class="hljs-literal">true
<span class="hljs-comment"># Analyze downtime
TOTAL_REQUESTS=$(<span class="hljs-built_in">wc -l < <span class="hljs-string">"$HEALTH_LOG")
FAILED=$(grep -c <span class="hljs-string">" 000\| 50[0-9]" <span class="hljs-string">"$HEALTH_LOG" <span class="hljs-pipe">|| <span class="hljs-built_in">echo 0)
<span class="hljs-built_in">echo <span class="hljs-string">"Downtime requests: ${FAILED}/<span class="hljs-variable">${TOTAL_REQUESTS}"
<span class="hljs-comment"># Verify new pods are running
NEW_PODS=$(kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -l app=<span class="hljs-string">"$DEPLOYMENT" \
-o jsonpath=<span class="hljs-string">'{.items[*].metadata.name}')
<span class="hljs-built_in">echo <span class="hljs-string">"New pods: $NEW_PODS"
<span class="hljs-comment"># Validate pod count matches replica set
DESIRED=$(kubectl get deployment <span class="hljs-string">"$DEPLOYMENT" -n <span class="hljs-string">"$NAMESPACE" \
-o jsonpath=<span class="hljs-string">'{.spec.replicas}')
RUNNING=$(kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -l app=<span class="hljs-string">"$DEPLOYMENT" \
--field-selector=status.phase=Running --no-headers <span class="hljs-pipe">| <span class="hljs-built_in">wc -l)
<span class="hljs-built_in">echo <span class="hljs-string">"Desired: $DESIRED <span class="hljs-pipe">| Running: <span class="hljs-variable">$RUNNING"
[ <span class="hljs-string">"$RUNNING" = <span class="hljs-string">"$DESIRED" ] && <span class="hljs-built_in">echo <span class="hljs-string">"PASS: All replicas running" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Replica count mismatch"
<span class="hljs-keyword">if [ <span class="hljs-string">"$RECOVERED" = <span class="hljs-string">"false" ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Service did not recover within $((RTO_SECONDS + 10))s"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
[ <span class="hljs-string">"$ACTUAL_RTO" -le <span class="hljs-string">"$RTO_SECONDS" ] && <span class="hljs-built_in">echo <span class="hljs-string">"PASS: RTO met" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"WARN: RTO exceeded"
<span class="hljs-built_in">rm -f <span class="hljs-string">"$HEALTH_LOG"Testing Liveness and Readiness Probes
Probes are critical for failover — a pod that fails its liveness check gets restarted; one that fails readiness gets removed from load balancing. Test that they work:
#!/bin/bash
<span class="hljs-comment"># test_probes.sh — Verify liveness and readiness probe behavior
NAMESPACE=<span class="hljs-string">"production"
POD_NAME=<span class="hljs-string">"api-7d9f4b8c6-xk9p2"
<span class="hljs-built_in">echo <span class="hljs-string">"=== Liveness Probe Test ==="
<span class="hljs-comment"># Make the app fail its liveness check
kubectl <span class="hljs-built_in">exec -n <span class="hljs-string">"$NAMESPACE" <span class="hljs-string">"$POD_NAME" -- \
sh -c <span class="hljs-string">"touch /tmp/unhealthy" <span class="hljs-comment"># If your app checks for this file
<span class="hljs-comment"># Wait for Kubernetes to detect failure and restart
<span class="hljs-built_in">echo <span class="hljs-string">"Waiting for restart..."
RESTART_COUNT_BEFORE=$(kubectl get pod -n <span class="hljs-string">"$NAMESPACE" <span class="hljs-string">"$POD_NAME" \
-o jsonpath=<span class="hljs-string">'{.status.containerStatuses[0].restartCount}')
<span class="hljs-built_in">sleep 60 <span class="hljs-comment"># Longer than liveness probe failure threshold
RESTART_COUNT_AFTER=$(kubectl get pod -n <span class="hljs-string">"$NAMESPACE" <span class="hljs-string">"$POD_NAME" \
-o jsonpath=<span class="hljs-string">'{.status.containerStatuses[0].restartCount}' 2>/dev/null <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"pod replaced")
<span class="hljs-built_in">echo <span class="hljs-string">"Restart count before: $RESTART_COUNT_BEFORE"
<span class="hljs-built_in">echo <span class="hljs-string">"Restart count after: $RESTART_COUNT_AFTER"
<span class="hljs-built_in">echo <span class="hljs-string">""
<span class="hljs-built_in">echo <span class="hljs-string">"=== Readiness Probe Test ==="
<span class="hljs-comment"># Make app fail readiness (but stay alive)
kubectl <span class="hljs-built_in">exec -n <span class="hljs-string">"$NAMESPACE" <span class="hljs-string">"$POD_NAME" -- \
sh -c <span class="hljs-string">"touch /tmp/not_ready"
<span class="hljs-built_in">sleep 15
<span class="hljs-comment"># Check pod is not receiving traffic (NotReady)
POD_READY=$(kubectl get pod -n <span class="hljs-string">"$NAMESPACE" <span class="hljs-string">"$POD_NAME" \
-o jsonpath=<span class="hljs-string">'{.status.conditions[?(@.type=="Ready")].status}')
<span class="hljs-built_in">echo <span class="hljs-string">"Pod ready status: $POD_READY"
[ <span class="hljs-string">"$POD_READY" = <span class="hljs-string">"False" ] && <span class="hljs-built_in">echo <span class="hljs-string">"PASS: Pod removed from load balancing" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Pod still receiving traffic"
<span class="hljs-comment"># Verify other pods are healthy
kubectl get endpoints -n <span class="hljs-string">"$NAMESPACE" api-serviceNode-Level Failure Testing
Simulating Node Failure
#!/bin/bash
<span class="hljs-comment"># test_node_failure.sh
NAMESPACE=<span class="hljs-string">"production"
TARGET_NODE=<span class="hljs-string">"k8s-worker-3"
SERVICE_URL=<span class="hljs-string">"http://api.internal/health"
<span class="hljs-built_in">echo <span class="hljs-string">"=== Node Failure Test ==="
<span class="hljs-comment"># Check current pod distribution
<span class="hljs-built_in">echo <span class="hljs-string">"Pod distribution before failure:"
kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -o wide <span class="hljs-pipe">| grep -v <span class="hljs-string">"RESTARTS"
<span class="hljs-comment"># Get pods on target node
PODS_ON_NODE=$(kubectl get pods -n <span class="hljs-string">"$NAMESPACE" --field-selector spec.nodeName=<span class="hljs-string">"$TARGET_NODE" \
-o jsonpath=<span class="hljs-string">'{.items[*].metadata.name}')
<span class="hljs-built_in">echo <span class="hljs-string">"Pods on $TARGET_NODE: <span class="hljs-variable">$PODS_ON_NODE"
<span class="hljs-comment"># Start health monitoring
HEALTH_LOG=$(<span class="hljs-built_in">mktemp)
<span class="hljs-function">monitor() {
<span class="hljs-keyword">while <span class="hljs-literal">true; <span class="hljs-keyword">do
CODE=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" --max-time 2 <span class="hljs-string">"$SERVICE_URL" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"000")
<span class="hljs-built_in">echo <span class="hljs-string">"$(date +%s) <span class="hljs-variable">$CODE" >> <span class="hljs-string">"$HEALTH_LOG"
<span class="hljs-built_in">sleep 0.5
<span class="hljs-keyword">done
}
monitor &
MONITOR_PID=$!
<span class="hljs-built_in">sleep 5
<span class="hljs-comment"># Simulate node failure by cordoning and draining
<span class="hljs-built_in">echo <span class="hljs-string">"Simulating node failure: $TARGET_NODE"
FAILURE_TIME=$(<span class="hljs-built_in">date +%s)
<span class="hljs-comment"># Option 1: Graceful (eviction with drain)
kubectl drain <span class="hljs-string">"$TARGET_NODE" --ignore-daemonsets --delete-emptydir-data --force
<span class="hljs-comment"># Option 2: Abrupt (for hardware crash simulation)
<span class="hljs-comment"># kubectl delete node "$TARGET_NODE"
<span class="hljs-comment"># ssh "$TARGET_NODE" "sudo poweroff" # actual machine shutdown
<span class="hljs-comment"># Wait for pods to reschedule
<span class="hljs-built_in">sleep 30
<span class="hljs-built_in">echo <span class="hljs-string">"Pod distribution after failure:"
kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -o wide
<span class="hljs-comment"># Stop monitoring
<span class="hljs-built_in">kill <span class="hljs-string">"$MONITOR_PID" 2>/dev/null <span class="hljs-pipe">|| <span class="hljs-literal">true
<span class="hljs-comment"># Analyze
TOTAL=$(<span class="hljs-built_in">wc -l < <span class="hljs-string">"$HEALTH_LOG")
FAILED=$(grep -c <span class="hljs-string">" 000\| 50[0-9]" <span class="hljs-string">"$HEALTH_LOG" <span class="hljs-pipe">|| <span class="hljs-built_in">echo 0)
<span class="hljs-built_in">echo <span class="hljs-string">"Error rate during node failure: ${FAILED}/<span class="hljs-variable">${TOTAL} (<span class="hljs-subst">$(( FAILED * 100 / TOTAL ))%)"
<span class="hljs-comment"># Verify all pods are running on remaining nodes
DESIRED=$(kubectl get deployment api -n <span class="hljs-string">"$NAMESPACE" -o jsonpath=<span class="hljs-string">'{.spec.replicas}')
RUNNING=$(kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -l app=api --field-selector=status.phase=Running \
--no-headers <span class="hljs-pipe">| <span class="hljs-built_in">wc -l)
<span class="hljs-built_in">echo <span class="hljs-string">"Desired replicas: $DESIRED <span class="hljs-pipe">| Running: <span class="hljs-variable">$RUNNING"
<span class="hljs-comment"># Re-add node to cluster
kubectl uncordon <span class="hljs-string">"$TARGET_NODE"
<span class="hljs-built_in">rm -f <span class="hljs-string">"$HEALTH_LOG"Testing Pod Disruption Budgets
PodDisruptionBudgets (PDBs) ensure a minimum number of pods remain available during voluntary disruptions. Verify yours are configured and working:
# pdb.yaml — Example PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # At least 2 pods must be available during disruptions
selector:
matchLabels:
app: api# Test PDB enforcement
kubectl get pdb -n production
<span class="hljs-comment"># Try to drain a node when it would violate PDB
kubectl drain k8s-worker-2 --ignore-daemonsets --delete-emptydir-data
<span class="hljs-comment"># Kubernetes should block or slow this down if PDB would be violated
<span class="hljs-comment"># Expected output:
<span class="hljs-comment"># error when evicting pods/"api-7d9f4b8c6-xk9p2" -n "production"
<span class="hljs-comment"># (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.Network Failure Testing
Simulating Network Partition with tc
#!/bin/bash
<span class="hljs-comment"># Simulate packet loss between services using tc (traffic control)
TARGET_POD=<span class="hljs-string">"api-7d9f4b8c6-xk9p2"
NAMESPACE=<span class="hljs-string">"production"
DURATION_SECONDS=60
PACKET_LOSS_PERCENT=50
<span class="hljs-built_in">echo <span class="hljs-string">"=== Network Partition Test: ${PACKET_LOSS_PERCENT}% packet loss for <span class="hljs-variable">${DURATION_SECONDS}s ==="
<span class="hljs-comment"># Inject packet loss into target pod's network
kubectl <span class="hljs-built_in">exec -n <span class="hljs-string">"$NAMESPACE" <span class="hljs-string">"$TARGET_POD" -- \
tc qdisc add dev eth0 root netem loss <span class="hljs-string">"${PACKET_LOSS_PERCENT}%"
<span class="hljs-built_in">echo <span class="hljs-string">"Packet loss injected. Testing service behavior..."
<span class="hljs-comment"># Test application behavior during network degradation
<span class="hljs-keyword">for i <span class="hljs-keyword">in {1..20}; <span class="hljs-keyword">do
STATUS=$(curl -s -o /dev/null -w <span class="hljs-string">"%{http_code}" --max-time 5 <span class="hljs-string">"http://api.internal/endpoint" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"000")
<span class="hljs-built_in">echo <span class="hljs-string">"Request $i: <span class="hljs-variable">$STATUS"
<span class="hljs-built_in">sleep 3
<span class="hljs-keyword">done
<span class="hljs-comment"># Remove packet loss after test duration
<span class="hljs-built_in">sleep <span class="hljs-string">"$DURATION_SECONDS"
kubectl <span class="hljs-built_in">exec -n <span class="hljs-string">"$NAMESPACE" <span class="hljs-string">"$TARGET_POD" -- \
tc qdisc del dev eth0 root
<span class="hljs-built_in">echo <span class="hljs-string">"Network partition removed"Using Chaos Mesh for Network Chaos
Chaos Mesh provides Kubernetes-native chaos injection:
# network-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: api-network-delay
namespace: production
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api
delay:
latency: "100ms"
correlation: "25"
jitter: "50ms"
duration: "5m"
direction: both# Apply and monitor
kubectl apply -f network-chaos.yaml
<span class="hljs-comment"># Check application response times during chaos
hey -z 5m -c 10 http://api.internal/endpoint
<span class="hljs-comment"># Cleanup
kubectl delete networkchaos api-network-delay -n productionAvailability Zone Failure Simulation
Test that your multi-AZ deployment survives losing an entire AZ:
#!/bin/bash
<span class="hljs-comment"># test_az_failure.sh — Simulate AZ outage
FAILED_AZ=<span class="hljs-string">"us-east-1b"
NAMESPACE=<span class="hljs-string">"production"
<span class="hljs-built_in">echo <span class="hljs-string">"=== AZ Failure Test: Simulating loss of $FAILED_AZ ==="
<span class="hljs-comment"># Get all nodes in the target AZ
NODES_IN_AZ=$(kubectl get nodes \
-l <span class="hljs-string">"topology.kubernetes.io/zone=$FAILED_AZ" \
-o jsonpath=<span class="hljs-string">'{.items[*].metadata.name}')
<span class="hljs-built_in">echo <span class="hljs-string">"Nodes in $FAILED_AZ: <span class="hljs-variable">$NODES_IN_AZ"
<span class="hljs-comment"># Count pods that will be affected
AFFECTED_PODS=$(kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -o wide <span class="hljs-pipe">| grep -E <span class="hljs-string">"$NODES_IN_AZ" <span class="hljs-pipe">| <span class="hljs-built_in">wc -l)
<span class="hljs-built_in">echo <span class="hljs-string">"Pods to be evacuated: $AFFECTED_PODS"
<span class="hljs-comment"># Start load testing
hey -z 5m -c 50 http://api.internal/endpoint &
HEY_PID=$!
<span class="hljs-comment"># Simulate AZ failure: cordon and drain all nodes in AZ
FAILURE_START=$(<span class="hljs-built_in">date +%s)
<span class="hljs-keyword">for NODE <span class="hljs-keyword">in <span class="hljs-variable">$NODES_IN_AZ; <span class="hljs-keyword">do
kubectl cordon <span class="hljs-string">"$NODE"
<span class="hljs-keyword">done
<span class="hljs-keyword">for NODE <span class="hljs-keyword">in <span class="hljs-variable">$NODES_IN_AZ; <span class="hljs-keyword">do
kubectl drain <span class="hljs-string">"$NODE" --ignore-daemonsets --delete-emptydir-data --force --<span class="hljs-built_in">timeout=120s
<span class="hljs-keyword">done
DRAIN_END=$(<span class="hljs-built_in">date +%s)
<span class="hljs-built_in">echo <span class="hljs-string">"Drain completed in $((DRAIN_END - FAILURE_START))s"
<span class="hljs-comment"># Wait for pods to settle
<span class="hljs-built_in">sleep 30
<span class="hljs-comment"># Verify all pods are running on remaining AZs
<span class="hljs-built_in">echo <span class="hljs-string">"Pod distribution after AZ failure:"
kubectl get pods -n <span class="hljs-string">"$NAMESPACE" -o wide
<span class="hljs-comment"># Check that remaining nodes are across multiple AZs
kubectl get nodes -l <span class="hljs-string">"topology.kubernetes.io/zone" \
--label-columns topology.kubernetes.io/zone
<span class="hljs-comment"># Wait for load test to finish
<span class="hljs-built_in">wait <span class="hljs-string">"$HEY_PID"
<span class="hljs-built_in">echo <span class="hljs-string">"Restoring nodes..."
<span class="hljs-keyword">for NODE <span class="hljs-keyword">in <span class="hljs-variable">$NODES_IN_AZ; <span class="hljs-keyword">do
kubectl uncordon <span class="hljs-string">"$NODE"
<span class="hljs-keyword">doneAutomating Failover Tests
Run failover tests on a schedule in a staging cluster:
# CronJob for weekly failover testing
apiVersion: batch/v1
kind: CronJob
metadata:
name: failover-test-weekly
namespace: test-automation
spec:
schedule: "0 4 * * 0" # Sundays at 4am
jobTemplate:
spec:
template:
spec:
serviceAccountName: failover-test-sa
containers:
- name: failover-test
image: company/failover-test:latest
command: ["python", "failover_tests.py"]
env:
- name: TARGET_NAMESPACE
value: "staging"
- name: NOTIFY_SLACK
valueFrom:
secretKeyRef:
name: slack-webhook
key: url
restartPolicy: Never# failover_tests.py — Orchestrate all failover tests
import asyncio
import json
import os
from datetime import datetime
async def run_all_tests() -> dict:
results = {
'run_at': datetime.utcnow().isoformat(),
'tests': {}
}
# Pod restart test
from tests.pod_restart import test_pod_restart
results['tests']['pod_restart'] = await test_pod_restart(
namespace=os.environ['TARGET_NAMESPACE'],
deployment='api',
rto_seconds=30
)
# Node failure test (one node)
from tests.node_failure import test_node_failure
results['tests']['node_failure'] = await test_node_failure(
namespace=os.environ['TARGET_NAMESPACE'],
rto_seconds=120
)
# Network chaos test
from tests.network_chaos import test_network_latency
results['tests']['network_chaos'] = await test_network_latency(
namespace=os.environ['TARGET_NAMESPACE'],
latency_ms=100,
duration_seconds=60
)
# Calculate overall pass/fail
results['passed'] = all(
t.get('rto_met', False)
for t in results['tests'].values()
)
# Notify
await notify_slack(results, os.environ['NOTIFY_SLACK'])
return results
if __name__ == '__main__':
results = asyncio.run(run_all_tests())
print(json.dumps(results, indent=2))
exit(0 if results['passed'] else 1)What Good Kubernetes Failover Looks Like
Pod failure (single pod in multi-replica deployment):
- Detection: < 5 seconds (health check interval)
- Pod removed from LB: < 5 seconds
- New pod started: < 30 seconds
- Zero requests to failed pod after removal
Node failure (one of N nodes):
- Pod eviction: immediate (Kubernetes marks node NotReady after 40s by default)
- Pod rescheduling: 2-5 minutes
- Service continuity: depends on PDB and replica count
- Full recovery: 5-10 minutes
AZ failure (1 of 3 AZs):
- Traffic automatically routes to remaining AZs (if multi-AZ service configured)
- Pod rescheduling: 5-15 minutes
- Full capacity on remaining AZs: 10-20 minutes
If your actual numbers are significantly worse than these, investigate:
- Image pull time (use pre-pulled images or image pull policy optimization)
- Startup probe vs readiness probe (use startup probes for slow-starting apps)
- PDB configuration (too strict PDBs slow down node draining)
- Node affinity rules (may prevent rescheduling if no suitable nodes remain)
Summary
Kubernetes failover testing covers four layers:
- Pod failures — liveness/readiness probes, restart policies
- Node failures — pod rescheduling, PodDisruptionBudgets
- Network failures — packet loss, latency, partition behavior
- AZ failures — multi-AZ coverage, cross-zone traffic routing
Automate what you can. Pod restart tests should run weekly in staging. Node failure tests should run monthly. AZ failure simulation should run quarterly. Each test result should be captured, compared to your RTO/RPO targets, and trigger alerts when thresholds are missed.
Kubernetes resilience is not a feature you enable — it's a property you verify through testing.