Chaos Engineering and Observability: You Can't Break What You Can't See
Chaos engineering is not about breaking systems randomly. It's about running controlled experiments to validate hypotheses about system behavior under failure conditions. Without observability, you can't form those hypotheses, can't detect the blast radius of an experiment, and can't determine whether your system actually recovered.
Observability isn't a prerequisite for chaos engineering — it's the prerequisite. Teams that run chaos experiments without solid observability tooling are flying blind.
The Observability Foundation
Before your first chaos experiment, you need three things working:
Metrics: Numeric measurements over time — request rate, error rate, latency percentiles, resource utilization. Your SLIs (Service Level Indicators) are defined as metrics.
Traces: Request flow across service boundaries. When a chaos experiment causes latency, distributed traces show you exactly where the slowdown is occurring.
Logs: Structured events with context. When something fails, logs tell you why — the specific error, the affected request, the state at failure time.
Without all three, chaos experiments produce confusing results. A latency spike in metrics tells you something is wrong. Traces tell you which service. Logs tell you what error was returned.
Defining Your Steady State
Every chaos experiment starts with a steady state hypothesis: "I believe that when [failure condition], the system will [behavior]."
Steady state is defined in terms of observable metrics. You can't have a steady state hypothesis without metrics.
Common steady state definitions:
# Steady state: API is healthy
steady_state:
- metric: http_request_error_rate
target: < 0.5%
window: 5m
- metric: http_request_p99_latency_ms
target: < 500
window: 5m
- metric: active_user_sessions
target: > 0
window: 1mBefore starting any experiment, verify that your system is actually in steady state. An experiment run against a system already experiencing issues produces uninterpretable results.
Instrumentation Required Before Chaos
Golden Signals
Google's SRE book defines four golden signals that every service should expose:
Latency: How long requests take. Measure both successful and failed requests separately — error responses that are fast can mask problems.
# Prometheus: request latency histogram
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)Traffic: How much demand your system is serving:
rate(http_requests_total[5m])Errors: Rate of failed requests:
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])Saturation: How "full" your system is — CPU, memory, connection pool utilization:
# Connection pool saturation
db_connection_pool_used / db_connection_pool_maxThese four signals define your steady state. Every chaos experiment must monitor all four throughout its duration.
Distributed Tracing Setup
For microservices chaos experiments, you need distributed tracing to understand which service is being affected:
// OpenTelemetry Node.js setup
const { NodeTracerProvider } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const provider = new NodeTracerProvider({
resource: new Resource({
'service.name': 'user-service',
'service.version': process.env.APP_VERSION,
}),
});
provider.addSpanProcessor(
new SimpleSpanProcessor(new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
}))
);
provider.register();With distributed tracing in place, when a chaos experiment causes latency, you can trace a specific slow request and see exactly which service added the latency and why.
Structured Logging
Structured logs make it possible to correlate events across services during a chaos experiment:
// Structured logging with correlation IDs
const logger = {
info: (msg, context) => console.log(JSON.stringify({
level: 'info',
timestamp: new Date().toISOString(),
message: msg,
...context,
})),
};
// Usage in request handler
logger.info('Processing request', {
requestId: req.headers['x-request-id'],
userId: req.user?.id,
endpoint: req.path,
duration_ms: Date.now() - req.startTime,
});During a chaos experiment, filter logs by the experiment time window and look for error patterns, timeout messages, and retry attempts.
Chaos Experiment Design with Observability in Mind
Hypothesis-Driven Experiments
A well-formed chaos hypothesis is directly tied to your observability stack:
Hypothesis: When the payment service has 20% of requests fail with 500 errors,
the checkout service will degrade gracefully — users will see an error message
rather than a hanging request, and the error rate on the checkout API will
not exceed 25% (we expect checkout to succeed for most users via retry logic).
Observable steady state:
- checkout_error_rate < 5% (measured via Prometheus)
- checkout_p99_latency_ms < 1000 (measured via Prometheus)
- payment_failure_alert fires within 2 minutes (measured via Alertmanager)
- No zombie transactions visible in database (measured via SQL query)
Observability during experiment:
- Grafana dashboard: checkout + payment golden signals
- Distributed traces: identify which checkout requests are payment-dependent
- Logs: grep for payment timeout messages in checkout serviceAutomated Steady State Checks
Use chaos engineering tools that integrate with your observability stack to automatically verify steady state:
Chaos Toolkit with Prometheus:
# chaos-experiment.yaml
version: 1.0.0
title: Payment service degradation
description: Verify checkout degrades gracefully when payment fails
steady-state-hypothesis:
title: System operates normally
probes:
- name: checkout-error-rate-acceptable
type: probe
tolerance: true
provider:
type: python
module: chaossprometheus.probes
func: query_within_range
arguments:
query: |
rate(http_requests_total{service="checkout",status=~"5.."}[5m])
/ rate(http_requests_total{service="checkout"}[5m]) * 100
min: 0
max: 5 # less than 5% errors
method:
- type: action
name: inject-payment-failures
provider:
type: python
module: chaosgremlin.actions
func: attack_targets
arguments:
attack_configuration:
attack_type: "latency"
delay: 5000 # 5 second delay
rollback:
- type: action
name: stop-payment-failures
provider:
type: python
module: chaosgremlin.actions
func: halt_all_attacksDashboards for Chaos Experiments
Create a dedicated chaos experiment dashboard in Grafana that shows:
{
"panels": [
{
"title": "Request Error Rate by Service",
"type": "timeseries",
"targets": [{
"expr": "rate(http_requests_total{status=~'5..'}[1m]) / rate(http_requests_total[1m])",
"legendFormat": "{{service}}"
}]
},
{
"title": "P99 Latency by Service",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m]))",
"legendFormat": "{{service}}"
}]
},
{
"title": "Active Chaos Experiments",
"type": "table",
"targets": [{
"expr": "chaos_experiment_active",
"legendFormat": "{{name}}"
}]
}
]
}The "Active Chaos Experiments" panel is particularly valuable — it shows which experiments are running and correlates with metric changes, so you can distinguish chaos-induced failures from organic failures.
Alerting During Chaos Experiments
Configure your alerting to suppress non-critical alerts during planned chaos experiments, while keeping critical alerts active:
# Alertmanager: suppress non-critical alerts during chaos
inhibit_rules:
- source_match:
alertname: 'ChaosExperimentActive'
target_match_re:
severity: 'warning'
equal: ['environment']You still want alerts to fire — but you want to distinguish "expected behavior during experiment" from "unexpected behavior." Tag chaos experiment periods explicitly in your monitoring systems.
Post-Experiment Analysis
After each experiment, conduct a structured analysis using your observability data:
Did steady state hold? Compare pre-experiment metrics with during-experiment metrics. Quantify the blast radius.
Did the system recover? Plot recovery time — how long after experiment end before metrics returned to steady state.
What was unexpected? Look for metrics that changed in ways your hypothesis didn't predict. Unexpected behavior is often the most valuable finding.
Are there alerts we should have? If the experiment caused significant user impact but no alerts fired, you have an alerting gap.
Document findings with screenshots of your Grafana dashboards during the experiment — make the data visible in your post-experiment report.
Observability Gaps That Block Chaos Engineering
No service-level metrics. If you can only see infrastructure metrics (CPU, memory) but not application metrics (request rate, error rate), you can't define steady state for your services.
No distributed tracing. Without traces, you can't determine which service caused a latency increase during an experiment.
Unstructured logs. String logs you can't query make it impossible to correlate events during experiments.
No alerting baseline. If your system generates constant alerts even in steady state, you can't tell whether chaos is causing new alerts.
Metrics retention too short. Some chaos experiments need to run for hours. If your metrics retention is 15 minutes, you can't review historical data after the experiment.
Before your first chaos experiment, audit these gaps. Fix them first.
HelpMeTest provides continuous monitoring and alerting that helps teams establish the steady-state baselines essential for chaos engineering. Start free.