SLI, SLO, and Error Budget Testing: A Practical Guide
SLI, SLO, and error budget are the vocabulary of reliability engineering. They give you a rigorous way to answer "is our service reliable enough?" and "how much risk can we take with this deployment?" without relying on gut feeling or incident retrospectives.
This guide covers how to implement SLIs and SLOs in code, write tests that verify compliance, and set up error budget burn rate alerting.
Core Concepts
SLI (Service Level Indicator) — A quantitative measure of service behavior. Must be measurable from real user perspective.
Examples:
- Request latency (P95 < 200ms)
- Availability (HTTP 5xx rate < 0.1%)
- Data freshness (pipeline outputs updated within 5 minutes)
- Task completion rate (batch jobs succeed > 99.9% of the time)
SLO (Service Level Objective) — The target value for an SLI over a time window.
Examples:
- 99.9% of requests complete in < 200ms, measured over a 28-day rolling window
- Error rate < 0.1% over 28 days
- 99.5% of pipelines complete within their SLA window
Error Budget — The allowed room for failure. A 99.9% SLO over 28 days gives you 40 minutes of budget (0.1% × 28 days × 24 hours × 60 minutes).
Defining SLIs in Code
from dataclasses import dataclass
from typing import Callable
import time
import statistics
@dataclass
class SLI:
name: str
description: str
query: Callable # Function that computes the current SLI value
unit: str
@dataclass
class SLO:
name: str
sli: SLI
target: float # e.g., 0.999 for 99.9%
window_days: int # Rolling window for measurement
@property
def error_budget_fraction(self) -> float:
return 1.0 - self.target
def error_budget_minutes(self) -> float:
return self.error_budget_fraction * self.window_days * 24 * 60
# Define SLIs
availability_sli = SLI(
name="http_availability",
description="Fraction of HTTP requests that succeed (non-5xx)",
query=lambda: query_prometheus(
'sum(rate(http_requests_total{status!~"5.."}[5m])) / '
'sum(rate(http_requests_total[5m]))'
),
unit="ratio"
)
latency_sli = SLI(
name="http_latency_p95",
description="95th percentile request latency",
query=lambda: query_prometheus(
'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
),
unit="seconds"
)
# Define SLOs
availability_slo = SLO(
name="API Availability",
sli=availability_sli,
target=0.999, # 99.9%
window_days=28
)
latency_slo = SLO(
name="API Latency P95",
sli=latency_sli,
target=0.95, # 95% of requests under 200ms
window_days=28
)
print(f"Error budget for {availability_slo.name}: "
f"{availability_slo.error_budget_minutes():.1f} minutes per {availability_slo.window_days} days")
# → 40.3 minutes per 28 daysWriting SLO Compliance Tests
Synthetic Monitoring Tests
Synthetic tests continuously validate your SLO from the outside, mimicking real user requests:
import requests
import time
import pytest
class SLOComplianceChecker:
def __init__(self, endpoint: str, target_p95_ms: float, target_success_rate: float):
self.endpoint = endpoint
self.target_p95_ms = target_p95_ms
self.target_success_rate = target_success_rate
def run_sample(self, n_requests: int = 100) -> dict:
latencies = []
successes = 0
errors = []
for i in range(n_requests):
start = time.perf_counter_ns()
try:
response = requests.get(self.endpoint, timeout=5.0)
elapsed_ms = (time.perf_counter_ns() - start) / 1_000_000
latencies.append(elapsed_ms)
if response.status_code < 500:
successes += 1
else:
errors.append({"status": response.status_code, "latency_ms": elapsed_ms})
except requests.exceptions.Timeout:
elapsed_ms = (time.perf_counter_ns() - start) / 1_000_000
latencies.append(elapsed_ms)
errors.append({"status": "timeout", "latency_ms": elapsed_ms})
latencies.sort()
p95_ms = latencies[int(0.95 * len(latencies))]
success_rate = successes / n_requests
return {
"n_requests": n_requests,
"success_rate": success_rate,
"p95_latency_ms": p95_ms,
"errors": errors
}
@pytest.mark.slo
def test_api_availability_slo():
checker = SLOComplianceChecker(
endpoint="https://api.yourservice.com/health",
target_p95_ms=200.0,
target_success_rate=0.999
)
result = checker.run_sample(n_requests=200)
assert result["success_rate"] >= 0.999, \
f"Success rate {result['success_rate']:.4f} below 99.9% SLO target. " \
f"Errors: {result['errors'][:3]}"
assert result["p95_latency_ms"] <= 200.0, \
f"P95 latency {result['p95_latency_ms']:.1f}ms exceeds 200ms SLO"Error Budget Burn Rate Testing
Burn rate tells you how quickly you're consuming your error budget. A burn rate > 1 means you're on track to exhaust the budget before the window ends.
def calculate_burn_rate(current_error_rate: float, slo_target: float) -> float:
"""
Burn rate of 1.0 = consuming budget at exactly the allowed rate.
Burn rate of 2.0 = consuming budget twice as fast (will exhaust in half the window).
"""
error_budget = 1.0 - slo_target
return current_error_rate / error_budget
def test_error_budget_burn_rate_alert():
"""
Verify that the monitoring system fires when burn rate is too high.
Fast burn (1-hour window): alert if burn rate > 14.4x
Slow burn (6-hour window): alert if burn rate > 6x
(These are Google's recommended multiwindow burn rate thresholds)
"""
# Simulate a burst of errors
simulated_error_rate = 0.05 # 5% error rate (50x the 0.1% SLO)
burn_rate = calculate_burn_rate(
current_error_rate=simulated_error_rate,
slo_target=0.999
)
# At 5% error rate with 0.1% budget: burn rate = 50x
assert burn_rate > 14.4, \
f"Expected fast-burn alert to trigger (burn_rate={burn_rate:.1f} should be > 14.4x)"
# Verify the alert system would fire
alert_fired = monitoring_client.check_alert_status("error_budget_burn_rate_high")
assert alert_fired, f"Burn rate alert did not fire at burn_rate={burn_rate:.1f}x"Prometheus SLO Alerting Rules
# prometheus-slo-rules.yaml
groups:
- name: slo_error_budget
rules:
# SLI: availability
- record: slo:http_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- record: slo:http_availability:ratio_rate1h
expr: |
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
# Fast burn alert: 14.4x burn rate over 1h (will exhaust budget in 2 days)
- alert: HighErrorBudgetBurnRate
expr: |
(
(1 - slo:http_availability:ratio_rate1h) / (1 - 0.999)
) > 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "High error budget burn rate"
description: "Burn rate {{ $value | humanize }}x — exhausts budget in {{ 672 | div $value | humanizeDuration }}"
# Slow burn alert: 3x burn rate over 6h
- alert: ElevatedErrorBudgetBurnRate
expr: |
(
(1 - slo:http_availability:ratio_rate6h) / (1 - 0.999)
) > 3
for: 15m
labels:
severity: warning
annotations:
summary: "Elevated error budget burn rate"
description: "Burn rate {{ $value | humanize }}x over last 6 hours"Testing Alerting Rules
def test_prometheus_alert_fires_on_high_error_rate(prometheus_test_client):
"""
Inject a high error rate and verify the alert fires within the expected time.
"""
# Simulate 15% error rate (150x the SLO budget)
prometheus_test_client.inject_metric(
metric='http_requests_total',
labels={'status': '500'},
value=150.0 # 150 5xx per time unit
)
prometheus_test_client.inject_metric(
metric='http_requests_total',
labels={'status': '200'},
value=850.0 # 850 successful per time unit (85% success = 15% error)
)
# Wait for alert to evaluate
import time
time.sleep(65) # Rule evaluation interval + buffer
alerts = prometheus_test_client.get_active_alerts()
alert_names = [a['alertname'] for a in alerts]
assert 'HighErrorBudgetBurnRate' in alert_names, \
f"Expected HighErrorBudgetBurnRate alert, got: {alert_names}"
alert = next(a for a in alerts if a['alertname'] == 'HighErrorBudgetBurnRate')
burn_rate = float(alert['value'])
assert burn_rate > 14.4, f"Burn rate {burn_rate:.1f} below expected threshold"Error Budget Policy
The error budget policy defines what happens when budget is exhausted:
class ErrorBudgetPolicy:
"""Automate responses based on remaining error budget."""
def __init__(self, slo: SLO, metrics_client):
self.slo = slo
self.metrics = metrics_client
def get_remaining_budget_fraction(self, window_hours: int = 28 * 24) -> float:
"""Returns remaining error budget as a fraction (1.0 = full budget, 0 = exhausted)."""
consumed_errors = self.metrics.get_error_count(hours=window_hours)
total_requests = self.metrics.get_request_count(hours=window_hours)
if total_requests == 0:
return 1.0
actual_error_rate = consumed_errors / total_requests
allowed_error_rate = 1.0 - self.slo.target
consumed_fraction = actual_error_rate / allowed_error_rate
return max(0.0, 1.0 - consumed_fraction)
def get_deployment_recommendation(self) -> str:
"""Return deployment policy based on current error budget."""
remaining = self.get_remaining_budget_fraction()
if remaining > 0.50:
return "PROCEED" # > 50% budget remaining
elif remaining > 0.10:
return "CAUTION" # 10-50% budget remaining — require extra review
else:
return "FREEZE" # < 10% budget — no risky deployments
def test_error_budget_policy_blocks_risky_deploys():
"""When error budget is exhausted, high-risk deployments should be blocked."""
policy = ErrorBudgetPolicy(availability_slo, metrics_client=test_metrics_client)
# Simulate budget exhausted
test_metrics_client.set_error_rate(0.50) # 500x the budget
recommendation = policy.get_deployment_recommendation()
assert recommendation == "FREEZE", \
f"Expected FREEZE when budget exhausted, got {recommendation}"SLO Review Cadence
Beyond automated testing, SLOs need regular human review:
def generate_slo_report(slo: SLO, window_days: int = 28) -> str:
"""Generate a weekly SLO compliance report."""
metrics = query_slo_metrics(slo, window_days)
report = f"""
## SLO Report: {slo.name}
**Period:** Last {window_days} days
**Target:** {slo.target:.4%}
**Actual:** {metrics.actual:.4%}
**Status:** {'✅ COMPLIANT' if metrics.actual >= slo.target else '❌ BREACHED'}
### Error Budget
- Total budget: {slo.error_budget_minutes():.1f} minutes
- Consumed: {metrics.budget_consumed_minutes:.1f} minutes
- Remaining: {metrics.budget_remaining_minutes:.1f} minutes ({metrics.budget_remaining_pct:.1%})
### Incidents
{chr(10).join(f"- {i.timestamp}: {i.duration_minutes:.1f}min — {i.description}"
for i in metrics.incidents)}
### Recommendation
{'Deploy velocity is sustainable.' if metrics.budget_remaining_pct > 0.5
else 'Slow deploy cadence — budget is constrained.' if metrics.budget_remaining_pct > 0.1
else '⚠ FREEZE non-critical changes. Budget nearly exhausted.'}
"""
return reportSLIs and SLOs work best when they're tied to real user pain. A latency SLO that nobody notices breaching is the wrong SLO. Start with one or two SLOs that directly correspond to user experience, test them rigorously, and expand from there.