SLI, SLO, and Error Budget Testing: A Practical Guide

SLI, SLO, and Error Budget Testing: A Practical Guide

SLI, SLO, and error budget are the vocabulary of reliability engineering. They give you a rigorous way to answer "is our service reliable enough?" and "how much risk can we take with this deployment?" without relying on gut feeling or incident retrospectives.

This guide covers how to implement SLIs and SLOs in code, write tests that verify compliance, and set up error budget burn rate alerting.

Core Concepts

SLI (Service Level Indicator) — A quantitative measure of service behavior. Must be measurable from real user perspective.

Examples:

  • Request latency (P95 < 200ms)
  • Availability (HTTP 5xx rate < 0.1%)
  • Data freshness (pipeline outputs updated within 5 minutes)
  • Task completion rate (batch jobs succeed > 99.9% of the time)

SLO (Service Level Objective) — The target value for an SLI over a time window.

Examples:

  • 99.9% of requests complete in < 200ms, measured over a 28-day rolling window
  • Error rate < 0.1% over 28 days
  • 99.5% of pipelines complete within their SLA window

Error Budget — The allowed room for failure. A 99.9% SLO over 28 days gives you 40 minutes of budget (0.1% × 28 days × 24 hours × 60 minutes).

Defining SLIs in Code

from dataclasses import dataclass
from typing import Callable
import time
import statistics

@dataclass
class SLI:
    name: str
    description: str
    query: Callable  # Function that computes the current SLI value
    unit: str

@dataclass
class SLO:
    name: str
    sli: SLI
    target: float        # e.g., 0.999 for 99.9%
    window_days: int     # Rolling window for measurement
    
    @property
    def error_budget_fraction(self) -> float:
        return 1.0 - self.target
    
    def error_budget_minutes(self) -> float:
        return self.error_budget_fraction * self.window_days * 24 * 60

# Define SLIs
availability_sli = SLI(
    name="http_availability",
    description="Fraction of HTTP requests that succeed (non-5xx)",
    query=lambda: query_prometheus(
        'sum(rate(http_requests_total{status!~"5.."}[5m])) / '
        'sum(rate(http_requests_total[5m]))'
    ),
    unit="ratio"
)

latency_sli = SLI(
    name="http_latency_p95",
    description="95th percentile request latency",
    query=lambda: query_prometheus(
        'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))'
    ),
    unit="seconds"
)

# Define SLOs
availability_slo = SLO(
    name="API Availability",
    sli=availability_sli,
    target=0.999,    # 99.9%
    window_days=28
)

latency_slo = SLO(
    name="API Latency P95",
    sli=latency_sli,
    target=0.95,     # 95% of requests under 200ms
    window_days=28
)

print(f"Error budget for {availability_slo.name}: "
      f"{availability_slo.error_budget_minutes():.1f} minutes per {availability_slo.window_days} days")
# → 40.3 minutes per 28 days

Writing SLO Compliance Tests

Synthetic Monitoring Tests

Synthetic tests continuously validate your SLO from the outside, mimicking real user requests:

import requests
import time
import pytest

class SLOComplianceChecker:
    def __init__(self, endpoint: str, target_p95_ms: float, target_success_rate: float):
        self.endpoint = endpoint
        self.target_p95_ms = target_p95_ms
        self.target_success_rate = target_success_rate
    
    def run_sample(self, n_requests: int = 100) -> dict:
        latencies = []
        successes = 0
        errors = []
        
        for i in range(n_requests):
            start = time.perf_counter_ns()
            try:
                response = requests.get(self.endpoint, timeout=5.0)
                elapsed_ms = (time.perf_counter_ns() - start) / 1_000_000
                latencies.append(elapsed_ms)
                
                if response.status_code < 500:
                    successes += 1
                else:
                    errors.append({"status": response.status_code, "latency_ms": elapsed_ms})
            except requests.exceptions.Timeout:
                elapsed_ms = (time.perf_counter_ns() - start) / 1_000_000
                latencies.append(elapsed_ms)
                errors.append({"status": "timeout", "latency_ms": elapsed_ms})
        
        latencies.sort()
        p95_ms = latencies[int(0.95 * len(latencies))]
        success_rate = successes / n_requests
        
        return {
            "n_requests": n_requests,
            "success_rate": success_rate,
            "p95_latency_ms": p95_ms,
            "errors": errors
        }

@pytest.mark.slo
def test_api_availability_slo():
    checker = SLOComplianceChecker(
        endpoint="https://api.yourservice.com/health",
        target_p95_ms=200.0,
        target_success_rate=0.999
    )
    
    result = checker.run_sample(n_requests=200)
    
    assert result["success_rate"] >= 0.999, \
        f"Success rate {result['success_rate']:.4f} below 99.9% SLO target. " \
        f"Errors: {result['errors'][:3]}"
    
    assert result["p95_latency_ms"] <= 200.0, \
        f"P95 latency {result['p95_latency_ms']:.1f}ms exceeds 200ms SLO"

Error Budget Burn Rate Testing

Burn rate tells you how quickly you're consuming your error budget. A burn rate > 1 means you're on track to exhaust the budget before the window ends.

def calculate_burn_rate(current_error_rate: float, slo_target: float) -> float:
    """
    Burn rate of 1.0 = consuming budget at exactly the allowed rate.
    Burn rate of 2.0 = consuming budget twice as fast (will exhaust in half the window).
    """
    error_budget = 1.0 - slo_target
    return current_error_rate / error_budget

def test_error_budget_burn_rate_alert():
    """
    Verify that the monitoring system fires when burn rate is too high.
    
    Fast burn (1-hour window): alert if burn rate > 14.4x
    Slow burn (6-hour window): alert if burn rate > 6x
    (These are Google's recommended multiwindow burn rate thresholds)
    """
    # Simulate a burst of errors
    simulated_error_rate = 0.05  # 5% error rate (50x the 0.1% SLO)
    
    burn_rate = calculate_burn_rate(
        current_error_rate=simulated_error_rate,
        slo_target=0.999
    )
    
    # At 5% error rate with 0.1% budget: burn rate = 50x
    assert burn_rate > 14.4, \
        f"Expected fast-burn alert to trigger (burn_rate={burn_rate:.1f} should be > 14.4x)"
    
    # Verify the alert system would fire
    alert_fired = monitoring_client.check_alert_status("error_budget_burn_rate_high")
    assert alert_fired, f"Burn rate alert did not fire at burn_rate={burn_rate:.1f}x"

Prometheus SLO Alerting Rules

# prometheus-slo-rules.yaml
groups:
  - name: slo_error_budget
    rules:
      # SLI: availability
      - record: slo:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
      
      - record: slo:http_availability:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[1h]))
          /
          sum(rate(http_requests_total[1h]))
      
      # Fast burn alert: 14.4x burn rate over 1h (will exhaust budget in 2 days)
      - alert: HighErrorBudgetBurnRate
        expr: |
          (
            (1 - slo:http_availability:ratio_rate1h) / (1 - 0.999)
          ) > 14.4
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error budget burn rate"
          description: "Burn rate {{ $value | humanize }}x — exhausts budget in {{ 672 | div $value | humanizeDuration }}"
      
      # Slow burn alert: 3x burn rate over 6h
      - alert: ElevatedErrorBudgetBurnRate
        expr: |
          (
            (1 - slo:http_availability:ratio_rate6h) / (1 - 0.999)
          ) > 3
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Elevated error budget burn rate"
          description: "Burn rate {{ $value | humanize }}x over last 6 hours"

Testing Alerting Rules

def test_prometheus_alert_fires_on_high_error_rate(prometheus_test_client):
    """
    Inject a high error rate and verify the alert fires within the expected time.
    """
    # Simulate 15% error rate (150x the SLO budget)
    prometheus_test_client.inject_metric(
        metric='http_requests_total',
        labels={'status': '500'},
        value=150.0  # 150 5xx per time unit
    )
    prometheus_test_client.inject_metric(
        metric='http_requests_total',
        labels={'status': '200'},
        value=850.0  # 850 successful per time unit (85% success = 15% error)
    )
    
    # Wait for alert to evaluate
    import time
    time.sleep(65)  # Rule evaluation interval + buffer
    
    alerts = prometheus_test_client.get_active_alerts()
    alert_names = [a['alertname'] for a in alerts]
    
    assert 'HighErrorBudgetBurnRate' in alert_names, \
        f"Expected HighErrorBudgetBurnRate alert, got: {alert_names}"
    
    alert = next(a for a in alerts if a['alertname'] == 'HighErrorBudgetBurnRate')
    burn_rate = float(alert['value'])
    
    assert burn_rate > 14.4, f"Burn rate {burn_rate:.1f} below expected threshold"

Error Budget Policy

The error budget policy defines what happens when budget is exhausted:

class ErrorBudgetPolicy:
    """Automate responses based on remaining error budget."""
    
    def __init__(self, slo: SLO, metrics_client):
        self.slo = slo
        self.metrics = metrics_client
    
    def get_remaining_budget_fraction(self, window_hours: int = 28 * 24) -> float:
        """Returns remaining error budget as a fraction (1.0 = full budget, 0 = exhausted)."""
        consumed_errors = self.metrics.get_error_count(hours=window_hours)
        total_requests = self.metrics.get_request_count(hours=window_hours)
        
        if total_requests == 0:
            return 1.0
        
        actual_error_rate = consumed_errors / total_requests
        allowed_error_rate = 1.0 - self.slo.target
        
        consumed_fraction = actual_error_rate / allowed_error_rate
        return max(0.0, 1.0 - consumed_fraction)
    
    def get_deployment_recommendation(self) -> str:
        """Return deployment policy based on current error budget."""
        remaining = self.get_remaining_budget_fraction()
        
        if remaining > 0.50:
            return "PROCEED"  # > 50% budget remaining
        elif remaining > 0.10:
            return "CAUTION"  # 10-50% budget remaining — require extra review
        else:
            return "FREEZE"   # < 10% budget — no risky deployments

def test_error_budget_policy_blocks_risky_deploys():
    """When error budget is exhausted, high-risk deployments should be blocked."""
    policy = ErrorBudgetPolicy(availability_slo, metrics_client=test_metrics_client)
    
    # Simulate budget exhausted
    test_metrics_client.set_error_rate(0.50)  # 500x the budget
    
    recommendation = policy.get_deployment_recommendation()
    assert recommendation == "FREEZE", \
        f"Expected FREEZE when budget exhausted, got {recommendation}"

SLO Review Cadence

Beyond automated testing, SLOs need regular human review:

def generate_slo_report(slo: SLO, window_days: int = 28) -> str:
    """Generate a weekly SLO compliance report."""
    metrics = query_slo_metrics(slo, window_days)
    
    report = f"""
## SLO Report: {slo.name}
**Period:** Last {window_days} days
**Target:** {slo.target:.4%}
**Actual:** {metrics.actual:.4%}
**Status:** {'✅ COMPLIANT' if metrics.actual >= slo.target else '❌ BREACHED'}

### Error Budget
- Total budget: {slo.error_budget_minutes():.1f} minutes
- Consumed: {metrics.budget_consumed_minutes:.1f} minutes
- Remaining: {metrics.budget_remaining_minutes:.1f} minutes ({metrics.budget_remaining_pct:.1%})

### Incidents
{chr(10).join(f"- {i.timestamp}: {i.duration_minutes:.1f}min — {i.description}" 
              for i in metrics.incidents)}

### Recommendation
{'Deploy velocity is sustainable.' if metrics.budget_remaining_pct > 0.5 
 else 'Slow deploy cadence — budget is constrained.' if metrics.budget_remaining_pct > 0.1
 else '⚠ FREEZE non-critical changes. Budget nearly exhausted.'}
"""
    return report

SLIs and SLOs work best when they're tied to real user pain. A latency SLO that nobody notices breaching is the wrong SLO. Start with one or two SLOs that directly correspond to user experience, test them rigorously, and expand from there.

Read more