Developer Platform Reliability: SLOs, Chaos Testing, and Platform Health Checks

Developer Platform Reliability: SLOs, Chaos Testing, and Platform Health Checks

Internal developer platforms (IDPs) are infrastructure for your engineers. When they break, all development stops. Yet most platform teams define SLOs for their user-facing products while running their own IDPs on vibes and hope.

This guide covers how to define meaningful SLOs for developer platforms, test them with chaos experiments, and build health checks that surface problems before developers notice.

Why Developer Platforms Need SLOs

Your platform team's customers are internal developers. When the CI pipeline is broken, the artifact registry is unreachable, or the internal portal shows stale data, developers can't ship. The cost is direct: engineering time lost, blocked deployments, frustrated teams.

SLOs for developer platforms answer: "How reliable does this platform need to be, and are we meeting that bar?"

A developer platform typically consists of:

  • CI/CD systems (GitHub Actions, Jenkins, ArgoCD)
  • Artifact registries (Docker Hub, ECR, Artifactory)
  • Internal portals (Backstage, Port)
  • Secret management (Vault, AWS Secrets Manager)
  • Service catalogs and documentation
  • Infrastructure provisioning (Terraform Cloud, Pulumi Cloud)
  • Observability tooling (Grafana, Datadog, Honeycomb)

Each component needs its own SLOs.

Defining SLOs for Platform Components

CI/CD Pipeline SLOs

# slos/ci-cd.yaml
slos:
  - name: "CI Pipeline Availability"
    description: "Percentage of CI pipeline runs that complete (pass or fail) without infrastructure errors"
    indicator:
      type: ratio
      good_events: "ci_runs{outcome!='infrastructure_error'}"
      total_events: "ci_runs_total"
    target: 99.5  # 0.5% allowed infrastructure failure rate
    window: 30d

  - name: "CI Pipeline Latency"
    description: "Percentage of CI runs that complete within 20 minutes"
    indicator:
      type: ratio
      good_events: "ci_run_duration_seconds{le='1200'}"  # 20 min = 1200s
      total_events: "ci_runs_total"
    target: 95.0  # 5% can be slow
    window: 30d

  - name: "Artifact Push Availability"
    description: "Successful Docker image pushes to registry"
    indicator:
      type: ratio
      good_events: "registry_push{status='200'}"
      total_events: "registry_push_total"
    target: 99.9
    window: 30d

Secret Manager SLOs

  - name: "Secret Retrieval Availability"
    description: "Applications can read secrets on startup"
    indicator:
      type: ratio
      good_events: "secret_reads{status='success'}"
      total_events: "secret_reads_total"
    target: 99.99  # Four nines — startup failures are critical
    window: 30d

  - name: "Secret Retrieval Latency"
    description: "Secret reads complete under 500ms"
    indicator:
      type: ratio
      good_events: "secret_read_duration_ms{le='500'}"
      total_events: "secret_reads_total"
    target: 99.0
    window: 30d

Internal Portal SLOs

  - name: "Developer Portal Availability"
    description: "Portal loads successfully for engineers"
    indicator:
      type: availability
      probe: "https://internal.example.com/health"
      expected_status: 200
    target: 99.5  # Lower target — planned maintenance is more acceptable
    window: 30d

  - name: "Service Catalog Freshness"
    description: "Service entities updated within 1 hour of GitHub push"
    indicator:
      type: freshness
      max_staleness_minutes: 60
    target: 99.0
    window: 30d

Implementing SLO Monitoring

Prometheus-Based SLO Tracking

# prometheus/recording-rules.yaml
groups:
  - name: platform_slos
    interval: 30s
    rules:
      # CI Pipeline availability
      - record: ci:run_success_rate:5m
        expr: |
          sum(rate(ci_runs_total{outcome!="infrastructure_error"}[5m]))
          /
          sum(rate(ci_runs_total[5m]))

      # Error budget consumption rate
      - record: ci:error_budget_consumed:1h
        expr: |
          1 - ci:run_success_rate:5m
          # Compare to allowed error rate (0.5% = 0.005)
# tests/slos/test_slo_compliance.py
import pytest
import httpx
from datetime import datetime, timedelta

PROMETHEUS = "http://prometheus:9090"

def query_prometheus(query, start=None, end=None, step="60s"):
    if start and end:
        r = httpx.get(f"{PROMETHEUS}/api/v1/query_range", params={
            "query": query,
            "start": start.isoformat(),
            "end": end.isoformat(),
            "step": step
        })
    else:
        r = httpx.get(f"{PROMETHEUS}/api/v1/query", params={"query": query})
    
    r.raise_for_status()
    return r.json()

def test_ci_availability_slo_met():
    """CI pipeline availability should meet 99.5% SLO over the last 30 days."""
    end = datetime.now()
    start = end - timedelta(days=30)
    
    result = query_prometheus(
        'sum(increase(ci_runs_total{outcome!="infrastructure_error"}[30d])) / sum(increase(ci_runs_total[30d]))',
        start=start, end=end
    )
    
    data = result["data"]["result"]
    if not data:
        pytest.skip("No CI run data available")
    
    availability = float(data[0]["values"][-1][1])
    assert availability >= 0.995, \
        f"CI availability SLO breached: {availability:.3%} < 99.5%"

def test_secret_retrieval_latency_slo_met():
    """99% of secret reads should complete in under 500ms."""
    result = query_prometheus(
        'histogram_quantile(0.99, sum(rate(secret_read_duration_ms_bucket[30d])) by (le))'
    )
    
    data = result["data"]["result"]
    if not data:
        pytest.skip("No secret read metrics available")
    
    p99_ms = float(data[0]["value"][1])
    assert p99_ms <= 500, \
        f"Secret retrieval p99 latency SLO breached: {p99_ms}ms > 500ms"

def test_error_budget_not_exhausted():
    """Error budget should not be more than 80% consumed."""
    # 30-day window, 99.5% target = 0.5% error budget
    # 0.5% of 30 days = 216 minutes of allowed downtime
    
    result = query_prometheus(
        'sum(increase(ci_runs_total{outcome="infrastructure_error"}[30d])) / sum(increase(ci_runs_total[30d]))'
    )
    
    data = result["data"]["result"]
    if not data:
        pytest.skip("No data available")
    
    actual_error_rate = float(data[0]["value"][1])
    allowed_error_rate = 0.005  # 99.5% SLO
    budget_consumed = actual_error_rate / allowed_error_rate
    
    assert budget_consumed <= 0.8, \
        f"Error budget {budget_consumed:.0%} consumed (threshold: 80%)"

Chaos Engineering for Developer Platforms

Platform components need to survive failures in their own dependencies. Chaos experiments reveal brittleness.

CI/CD Chaos: Registry Outage

# chaos/test_registry_outage.py
"""Test CI pipeline behavior when artifact registry is unavailable."""
import subprocess
import time
import pytest

@pytest.fixture
def blocked_registry(docker_network):
    """Block traffic to the artifact registry."""
    # Use iptables or tc-netem to simulate registry outage
    subprocess.run([
        "iptables", "-A", "OUTPUT",
        "-d", "registry.example.com",
        "-j", "DROP"
    ], check=True)
    
    yield
    
    # Restore
    subprocess.run([
        "iptables", "-D", "OUTPUT",
        "-d", "registry.example.com",
        "-j", "DROP"
    ], check=True)

def test_ci_pipeline_fails_gracefully_on_registry_outage(blocked_registry):
    """CI pipeline should fail with a clear error, not hang indefinitely."""
    start = time.time()
    
    # Trigger a pipeline that requires a registry push
    result = trigger_test_pipeline(timeout=120)
    
    elapsed = time.time() - start
    
    # Should fail, not hang
    assert result["status"] == "FAILURE"
    assert elapsed < 120, "Pipeline hung instead of failing fast"
    
    # Error should mention the registry, not be a cryptic timeout
    assert any(
        keyword in result["error_message"].lower()
        for keyword in ["registry", "push", "network", "connection"]
    ), f"Unclear error message: {result['error_message']}"

Secret Manager Chaos: Latency Injection

# chaos/test_vault_latency.py
"""Test application startup with slow Vault responses."""
import requests
import time
import threading

def inject_latency(target_host, delay_ms=2000, duration_seconds=30):
    """Use toxiproxy to inject latency into Vault connections."""
    toxiproxy = requests.Session()
    toxiproxy.headers["Content-Type"] = "application/json"
    
    # Add latency toxic
    toxic_payload = {
        "name": "vault_latency",
        "type": "latency",
        "attributes": {"latency": delay_ms, "jitter": 100}
    }
    toxiproxy.post(
        f"http://toxiproxy:8474/proxies/vault/toxics",
        json=toxic_payload
    )
    
    time.sleep(duration_seconds)
    
    # Remove toxic
    toxiproxy.delete("http://toxiproxy:8474/proxies/vault/toxics/vault_latency")

def test_app_starts_despite_slow_vault():
    """Applications should start successfully even if Vault is slow."""
    # Inject 2-second latency
    latency_thread = threading.Thread(
        target=inject_latency,
        args=("vault", 2000, 60),
        daemon=True
    )
    latency_thread.start()
    
    # Try starting the application
    start = time.time()
    app_process = start_test_application()
    ready = wait_for_ready(app_process, timeout=30)
    elapsed = time.time() - start
    
    assert ready, f"Application failed to start within 30s under Vault latency"
    assert elapsed < 30, f"Application startup took {elapsed:.1f}s — too slow"
    
    app_process.terminate()

def test_app_handles_vault_unavailable():
    """Application should start with cached/default secrets when Vault is down."""
    # Block Vault entirely
    block_service("vault")
    
    try:
        app_process = start_test_application()
        ready = wait_for_ready(app_process, timeout=10)
        
        if not ready:
            # Acceptable: app refuses to start without secrets
            # Verify it exits cleanly with a meaningful error
            returncode = app_process.wait(timeout=5)
            assert returncode != 0
            logs = app_process.stderr.read().decode()
            assert "vault" in logs.lower() or "secret" in logs.lower()
        else:
            # Also acceptable: app starts with fallback config
            # Verify it reports degraded state
            health = requests.get("http://localhost:8080/health").json()
            assert health["status"] in ("degraded", "partial")
    finally:
        unblock_service("vault")
        if app_process.poll() is None:
            app_process.terminate()

Platform Portal Chaos: Database Degradation

def test_portal_serves_stale_data_when_db_slow():
    """Portal should serve cached data and display staleness warning under DB load."""
    # Inject CPU stress on the database host
    inject_cpu_stress(target="postgres-portal", cpu_percent=90, duration=30)
    
    response = requests.get("https://internal.example.com/api/catalog/services")
    
    # Portal should still respond
    assert response.status_code == 200
    
    # But should indicate data may be stale
    assert response.json().get("stale") is True or \
           "X-Data-Staleness" in response.headers
    
    # Response should still be fast (served from cache)
    assert response.elapsed.total_seconds() < 2.0

Platform Health Checks

Run continuous health checks against platform components to detect failures before developers notice.

Synthetic Monitoring

# tests/platform_health/test_health_checks.py
"""Synthetic monitoring for developer platform components."""
import pytest
import httpx
import subprocess

def test_ci_can_start_new_run():
    """Verify CI is accepting new pipeline triggers."""
    r = httpx.post(
        "https://api.github.com/repos/org/smoke-test/dispatches",
        json={"event_type": "health_check"},
        headers={"Authorization": f"token {GITHUB_TOKEN}"}
    )
    assert r.status_code == 204, f"CI dispatch failed: {r.text}"

def test_artifact_registry_push_pull():
    """Verify the artifact registry accepts push and pull."""
    # Push a tiny test image
    subprocess.run([
        "docker", "build", "-t", "registry.example.com/smoke:latest",
        "-f", "-", "."
    ], input=b"FROM scratch", check=True, capture_output=True)
    
    push_result = subprocess.run([
        "docker", "push", "registry.example.com/smoke:latest"
    ], capture_output=True)
    assert push_result.returncode == 0, f"Registry push failed: {push_result.stderr}"
    
    # Pull it back
    pull_result = subprocess.run([
        "docker", "pull", "registry.example.com/smoke:latest"
    ], capture_output=True)
    assert pull_result.returncode == 0, f"Registry pull failed: {pull_result.stderr}"

def test_vault_secret_accessible():
    """Verify Vault is reachable and can return secrets."""
    import hvac
    client = hvac.Client(url=VAULT_ADDR, token=VAULT_TOKEN)
    
    secret = client.secrets.kv.v2.read_secret_version(
        path="smoke-test/health-check"
    )
    assert secret["data"]["data"]["value"] == "ok"

def test_internal_portal_returns_200():
    r = httpx.get("https://internal.example.com/health", timeout=10)
    assert r.status_code == 200

def test_portal_service_catalog_not_empty():
    r = httpx.get(
        "https://internal.example.com/api/catalog/services",
        headers={"Authorization": f"Bearer {PORTAL_TOKEN}"}
    )
    assert r.status_code == 200
    services = r.json()["items"]
    assert len(services) > 0, "Service catalog is empty — sync may have broken"

def test_grafana_datasource_healthy():
    """Grafana's datasources should all be working."""
    r = httpx.get(
        "https://grafana.internal.example.com/api/datasources",
        auth=("admin", GRAFANA_PASSWORD)
    )
    datasources = r.json()
    
    for ds in datasources:
        health_r = httpx.get(
            f"https://grafana.internal.example.com/api/datasources/{ds['id']}/health",
            auth=("admin", GRAFANA_PASSWORD)
        )
        assert health_r.json()["status"] == "OK", \
            f"Datasource '{ds['name']}' unhealthy: {health_r.json()['message']}"

Scheduling Platform Health Checks

Run these checks continuously. With HelpMeTest, you can schedule them without managing test infrastructure:

# Portal availability check
Go To https://internal.example.com/health
Status Should Be 200
Page Should Contain "status":"ok"

# CI API availability
GET https://api.github.com/repos/org/status
Status Should Be 200

HelpMeTest runs checks every 5 minutes and sends alerts when any platform component fails.

Alerting on SLO Breaches

# prometheus/alerts.yaml
groups:
  - name: platform_slos
    rules:
      - alert: CIPipelineAvailabilitySLOAtRisk
        expr: |
          (
            1 - sum(rate(ci_runs_total{outcome!="infrastructure_error"}[6h]))
            /
            sum(rate(ci_runs_total[6h]))
          ) > 0.005  # Burning error budget faster than 1x
        for: 30m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "CI pipeline error budget burning fast"
          description: >
            CI pipeline error rate over 6h is {{ $value | humanizePercentage }}.
            At this rate, the 30-day SLO (99.5%) will be breached.
            Error budget consumed: ~{{ $value | humanizePercentage }} of monthly allowance per day.

      - alert: VaultLatencySLOBreach
        expr: |
          histogram_quantile(0.99, 
            sum(rate(secret_read_duration_ms_bucket[1h])) by (le)
          ) > 500
        for: 10m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Vault p99 latency exceeds SLO"
          description: >
            Vault secret retrieval p99 latency is {{ $value }}ms.
            SLO target: 500ms. Applications may be experiencing slow startup.

Platform reliability is not a soft goal — it's a multiplier on all engineering productivity. Every minute your CI is broken is a minute every engineer in your company loses. Define SLOs, test against them, and treat breaches with the same urgency as production incidents.

Read more