Developer Platform Reliability: SLOs, Chaos Testing, and Platform Health Checks
Internal developer platforms (IDPs) are infrastructure for your engineers. When they break, all development stops. Yet most platform teams define SLOs for their user-facing products while running their own IDPs on vibes and hope.
This guide covers how to define meaningful SLOs for developer platforms, test them with chaos experiments, and build health checks that surface problems before developers notice.
Why Developer Platforms Need SLOs
Your platform team's customers are internal developers. When the CI pipeline is broken, the artifact registry is unreachable, or the internal portal shows stale data, developers can't ship. The cost is direct: engineering time lost, blocked deployments, frustrated teams.
SLOs for developer platforms answer: "How reliable does this platform need to be, and are we meeting that bar?"
A developer platform typically consists of:
- CI/CD systems (GitHub Actions, Jenkins, ArgoCD)
- Artifact registries (Docker Hub, ECR, Artifactory)
- Internal portals (Backstage, Port)
- Secret management (Vault, AWS Secrets Manager)
- Service catalogs and documentation
- Infrastructure provisioning (Terraform Cloud, Pulumi Cloud)
- Observability tooling (Grafana, Datadog, Honeycomb)
Each component needs its own SLOs.
Defining SLOs for Platform Components
CI/CD Pipeline SLOs
# slos/ci-cd.yaml
slos:
- name: "CI Pipeline Availability"
description: "Percentage of CI pipeline runs that complete (pass or fail) without infrastructure errors"
indicator:
type: ratio
good_events: "ci_runs{outcome!='infrastructure_error'}"
total_events: "ci_runs_total"
target: 99.5 # 0.5% allowed infrastructure failure rate
window: 30d
- name: "CI Pipeline Latency"
description: "Percentage of CI runs that complete within 20 minutes"
indicator:
type: ratio
good_events: "ci_run_duration_seconds{le='1200'}" # 20 min = 1200s
total_events: "ci_runs_total"
target: 95.0 # 5% can be slow
window: 30d
- name: "Artifact Push Availability"
description: "Successful Docker image pushes to registry"
indicator:
type: ratio
good_events: "registry_push{status='200'}"
total_events: "registry_push_total"
target: 99.9
window: 30dSecret Manager SLOs
- name: "Secret Retrieval Availability"
description: "Applications can read secrets on startup"
indicator:
type: ratio
good_events: "secret_reads{status='success'}"
total_events: "secret_reads_total"
target: 99.99 # Four nines — startup failures are critical
window: 30d
- name: "Secret Retrieval Latency"
description: "Secret reads complete under 500ms"
indicator:
type: ratio
good_events: "secret_read_duration_ms{le='500'}"
total_events: "secret_reads_total"
target: 99.0
window: 30dInternal Portal SLOs
- name: "Developer Portal Availability"
description: "Portal loads successfully for engineers"
indicator:
type: availability
probe: "https://internal.example.com/health"
expected_status: 200
target: 99.5 # Lower target — planned maintenance is more acceptable
window: 30d
- name: "Service Catalog Freshness"
description: "Service entities updated within 1 hour of GitHub push"
indicator:
type: freshness
max_staleness_minutes: 60
target: 99.0
window: 30dImplementing SLO Monitoring
Prometheus-Based SLO Tracking
# prometheus/recording-rules.yaml
groups:
- name: platform_slos
interval: 30s
rules:
# CI Pipeline availability
- record: ci:run_success_rate:5m
expr: |
sum(rate(ci_runs_total{outcome!="infrastructure_error"}[5m]))
/
sum(rate(ci_runs_total[5m]))
# Error budget consumption rate
- record: ci:error_budget_consumed:1h
expr: |
1 - ci:run_success_rate:5m
# Compare to allowed error rate (0.5% = 0.005)# tests/slos/test_slo_compliance.py
import pytest
import httpx
from datetime import datetime, timedelta
PROMETHEUS = "http://prometheus:9090"
def query_prometheus(query, start=None, end=None, step="60s"):
if start and end:
r = httpx.get(f"{PROMETHEUS}/api/v1/query_range", params={
"query": query,
"start": start.isoformat(),
"end": end.isoformat(),
"step": step
})
else:
r = httpx.get(f"{PROMETHEUS}/api/v1/query", params={"query": query})
r.raise_for_status()
return r.json()
def test_ci_availability_slo_met():
"""CI pipeline availability should meet 99.5% SLO over the last 30 days."""
end = datetime.now()
start = end - timedelta(days=30)
result = query_prometheus(
'sum(increase(ci_runs_total{outcome!="infrastructure_error"}[30d])) / sum(increase(ci_runs_total[30d]))',
start=start, end=end
)
data = result["data"]["result"]
if not data:
pytest.skip("No CI run data available")
availability = float(data[0]["values"][-1][1])
assert availability >= 0.995, \
f"CI availability SLO breached: {availability:.3%} < 99.5%"
def test_secret_retrieval_latency_slo_met():
"""99% of secret reads should complete in under 500ms."""
result = query_prometheus(
'histogram_quantile(0.99, sum(rate(secret_read_duration_ms_bucket[30d])) by (le))'
)
data = result["data"]["result"]
if not data:
pytest.skip("No secret read metrics available")
p99_ms = float(data[0]["value"][1])
assert p99_ms <= 500, \
f"Secret retrieval p99 latency SLO breached: {p99_ms}ms > 500ms"
def test_error_budget_not_exhausted():
"""Error budget should not be more than 80% consumed."""
# 30-day window, 99.5% target = 0.5% error budget
# 0.5% of 30 days = 216 minutes of allowed downtime
result = query_prometheus(
'sum(increase(ci_runs_total{outcome="infrastructure_error"}[30d])) / sum(increase(ci_runs_total[30d]))'
)
data = result["data"]["result"]
if not data:
pytest.skip("No data available")
actual_error_rate = float(data[0]["value"][1])
allowed_error_rate = 0.005 # 99.5% SLO
budget_consumed = actual_error_rate / allowed_error_rate
assert budget_consumed <= 0.8, \
f"Error budget {budget_consumed:.0%} consumed (threshold: 80%)"Chaos Engineering for Developer Platforms
Platform components need to survive failures in their own dependencies. Chaos experiments reveal brittleness.
CI/CD Chaos: Registry Outage
# chaos/test_registry_outage.py
"""Test CI pipeline behavior when artifact registry is unavailable."""
import subprocess
import time
import pytest
@pytest.fixture
def blocked_registry(docker_network):
"""Block traffic to the artifact registry."""
# Use iptables or tc-netem to simulate registry outage
subprocess.run([
"iptables", "-A", "OUTPUT",
"-d", "registry.example.com",
"-j", "DROP"
], check=True)
yield
# Restore
subprocess.run([
"iptables", "-D", "OUTPUT",
"-d", "registry.example.com",
"-j", "DROP"
], check=True)
def test_ci_pipeline_fails_gracefully_on_registry_outage(blocked_registry):
"""CI pipeline should fail with a clear error, not hang indefinitely."""
start = time.time()
# Trigger a pipeline that requires a registry push
result = trigger_test_pipeline(timeout=120)
elapsed = time.time() - start
# Should fail, not hang
assert result["status"] == "FAILURE"
assert elapsed < 120, "Pipeline hung instead of failing fast"
# Error should mention the registry, not be a cryptic timeout
assert any(
keyword in result["error_message"].lower()
for keyword in ["registry", "push", "network", "connection"]
), f"Unclear error message: {result['error_message']}"Secret Manager Chaos: Latency Injection
# chaos/test_vault_latency.py
"""Test application startup with slow Vault responses."""
import requests
import time
import threading
def inject_latency(target_host, delay_ms=2000, duration_seconds=30):
"""Use toxiproxy to inject latency into Vault connections."""
toxiproxy = requests.Session()
toxiproxy.headers["Content-Type"] = "application/json"
# Add latency toxic
toxic_payload = {
"name": "vault_latency",
"type": "latency",
"attributes": {"latency": delay_ms, "jitter": 100}
}
toxiproxy.post(
f"http://toxiproxy:8474/proxies/vault/toxics",
json=toxic_payload
)
time.sleep(duration_seconds)
# Remove toxic
toxiproxy.delete("http://toxiproxy:8474/proxies/vault/toxics/vault_latency")
def test_app_starts_despite_slow_vault():
"""Applications should start successfully even if Vault is slow."""
# Inject 2-second latency
latency_thread = threading.Thread(
target=inject_latency,
args=("vault", 2000, 60),
daemon=True
)
latency_thread.start()
# Try starting the application
start = time.time()
app_process = start_test_application()
ready = wait_for_ready(app_process, timeout=30)
elapsed = time.time() - start
assert ready, f"Application failed to start within 30s under Vault latency"
assert elapsed < 30, f"Application startup took {elapsed:.1f}s — too slow"
app_process.terminate()
def test_app_handles_vault_unavailable():
"""Application should start with cached/default secrets when Vault is down."""
# Block Vault entirely
block_service("vault")
try:
app_process = start_test_application()
ready = wait_for_ready(app_process, timeout=10)
if not ready:
# Acceptable: app refuses to start without secrets
# Verify it exits cleanly with a meaningful error
returncode = app_process.wait(timeout=5)
assert returncode != 0
logs = app_process.stderr.read().decode()
assert "vault" in logs.lower() or "secret" in logs.lower()
else:
# Also acceptable: app starts with fallback config
# Verify it reports degraded state
health = requests.get("http://localhost:8080/health").json()
assert health["status"] in ("degraded", "partial")
finally:
unblock_service("vault")
if app_process.poll() is None:
app_process.terminate()Platform Portal Chaos: Database Degradation
def test_portal_serves_stale_data_when_db_slow():
"""Portal should serve cached data and display staleness warning under DB load."""
# Inject CPU stress on the database host
inject_cpu_stress(target="postgres-portal", cpu_percent=90, duration=30)
response = requests.get("https://internal.example.com/api/catalog/services")
# Portal should still respond
assert response.status_code == 200
# But should indicate data may be stale
assert response.json().get("stale") is True or \
"X-Data-Staleness" in response.headers
# Response should still be fast (served from cache)
assert response.elapsed.total_seconds() < 2.0Platform Health Checks
Run continuous health checks against platform components to detect failures before developers notice.
Synthetic Monitoring
# tests/platform_health/test_health_checks.py
"""Synthetic monitoring for developer platform components."""
import pytest
import httpx
import subprocess
def test_ci_can_start_new_run():
"""Verify CI is accepting new pipeline triggers."""
r = httpx.post(
"https://api.github.com/repos/org/smoke-test/dispatches",
json={"event_type": "health_check"},
headers={"Authorization": f"token {GITHUB_TOKEN}"}
)
assert r.status_code == 204, f"CI dispatch failed: {r.text}"
def test_artifact_registry_push_pull():
"""Verify the artifact registry accepts push and pull."""
# Push a tiny test image
subprocess.run([
"docker", "build", "-t", "registry.example.com/smoke:latest",
"-f", "-", "."
], input=b"FROM scratch", check=True, capture_output=True)
push_result = subprocess.run([
"docker", "push", "registry.example.com/smoke:latest"
], capture_output=True)
assert push_result.returncode == 0, f"Registry push failed: {push_result.stderr}"
# Pull it back
pull_result = subprocess.run([
"docker", "pull", "registry.example.com/smoke:latest"
], capture_output=True)
assert pull_result.returncode == 0, f"Registry pull failed: {pull_result.stderr}"
def test_vault_secret_accessible():
"""Verify Vault is reachable and can return secrets."""
import hvac
client = hvac.Client(url=VAULT_ADDR, token=VAULT_TOKEN)
secret = client.secrets.kv.v2.read_secret_version(
path="smoke-test/health-check"
)
assert secret["data"]["data"]["value"] == "ok"
def test_internal_portal_returns_200():
r = httpx.get("https://internal.example.com/health", timeout=10)
assert r.status_code == 200
def test_portal_service_catalog_not_empty():
r = httpx.get(
"https://internal.example.com/api/catalog/services",
headers={"Authorization": f"Bearer {PORTAL_TOKEN}"}
)
assert r.status_code == 200
services = r.json()["items"]
assert len(services) > 0, "Service catalog is empty — sync may have broken"
def test_grafana_datasource_healthy():
"""Grafana's datasources should all be working."""
r = httpx.get(
"https://grafana.internal.example.com/api/datasources",
auth=("admin", GRAFANA_PASSWORD)
)
datasources = r.json()
for ds in datasources:
health_r = httpx.get(
f"https://grafana.internal.example.com/api/datasources/{ds['id']}/health",
auth=("admin", GRAFANA_PASSWORD)
)
assert health_r.json()["status"] == "OK", \
f"Datasource '{ds['name']}' unhealthy: {health_r.json()['message']}"Scheduling Platform Health Checks
Run these checks continuously. With HelpMeTest, you can schedule them without managing test infrastructure:
# Portal availability check
Go To https://internal.example.com/health
Status Should Be 200
Page Should Contain "status":"ok"
# CI API availability
GET https://api.github.com/repos/org/status
Status Should Be 200HelpMeTest runs checks every 5 minutes and sends alerts when any platform component fails.
Alerting on SLO Breaches
# prometheus/alerts.yaml
groups:
- name: platform_slos
rules:
- alert: CIPipelineAvailabilitySLOAtRisk
expr: |
(
1 - sum(rate(ci_runs_total{outcome!="infrastructure_error"}[6h]))
/
sum(rate(ci_runs_total[6h]))
) > 0.005 # Burning error budget faster than 1x
for: 30m
labels:
severity: warning
team: platform
annotations:
summary: "CI pipeline error budget burning fast"
description: >
CI pipeline error rate over 6h is {{ $value | humanizePercentage }}.
At this rate, the 30-day SLO (99.5%) will be breached.
Error budget consumed: ~{{ $value | humanizePercentage }} of monthly allowance per day.
- alert: VaultLatencySLOBreach
expr: |
histogram_quantile(0.99,
sum(rate(secret_read_duration_ms_bucket[1h])) by (le)
) > 500
for: 10m
labels:
severity: critical
team: platform
annotations:
summary: "Vault p99 latency exceeds SLO"
description: >
Vault secret retrieval p99 latency is {{ $value }}ms.
SLO target: 500ms. Applications may be experiencing slow startup.Platform reliability is not a soft goal — it's a multiplier on all engineering productivity. Every minute your CI is broken is a minute every engineer in your company loses. Define SLOs, test against them, and treat breaches with the same urgency as production incidents.