Platform Engineering QA: Testing Internal Developer Platforms End-to-End
Internal developer platforms (IDPs) are the infrastructure that your engineering teams depend on every day. A broken platform breaks every team simultaneously. This guide covers the full QA strategy for IDPs: from the testing pyramid for platform components, to golden path validation, self-service workflow smoke tests, chaos testing for platform resilience, and SLO-based testing that proves your platform meets its reliability commitments.
Key Takeaways
The platform testing pyramid inverts the standard model. For application code, unit tests dominate. For IDPs, integration and E2E tests dominate — because the platform's value is in the interactions between components, not individual functions.
Golden path tests are your highest-priority E2E tests. A golden path is the recommended happy path for developer workflows (e.g., "create a new service"). If the golden path is broken, the platform is broken for everyone. Test it first, test it constantly.
Self-service workflows need idempotency tests. Platform APIs that create resources (namespaces, clusters, databases) must be safe to call multiple times. Test that retries don't create duplicate resources or fail noisily.
Chaos test your platform components, not your applications. Kill the platform API server, exhaust the Crossplane provider, saturate the ArgoCD application controller — the platform must degrade gracefully and self-heal.
SLO tests measure what matters to developers. Define concrete SLOs for platform operations (e.g., "new service provisioned within 5 minutes") and write tests that measure and assert on them. These are the tests that keep platform teams honest.
The IDP Testing Pyramid
Standard software testing advice says: write many unit tests, fewer integration tests, and minimal E2E tests. For platforms, this inverts:
/\
/E2E\ Golden path workflows
/------\ (high priority — must pass)
/ Integ \ Component integration tests
/------------\ (catalog↔ArgoCD, Crossplane↔provider)
/ Contract \ API contract tests between platform services
/----------------\
/ Smoke Tests \ Platform API health checks
/--------------------\Unit tests for individual platform components (individual Crossplane compositions, individual Backstage processors) still matter, but they're covered in the component-specific guides. This guide focuses on the layers above: integration, contract, smoke, and E2E testing for the platform as a whole.
Defining Your Golden Paths
A golden path is the recommended, opinionated workflow for common developer tasks. Document your golden paths first, then write tests for each step.
Example: "Create a New Service" Golden Path
1. Developer submits ServiceClaim CR to cluster
2. Crossplane creates: namespace, ServiceAccount, RoleBinding, NetworkPolicy
3. ArgoCD detects the new namespace and creates an Application for it
4. Backstage catalog ingests the new component entity
5. Developer can access the service URL within 10 minutes
6. Developer can deploy an image to the service within 15 minutesEach step is a testable assertion with a timeout. Write a test that walks through all of them end-to-end.
Example: "Create a Database" Golden Path
1. Developer applies XDatabase claim
2. Crossplane provisions RDS instance + subnet group + security group
3. Connection string is written to a Kubernetes Secret in the developer's namespace
4. Developer can connect to the database within 20 minutes
5. Backstage shows the database as a dependency of the serviceTesting Self-Service Workflows
Idempotency Tests
Platform APIs must be safe to call multiple times. Test this explicitly:
#!/usr/bin/env bash
<span class="hljs-comment"># scripts/test-idempotency.sh
<span class="hljs-built_in">set -e
NAMESPACE=<span class="hljs-string">"idempotency-test"
<span class="hljs-built_in">echo <span class="hljs-string">"Test: Creating namespace via platform API (first call)"
curl -sf -X POST <span class="hljs-string">"https://platform.internal/api/v1/namespaces" \
-H <span class="hljs-string">"Authorization: Bearer $PLATFORM_TOKEN" \
-H <span class="hljs-string">"Content-Type: application/json" \
-d <span class="hljs-string">"{\"name\": \"$NAMESPACE\", \"team\": \"platform-test\", \"environment\": \"staging\"}"
<span class="hljs-built_in">echo <span class="hljs-string">"Test: Creating same namespace again (idempotency check)"
RESPONSE=$(curl -sf -X POST <span class="hljs-string">"https://platform.internal/api/v1/namespaces" \
-H <span class="hljs-string">"Authorization: Bearer $PLATFORM_TOKEN" \
-H <span class="hljs-string">"Content-Type: application/json" \
-d <span class="hljs-string">"{\"name\": \"$NAMESPACE\", \"team\": \"platform-test\", \"environment\": \"staging\"}" \
-w <span class="hljs-string">"\n%{http_code}")
HTTP_CODE=$(<span class="hljs-built_in">echo <span class="hljs-string">"$RESPONSE" <span class="hljs-pipe">| <span class="hljs-built_in">tail -1)
BODY=$(<span class="hljs-built_in">echo <span class="hljs-string">"$RESPONSE" <span class="hljs-pipe">| <span class="hljs-built_in">head -1)
<span class="hljs-comment"># Should return 200 (already exists) or 201 (created), not 409 or 500
<span class="hljs-keyword">if [[ <span class="hljs-string">"$HTTP_CODE" == <span class="hljs-string">"409" <span class="hljs-pipe">|| <span class="hljs-string">"$HTTP_CODE" == <span class="hljs-string">"500" ]]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Idempotency check failed — got HTTP $HTTP_CODE"
<span class="hljs-built_in">echo <span class="hljs-string">"Response: $BODY"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">"OK: Idempotency check passed (HTTP $HTTP_CODE)"
<span class="hljs-comment"># Verify only one namespace was created
NS_COUNT=$(kubectl get namespace -l platform.internal/team=platform-test --no-headers <span class="hljs-pipe">| grep <span class="hljs-string">"$NAMESPACE" <span class="hljs-pipe">| <span class="hljs-built_in">wc -l)
<span class="hljs-keyword">if [ <span class="hljs-string">"$NS_COUNT" -ne 1 ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Expected 1 namespace, found $NS_COUNT"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">"OK: Exactly 1 namespace exists after 2 creation calls"Golden Path Automation Test
#!/usr/bin/env python3
# tests/golden-path-new-service.py
import subprocess
import time
import sys
import requests
import yaml
PLATFORM_API = "https://platform.internal/api/v1"
TOKEN = os.environ["PLATFORM_TOKEN"]
HEADERS = {"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}
def run(cmd, check=True):
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
if check and result.returncode != 0:
print(f"FAIL: Command failed: {cmd}")
print(result.stderr)
sys.exit(1)
return result.stdout.strip()
def wait_for(condition_fn, timeout_s, poll_interval_s=5, description="condition"):
deadline = time.time() + timeout_s
while time.time() < deadline:
if condition_fn():
return True
print(f" Waiting for {description}...")
time.sleep(poll_interval_s)
print(f"FAIL: Timed out waiting for {description} after {timeout_s}s")
sys.exit(1)
SERVICE_NAME = f"golden-path-test-{int(time.time())}"
print(f"Golden Path Test: Create New Service '{SERVICE_NAME}'")
# Step 1: Submit service claim
print("\n[1/6] Submitting ServiceClaim...")
claim = {
"apiVersion": "platform.internal/v1alpha1",
"kind": "ServiceClaim",
"metadata": {"name": SERVICE_NAME, "namespace": "platform-claims"},
"spec": {
"team": "platform-test",
"environment": "staging",
"language": "go",
"template": "microservice",
},
}
run(f"kubectl apply -f - <<< '{yaml.dump(claim)}'")
# Step 2: Wait for namespace to be created by Crossplane
print("[2/6] Waiting for namespace creation (max 3 minutes)...")
wait_for(
lambda: run(f"kubectl get namespace {SERVICE_NAME} --no-headers 2>/dev/null", check=False) != "",
timeout_s=180,
description=f"namespace {SERVICE_NAME}",
)
print(f" OK: Namespace {SERVICE_NAME} created")
# Step 3: Wait for ArgoCD application to appear
print("[3/6] Waiting for ArgoCD application (max 2 minutes)...")
wait_for(
lambda: run(f"argocd app get {SERVICE_NAME} --grpc-web 2>/dev/null", check=False) != "",
timeout_s=120,
description=f"ArgoCD application {SERVICE_NAME}",
)
print(f" OK: ArgoCD application {SERVICE_NAME} created")
# Step 4: Wait for Backstage catalog ingestion
print("[4/6] Waiting for Backstage catalog entry (max 5 minutes)...")
def backstage_entity_exists():
try:
resp = requests.get(
f"https://backstage.internal/api/catalog/entities/by-name/component/default/{SERVICE_NAME}",
headers={"Authorization": f"Bearer {TOKEN}"},
timeout=10,
)
return resp.status_code == 200
except Exception:
return False
wait_for(backstage_entity_exists, timeout_s=300, description=f"Backstage entity {SERVICE_NAME}")
print(f" OK: Backstage catalog entry created")
# Step 5: Verify RBAC (developer can deploy)
print("[5/6] Verifying developer RBAC...")
rbac_check = run(
f"kubectl auth can-i create deployments --namespace {SERVICE_NAME} "
f"--as system:serviceaccount:{SERVICE_NAME}:developer",
check=False,
)
if "yes" not in rbac_check:
print(f"FAIL: Developer ServiceAccount cannot create deployments in {SERVICE_NAME}")
sys.exit(1)
print(f" OK: Developer RBAC configured correctly")
# Step 6: Verify network policy
print("[6/6] Verifying network policies...")
netpol_count = run(
f"kubectl get networkpolicies -n {SERVICE_NAME} --no-headers | wc -l"
).strip()
if int(netpol_count) < 1:
print(f"FAIL: No NetworkPolicies created in {SERVICE_NAME}")
sys.exit(1)
print(f" OK: {netpol_count} NetworkPolicy/NetworkPolicies created")
# Cleanup
print(f"\nCleaning up...")
run(f"kubectl delete serviceclaim {SERVICE_NAME} -n platform-claims --ignore-not-found")
print(f"\nGolden Path Test PASSED — all 6 steps completed successfully")Platform API Smoke Tests
Smoke tests run against the live platform and verify that core APIs are responding correctly. Run these after every platform deployment:
#!/usr/bin/env bash
<span class="hljs-comment"># scripts/smoke-tests.sh
PLATFORM_API=<span class="hljs-string">"https://platform.internal/api/v1"
BACKSTAGE_API=<span class="hljs-string">"https://backstage.internal/api"
ARGOCD_API=<span class="hljs-string">"https://argocd.internal"
PASS=0
FAIL=0
<span class="hljs-function">smoke_test() {
<span class="hljs-built_in">local name=<span class="hljs-variable">$1
<span class="hljs-built_in">local url=<span class="hljs-variable">$2
<span class="hljs-built_in">local expected_status=<span class="hljs-variable">$3
<span class="hljs-built_in">local auth_header=<span class="hljs-variable">$4
actual_status=$(curl -sf -o /dev/null -w <span class="hljs-string">"%{http_code}" \
<span class="hljs-variable">${auth_header:+-H "Authorization: Bearer $auth_header"} \
<span class="hljs-string">"$url" 2>/dev/null <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"000")
<span class="hljs-keyword">if [ <span class="hljs-string">"$actual_status" = <span class="hljs-string">"$expected_status" ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"OK: $name (HTTP <span class="hljs-variable">$actual_status)"
PASS=$((PASS + <span class="hljs-number">1))
<span class="hljs-keyword">else
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: $name — expected HTTP <span class="hljs-variable">$expected_status, got <span class="hljs-variable">$actual_status"
FAIL=$((FAIL + <span class="hljs-number">1))
<span class="hljs-keyword">fi
}
<span class="hljs-comment"># Platform API
smoke_test <span class="hljs-string">"Platform API health" <span class="hljs-string">"$PLATFORM_API/health" <span class="hljs-string">"200"
smoke_test <span class="hljs-string">"Platform API auth required" <span class="hljs-string">"$PLATFORM_API/namespaces" <span class="hljs-string">"401"
smoke_test <span class="hljs-string">"Platform API authenticated" <span class="hljs-string">"$PLATFORM_API/namespaces" <span class="hljs-string">"200" <span class="hljs-string">"$PLATFORM_TOKEN"
<span class="hljs-comment"># Backstage
smoke_test <span class="hljs-string">"Backstage health" <span class="hljs-string">"$BACKSTAGE_API/health" <span class="hljs-string">"200"
smoke_test <span class="hljs-string">"Backstage catalog entities" <span class="hljs-string">"$BACKSTAGE_API/catalog/entities?limit=1" <span class="hljs-string">"200" <span class="hljs-string">"$BACKSTAGE_TOKEN"
smoke_test <span class="hljs-string">"Backstage auth" <span class="hljs-string">"$BACKSTAGE_API/auth/providers" <span class="hljs-string">"200"
<span class="hljs-comment"># ArgoCD
smoke_test <span class="hljs-string">"ArgoCD API server" <span class="hljs-string">"$ARGOCD_API/api/v1/applications?limit=1" <span class="hljs-string">"200"
<span class="hljs-comment"># Crossplane
CROSSPLANE_HEALTHY=$(kubectl get provider -o jsonpath=<span class="hljs-string">'{.items[*].status.conditions[?(@.type=="Healthy")].status}' 2>/dev/null)
<span class="hljs-keyword">if [[ <span class="hljs-string">"$CROSSPLANE_HEALTHY" == *<span class="hljs-string">"False"* ]]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: One or more Crossplane providers are unhealthy"
kubectl get provider
FAIL=$((FAIL + <span class="hljs-number">1))
<span class="hljs-keyword">else
<span class="hljs-built_in">echo <span class="hljs-string">"OK: All Crossplane providers are healthy"
PASS=$((PASS + <span class="hljs-number">1))
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">""
<span class="hljs-built_in">echo <span class="hljs-string">"Smoke tests: $PASS passed, <span class="hljs-variable">$FAIL failed"
[ <span class="hljs-variable">$FAIL -eq 0 ]Chaos Testing Platform Components
Platforms must degrade gracefully when components fail. Chaos tests verify this:
Testing ArgoCD Controller Resilience
#!/usr/bin/env bash
<span class="hljs-comment"># chaos/test-argocd-controller-failure.sh
<span class="hljs-built_in">echo <span class="hljs-string">"Chaos Test: ArgoCD application controller failure"
<span class="hljs-comment"># Record current application count
INITIAL_APP_COUNT=$(argocd app list --grpc-web -o json <span class="hljs-pipe">| jq <span class="hljs-string">'. | length')
<span class="hljs-built_in">echo <span class="hljs-string">"Initial application count: $INITIAL_APP_COUNT"
<span class="hljs-comment"># Kill the application controller
kubectl delete pod -n argocd -l app.kubernetes.io/name=argocd-application-controller
<span class="hljs-built_in">echo <span class="hljs-string">"Killed ArgoCD application controller — waiting for recovery..."
<span class="hljs-comment"># Wait for the controller to restart
kubectl <span class="hljs-built_in">wait --<span class="hljs-keyword">for=condition=available \
deployment/argocd-application-controller \
-n argocd \
--<span class="hljs-built_in">timeout=120s 2>/dev/null <span class="hljs-pipe">|| \
kubectl <span class="hljs-built_in">wait --<span class="hljs-keyword">for=condition=ready \
pod -l app.kubernetes.io/name=argocd-application-controller \
-n argocd \
--<span class="hljs-built_in">timeout=120s
<span class="hljs-built_in">echo <span class="hljs-string">"Controller restarted — verifying application state..."
<span class="hljs-built_in">sleep 30 <span class="hljs-comment"># Give controller time to reconcile
<span class="hljs-comment"># Verify no applications were lost
FINAL_APP_COUNT=$(argocd app list --grpc-web -o json <span class="hljs-pipe">| jq <span class="hljs-string">'. | length')
<span class="hljs-keyword">if [ <span class="hljs-string">"$FINAL_APP_COUNT" -ne <span class="hljs-string">"$INITIAL_APP_COUNT" ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Application count changed after controller restart ($INITIAL_APP_COUNT → <span class="hljs-variable">$FINAL_APP_COUNT)"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-comment"># Verify all applications are still healthy
UNHEALTHY=$(argocd app list --grpc-web -o json <span class="hljs-pipe">| \
jq <span class="hljs-string">'[.[] | select(.status.health.status != "Healthy")] <span class="hljs-pipe">| length')
<span class="hljs-keyword">if [ <span class="hljs-string">"$UNHEALTHY" -gt 0 ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: $UNHEALTHY applications became unhealthy after controller restart"
argocd app list --grpc-web -o json <span class="hljs-pipe">| \
jq <span class="hljs-string">'.[] | select(.status.health.status != "Healthy") <span class="hljs-pipe">| .metadata.name'
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">"PASS: ArgoCD application controller recovered correctly"
<span class="hljs-built_in">echo <span class="hljs-string">" - Application count preserved: $FINAL_APP_COUNT"
<span class="hljs-built_in">echo <span class="hljs-string">" - All applications healthy"Testing Crossplane Provider Failure
#!/usr/bin/env bash
<span class="hljs-comment"># chaos/test-crossplane-provider-failure.sh
<span class="hljs-built_in">echo <span class="hljs-string">"Chaos Test: Crossplane AWS provider pod failure"
<span class="hljs-comment"># Record current managed resource count
INITIAL_MR_COUNT=$(kubectl get managed --all-namespaces --no-headers <span class="hljs-pipe">| <span class="hljs-built_in">wc -l)
<span class="hljs-built_in">echo <span class="hljs-string">"Initial managed resource count: $INITIAL_MR_COUNT"
<span class="hljs-comment"># Kill the provider pod
kubectl delete pod -n crossplane-system \
-l pkg.crossplane.io/revision=upbound-provider-aws
<span class="hljs-built_in">echo <span class="hljs-string">"Killed Crossplane AWS provider — waiting for recovery..."
<span class="hljs-comment"># Wait for provider to recover
kubectl <span class="hljs-built_in">wait --<span class="hljs-keyword">for=condition=healthy provider/upbound-provider-aws \
--<span class="hljs-built_in">timeout=180s
<span class="hljs-built_in">echo <span class="hljs-string">"Provider recovered — verifying managed resource state..."
<span class="hljs-built_in">sleep 60 <span class="hljs-comment"># Give provider time to re-sync
<span class="hljs-comment"># Verify no managed resources were deleted during outage
FINAL_MR_COUNT=$(kubectl get managed --all-namespaces --no-headers <span class="hljs-pipe">| <span class="hljs-built_in">wc -l)
<span class="hljs-keyword">if [ <span class="hljs-string">"$FINAL_MR_COUNT" -lt <span class="hljs-string">"$INITIAL_MR_COUNT" ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Managed resource count decreased during provider outage ($INITIAL_MR_COUNT → <span class="hljs-variable">$FINAL_MR_COUNT)"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-comment"># Verify no managed resources are in error state
ERROR_COUNT=$(kubectl get managed --all-namespaces -o json <span class="hljs-pipe">| \
jq <span class="hljs-string">'[.items[] | select(.status.conditions[]? <span class="hljs-pipe">| select(.type == "Ready" and .status == "False"))] <span class="hljs-pipe">| length')
<span class="hljs-keyword">if [ <span class="hljs-string">"$ERROR_COUNT" -gt 0 ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"WARN: $ERROR_COUNT managed resources in error state (may be pre-existing)"
kubectl get managed --all-namespaces -o json <span class="hljs-pipe">| \
jq <span class="hljs-string">'.items[] | select(.status.conditions[]? <span class="hljs-pipe">| select(.type == "Ready" and .status == "False")) <span class="hljs-pipe">| .metadata.name'
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">"PASS: Crossplane provider recovered without data loss"Using Chaos Mesh for Systematic Chaos
# chaos/platform-chaos-experiments.yaml
# Experiment 1: API server latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: platform-api-latency
namespace: chaos-testing
spec:
action: delay
mode: all
selector:
namespaces:
- platform-system
labelSelectors:
app: platform-api
delay:
latency: "200ms"
jitter: "50ms"
duration: "5m"
---
# Experiment 2: Backstage database connection failure
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: backstage-db-partition
namespace: chaos-testing
spec:
action: partition
mode: all
selector:
namespaces:
- backstage
labelSelectors:
app: backstage
direction: to
target:
mode: all
selector:
namespaces:
- backstage
labelSelectors:
app: postgresql
duration: "2m"Apply and verify platform behavior during chaos:
# Apply chaos experiment
kubectl apply -f chaos/platform-chaos-experiments.yaml
<span class="hljs-comment"># Verify platform API degrades gracefully (circuit breaker, fallback cache)
<span class="hljs-built_in">sleep 30
smoke_test_result=$(./scripts/smoke-tests.sh 2>&1)
<span class="hljs-keyword">if <span class="hljs-built_in">echo <span class="hljs-string">"$smoke_test_result" <span class="hljs-pipe">| grep -q <span class="hljs-string">"FAIL: Platform API"; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Platform API did not degrade gracefully under 200ms latency"
<span class="hljs-built_in">echo <span class="hljs-string">"$smoke_test_result"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">"OK: Platform API remained functional under network latency"
<span class="hljs-comment"># Verify Backstage served cached responses during DB partition
BACKSTAGE_RESPONSE=$(curl -sf \
-H <span class="hljs-string">"Authorization: Bearer $BACKSTAGE_TOKEN" \
<span class="hljs-string">"https://backstage.internal/api/catalog/entities?limit=1")
<span class="hljs-keyword">if [ -z <span class="hljs-string">"$BACKSTAGE_RESPONSE" ]; <span class="hljs-keyword">then
<span class="hljs-built_in">echo <span class="hljs-string">"FAIL: Backstage returned empty response during DB partition (no cache)"
<span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi
<span class="hljs-built_in">echo <span class="hljs-string">"OK: Backstage served cached catalog data during DB partition"
<span class="hljs-comment"># Remove chaos
kubectl delete -f chaos/platform-chaos-experiments.yamlSLO Testing for Platform Reliability
Platform SLOs define the reliability commitments to developer teams. Make them testable:
# slos/platform-slos.yaml
slos:
- name: service-provisioning-latency
description: "New service provisioned within 10 minutes of claim submission"
target: 0.95 # 95% of requests meet this SLO
threshold_seconds: 600
- name: deployment-pipeline-availability
description: "ArgoCD API available 99.5% of the time"
target: 0.995
measurement_window: 30d
- name: platform-api-latency-p99
description: "Platform API p99 latency under 2 seconds"
target: 0.99
threshold_ms: 2000
- name: catalog-freshness
description: "Backstage catalog reflects Git state within 15 minutes"
target: 0.99
threshold_seconds: 900SLO Measurement Test
#!/usr/bin/env python3
# tests/slo-test.py
import time
import statistics
import sys
import subprocess
import yaml
def test_service_provisioning_latency(n_samples=10, timeout_s=600):
"""
Creates N service claims and measures provisioning time.
Fails if fewer than 95% complete within 600 seconds.
"""
print(f"SLO Test: Service provisioning latency (n={n_samples})")
latencies = []
failures = 0
for i in range(n_samples):
service_name = f"slo-test-{int(time.time())}-{i}"
start = time.time()
# Submit claim
claim_yaml = f"""
apiVersion: platform.internal/v1alpha1
kind: ServiceClaim
metadata:
name: {service_name}
namespace: platform-claims
spec:
team: slo-test
environment: staging
language: go
"""
subprocess.run(
f"kubectl apply -f - <<< '{claim_yaml}'",
shell=True, check=True, capture_output=True
)
# Wait for namespace (proxy for provisioning complete)
deadline = time.time() + timeout_s
provisioned = False
while time.time() < deadline:
result = subprocess.run(
f"kubectl get namespace {service_name} --no-headers 2>/dev/null",
shell=True, capture_output=True, text=True
)
if result.returncode == 0 and result.stdout.strip():
provisioned = True
break
time.sleep(10)
elapsed = time.time() - start
latencies.append(elapsed)
if provisioned:
print(f" [{i+1}/{n_samples}] {service_name}: {elapsed:.1f}s")
else:
print(f" [{i+1}/{n_samples}] {service_name}: TIMEOUT after {elapsed:.1f}s")
failures += 1
# Cleanup
subprocess.run(
f"kubectl delete serviceclaim {service_name} -n platform-claims --ignore-not-found",
shell=True, capture_output=True
)
success_rate = (n_samples - failures) / n_samples
p50 = statistics.median(latencies)
p95 = sorted(latencies)[int(0.95 * len(latencies))]
print(f"\nResults:")
print(f" Success rate: {success_rate:.1%} (target: 95%)")
print(f" p50 latency: {p50:.1f}s")
print(f" p95 latency: {p95:.1f}s (threshold: {timeout_s}s)")
if success_rate < 0.95:
print(f"FAIL: SLO not met — success rate {success_rate:.1%} < 95%")
sys.exit(1)
print("PASS: Service provisioning SLO met")
if __name__ == "__main__":
test_service_provisioning_latency()Running Platform QA in CI
# .github/workflows/platform-qa.yaml
name: Platform QA
on:
schedule:
- cron: '0 */6 * * *' # every 6 hours
workflow_dispatch: # manual trigger
jobs:
smoke-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run smoke tests
env:
PLATFORM_TOKEN: ${{ secrets.PLATFORM_TOKEN }}
BACKSTAGE_TOKEN: ${{ secrets.BACKSTAGE_TOKEN }}
run: ./scripts/smoke-tests.sh
golden-path:
runs-on: ubuntu-latest
needs: smoke-tests
steps:
- uses: actions/checkout@v4
- name: Run golden path test
env:
PLATFORM_TOKEN: ${{ secrets.PLATFORM_TOKEN }}
KUBECONFIG: ${{ secrets.KUBECONFIG }}
run: python3 tests/golden-path-new-service.py
slo-tests:
runs-on: ubuntu-latest
needs: smoke-tests
steps:
- uses: actions/checkout@v4
- name: Run SLO measurement
env:
PLATFORM_TOKEN: ${{ secrets.PLATFORM_TOKEN }}
run: python3 tests/slo-test.py
chaos:
runs-on: ubuntu-latest
needs: [smoke-tests, golden-path]
if: github.event_name == 'schedule' # only in scheduled runs
steps:
- uses: actions/checkout@v4
- name: Run chaos tests
run: |
./chaos/test-argocd-controller-failure.sh
./chaos/test-crossplane-provider-failure.shPlatform QA Dashboard
Track platform QA results over time with a simple dashboard. Write results to a time-series store after each run:
#!/usr/bin/env bash
<span class="hljs-comment"># scripts/record-qa-results.sh
TIMESTAMP=$(<span class="hljs-built_in">date -u +%Y-%m-%dT%H:%M:%SZ)
RUN_ID=<span class="hljs-variable">$GITHUB_RUN_ID
<span class="hljs-comment"># Write results as JSON for ingestion into your observability platform
<span class="hljs-built_in">cat > /tmp/qa-results.json << <span class="hljs-string">EOF
{
"timestamp": "$TIMESTAMP",
"run_id": "$RUN_ID",
"smoke_tests": {
"passed": $SMOKE_PASS,
"failed": $SMOKE_FAIL
},
"golden_path": {
"passed": $GOLDEN_PASS,
"duration_seconds": $GOLDEN_DURATION
},
"slo": {
"service_provisioning_success_rate": $SLO_SUCCESS_RATE,
"service_provisioning_p95_seconds": $SLO_P95
}
}
EOF
<span class="hljs-comment"># Post to your metrics endpoint
curl -sf -X POST <span class="hljs-string">"https://metrics.internal/api/platform-qa" \
-H <span class="hljs-string">"Content-Type: application/json" \
-d @/tmp/qa-results.jsonConclusion
Platform engineering QA is different from application QA because the platform serves as infrastructure for every other team. Failures are multiplied — not just one team is affected, but all teams simultaneously. This demands a testing approach that prioritizes golden path E2E tests above all else, combines chaos testing to verify graceful degradation, and enforces SLOs with automated measurement.
Start with smoke tests after every deployment, add golden path tests to your pre-release checklist, and run chaos tests weekly. Measure SLO compliance continuously. A platform team that can prove its reliability commitments with test data earns the trust of the developers who depend on it.
HelpMeTest can monitor your platform engineering pipelines automatically — sign up free