Setting SLOs for Test Suite Flakiness Rate and Alerting on Regressions
Production services have SLOs — "99.9% of requests succeed within 500ms." Test suites have... hope. Engineers hope their tests are stable, hope failures mean real bugs, hope the suite is trustworthy. When tests fail, they re-run and hope.
This informal relationship with reliability is the source of the "flaky test" culture: failures are expected, ignored, worked around. The fix is to apply the same rigor to test reliability that SRE teams apply to production services — define what "reliable" means, measure it, and alert when it degrades.
Defining Test Suite Reliability
A test suite's reliability can be measured with one number: first-pass success rate — the percentage of CI runs where all tests pass on the first attempt, without retries.
First-pass success rate = (Runs with 0 failures) / (Total CI runs) × 100A suite with 5% flakiness has a 95% first-pass success rate. Engineers re-run 1 in 20 CI runs due to flakiness. Over a year, that's significant wasted time and eroded trust.
Setting the SLO
Pick a target based on your current state and team tolerance:
| First-pass rate | Interpretation |
|---|---|
| < 90% | Crisis — test suite is not trustworthy |
| 90-95% | Problematic — engineers are spending meaningful time on re-runs |
| 95-99% | Acceptable — occasional flakiness, manageable |
| > 99% | Good — flakiness is rare, trust is maintained |
| > 99.5% | Excellent — almost all failures are real bugs |
Start with a target 5% above your current rate. If you're at 90% today, target 95% in 90 days. Don't set aspirational targets — set achievable ones, then raise them.
Document the SLO:
# .github/test-slo.yaml
test_reliability:
slo_target: 97 # percent
measurement_window: 7d # rolling 7-day window
exclusions:
- infrastructure_failures # Docker pull failures, runner unavailability
alert_threshold: 95 # alert when rate drops below this
alert_channel: "#test-reliability"Measuring Flakiness Rate
Track first-pass success over time. For each CI run, record: timestamp, commit SHA, pass/fail, number of retries needed.
Collecting data from GitHub Actions
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
outputs:
first_pass: ${{ steps.test.outcome }}
steps:
- uses: actions/checkout@v4
- id: test
run: npm test
continue-on-error: true
- name: Record result
if: always()
run: |
STATUS="${{ steps.test.outcome }}"
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
COMMIT="${{ github.sha }}"
RUN_ID="${{ github.run_id }}"
echo "${TIMESTAMP},${RUN_ID},${COMMIT},${STATUS}" \
>> test-reliability-log.csv
- name: Fail if tests failed
if: steps.test.outcome == 'failure'
run: exit 1Store the CSV in an S3 bucket, GitHub Gist, or a dedicated metrics database.
Calculating the SLO
# scripts/calculate-slo.py
import csv
from datetime import datetime, timedelta
WINDOW_DAYS = 7
TARGET = 97.0
with open("test-reliability-log.csv") as f:
reader = csv.DictReader(f, fieldnames=["timestamp", "run_id", "commit", "status"])
rows = list(reader)
cutoff = datetime.utcnow() - timedelta(days=WINDOW_DAYS)
recent = [
r for r in rows
if datetime.fromisoformat(r["timestamp"].rstrip("Z")) > cutoff
]
total = len(recent)
passed = sum(1 for r in recent if r["status"] == "success")
rate = (passed / total * 100) if total > 0 else 0
print(f"Test reliability SLO report ({WINDOW_DAYS}-day window)")
print(f"Total runs: {total}")
print(f"First-pass successes: {passed}")
print(f"Success rate: {rate:.1f}%")
print(f"SLO target: {TARGET}%")
print(f"Status: {'✓ SLO MET' if rate >= TARGET else '✗ SLO BREACHED'}")
error_budget_used = max(0, TARGET - rate)
print(f"Error budget used: {error_budget_used:.1f}pp of {100-TARGET:.1f}pp")Run this weekly and post results to your team channel.
Error Budget Thinking
Borrow the error budget concept from SRE. If your SLO is 97%, your error budget is 3% — the allowed failure rate. When you use up your budget, you stop new feature development and fix flaky tests.
Error budget = 100% - SLO target = 3%
Current failure rate = 100% - actual first-pass rate
If current failure rate > error budget → budget exhausted → fix testsThis creates a forcing function. Teams can't ignore flakiness indefinitely — when the budget runs out, fixing tests becomes the highest priority.
For a 7-day window:
- 100 CI runs
- SLO: 97% → 3 allowed failures
- Actual: 94% → 6 failures → budget exhausted → fix required
Alerting on Regression
Set up alerts when the first-pass rate drops below the alert threshold:
# .github/workflows/slo-check.yml
name: Test SLO Check
on:
schedule:
- cron: "0 9 * * MON-FRI" # Weekdays at 9am
workflow_dispatch:
jobs:
check-slo:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download reliability log
run: aws s3 cp s3://your-bucket/test-reliability-log.csv .
- name: Check SLO
id: slo
run: |
RATE=$(python scripts/calculate-slo.py --json | jq '.rate')
echo "rate=$RATE" >> $GITHUB_OUTPUT
if (( $(echo "$RATE < 95" | bc -l) )); then
echo "breach=true" >> $GITHUB_OUTPUT
else
echo "breach=false" >> $GITHUB_OUTPUT
fi
- name: Alert on breach
if: steps.slo.outputs.breach == 'true'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "⚠️ Test SLO Breach: first-pass rate is ${{ steps.slo.outputs.rate }}% (target: 97%)",
"channel": "#test-reliability"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}Per-test Flakiness SLOs
Beyond the suite-level SLO, track per-test reliability. A test that fails 20% of the time is a much higher-priority fix than one that fails 1% of the time.
# scripts/per-test-flakiness.py
# Reads test results JSON files and reports per-test failure rates
import json
import glob
from collections import defaultdict
test_results = defaultdict(lambda: {"pass": 0, "fail": 0})
for results_file in glob.glob("test-results/*.json"):
with open(results_file) as f:
data = json.load(f)
for suite in data.get("testResults", []):
for test in suite.get("testResults", []):
name = test["fullName"]
if test["status"] == "passed":
test_results[name]["pass"] += 1
elif test["status"] == "failed":
test_results[name]["fail"] += 1
# Report tests above 5% failure rate
print("Tests above 5% failure rate (quarantine candidates):\n")
for name, counts in test_results.items():
total = counts["pass"] + counts["fail"]
if total < 5:
continue # Not enough data
rate = counts["fail"] / total * 100
if rate >= 5:
print(f" {rate:.0f}% failure rate ({counts['fail']}/{total}): {name}")Run this weekly. Tests above 5% failure rate are quarantine candidates. Tests above 20% failure rate are critical fixes.
The Reliability Roadmap
Starting from zero, here's a practical path to a reliable test suite:
Week 1: Establish baseline
- Record first-pass success rate for the past 30 days
- Identify the 5 most flaky tests
Week 2-4: Fix quick wins
- Fix timing issues (largest category)
- Add
beforeEachstate resets - Quarantine tests that can't be fixed immediately
Month 2: Process changes
- Add the flakiness rate check to CI
- Set and communicate the SLO
- Start the error budget tracker
Month 3+: Maintain and improve
- Weekly flakiness report to the team
- Error budget reviews in sprint planning
- Gradually raise the SLO target as reliability improves
The goal isn't zero flakiness — it's making flakiness visible, measurable, and actively managed. A test suite with a published SLO, an error budget, and alerts when that budget is breached is a test suite engineers can trust.
For teams running continuous test monitoring in production — where tests run 24/7 and failures need to be triaged immediately — HelpMeTest provides the infrastructure layer, running tests on a schedule, alerting on failures, and distinguishing infrastructure noise from real regressions with AI-powered analysis.