Setting SLOs for Test Suite Flakiness Rate and Alerting on Regressions

Setting SLOs for Test Suite Flakiness Rate and Alerting on Regressions

Production services have SLOs — "99.9% of requests succeed within 500ms." Test suites have... hope. Engineers hope their tests are stable, hope failures mean real bugs, hope the suite is trustworthy. When tests fail, they re-run and hope.

This informal relationship with reliability is the source of the "flaky test" culture: failures are expected, ignored, worked around. The fix is to apply the same rigor to test reliability that SRE teams apply to production services — define what "reliable" means, measure it, and alert when it degrades.

Defining Test Suite Reliability

A test suite's reliability can be measured with one number: first-pass success rate — the percentage of CI runs where all tests pass on the first attempt, without retries.

First-pass success rate = (Runs with 0 failures) / (Total CI runs) × 100

A suite with 5% flakiness has a 95% first-pass success rate. Engineers re-run 1 in 20 CI runs due to flakiness. Over a year, that's significant wasted time and eroded trust.

Setting the SLO

Pick a target based on your current state and team tolerance:

First-pass rate Interpretation
< 90% Crisis — test suite is not trustworthy
90-95% Problematic — engineers are spending meaningful time on re-runs
95-99% Acceptable — occasional flakiness, manageable
> 99% Good — flakiness is rare, trust is maintained
> 99.5% Excellent — almost all failures are real bugs

Start with a target 5% above your current rate. If you're at 90% today, target 95% in 90 days. Don't set aspirational targets — set achievable ones, then raise them.

Document the SLO:

# .github/test-slo.yaml
test_reliability:
  slo_target: 97  # percent
  measurement_window: 7d  # rolling 7-day window
  exclusions:
    - infrastructure_failures  # Docker pull failures, runner unavailability
  alert_threshold: 95  # alert when rate drops below this
  alert_channel: "#test-reliability"

Measuring Flakiness Rate

Track first-pass success over time. For each CI run, record: timestamp, commit SHA, pass/fail, number of retries needed.

Collecting data from GitHub Actions

name: Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    outputs:
      first_pass: ${{ steps.test.outcome }}
    steps:
      - uses: actions/checkout@v4
      - id: test
        run: npm test
        continue-on-error: true

      - name: Record result
        if: always()
        run: |
          STATUS="${{ steps.test.outcome }}"
          TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
          COMMIT="${{ github.sha }}"
          RUN_ID="${{ github.run_id }}"

          echo "${TIMESTAMP},${RUN_ID},${COMMIT},${STATUS}" \
            >> test-reliability-log.csv

      - name: Fail if tests failed
        if: steps.test.outcome == 'failure'
        run: exit 1

Store the CSV in an S3 bucket, GitHub Gist, or a dedicated metrics database.

Calculating the SLO

# scripts/calculate-slo.py
import csv
from datetime import datetime, timedelta

WINDOW_DAYS = 7
TARGET = 97.0

with open("test-reliability-log.csv") as f:
    reader = csv.DictReader(f, fieldnames=["timestamp", "run_id", "commit", "status"])
    rows = list(reader)

cutoff = datetime.utcnow() - timedelta(days=WINDOW_DAYS)
recent = [
    r for r in rows
    if datetime.fromisoformat(r["timestamp"].rstrip("Z")) > cutoff
]

total = len(recent)
passed = sum(1 for r in recent if r["status"] == "success")
rate = (passed / total * 100) if total > 0 else 0

print(f"Test reliability SLO report ({WINDOW_DAYS}-day window)")
print(f"Total runs: {total}")
print(f"First-pass successes: {passed}")
print(f"Success rate: {rate:.1f}%")
print(f"SLO target: {TARGET}%")
print(f"Status: {'✓ SLO MET' if rate >= TARGET else '✗ SLO BREACHED'}")

error_budget_used = max(0, TARGET - rate)
print(f"Error budget used: {error_budget_used:.1f}pp of {100-TARGET:.1f}pp")

Run this weekly and post results to your team channel.

Error Budget Thinking

Borrow the error budget concept from SRE. If your SLO is 97%, your error budget is 3% — the allowed failure rate. When you use up your budget, you stop new feature development and fix flaky tests.

Error budget = 100% - SLO target = 3%
Current failure rate = 100% - actual first-pass rate

If current failure rate > error budget → budget exhausted → fix tests

This creates a forcing function. Teams can't ignore flakiness indefinitely — when the budget runs out, fixing tests becomes the highest priority.

For a 7-day window:

  • 100 CI runs
  • SLO: 97% → 3 allowed failures
  • Actual: 94% → 6 failures → budget exhausted → fix required

Alerting on Regression

Set up alerts when the first-pass rate drops below the alert threshold:

# .github/workflows/slo-check.yml
name: Test SLO Check
on:
  schedule:
    - cron: "0 9 * * MON-FRI" # Weekdays at 9am
  workflow_dispatch:

jobs:
  check-slo:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download reliability log
        run: aws s3 cp s3://your-bucket/test-reliability-log.csv .
      - name: Check SLO
        id: slo
        run: |
          RATE=$(python scripts/calculate-slo.py --json | jq '.rate')
          echo "rate=$RATE" >> $GITHUB_OUTPUT

          if (( $(echo "$RATE < 95" | bc -l) )); then
            echo "breach=true" >> $GITHUB_OUTPUT
          else
            echo "breach=false" >> $GITHUB_OUTPUT
          fi

      - name: Alert on breach
        if: steps.slo.outputs.breach == 'true'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "⚠️ Test SLO Breach: first-pass rate is ${{ steps.slo.outputs.rate }}% (target: 97%)",
              "channel": "#test-reliability"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Per-test Flakiness SLOs

Beyond the suite-level SLO, track per-test reliability. A test that fails 20% of the time is a much higher-priority fix than one that fails 1% of the time.

# scripts/per-test-flakiness.py
# Reads test results JSON files and reports per-test failure rates

import json
import glob
from collections import defaultdict

test_results = defaultdict(lambda: {"pass": 0, "fail": 0})

for results_file in glob.glob("test-results/*.json"):
    with open(results_file) as f:
        data = json.load(f)
    for suite in data.get("testResults", []):
        for test in suite.get("testResults", []):
            name = test["fullName"]
            if test["status"] == "passed":
                test_results[name]["pass"] += 1
            elif test["status"] == "failed":
                test_results[name]["fail"] += 1

# Report tests above 5% failure rate
print("Tests above 5% failure rate (quarantine candidates):\n")
for name, counts in test_results.items():
    total = counts["pass"] + counts["fail"]
    if total < 5:
        continue  # Not enough data
    rate = counts["fail"] / total * 100
    if rate >= 5:
        print(f"  {rate:.0f}% failure rate ({counts['fail']}/{total}): {name}")

Run this weekly. Tests above 5% failure rate are quarantine candidates. Tests above 20% failure rate are critical fixes.

The Reliability Roadmap

Starting from zero, here's a practical path to a reliable test suite:

Week 1: Establish baseline

  • Record first-pass success rate for the past 30 days
  • Identify the 5 most flaky tests

Week 2-4: Fix quick wins

  • Fix timing issues (largest category)
  • Add beforeEach state resets
  • Quarantine tests that can't be fixed immediately

Month 2: Process changes

  • Add the flakiness rate check to CI
  • Set and communicate the SLO
  • Start the error budget tracker

Month 3+: Maintain and improve

  • Weekly flakiness report to the team
  • Error budget reviews in sprint planning
  • Gradually raise the SLO target as reliability improves

The goal isn't zero flakiness — it's making flakiness visible, measurable, and actively managed. A test suite with a published SLO, an error budget, and alerts when that budget is breached is a test suite engineers can trust.

For teams running continuous test monitoring in production — where tests run 24/7 and failures need to be triaged immediately — HelpMeTest provides the infrastructure layer, running tests on a schedule, alerting on failures, and distinguishing infrastructure noise from real regressions with AI-powered analysis.

Read more