Test Reliability

Setting SLOs for Test Suite Flakiness Rate and Alerting on Regressions

HelpMeTest

22 May 2026 — 5 min read

Production services have SLOs — "99.9% of requests succeed within 500ms." Test suites have... hope. Engineers hope their tests are stable, hope failures mean real bugs, hope the suite is trustworthy. When tests fail, they re-run and hope.

This informal relationship with reliability is the source of the "flaky test" culture: failures are expected, ignored, worked around. The fix is to apply the same rigor to test reliability that SRE teams apply to production services — define what "reliable" means, measure it, and alert when it degrades.

Defining Test Suite Reliability

A test suite's reliability can be measured with one number: first-pass success rate — the percentage of CI runs where all tests pass on the first attempt, without retries.

First-pass success rate = (Runs with 0 failures) / (Total CI runs) × 100

A suite with 5% flakiness has a 95% first-pass success rate. Engineers re-run 1 in 20 CI runs due to flakiness. Over a year, that's significant wasted time and eroded trust.

Setting the SLO

Pick a target based on your current state and team tolerance:

First-pass rate	Interpretation
< 90%	Crisis — test suite is not trustworthy
90-95%	Problematic — engineers are spending meaningful time on re-runs
95-99%	Acceptable — occasional flakiness, manageable
> 99%	Good — flakiness is rare, trust is maintained
> 99.5%	Excellent — almost all failures are real bugs

Start with a target 5% above your current rate. If you're at 90% today, target 95% in 90 days. Don't set aspirational targets — set achievable ones, then raise them.

Document the SLO:

# .github/test-slo.yaml
test_reliability:
  slo_target: 97  # percent
  measurement_window: 7d  # rolling 7-day window
  exclusions:
    - infrastructure_failures  # Docker pull failures, runner unavailability
  alert_threshold: 95  # alert when rate drops below this
  alert_channel: "#test-reliability"

Measuring Flakiness Rate

Track first-pass success over time. For each CI run, record: timestamp, commit SHA, pass/fail, number of retries needed.

Collecting data from GitHub Actions

name: Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    outputs:
      first_pass: ${{ steps.test.outcome }}
    steps:
      - uses: actions/checkout@v4
      - id: test
        run: npm test
        continue-on-error: true

      - name: Record result
        if: always()
        run: |
          STATUS="${{ steps.test.outcome }}"
          TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
          COMMIT="${{ github.sha }}"
          RUN_ID="${{ github.run_id }}"

          echo "${TIMESTAMP},${RUN_ID},${COMMIT},${STATUS}" \
            >> test-reliability-log.csv

      - name: Fail if tests failed
        if: steps.test.outcome == 'failure'
        run: exit 1

Store the CSV in an S3 bucket, GitHub Gist, or a dedicated metrics database.

Calculating the SLO

# scripts/calculate-slo.py
import csv
from datetime import datetime, timedelta

WINDOW_DAYS = 7
TARGET = 97.0

with open("test-reliability-log.csv") as f:
    reader = csv.DictReader(f, fieldnames=["timestamp", "run_id", "commit", "status"])
    rows = list(reader)

cutoff = datetime.utcnow() - timedelta(days=WINDOW_DAYS)
recent = [
    r for r in rows
    if datetime.fromisoformat(r["timestamp"].rstrip("Z")) > cutoff
]

total = len(recent)
passed = sum(1 for r in recent if r["status"] == "success")
rate = (passed / total * 100) if total > 0 else 0

print(f"Test reliability SLO report ({WINDOW_DAYS}-day window)")
print(f"Total runs: {total}")
print(f"First-pass successes: {passed}")
print(f"Success rate: {rate:.1f}%")
print(f"SLO target: {TARGET}%")
print(f"Status: {'✓ SLO MET' if rate >= TARGET else '✗ SLO BREACHED'}")

error_budget_used = max(0, TARGET - rate)
print(f"Error budget used: {error_budget_used:.1f}pp of {100-TARGET:.1f}pp")

Run this weekly and post results to your team channel.

Error Budget Thinking

Borrow the error budget concept from SRE. If your SLO is 97%, your error budget is 3% — the allowed failure rate. When you use up your budget, you stop new feature development and fix flaky tests.

Error budget = 100% - SLO target = 3%
Current failure rate = 100% - actual first-pass rate

If current failure rate > error budget → budget exhausted → fix tests

This creates a forcing function. Teams can't ignore flakiness indefinitely — when the budget runs out, fixing tests becomes the highest priority.

For a 7-day window:

100 CI runs
SLO: 97% → 3 allowed failures
Actual: 94% → 6 failures → budget exhausted → fix required

Alerting on Regression

Set up alerts when the first-pass rate drops below the alert threshold:

# .github/workflows/slo-check.yml
name: Test SLO Check
on:
  schedule:
    - cron: "0 9 * * MON-FRI" # Weekdays at 9am
  workflow_dispatch:

jobs:
  check-slo:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download reliability log
        run: aws s3 cp s3://your-bucket/test-reliability-log.csv .
      - name: Check SLO
        id: slo
        run: |
          RATE=$(python scripts/calculate-slo.py --json | jq '.rate')
          echo "rate=$RATE" >> $GITHUB_OUTPUT

          if (( $(echo "$RATE < 95" | bc -l) )); then
            echo "breach=true" >> $GITHUB_OUTPUT
          else
            echo "breach=false" >> $GITHUB_OUTPUT
          fi

      - name: Alert on breach
        if: steps.slo.outputs.breach == 'true'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "⚠️ Test SLO Breach: first-pass rate is ${{ steps.slo.outputs.rate }}% (target: 97%)",
              "channel": "#test-reliability"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Per-test Flakiness SLOs

Beyond the suite-level SLO, track per-test reliability. A test that fails 20% of the time is a much higher-priority fix than one that fails 1% of the time.

# scripts/per-test-flakiness.py
# Reads test results JSON files and reports per-test failure rates

import json
import glob
from collections import defaultdict

test_results = defaultdict(lambda: {"pass": 0, "fail": 0})

for results_file in glob.glob("test-results/*.json"):
    with open(results_file) as f:
        data = json.load(f)
    for suite in data.get("testResults", []):
        for test in suite.get("testResults", []):
            name = test["fullName"]
            if test["status"] == "passed":
                test_results[name]["pass"] += 1
            elif test["status"] == "failed":
                test_results[name]["fail"] += 1

# Report tests above 5% failure rate
print("Tests above 5% failure rate (quarantine candidates):\n")
for name, counts in test_results.items():
    total = counts["pass"] + counts["fail"]
    if total < 5:
        continue  # Not enough data
    rate = counts["fail"] / total * 100
    if rate >= 5:
        print(f"  {rate:.0f}% failure rate ({counts['fail']}/{total}): {name}")

Run this weekly. Tests above 5% failure rate are quarantine candidates. Tests above 20% failure rate are critical fixes.

The Reliability Roadmap

Starting from zero, here's a practical path to a reliable test suite:

Week 1: Establish baseline

Record first-pass success rate for the past 30 days
Identify the 5 most flaky tests

Week 2-4: Fix quick wins

Fix timing issues (largest category)
Add beforeEach state resets
Quarantine tests that can't be fixed immediately

Month 2: Process changes

Add the flakiness rate check to CI
Set and communicate the SLO
Start the error budget tracker

Month 3+: Maintain and improve

Weekly flakiness report to the team
Error budget reviews in sprint planning
Gradually raise the SLO target as reliability improves

The goal isn't zero flakiness — it's making flakiness visible, measurable, and actively managed. A test suite with a published SLO, an error budget, and alerts when that budget is breached is a test suite engineers can trust.

For teams running continuous test monitoring in production — where tests run 24/7 and failures need to be triaged immediately — HelpMeTest provides the infrastructure layer, running tests on a schedule, alerting on failures, and distinguishing infrastructure noise from real regressions with AI-powered analysis.

Setting SLOs for Test Suite Flakiness Rate and Alerting on Regressions

HelpMeTest

Defining Test Suite Reliability

Setting the SLO

Measuring Flakiness Rate

Collecting data from GitHub Actions

Calculating the SLO

Error Budget Thinking

Alerting on Regression

Per-test Flakiness SLOs

The Reliability Roadmap

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest