Detecting Flaky Tests in CI with GitHub Actions Test Annotations

Detecting Flaky Tests in CI with GitHub Actions Test Annotations

Flaky tests are hard to detect manually — a test that fails 5% of the time looks like noise until someone tracks it across 20 CI runs. GitHub Actions gives you tools to surface flaky tests automatically: test annotations that appear directly on pull requests, workflow features for re-running failed jobs, and patterns for tracking flakiness rates over time.

Test Annotations in GitHub Actions

GitHub Actions supports "check annotations" — inline comments that appear on the Files tab of a pull request, pointing to specific lines of code. Many test reporters output annotations automatically.

Jest with GitHub Actions reporter

Jest's github-actions reporter outputs annotations natively:

npx jest --reporters=default --reporters=github-actions

Or in jest.config.js:

module.exports = {
  reporters: [
    "default",
    process.env.GITHUB_ACTIONS && "github-actions",
  ].filter(Boolean),
};

Failed tests appear as annotations on the relevant test file lines in the PR.

Playwright with GitHub reporter

Playwright includes a github reporter:

// playwright.config.ts
export default defineConfig({
  reporter: process.env.CI
    ? [["github"], ["html", { open: "never" }]]
    : [["list"]],
});

Failed tests show as check annotations on the specific test() call in the PR diff.

Custom annotations for any test framework

For frameworks without native GitHub Actions support, output annotations manually using the Actions workflow commands:

# In any CI step
<span class="hljs-built_in">echo <span class="hljs-string">"::error file=tests/checkout.test.js,line=42,title=Flaky Test::Test 'processes checkout' failed 2/5 runs"
<span class="hljs-built_in">echo <span class="hljs-string">"::warning file=tests/auth.test.js,line=18,title=Slow Test::Test took 8.3s (threshold: 5s)"

Use this in a script that analyzes test results:

// scripts/report-flaky.js
const results = JSON.parse(fs.readFileSync("test-results.json"));

results.testResults.forEach((file) => {
  file.testResults.forEach((test) => {
    if (test.status === "failed") {
      const loc = `file=${file.testFilePath},line=${test.location?.line || 0}`;
      console.log(`::error ${loc},title=Test Failed::${test.fullName}`);
    }
  });
});

Re-running Failed Jobs

GitHub Actions supports re-running only the failed jobs from a workflow, without re-running passing jobs:

name: Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test

From the GitHub UI, click "Re-run failed jobs" on a failed workflow run. This re-runs only the jobs that failed, using the same commit and configuration.

For automated re-runs on transient failures, use the retry pattern:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - name: Run tests with retry
        uses: nick-fields/retry@v2
        with:
          timeout_minutes: 10
          max_attempts: 3
          command: npm test
          retry_on: error

The nick-fields/retry action re-runs the step up to 3 times on failure. Be careful: this masks flakiness rather than fixing it. Use it only for known-transient infrastructure issues (Docker pull failures, network timeouts), not for test logic failures.

Detecting Flakiness Across Runs

A single failed CI run doesn't prove flakiness. You need to see the pattern across multiple runs. Here's a workflow that runs the full test suite 5 times and reports the flakiness rate:

name: Flakiness Detection
on:
  schedule:
    - cron: "0 2 * * *" # Daily at 2am
  workflow_dispatch:
    inputs:
      repeat_count:
        description: "Number of times to run the suite"
        default: "10"

jobs:
  detect-flakiness:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci

      - name: Run suite multiple times and collect results
        run: |
          REPEAT=${{ github.event.inputs.repeat_count || 5 }}
          PASS=0
          FAIL=0
          FAILED_TESTS=""

          for i in $(seq 1 $REPEAT); do
            echo "--- Run $i of $REPEAT ---"
            if npx jest --json --outputFile=results-$i.json; then
              PASS=$((PASS + 1))
            else
              FAIL=$((FAIL + 1))
              # Extract failing test names
              FAILING=$(node -e "
                const r = require('./results-$i.json');
                r.testResults.forEach(f => 
                  f.testResults.filter(t => t.status === 'failed')
                    .forEach(t => console.log(t.fullName))
                );
              ")
              FAILED_TESTS="$FAILED_TESTS\n$FAILING"
            fi
          done

          echo "Results: $PASS/$REPEAT passed"
          if [ $FAIL -gt 0 ]; then
            echo "FLAKY TESTS DETECTED:"
            echo -e "$FAILED_TESTS" | sort | uniq -c | sort -rn
          fi

          # Fail workflow if flakiness rate > 20%
          RATE=$((FAIL * 100 / REPEAT))
          if [ $RATE -gt 20 ]; then
            echo "::error::Flakiness rate ${RATE}% exceeds threshold (20%)"
            exit 1
          fi

This workflow runs nightly, runs the suite 5 times, and alerts if more than 20% of runs fail. The sort | uniq -c | sort -rn pipeline shows which tests appear most frequently in failure results — your highest-priority quarantine candidates.

PR-level Flakiness Detection

For new tests added in a PR, run them repeatedly before allowing merge:

name: New Test Flakiness Check
on:
  pull_request:
    paths:
      - "tests/**/*.test.js"
      - "tests/**/*.spec.js"

jobs:
  check-new-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0 # Need full history to find changed files

      - name: Find changed test files
        id: changed-tests
        run: |
          CHANGED=$(git diff --name-only origin/${{ github.base_ref }}...HEAD \
            -- 'tests/**/*.test.js' 'tests/**/*.spec.js' | tr '\n' ' ')
          echo "files=$CHANGED" >> $GITHUB_OUTPUT

      - name: Run changed tests 10 times
        if: steps.changed-tests.outputs.files != ''
        run: |
          FILES="${{ steps.changed-tests.outputs.files }}"
          echo "Checking flakiness for: $FILES"

          FAIL=0
          for i in {1..10}; do
            if ! npx jest $FILES --passWithNoTests; then
              FAIL=$((FAIL + 1))
            fi
          done

          if [ $FAIL -gt 0 ]; then
            echo "::warning::$FAIL/10 runs failed for modified tests. Review for flakiness."
          fi

          # Block if > 30% failure rate on new tests
          if [ $FAIL -gt 3 ]; then
            echo "::error::New tests are flaky ($FAIL/10 failures). Fix before merging."
            exit 1
          fi

This catches flaky tests in review rather than in production CI.

Storing Flakiness History

Track flakiness rates over time using GitHub Actions artifacts and a simple aggregator:

- name: Save test results
  uses: actions/upload-artifact@v4
  if: always()
  with:
    name: test-results-${{ github.run_id }}
    path: test-results.json
    retention-days: 30

- name: Update flakiness database
  if: always()
  run: |
    node scripts/update-flakiness-db.js \
      --results test-results.json \
      --run-id ${{ github.run_id }} \
      --commit ${{ github.sha }}

The update-flakiness-db.js script appends to a JSON file stored in a separate branch or gist, building a historical record of which tests fail and how often. Query this weekly to produce a "most flaky tests" report.

Combining Annotations, Detection, and Tracking

A mature flakiness management system has three layers:

  1. Per-PR annotations — failed tests appear inline on the PR diff, no context-switching needed
  2. New test flakiness gate — new tests must be stable before merge
  3. Nightly detection — all tests run multiple times, flakiness rate tracked over time

These layers work together: annotations surface failures immediately, the gate prevents new flakiness from entering the codebase, and nightly detection surfaces existing flakiness for prioritization.

The result is a test suite where flakiness is visible, tracked, and actively reduced — rather than something developers learn to ignore.

For teams that want continuous monitoring beyond CI — running tests against production environments 24/7 and distinguishing infrastructure noise from real failures — HelpMeTest provides always-on test execution with AI-powered failure analysis.

Read more