Detecting Flaky Tests in CI with GitHub Actions Test Annotations
Flaky tests are hard to detect manually — a test that fails 5% of the time looks like noise until someone tracks it across 20 CI runs. GitHub Actions gives you tools to surface flaky tests automatically: test annotations that appear directly on pull requests, workflow features for re-running failed jobs, and patterns for tracking flakiness rates over time.
Test Annotations in GitHub Actions
GitHub Actions supports "check annotations" — inline comments that appear on the Files tab of a pull request, pointing to specific lines of code. Many test reporters output annotations automatically.
Jest with GitHub Actions reporter
Jest's github-actions reporter outputs annotations natively:
npx jest --reporters=default --reporters=github-actionsOr in jest.config.js:
module.exports = {
reporters: [
"default",
process.env.GITHUB_ACTIONS && "github-actions",
].filter(Boolean),
};Failed tests appear as annotations on the relevant test file lines in the PR.
Playwright with GitHub reporter
Playwright includes a github reporter:
// playwright.config.ts
export default defineConfig({
reporter: process.env.CI
? [["github"], ["html", { open: "never" }]]
: [["list"]],
});Failed tests show as check annotations on the specific test() call in the PR diff.
Custom annotations for any test framework
For frameworks without native GitHub Actions support, output annotations manually using the Actions workflow commands:
# In any CI step
<span class="hljs-built_in">echo <span class="hljs-string">"::error file=tests/checkout.test.js,line=42,title=Flaky Test::Test 'processes checkout' failed 2/5 runs"
<span class="hljs-built_in">echo <span class="hljs-string">"::warning file=tests/auth.test.js,line=18,title=Slow Test::Test took 8.3s (threshold: 5s)"Use this in a script that analyzes test results:
// scripts/report-flaky.js
const results = JSON.parse(fs.readFileSync("test-results.json"));
results.testResults.forEach((file) => {
file.testResults.forEach((test) => {
if (test.status === "failed") {
const loc = `file=${file.testFilePath},line=${test.location?.line || 0}`;
console.log(`::error ${loc},title=Test Failed::${test.fullName}`);
}
});
});Re-running Failed Jobs
GitHub Actions supports re-running only the failed jobs from a workflow, without re-running passing jobs:
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm testFrom the GitHub UI, click "Re-run failed jobs" on a failed workflow run. This re-runs only the jobs that failed, using the same commit and configuration.
For automated re-runs on transient failures, use the retry pattern:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- name: Run tests with retry
uses: nick-fields/retry@v2
with:
timeout_minutes: 10
max_attempts: 3
command: npm test
retry_on: errorThe nick-fields/retry action re-runs the step up to 3 times on failure. Be careful: this masks flakiness rather than fixing it. Use it only for known-transient infrastructure issues (Docker pull failures, network timeouts), not for test logic failures.
Detecting Flakiness Across Runs
A single failed CI run doesn't prove flakiness. You need to see the pattern across multiple runs. Here's a workflow that runs the full test suite 5 times and reports the flakiness rate:
name: Flakiness Detection
on:
schedule:
- cron: "0 2 * * *" # Daily at 2am
workflow_dispatch:
inputs:
repeat_count:
description: "Number of times to run the suite"
default: "10"
jobs:
detect-flakiness:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm ci
- name: Run suite multiple times and collect results
run: |
REPEAT=${{ github.event.inputs.repeat_count || 5 }}
PASS=0
FAIL=0
FAILED_TESTS=""
for i in $(seq 1 $REPEAT); do
echo "--- Run $i of $REPEAT ---"
if npx jest --json --outputFile=results-$i.json; then
PASS=$((PASS + 1))
else
FAIL=$((FAIL + 1))
# Extract failing test names
FAILING=$(node -e "
const r = require('./results-$i.json');
r.testResults.forEach(f =>
f.testResults.filter(t => t.status === 'failed')
.forEach(t => console.log(t.fullName))
);
")
FAILED_TESTS="$FAILED_TESTS\n$FAILING"
fi
done
echo "Results: $PASS/$REPEAT passed"
if [ $FAIL -gt 0 ]; then
echo "FLAKY TESTS DETECTED:"
echo -e "$FAILED_TESTS" | sort | uniq -c | sort -rn
fi
# Fail workflow if flakiness rate > 20%
RATE=$((FAIL * 100 / REPEAT))
if [ $RATE -gt 20 ]; then
echo "::error::Flakiness rate ${RATE}% exceeds threshold (20%)"
exit 1
fiThis workflow runs nightly, runs the suite 5 times, and alerts if more than 20% of runs fail. The sort | uniq -c | sort -rn pipeline shows which tests appear most frequently in failure results — your highest-priority quarantine candidates.
PR-level Flakiness Detection
For new tests added in a PR, run them repeatedly before allowing merge:
name: New Test Flakiness Check
on:
pull_request:
paths:
- "tests/**/*.test.js"
- "tests/**/*.spec.js"
jobs:
check-new-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history to find changed files
- name: Find changed test files
id: changed-tests
run: |
CHANGED=$(git diff --name-only origin/${{ github.base_ref }}...HEAD \
-- 'tests/**/*.test.js' 'tests/**/*.spec.js' | tr '\n' ' ')
echo "files=$CHANGED" >> $GITHUB_OUTPUT
- name: Run changed tests 10 times
if: steps.changed-tests.outputs.files != ''
run: |
FILES="${{ steps.changed-tests.outputs.files }}"
echo "Checking flakiness for: $FILES"
FAIL=0
for i in {1..10}; do
if ! npx jest $FILES --passWithNoTests; then
FAIL=$((FAIL + 1))
fi
done
if [ $FAIL -gt 0 ]; then
echo "::warning::$FAIL/10 runs failed for modified tests. Review for flakiness."
fi
# Block if > 30% failure rate on new tests
if [ $FAIL -gt 3 ]; then
echo "::error::New tests are flaky ($FAIL/10 failures). Fix before merging."
exit 1
fiThis catches flaky tests in review rather than in production CI.
Storing Flakiness History
Track flakiness rates over time using GitHub Actions artifacts and a simple aggregator:
- name: Save test results
uses: actions/upload-artifact@v4
if: always()
with:
name: test-results-${{ github.run_id }}
path: test-results.json
retention-days: 30
- name: Update flakiness database
if: always()
run: |
node scripts/update-flakiness-db.js \
--results test-results.json \
--run-id ${{ github.run_id }} \
--commit ${{ github.sha }}The update-flakiness-db.js script appends to a JSON file stored in a separate branch or gist, building a historical record of which tests fail and how often. Query this weekly to produce a "most flaky tests" report.
Combining Annotations, Detection, and Tracking
A mature flakiness management system has three layers:
- Per-PR annotations — failed tests appear inline on the PR diff, no context-switching needed
- New test flakiness gate — new tests must be stable before merge
- Nightly detection — all tests run multiple times, flakiness rate tracked over time
These layers work together: annotations surface failures immediately, the gate prevents new flakiness from entering the codebase, and nightly detection surfaces existing flakiness for prioritization.
The result is a test suite where flakiness is visible, tracked, and actively reduced — rather than something developers learn to ignore.
For teams that want continuous monitoring beyond CI — running tests against production environments 24/7 and distinguishing infrastructure noise from real failures — HelpMeTest provides always-on test execution with AI-powered failure analysis.