Test Trend Analysis: How to Spot Flaky Tests and Coverage Gaps from Reports

Test Trend Analysis: How to Spot Flaky Tests and Coverage Gaps from Reports

Running tests is table stakes. Understanding what the results mean over time — that's where teams actually improve their quality practice.

Test trend analysis is the discipline of reading patterns from test result history: which tests are reliable, which are lying to you, which parts of the codebase have no coverage, and whether quality is improving or degrading across releases.

This guide covers practical techniques for extracting signal from test data, whether you're working with a dedicated analytics platform or writing SQL queries against your own result store.

Why Trend Analysis Matters

A single test run tells you: "These N tests passed, these M failed, right now."

Trend analysis tells you:

  • Is the failure rate getting better or worse over time?
  • Which tests fail frequently vs. occasionally vs. never?
  • Is test coverage growing with the codebase, or falling behind?
  • Are new failures appearing on main, or only on feature branches?
  • Which areas of the application are most risky based on historical test behavior?

The difference between a reactive QA team (fixing broken tests) and a proactive one (preventing regressions) is usually whether they're doing trend analysis or not.

The Four Core Metrics

1. Pass Rate Trend

The overall ratio of passing tests over time. This is your highest-level signal.

Pass rate = (total tests - failures - errors) / total tests

What to look for:

  • Stable pass rate (>95%) with occasional dips: healthy. Dips after deploys indicate regressions.
  • Gradually declining pass rate: technical debt accumulating in the test suite — often stale selectors or outdated test data.
  • Volatile pass rate (wild swings): environment instability or heavy test flakiness.
  • Plateau below 90%: known failures being ignored. This is a culture problem, not a technical one.

2. Failure Rate Per Test

How often does each individual test fail, across all runs over a time window?

Failure rate = failures / total executions (last 30 days)

Classify tests into:

  • Green (0-5% failure rate): reliable, trusted
  • Yellow (5-20% failure rate): needs investigation — could be flaky or intermittently broken
  • Red (>20% failure rate): actively broken or severely flaky — high noise, low signal

3. Flakiness Score

A flaky test is one that changes result (pass↔fail) across consecutive runs without code changes. Flakiness is distinct from a consistently failing test.

Flakiness = status change count / total executions

High flakiness score on a passing test is just as dangerous as a failing test — it means you can't trust either result.

4. Coverage Trend

Are new features being covered by tests? Coverage trend measures whether the ratio of tested code is growing or shrinking as the codebase grows.

Track via:

  • Line/branch coverage from tools like Istanbul, coverage.py, or JaCoCo
  • Or proxy: test count vs. story points shipped over time

Identifying Flaky Tests

Flakiness is often the biggest source of noise in a test suite. A team with 20% flaky tests loses trust in their entire CI pipeline — engineers start ignoring failures, and real regressions slip through.

Method 1: Status Flip Detection

The most direct approach: count how many times a test changed its result between consecutive runs.

WITH ordered AS (
  SELECT
    test_name,
    status,
    created_at,
    LAG(status) OVER (PARTITION BY test_name ORDER BY created_at) as prev_status
  FROM test_results
  WHERE created_at > NOW() - INTERVAL '30 days'
),
flips AS (
  SELECT
    test_name,
    COUNT(*) as total_runs,
    SUM(CASE WHEN status != prev_status AND prev_status IS NOT NULL THEN 1 ELSE 0 END) as flips,
    AVG(CASE WHEN status = 'passed' THEN 1.0 ELSE 0.0 END) as pass_rate
  FROM ordered
  GROUP BY test_name
  HAVING COUNT(*) >= 5  -- minimum runs to be meaningful
)
SELECT
  test_name,
  total_runs,
  flips,
  ROUND(flips::float / total_runs * 100, 1) as flakiness_pct,
  ROUND(pass_rate * 100, 1) as pass_rate_pct
FROM flips
WHERE flips > 0
ORDER BY flips DESC;

Tests with >15% flakiness score are priority candidates for investigation.

Method 2: Re-run Comparison

Run your test suite twice in the same environment with the same code. Tests that produce different results between runs are flaky by definition — no code changed.

This is expensive for large suites but extremely accurate for targeted flakiness audits. Schedule a weekly double-run on the most volatile tests.

Method 3: Failure Rate Without Recent Code Changes

If a test fails, check whether any files it covers changed since the last green run. If not, the failure is likely flaky.

This requires linking test results to code changes — achievable by storing commit_sha with each test run and joining against Git history.

Fixing Flaky Tests

Once identified, flaky tests fall into common categories:

Timing-dependent failures:

  • Tests that sleep() for a fixed duration instead of waiting for a condition
  • Fix: replace sleeps with explicit waits (Playwright's waitForSelector, Robot Framework's Wait Until Element Is Visible)

Shared state pollution:

  • Tests that depend on database state left by a previous test
  • Fix: transaction rollback after each test, or dedicated test data setup/teardown

External service dependencies:

  • Tests that call real APIs, email services, or third-party systems
  • Fix: mock the dependency at the network boundary, or use a stable test environment

Race conditions in async code:

  • Tests that don't properly await async operations
  • Fix: ensure all async operations are properly awaited and assertions run after completion

Order-dependent tests:

  • Tests that pass individually but fail when run in a certain order
  • Fix: enforce test isolation — each test must set up its own preconditions

Spotting Coverage Gaps

Method 1: Module-Level Coverage Heatmap

Line coverage tools (Istanbul for JavaScript, coverage.py for Python, JaCoCo for Java) produce per-file coverage reports. Aggregate this into a heatmap by feature area or directory.

Look for:

  • Directories with <50% coverage that contain business-critical logic
  • New files added in the last 30 days with 0% coverage
  • Recently changed files (from git diff) with declining coverage

Method 2: Risk-Coverage Matrix

Plot feature importance against test coverage. High-importance, low-coverage areas are your biggest risk zones.

           | Low Coverage | High Coverage
-----------|--------------|--------------
High Risk  |  🔴 Critical  |  🟡 Monitor
Low Risk   |  🟡 Acceptable|  🟢 Good

Risk is a judgment call — typically defined by business criticality, change frequency, and historical defect density.

Method 3: Execution Frequency Analysis

If you track which test files cover which parts of the application, you can find areas that are never exercised by any test.

More practically: find parts of the application that have no failing tests but also no passing tests — they might not be covered at all.

Method 4: New Code Coverage Gate

In CI, fail the build if new code drops overall coverage below a threshold:

# GitHub Actions example
- name: Check coverage
  run: |
    pytest --cov=myapp --cov-report=term-missing --cov-fail-under=80

This doesn't find existing gaps, but it prevents new ones from being introduced.

Reading Failure Patterns

Time-Correlated Failures

If tests fail at specific times (e.g., every day at 2am, or every Monday), the cause is usually:

  • Environment-level: scheduled jobs competing with tests, auto-scaling events
  • Time-based: tests that depend on current time/date and break at boundaries
  • Data-based: weekly data refresh that changes expected values

Query for time patterns:

SELECT
  EXTRACT(HOUR FROM created_at) as hour_of_day,
  COUNT(*) FILTER (WHERE status = 'failed') as failures,
  COUNT(*) as total,
  ROUND(COUNT(*) FILTER (WHERE status = 'failed')::float / COUNT(*) * 100, 1) as failure_rate
FROM test_results
WHERE test_name = 'specific_flaky_test'
  AND created_at > NOW() - INTERVAL '30 days'
GROUP BY hour_of_day
ORDER BY hour_of_day;

Branch-Correlated Failures

Tests that only fail on feature branches (not main) indicate:

  • The test is testing an incomplete feature
  • The branch has a genuine regression
  • Environment differences between branch and main

Tests that fail on main but not on branches indicate:

  • Integration-time failures (merge introduced a regression)
  • Environment drift between CI environments

Deployment-Correlated Failures

Track which deploy introduced a new failure pattern. If a test that was green for 30 builds suddenly fails after build #31, the change in build #31 is the prime suspect.

SELECT
  tr.build_number,
  tr.commit_sha,
  COUNT(*) FILTER (WHERE res.status = 'failed') as new_failures
FROM test_runs tr
JOIN test_results res ON tr.run_id = res.run_id
WHERE tr.started_at > NOW() - INTERVAL '14 days'
GROUP BY tr.build_number, tr.commit_sha
ORDER BY tr.build_number DESC;

QA trend data is most valuable when shared beyond the QA team. A weekly digest with three numbers gives engineering managers and product teams actionable signal without requiring them to understand test infrastructure:

Weekly Test Quality Digest:

Week of May 12

📊 Pass rate: 94.2% (+1.1% from last week)
🔴 Flaky tests: 8 identified, 3 fixed
🟡 Coverage: 78% (-0.5% — 2 new files without tests)

Top failures this week:
- checkout-flow-safari: 18 failures (timing issue, fix in review)
- user-profile-update: 11 failures (API rate limit in staging)

Action items:
- [eng] Fix Safari timing issue in checkout tests
- [devops] Investigate staging rate limits

This format answers "are we getting better or worse?" without burying non-QA stakeholders in test infrastructure details.

Automating Trend Reports

For teams that want automated digests, a simple script can query your test result database and post to Slack each morning:

import psycopg2
import requests
import os

def generate_weekly_summary(db_url):
    conn = psycopg2.connect(db_url)
    cur = conn.cursor()
    
    # Pass rate this week vs last week
    cur.execute("""
        SELECT
          AVG(passed::float / NULLIF(total_tests, 0)) as this_week,
          NULL as last_week
        FROM test_runs
        WHERE started_at > NOW() - INTERVAL '7 days'
    """)
    row = cur.fetchone()
    pass_rate = round(row[0] * 100, 1) if row[0] else 0
    
    # Top flaky tests
    cur.execute("""
        WITH ranked AS (
          SELECT test_name, status,
            LAG(status) OVER (PARTITION BY test_name ORDER BY created_at) as prev
          FROM test_results
          WHERE created_at > NOW() - INTERVAL '7 days'
        )
        SELECT test_name, COUNT(*) FILTER (WHERE status != prev AND prev IS NOT NULL) as flips
        FROM ranked
        GROUP BY test_name
        HAVING COUNT(*) FILTER (WHERE status != prev AND prev IS NOT NULL) > 0
        ORDER BY flips DESC
        LIMIT 5
    """)
    flaky = cur.fetchall()
    
    cur.close()
    conn.close()
    
    msg = f"*Weekly Test Quality Report*\n"
    msg += f"Pass rate (7d): *{pass_rate}%*\n"
    if flaky:
        msg += f"\nTop flaky tests:\n"
        for name, flips in flaky:
            msg += f"• `{name}` — {flips} flips\n"
    
    return msg

def post_to_slack(message, webhook_url):
    requests.post(webhook_url, json={"text": message})

if __name__ == '__main__':
    msg = generate_weekly_summary(os.getenv('DB_URL'))
    post_to_slack(msg, os.getenv('SLACK_WEBHOOK'))

Using HelpMeTest for Built-In Analytics

HelpMeTest tracks test result history automatically and surfaces trends in its dashboard — without requiring you to set up a separate database, write ingestion scripts, or build Grafana panels.

The built-in analytics cover pass rate trends, test history per test case, and run-over-run comparisons. For teams that want test observability without the infrastructure overhead, HelpMeTest's Pro plan at $100/month flat includes unlimited tests with 3-month data retention — covering most of what custom trend analysis delivers out of the box.

Summary

Test trend analysis is the practice that separates teams that run tests from teams that improve software quality. The four metrics that matter:

  1. Pass rate trend — is quality improving?
  2. Per-test failure rate — which tests are broken vs. reliable?
  3. Flakiness score — which tests are lying to you?
  4. Coverage trend — is new code being tested?

Start with the flakiness analysis — a suite with 20% flaky tests gives you 20% noisy signal, and fixing that has immediate, visible impact on CI trust. Then move to coverage gap analysis for the areas where surprises happen. Trend reports follow naturally from having the data; the key is having somewhere to store test results beyond the current build.

Read more