Test Reliability

Flaky Tests Root Cause Analysis: 6 Most Common Causes and Fixes

HelpMeTest

22 May 2026 — 5 min read

A flaky test is a test that passes sometimes and fails sometimes without any change to the code. It fails in CI, you re-run it, it passes. You ship. Three days later it fails again. Nobody changed anything relevant.

Flaky tests are expensive. They erode trust in your test suite — developers learn to ignore failures, which means real bugs get ignored too. They waste CI time on re-runs. They create on-call fatigue. Google's engineering research estimates that 16% of their tests show some flakiness, and each flaky test takes engineers significant time to investigate.

The good news: most flaky tests fail for one of six reasons. Knowing the reason lets you fix or mitigate the problem systematically.

1. Timing and Asynchrony

The most common cause. A test waits for something — an animation to finish, a network request to return, a database write to complete — and uses a fixed delay or an implicit assumption about timing.

Signs of a timing bug:

Test passes locally but fails in CI (CI machines are slower)
Test fails occasionally on fast machines too
Failure message: "element not found", "timeout", "element not interactable"

Example (bad):

// Playwright
await page.click("#submit-button");
await page.waitForTimeout(2000); // "should be enough"
await expect(page.locator(".success-message")).toBeVisible();

Fix: Wait for the actual condition, not a fixed duration.

await page.click("#submit-button");
// Wait for the element that indicates completion
await expect(page.locator(".success-message")).toBeVisible({ timeout: 10000 });

For backend tests:

# Bad: time.sleep(2)
# Good: poll until condition is true
import time

def wait_for(condition_fn, timeout=10, interval=0.1):
    deadline = time.time() + timeout
    while time.time() < deadline:
        if condition_fn():
            return True
        time.sleep(interval)
    raise TimeoutError("Condition not met")

wait_for(lambda: order_service.get_order(order_id).status == "shipped")

2. Shared Mutable State

Tests that depend on or modify shared state — a database, a singleton, a global variable — create ordering dependencies. Test A writes data that Test B reads, or Test C corrupts data that Test D expects to be clean.

Signs of shared state:

Test passes when run alone but fails in a suite
Failures depend on which tests ran before
Different failure every time (the order of parallel tests varies)

Fix: Isolate each test's state.

Database: Use transactions that roll back:

# pytest with SQLAlchemy
@pytest.fixture(autouse=True)
def db_transaction(db_session):
    db_session.begin_nested()
    yield db_session
    db_session.rollback()

In-memory state: Reset before each test:

// Jest
beforeEach(() => {
  jest.clearAllMocks();
  cache.clear();
  eventBus.removeAllListeners();
});

Files: Use temp directories:

@pytest.fixture
def tmp_workspace(tmp_path):
    workspace = tmp_path / "workspace"
    workspace.mkdir()
    yield workspace
    # tmp_path is cleaned up automatically

3. Network and External Dependencies

Tests that make real network requests to external services fail when those services are slow, rate-limited, or temporarily unavailable.

Signs of network flakiness:

Failure message: "ECONNRESET", "ETIMEDOUT", "503 Service Unavailable"
Failures correlate with external service incidents
More failures during peak hours

Fix for unit/integration tests: Mock the external dependency.

// Jest
jest.mock("../src/stripe-client");
const mockStripe = require("../src/stripe-client");
mockStripe.charges.create.mockResolvedValue({ id: "ch_123", status: "succeeded" });

Fix for end-to-end tests that must hit real services: Record and replay. Tools like VCR (Python) or Polly.js (Node) record real HTTP responses and replay them in subsequent test runs.

import vcr

@vcr.use_cassette("fixtures/vcr_cassettes/payment.yaml")
def test_payment_processing():
    result = payment_service.charge(amount=2999, currency="usd")
    assert result.status == "succeeded"

The first run makes a real request and records it. Subsequent runs replay the recording — fast and deterministic.

4. Test Order Dependence

Some tests pass only if another test ran first. A setup test creates a user; the target test assumes that user exists. This works when tests run in alphabetical order but breaks when parallelized.

Signs of order dependence:

Test passes in the full suite but fails in isolation
Failure: "user not found", "resource already exists", or missing preconditions

Fix: Make each test self-contained. Every test creates everything it needs:

// Bad: assumes user created by previous test
it("updates user email", async () => {
  const updated = await userService.update("user-123", { email: "new@example.com" });
  expect(updated.email).toBe("new@example.com");
});

// Good: creates its own preconditions
it("updates user email", async () => {
  const user = await userService.create({ email: "original@example.com", name: "Test" });
  const updated = await userService.update(user.id, { email: "new@example.com" });
  expect(updated.email).toBe("new@example.com");
});

Use factories or fixtures to make test data creation easy and consistent.

5. Concurrency and Race Conditions

Parallel tests compete for shared resources: the same port, the same database row, the same file path. When two tests write to the same location simultaneously, results are unpredictable.

Signs of concurrency bugs:

Tests pass with --workers=1 but fail with --workers=4
Failures are different every run
Error messages about locks, conflicts, or duplicate keys

Fix: Use unique resources per test.

// Bad: all tests use port 3000
const server = app.listen(3000);

// Good: each test gets a random available port
const server = app.listen(0); // OS assigns available port
const port = server.address().port;

# Bad: all tests use the same database name
DB_NAME = "test_db"

# Good: each test worker gets its own database
import os
DB_NAME = f"test_db_{os.getpid()}"

For databases, use schema-level isolation: each test runs in its own schema, which can be dropped after the test.

6. Environment and Configuration Differences

A test passes in your local environment and fails in CI because the environments differ: different timezone, different locale, different available memory, different file path conventions.

Signs of environment flakiness:

Passes locally, fails in CI (consistently, not randomly)
Fails only on certain operating systems
Date/time assertions fail inconsistently

Fix: Timezone:

// Bad: depends on local timezone
expect(formatDate(new Date("2026-01-15"))).toBe("Jan 15, 2026");

// Good: explicit timezone
expect(formatDate(new Date("2026-01-15"), { timezone: "UTC" })).toBe("Jan 15, 2026");

Fix: Date assertions:

// Bad: exact timestamp comparison
expect(user.createdAt).toBe("2026-01-15T10:30:00.000Z");

// Good: approximate or relative comparison
expect(new Date(user.createdAt).getTime()).toBeCloseTo(Date.now(), -3);
// Or: assert the date is recent, not exact
expect(new Date(user.createdAt)).toBeInstanceOf(Date);

Fix: Docker for CI parity:

# .github/workflows/test.yml
jobs:
  test:
    runs-on: ubuntu-latest
    container:
      image: node:20
      env:
        TZ: UTC
        LANG: en_US.UTF-8

Diagnosing Unknown Flakiness

When you don't know which category a flaky test falls into:

Run the test 50 times in isolation: npx jest my-test.spec.js --repeat-each=50
- If it always passes: it's order-dependent or state-dependent
- If it sometimes fails: it's timing, concurrency, or environment
Check the failure message and stack trace carefully — the error type usually indicates the root cause
Compare CI and local environments — look for timezone, locale, memory, network differences
Run with verbose logging — add timestamps to log entries to see where tests are waiting
Check recent changes — flakiness that appeared recently often correlates with a specific commit

Fixing vs. Quarantining

Not all flaky tests can be fixed immediately. For tests that are difficult to fix but still valuable, quarantine them (see the Jest quarantine pattern post for specifics): mark them as known-flaky, run them separately, and track the flakiness rate over time.

The goal is to make the main test suite reliable — always green on clean code — while not permanently hiding the coverage that flaky tests provide.

For teams monitoring test reliability across environments, HelpMeTest provides 24/7 test monitoring with AI-powered analysis of failures, helping separate real regressions from infrastructure noise automatically.

Flaky Tests Root Cause Analysis: 6 Most Common Causes and Fixes

HelpMeTest

1. Timing and Asynchrony

2. Shared Mutable State

3. Network and External Dependencies

4. Test Order Dependence

5. Concurrency and Race Conditions

6. Environment and Configuration Differences

Diagnosing Unknown Flakiness

Fixing vs. Quarantining

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest