Flaky Tests Root Cause Analysis: 6 Most Common Causes and Fixes
A flaky test is a test that passes sometimes and fails sometimes without any change to the code. It fails in CI, you re-run it, it passes. You ship. Three days later it fails again. Nobody changed anything relevant.
Flaky tests are expensive. They erode trust in your test suite — developers learn to ignore failures, which means real bugs get ignored too. They waste CI time on re-runs. They create on-call fatigue. Google's engineering research estimates that 16% of their tests show some flakiness, and each flaky test takes engineers significant time to investigate.
The good news: most flaky tests fail for one of six reasons. Knowing the reason lets you fix or mitigate the problem systematically.
1. Timing and Asynchrony
The most common cause. A test waits for something — an animation to finish, a network request to return, a database write to complete — and uses a fixed delay or an implicit assumption about timing.
Signs of a timing bug:
- Test passes locally but fails in CI (CI machines are slower)
- Test fails occasionally on fast machines too
- Failure message: "element not found", "timeout", "element not interactable"
Example (bad):
// Playwright
await page.click("#submit-button");
await page.waitForTimeout(2000); // "should be enough"
await expect(page.locator(".success-message")).toBeVisible();Fix: Wait for the actual condition, not a fixed duration.
await page.click("#submit-button");
// Wait for the element that indicates completion
await expect(page.locator(".success-message")).toBeVisible({ timeout: 10000 });For backend tests:
# Bad: time.sleep(2)
# Good: poll until condition is true
import time
def wait_for(condition_fn, timeout=10, interval=0.1):
deadline = time.time() + timeout
while time.time() < deadline:
if condition_fn():
return True
time.sleep(interval)
raise TimeoutError("Condition not met")
wait_for(lambda: order_service.get_order(order_id).status == "shipped")2. Shared Mutable State
Tests that depend on or modify shared state — a database, a singleton, a global variable — create ordering dependencies. Test A writes data that Test B reads, or Test C corrupts data that Test D expects to be clean.
Signs of shared state:
- Test passes when run alone but fails in a suite
- Failures depend on which tests ran before
- Different failure every time (the order of parallel tests varies)
Fix: Isolate each test's state.
Database: Use transactions that roll back:
# pytest with SQLAlchemy
@pytest.fixture(autouse=True)
def db_transaction(db_session):
db_session.begin_nested()
yield db_session
db_session.rollback()In-memory state: Reset before each test:
// Jest
beforeEach(() => {
jest.clearAllMocks();
cache.clear();
eventBus.removeAllListeners();
});Files: Use temp directories:
@pytest.fixture
def tmp_workspace(tmp_path):
workspace = tmp_path / "workspace"
workspace.mkdir()
yield workspace
# tmp_path is cleaned up automatically3. Network and External Dependencies
Tests that make real network requests to external services fail when those services are slow, rate-limited, or temporarily unavailable.
Signs of network flakiness:
- Failure message: "ECONNRESET", "ETIMEDOUT", "503 Service Unavailable"
- Failures correlate with external service incidents
- More failures during peak hours
Fix for unit/integration tests: Mock the external dependency.
// Jest
jest.mock("../src/stripe-client");
const mockStripe = require("../src/stripe-client");
mockStripe.charges.create.mockResolvedValue({ id: "ch_123", status: "succeeded" });Fix for end-to-end tests that must hit real services: Record and replay. Tools like VCR (Python) or Polly.js (Node) record real HTTP responses and replay them in subsequent test runs.
import vcr
@vcr.use_cassette("fixtures/vcr_cassettes/payment.yaml")
def test_payment_processing():
result = payment_service.charge(amount=2999, currency="usd")
assert result.status == "succeeded"The first run makes a real request and records it. Subsequent runs replay the recording — fast and deterministic.
4. Test Order Dependence
Some tests pass only if another test ran first. A setup test creates a user; the target test assumes that user exists. This works when tests run in alphabetical order but breaks when parallelized.
Signs of order dependence:
- Test passes in the full suite but fails in isolation
- Failure: "user not found", "resource already exists", or missing preconditions
Fix: Make each test self-contained. Every test creates everything it needs:
// Bad: assumes user created by previous test
it("updates user email", async () => {
const updated = await userService.update("user-123", { email: "new@example.com" });
expect(updated.email).toBe("new@example.com");
});
// Good: creates its own preconditions
it("updates user email", async () => {
const user = await userService.create({ email: "original@example.com", name: "Test" });
const updated = await userService.update(user.id, { email: "new@example.com" });
expect(updated.email).toBe("new@example.com");
});Use factories or fixtures to make test data creation easy and consistent.
5. Concurrency and Race Conditions
Parallel tests compete for shared resources: the same port, the same database row, the same file path. When two tests write to the same location simultaneously, results are unpredictable.
Signs of concurrency bugs:
- Tests pass with
--workers=1but fail with--workers=4 - Failures are different every run
- Error messages about locks, conflicts, or duplicate keys
Fix: Use unique resources per test.
// Bad: all tests use port 3000
const server = app.listen(3000);
// Good: each test gets a random available port
const server = app.listen(0); // OS assigns available port
const port = server.address().port;# Bad: all tests use the same database name
DB_NAME = "test_db"
# Good: each test worker gets its own database
import os
DB_NAME = f"test_db_{os.getpid()}"For databases, use schema-level isolation: each test runs in its own schema, which can be dropped after the test.
6. Environment and Configuration Differences
A test passes in your local environment and fails in CI because the environments differ: different timezone, different locale, different available memory, different file path conventions.
Signs of environment flakiness:
- Passes locally, fails in CI (consistently, not randomly)
- Fails only on certain operating systems
- Date/time assertions fail inconsistently
Fix: Timezone:
// Bad: depends on local timezone
expect(formatDate(new Date("2026-01-15"))).toBe("Jan 15, 2026");
// Good: explicit timezone
expect(formatDate(new Date("2026-01-15"), { timezone: "UTC" })).toBe("Jan 15, 2026");Fix: Date assertions:
// Bad: exact timestamp comparison
expect(user.createdAt).toBe("2026-01-15T10:30:00.000Z");
// Good: approximate or relative comparison
expect(new Date(user.createdAt).getTime()).toBeCloseTo(Date.now(), -3);
// Or: assert the date is recent, not exact
expect(new Date(user.createdAt)).toBeInstanceOf(Date);Fix: Docker for CI parity:
# .github/workflows/test.yml
jobs:
test:
runs-on: ubuntu-latest
container:
image: node:20
env:
TZ: UTC
LANG: en_US.UTF-8Diagnosing Unknown Flakiness
When you don't know which category a flaky test falls into:
- Run the test 50 times in isolation:
npx jest my-test.spec.js --repeat-each=50- If it always passes: it's order-dependent or state-dependent
- If it sometimes fails: it's timing, concurrency, or environment
- Check the failure message and stack trace carefully — the error type usually indicates the root cause
- Compare CI and local environments — look for timezone, locale, memory, network differences
- Run with verbose logging — add timestamps to log entries to see where tests are waiting
- Check recent changes — flakiness that appeared recently often correlates with a specific commit
Fixing vs. Quarantining
Not all flaky tests can be fixed immediately. For tests that are difficult to fix but still valuable, quarantine them (see the Jest quarantine pattern post for specifics): mark them as known-flaky, run them separately, and track the flakiness rate over time.
The goal is to make the main test suite reliable — always green on clean code — while not permanently hiding the coverage that flaky tests provide.
For teams monitoring test reliability across environments, HelpMeTest provides 24/7 test monitoring with AI-powered analysis of failures, helping separate real regressions from infrastructure noise automatically.