Flaky Tests: Root Causes and How to Fix Them
Flaky tests fail intermittently — not because the application is broken, but because the test itself has hidden assumptions about timing, state, or environment. The fix is almost never "just retry it." Understanding the root cause and fixing it permanently is the only path to a reliable test suite.
Key Takeaways
Async timing issues cause most flaky E2E tests. The test proceeds before the application is ready. The fix is explicit waits for application state — not sleep timers.
Shared state is the second biggest cause. Tests that modify shared data and run in parallel will collide. The fix is test isolation: each test creates and owns its own data.
Retrying flaky tests masks the problem. A test that passes on the third try is still flaky — you just shipped a slower CI pipeline. Fix the root cause; use retries only as a temporary bandage.
Network dependency failures are a category on their own. Tests that depend on real external services will fail when those services are slow or unavailable. Use mocks or contract tests instead.
Quarantine flaky tests until they are fixed. A test that sometimes fails is worse than no test — it trains your team to ignore red CI builds.
A flaky test is a test that fails intermittently — not because the application has a bug, but because the test has hidden dependencies on timing, environment, or state that vary between runs. It passes on your machine, fails in CI. It passes when run alone, fails in parallel. It passes Monday, fails Tuesday, passes again Wednesday.
Flaky tests are one of the most corrosive problems in software engineering. They slow down CI, generate false alarms, and gradually train teams to ignore test failures — which means real bugs slip through.
This guide covers every major root cause of test flakiness, how to diagnose it, and how to fix it permanently.
Why Flaky Tests Are a Serious Problem
Before diving into causes and fixes, it is worth being explicit about why flaky tests matter.
They erode trust. When developers learn that a failing test "probably just needs another run," they stop treating test failures as signals. This is exactly backwards from what tests are for.
They slow down CI. Each flaky failure requires investigation time. Some teams institute automatic retries, which makes the problem invisible while making CI slower. A suite with 10% test flakiness may be running 30% more test executions than necessary.
They mask real failures. When CI is always orange because of known flaky tests, real failures blend in. The failing test might be the one that actually found a bug — but nobody is looking anymore.
They compound. As teams learn that test failures often require retries, they add retries everywhere. The flaky tests continue failing, the retries multiply, and the underlying problems are never addressed.
The right response to a flaky test is: quarantine it immediately (to preserve signal), investigate the root cause, and fix it within the sprint.
Root Cause 1: Async Timing Issues
This is the most common cause of flaky E2E tests. The test proceeds with the next action before the application has finished the previous one.
The Pattern
// Flaky: clicks button, immediately checks result before API call completes
cy.get('[data-testid="submit"]').click()
cy.get('.success-message').should('be.visible') // Fails intermittently
The test clicks Submit, which triggers an API call. If the API responds quickly, .success-message appears before the assertion runs. If the API is slow (maybe the CI machine was under load), the assertion runs before the success message appears. The test is timing-dependent.
The Fix: Wait for Application State, Not Time
// Bad fix: arbitrary sleep
cy.get('[data-testid="submit"]').click()
cy.wait(2000) // Fails on slow CI, wastes time on fast machines
cy.get('.success-message').should('be.visible')
// Good fix: wait for the actual state change
cy.get('[data-testid="submit"]').click()
cy.get('.success-message').should('be.visible') // Cypress retries until visible
// Better fix: wait for the network request that drives the state change
cy.intercept('POST', '/api/orders').as('createOrder')
cy.get('[data-testid="submit"]').click()
cy.wait('@createOrder') // Wait for the specific API call to complete
cy.get('.success-message').should('be.visible')
In Playwright:
// Wait for the response that drives the state change
const [response] = await Promise.all([
page.waitForResponse('**/api/orders'),
page.click('[data-testid="submit"]'),
]);
expect(response.status()).toBe(201);
await expect(page.locator('.success-message')).toBeVisible();
The rule: never use sleep() or wait(N ms) to solve timing issues. Instead, wait for a specific application condition that signals the operation is complete — a network response, a DOM state change, a URL change, or a specific element appearing.
Common Timing Patterns and Their Fixes
| Timing Problem | Flaky Pattern | Fix |
|---|---|---|
| API call not complete | Assert result immediately after click | Wait for network response |
| Animation not done | Click element mid-animation | Wait for animation class to be removed |
| Page not loaded | Interact before page ready | Wait for specific content or network idle |
| WebSocket message | Assert before message arrives | Intercept and wait for WebSocket event |
| Debounced input | Assert before debounce fires | Wait for debounce delay or resulting API call |
Root Cause 2: Shared Test State
Tests that share state and run in parallel will collide unpredictably.
The Pattern
// Test 1: Creates user with email "test@example.com"
// Test 2: Also creates user with email "test@example.com"
// Result: One of them gets a "email already exists" error — depending on order
Or more subtly:
// Test 1: Navigates to user list, counts 5 users
// Test 2: Creates a user in parallel
// Test 1 re-runs assertion: now 6 users — FAILS
Shared state failures are particularly insidious because they only appear when tests run in parallel, and the failures are not reproducible when you run the failing test in isolation.
The Fix: Test Isolation
Each test should create its own data and clean up after itself. No test should depend on the state left by another test.
// Instead of a shared test user that all tests assume exists:
beforeEach(async () => {
// Create a unique user for this test run
testUser = await db.createUser({
email: `test-${Date.now()}-${Math.random()}@example.com`,
name: 'Test User',
});
});
afterEach(async () => {
// Clean up
await db.deleteUser(testUser.id);
});
Or use database transactions that are rolled back after each test:
// Jest with transaction isolation
beforeEach(() => db.transaction.begin());
afterEach(() => db.transaction.rollback());
For E2E tests, avoid shared test accounts. Instead, create test users via API in beforeEach, run the test, and delete them in afterEach.
Parallelism-Specific Patterns
| Shared State Problem | Fix |
|---|---|
| Shared test database | Use per-test database transactions or isolated schemas |
| Shared test user account | Create unique users per test run |
| Shared browser session | Use separate browser contexts per test |
| Shared temp files | Use unique temp directories per test |
| Global counters/IDs | Use random or UUID-based IDs |
Root Cause 3: Order Dependency
Tests that must run in a specific order to pass indicate design problems.
The Pattern
// Test A: Creates a product
// Test B: Edits the product created by Test A
// If Test A fails, Test B fails too. If they run in different order, both fail.
Test suites designed this way are extremely fragile. Any parallel execution or shuffled ordering breaks everything.
The Fix: Self-Contained Tests
Every test must be independently executable. This means each test sets up its own preconditions:
describe('Product editing', () => {
let product;
beforeEach(async () => {
// Test B no longer depends on Test A — it creates its own product
product = await api.createProduct({ name: 'Test Widget', price: 9.99 });
});
it('can edit product name', async () => {
await page.goto(`/products/${product.id}/edit`);
await page.fill('[name="product-name"]', 'Updated Widget');
await page.click('[data-testid="save"]');
await expect(page.locator('.product-name')).toHaveText('Updated Widget');
});
});
A useful rule: any test in your suite should be runnable on its own with --grep "test name" and produce a meaningful result.
Root Cause 4: External Service Dependencies
Tests that call real external APIs, databases, or third-party services will fail whenever those services are slow, down, or rate-limiting.
The Pattern
// This test calls the real Stripe API — fails when Stripe is slow or test keys are rate-limited
it('processes a payment', async () => {
await page.fill('[name="card-number"]', '4242424242424242')
await page.click('[data-testid="pay"]')
cy.get('.payment-success').should('be.visible') // Fails if Stripe responds slowly
})
The Fix: Mock External Dependencies
For unit and integration tests, mock external services with libraries like msw (Mock Service Worker) or test doubles:
// Mock Stripe API in tests
import { setupServer } from 'msw/node';
import { rest } from 'msw';
const server = setupServer(
rest.post('https://api.stripe.com/v1/charges', (req, res, ctx) => {
return res(ctx.json({ id: 'ch_test123', status: 'succeeded' }));
})
);
beforeAll(() => server.listen());
afterEach(() => server.resetHandlers());
afterAll(() => server.close());
For E2E tests, intercept at the network level:
cy.intercept('POST', '**/stripe/charge', { fixture: 'stripe-success.json' })
For critical payment paths where you must test the real integration, use a separate test suite that runs less frequently (nightly, not on every PR) and is not in the critical CI path.
Root Cause 5: Resource Contention
Tests that compete for the same system resources — ports, files, database connections, or CPU — will fail unpredictably when resources are exhausted.
The Pattern
// Two tests both try to start a server on port 3000
// Whichever starts second fails with "port in use"
Or more subtly: tests that work fine in isolation but fail on resource-constrained CI machines (fewer CPUs, less memory, slower disks).
The Fix: Isolate Resources
Use dynamic port allocation:
// Find a free port instead of hardcoding
import { createServer } from 'net';
async function getFreePort(): Promise<number> {
return new Promise(resolve => {
const server = createServer();
server.listen(0, () => {
const port = (server.address() as AddressInfo).port;
server.close(() => resolve(port));
});
});
}
const port = await getFreePort();
For test databases, use separate schemas per test worker:
// playwright.config.ts
const workerIndex = process.env.TEST_WORKER_INDEX ?? '0';
process.env.DATABASE_URL = `postgresql://localhost/testdb_worker_${workerIndex}`;
Root Cause 6: Environment-Specific Behavior
Tests that pass in one environment and fail in another are testing environmental assumptions instead of application behavior.
Common Environment Issues
| Problem | Symptom | Fix |
|---|---|---|
| Timezone differences | Date assertions fail in different timezones | Use UTC everywhere in tests; mock Date.now() |
| Locale/language | Text assertions fail with different system locales | Set explicit locale in test config |
| Screen resolution | Element positioning assertions | Use viewport-agnostic selectors; set consistent viewport size |
| File path separators | Path assertions fail on Windows vs. Linux | Use path.join() always; normalize in assertions |
| Floating point | Numeric assertions fail cross-platform | Use toBeCloseTo() not toBe() for floats |
The Fix: Control the Environment
// playwright.config.ts — enforce consistent environment
export default defineConfig({
use: {
viewport: { width: 1280, height: 720 },
locale: 'en-US',
timezoneId: 'America/New_York',
},
});
For date/time flakiness, mock the system clock:
// Freeze time at a known point
await page.addInitScript(() => {
const fakeDate = new Date('2025-01-15T10:00:00Z');
Date.now = () => fakeDate.getTime();
globalThis.Date = class extends Date {
constructor(...args: any[]) {
super(args.length ? args[0] : fakeDate.getTime());
}
};
});
Root Cause 7: Test Data Pollution
Tests that rely on a specific dataset that drifts over time will fail when the data changes.
The Pattern
// This test assumes there are exactly 10 products in the database
it('shows all products', () => {
cy.get('.product-card').should('have.length', 10)
})
This test fails whenever a product is added or removed from the test database — which happens over time as the application evolves.
The Fix: Test Relative State, Not Absolute Values
// Better: create the data you need, test what you expect relative to it
it('shows all products', () => {
const productNames = ['Widget A', 'Widget B', 'Widget C'];
// Set up: create exactly these products
// Run: visit the page
cy.get('.product-card').should('have.length', productNames.length)
productNames.forEach(name => {
cy.contains('.product-card', name).should('exist')
})
})
Or even better — test that the products you care about are present, not that the total count is exactly right.
Root Cause 8: Test Infrastructure Issues
Sometimes tests are flaky not because of the test code or application, but because of the test infrastructure itself.
Common Infrastructure Issues
- CI machine overload: Tests time out because the CI runner is CPU-throttled
- Flaky Docker containers: Containers that do not start cleanly or take variable time to be ready
- Inconsistent network: CI environments with variable network latency that exceeds timeout values
- Memory pressure: Tests that pass on 8GB machines fail on 4GB CI runners
- Browser version mismatches: Tests that pass with Chrome 120 fail with Chrome 119 (or vice versa)
The Fix: Stabilize Infrastructure
- Set explicit timeouts that are generous for CI:
timeout: 60000instead of10000 - Use Docker healthchecks to verify dependencies are actually ready, not just started
- Pin browser versions in your test configuration
- Monitor CI machine resource usage — if tests consistently time out, the machine needs more resources
# Wait for the database to be ready, not just started
services:
postgres:
image: postgres:16
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
Diagnosing Flaky Tests
When a test starts failing intermittently, use this process:
Step 1: Reproduce Locally
Run the test in a loop to reproduce the failure:
# Run the test 20 times and collect results
<span class="hljs-keyword">for i <span class="hljs-keyword">in {1..20}; <span class="hljs-keyword">do npx playwright <span class="hljs-built_in">test --grep <span class="hljs-string">"test name" && <span class="hljs-built_in">echo <span class="hljs-string">"PASS" <span class="hljs-pipe">|| <span class="hljs-built_in">echo <span class="hljs-string">"FAIL"; <span class="hljs-keyword">done
If you cannot reproduce locally, the flakiness is likely environment-specific — run the same loop in CI with extra logging enabled.
Step 2: Identify the Failure Mode
Collect the failure messages across multiple runs. Are they:
- Always the same error (suggests timing — one specific operation is too slow)
- Different errors each time (suggests shared state — different collision each time)
- Only in CI (suggests environment or resource differences)
- Only when tests run in parallel (suggests shared state or resource contention)
Step 3: Add Diagnostic Logging
// Add detailed logging around the flaky section
test('flaky checkout test', async ({ page }) => {
await page.goto('/cart');
// Log page state before the flaky action
console.log('Page URL before click:', page.url());
console.log('Submit button state:', await page.locator('[data-testid="submit"]').isEnabled());
await page.click('[data-testid="submit"]');
// Log what happened after the click
await page.waitForTimeout(100); // Brief pause to let state settle
console.log('Page URL after click:', page.url());
console.log('Error messages:', await page.locator('.error-message').allTextContents());
});
Step 4: Quarantine While You Fix
Add the test to a quarantine list and exclude it from the main CI gate while you investigate:
// playwright.config.ts — exclude known flaky tests from blocking CI
projects: [
{
name: 'stable-tests',
grep: /^(?!.*\[FLAKY\]).*/,
retries: 0,
},
{
name: 'flaky-tests',
grep: /\[FLAKY\]/,
retries: 3, // Temporary retry while investigating
},
]
Tag the test clearly:
test('[FLAKY] checkout flow - timing issue under investigation', async ({ page }) => {
// ...
});
The Retry Trap
It is tempting to fix flaky tests by adding retries. Most test frameworks support this:
// Playwright
test.describe.configure({ retries: 3 });
// Cypress
// cypress.config.js
{ retries: { runMode: 2, openMode: 0 } }
Retries have a legitimate place — occasional infrastructure blips, network hiccups, and timing-edge-cases that are genuinely rare. But retries are not a fix for systematic flakiness. A test that fails 20% of the time and retries 3 times will pass on CI most of the time — but the underlying problem is still there. The test is still flaky. Your CI is just slower, and you are shipping code with a test suite that does not actually provide the coverage you think it does.
Use retries as a temporary quarantine measure, not as a permanent solution.
Building a Reliable Test Suite
Preventing flakiness is cheaper than fixing it. These practices will dramatically reduce flakiness in new test suites:
- Always wait for application state, never for time. Ban
sleep()andwait(N ms)from your test codebase. - Isolate test data. Every test creates its own data. No test assumes data created by another test exists.
- Use stable selectors.
data-testidattributes are your friends. Invest the five minutes to add them to key elements. - Mock external dependencies. Real API calls in unit and integration tests are a smell.
- Run tests in random order.
--randomize-orderwill surface order dependencies immediately. - Monitor flakiness rates. Track which tests have the highest retry rates in CI. The top 10 are your priority.
- Fix flakiness when found. Quarantine immediately; fix within the sprint. Flaky tests that stay in quarantine for months become invisible technical debt.
Frequently Asked Questions
What is a flaky test?
A flaky test is an automated test that sometimes passes and sometimes fails when run against the same code, without any changes to the application or test. Flakiness indicates hidden dependencies on timing, state, environment, or external services.
How do you find flaky tests?
Track CI failure history over time and calculate the failure rate per test. Tests with failure rates above 1-2% on unchanged code are likely flaky. Some CI platforms (CircleCI, GitHub Actions, Buildkite) have built-in flakiness detection.
Should I use test retries to fix flaky tests?
Retries hide flakiness but do not fix it. Use retries as a temporary measure while investigating and fixing the root cause. Permanent retries slow down your CI and erode confidence in your test suite.
Why do tests pass locally but fail in CI?
Common causes: different environment variables or configuration, different system resources (slower CPU/memory), different browser versions, timing differences from CI machine load, and parallelism that exposes shared state issues. Start by comparing the exact versions and environment variables between local and CI.
What percentage of tests should be flaky?
The target is zero. In practice, complex E2E suites on real applications often have 1-3% flakiness. Above 5% flakiness, teams typically start losing trust in the test suite entirely. Any flakiness is a signal that something needs fixing.
How does HelpMeTest help with flaky tests?
HelpMeTest uses AI-powered test execution that interprets natural language instructions against the current UI state at runtime. This eliminates the most common source of E2E test flakiness — brittle selectors that break when UI changes. Tests that describe intent ("click the checkout button") are inherently more resilient than tests that reference specific DOM attributes. The platform also includes self-healing for cases where elements have moved but the intent is still clear.
Summary
Flaky tests are a symptom, not a root cause. The most common underlying causes are:
- Async timing — test proceeds before application is ready. Fix with explicit state-based waits.
- Shared state — tests collide when run in parallel. Fix with test isolation.
- Order dependency — tests assume other tests have already run. Fix with self-contained setup.
- External services — real API calls fail intermittently. Fix with mocks.
- Resource contention — tests compete for ports, files, or CPU. Fix with dynamic resource allocation.
- Environment differences — behavior varies by timezone, locale, or screen size. Fix by controlling the environment.
- Test data drift — assertions about absolute counts or specific records that change. Fix with relative assertions and isolated data.
- Infrastructure issues — CI machines under load, containers not ready. Fix by stabilizing infrastructure.
The process: quarantine immediately, diagnose systematically, fix the root cause, verify with repeated runs, remove the quarantine tag. Every test in your suite should be a reliable signal. When it is, you can trust your CI.