How to Identify and Fix Flaky Tests: A Practical Guide

How to Identify and Fix Flaky Tests: A Practical Guide

Flaky tests are tests that pass sometimes and fail sometimes without any code changes. They're among the most damaging problems in a test suite — they erode trust in the entire test suite, slow down CI pipelines, and hide real failures behind noise.

If your team regularly reruns tests hoping they'll pass, or dismisses test failures as "probably flaky," you have a flakiness problem that needs systematic attention.

Why Flaky Tests Are Dangerous

When tests flake, developers learn to ignore failures. "It's probably just flaky" becomes the default explanation for any red build. But sometimes tests fail because of a real bug. When your team treats all failures skeptically, real bugs slip through.

Flaky tests also consume disproportionate time. Every flaky test failure triggers an investigation, a rerun, and often a manual review. Multiply this across a team and you're losing hours per week to noise.

Finding Flaky Tests

Detection Strategy 1: Run Tests Multiple Times

The most direct approach — run your test suite multiple times on the same code and look for tests that fail inconsistently:

# Run test suite 10 times, collect results
<span class="hljs-keyword">for i <span class="hljs-keyword">in $(<span class="hljs-built_in">seq 1 10); <span class="hljs-keyword">do
  pytest tests/ --tb=no -q 2>&1 <span class="hljs-pipe">| grep -E <span class="hljs-string">"FAILED|PASSED|ERROR" >> flaky_results.txt
<span class="hljs-keyword">done

<span class="hljs-comment"># Find tests with inconsistent results
<span class="hljs-built_in">sort flaky_results.txt <span class="hljs-pipe">| <span class="hljs-built_in">uniq -c <span class="hljs-pipe">| <span class="hljs-built_in">sort -rn <span class="hljs-pipe">| <span class="hljs-built_in">head -20

Detection Strategy 2: CI History Analysis

Most CI systems record test results per job. Analyze historical data to find tests with high pass rates (not 100% and not consistently failing):

# Query your CI API (GitHub Actions example)
import requests

def get_test_failure_rate(test_name, days=30):
    runs = get_recent_test_runs(test_name, days)
    failures = sum(1 for r in runs if r['status'] == 'failed')
    return failures / len(runs) if runs else 0

# Tests failing 5-50% of the time are likely flaky
# Tests failing >90% are consistently broken (different problem)
flaky_tests = [
    test for test in all_tests
    if 0.05 < get_test_failure_rate(test) < 0.90
]

Detection Strategy 3: Run in Different Orders

Some flaky tests depend on other tests running first (order dependency). Test runners usually run tests in the same order, hiding this problem:

# Pytest: randomize test order
pip install pytest-randomly
pytest tests/ -p randomly --randomly-seed=random

<span class="hljs-comment"># Jest: randomize order
jest --randomize

If tests fail with random ordering but pass normally, you have order-dependent tests.

Detection Strategy 4: Run in Parallel

Race conditions surface when tests run concurrently:

# Run with high parallelism to expose races
pytest tests/ -n 8  <span class="hljs-comment"># pytest-xdist

Root Causes and Fixes

1. Timing Dependencies (Most Common)

Tests that use sleep() or wait for a fixed duration are inherently flaky. If the system is slower than expected — under load in CI, on a different machine — the sleep isn't long enough.

Broken:

def test_email_sent():
    trigger_signup(email='user@example.com')
    time.sleep(2)  # hope the email service processed it
    assert email_service.sent_count() == 1

Fixed:

def test_email_sent():
    trigger_signup(email='user@example.com')
    # Wait until the condition is true, up to a timeout
    assert wait_for(
        lambda: email_service.sent_count() == 1,
        timeout=10,
        interval=0.1
    ), "Email not sent within 10 seconds"
def wait_for(condition, timeout=10, interval=0.1):
    deadline = time.time() + timeout
    while time.time() < deadline:
        if condition():
            return True
        time.sleep(interval)
    return False

2. Shared State Between Tests

Tests that modify shared state and don't clean up affect tests that run after them.

Broken:

# Tests share the same database without cleanup
def test_user_count():
    create_user('alice@example.com')
    assert User.count() == 1  # fails if other tests added users first

Fixed:

@pytest.fixture(autouse=True)
def clean_database():
    yield
    User.delete_all()
    Order.delete_all()

def test_user_count():
    create_user('alice@example.com')
    assert User.count() == 1  # always starts from 0

Or use transactions that roll back after each test:

@pytest.fixture(autouse=True)
def database_transaction(db):
    with db.transaction() as txn:
        yield txn
        txn.rollback()

3. Race Conditions in Async Code

Async code has timing-dependent behavior that's easy to get wrong in tests:

Broken:

test('processes message from queue', async () => {
  await queue.publish({ type: 'order.created', orderId: '123' });
  // Message is processed asynchronously — this may run before processing
  expect(await Order.findById('123')).toHaveProperty('status', 'confirmed');
});

Fixed:

test('processes message from queue', async () => {
  await queue.publish({ type: 'order.created', orderId: '123' });
  
  // Poll until the state changes
  const order = await waitFor(
    async () => {
      const o = await Order.findById('123');
      if (o?.status !== 'confirmed') throw new Error('Not yet');
      return o;
    },
    { timeout: 5000 }
  );
  
  expect(order.status).toBe('confirmed');
});

4. External Service Dependencies

Tests that call real external services fail when those services are down or rate-limited.

Broken:

def test_geocoding():
    result = google_maps.geocode('1600 Amphitheatre Pkwy')
    assert result['lat'] == pytest.approx(37.42, abs=0.01)

Fixed:

@patch('myapp.maps.requests.get')
def test_geocoding(mock_get):
    mock_get.return_value.json.return_value = {
        'results': [{'geometry': {'location': {'lat': 37.42, 'lng': -122.08}}}]
    }
    result = google_maps.geocode('1600 Amphitheatre Pkwy')
    assert result['lat'] == pytest.approx(37.42, abs=0.01)

5. Date and Time Dependencies

Tests that use new Date() or datetime.now() are inherently flaky — they behave differently based on when they run.

Broken:

test('sends birthday email on birthday', () => {
  const user = { name: 'Alice', birthday: new Date() };  // today's date
  sendBirthdayEmails([user]);
  expect(emailService.sentCount()).toBe(1);
  // Fails if run at midnight during date rollover
});

Fixed:

test('sends birthday email on birthday', () => {
  const today = new Date('2024-03-15');
  jest.setSystemTime(today);
  
  const user = { name: 'Alice', birthday: new Date('2024-03-15') };
  sendBirthdayEmails([user], { now: today });
  expect(emailService.sentCount()).toBe(1);
  
  jest.useRealTimers();
});

Pass time as a parameter or use a clock abstraction instead of calling system time directly.

6. Port Conflicts

Tests that start servers on fixed ports fail when something else is using that port.

Broken:

def test_server():
    server = Server(port=3000)
    server.start()
    # Fails if port 3000 is in use

Fixed:

def test_server():
    server = Server(port=0)  # OS assigns an available port
    server.start()
    actual_port = server.port  # get the assigned port
    response = requests.get(f'http://localhost:{actual_port}/health')
    assert response.status_code == 200
    server.stop()

7. Random Data Without Seeds

Tests that generate random data without fixing the seed fail when they happen to generate edge-case values.

Broken:

def test_price_calculation():
    price = random.uniform(0.01, 1000.00)
    tax = calculate_tax(price)
    assert tax > 0  # fails if price rounds to 0.000 somehow

Fixed:

@pytest.mark.parametrize('price', [0.01, 1.00, 99.99, 999.99])
def test_price_calculation(price):
    tax = calculate_tax(price)
    assert tax > 0

Or set a fixed seed for reproducible random values:

def test_with_random_data():
    random.seed(42)
    price = random.uniform(0.01, 1000.00)
    # same value every run

Systematic Remediation

When you have many flaky tests, prioritize by impact:

  1. High-frequency flakes in critical paths — fix immediately
  2. Tests that block deployment — fix before next release
  3. Occasional flakes in peripheral tests — quarantine and fix in next sprint

Track flakiness metrics over time. Set a goal (e.g., <1% flake rate) and measure progress weekly. Treat flakiness as a defect, not a fact of life.

The goal is a test suite where every failure means something. When you achieve that, your team will trust CI and stop dismissing failures.

Read more