How to Identify and Fix Flaky Tests: A Practical Guide
Flaky tests are tests that pass sometimes and fail sometimes without any code changes. They're among the most damaging problems in a test suite — they erode trust in the entire test suite, slow down CI pipelines, and hide real failures behind noise.
If your team regularly reruns tests hoping they'll pass, or dismisses test failures as "probably flaky," you have a flakiness problem that needs systematic attention.
Why Flaky Tests Are Dangerous
When tests flake, developers learn to ignore failures. "It's probably just flaky" becomes the default explanation for any red build. But sometimes tests fail because of a real bug. When your team treats all failures skeptically, real bugs slip through.
Flaky tests also consume disproportionate time. Every flaky test failure triggers an investigation, a rerun, and often a manual review. Multiply this across a team and you're losing hours per week to noise.
Finding Flaky Tests
Detection Strategy 1: Run Tests Multiple Times
The most direct approach — run your test suite multiple times on the same code and look for tests that fail inconsistently:
# Run test suite 10 times, collect results
<span class="hljs-keyword">for i <span class="hljs-keyword">in $(<span class="hljs-built_in">seq 1 10); <span class="hljs-keyword">do
pytest tests/ --tb=no -q 2>&1 <span class="hljs-pipe">| grep -E <span class="hljs-string">"FAILED|PASSED|ERROR" >> flaky_results.txt
<span class="hljs-keyword">done
<span class="hljs-comment"># Find tests with inconsistent results
<span class="hljs-built_in">sort flaky_results.txt <span class="hljs-pipe">| <span class="hljs-built_in">uniq -c <span class="hljs-pipe">| <span class="hljs-built_in">sort -rn <span class="hljs-pipe">| <span class="hljs-built_in">head -20Detection Strategy 2: CI History Analysis
Most CI systems record test results per job. Analyze historical data to find tests with high pass rates (not 100% and not consistently failing):
# Query your CI API (GitHub Actions example)
import requests
def get_test_failure_rate(test_name, days=30):
runs = get_recent_test_runs(test_name, days)
failures = sum(1 for r in runs if r['status'] == 'failed')
return failures / len(runs) if runs else 0
# Tests failing 5-50% of the time are likely flaky
# Tests failing >90% are consistently broken (different problem)
flaky_tests = [
test for test in all_tests
if 0.05 < get_test_failure_rate(test) < 0.90
]Detection Strategy 3: Run in Different Orders
Some flaky tests depend on other tests running first (order dependency). Test runners usually run tests in the same order, hiding this problem:
# Pytest: randomize test order
pip install pytest-randomly
pytest tests/ -p randomly --randomly-seed=random
<span class="hljs-comment"># Jest: randomize order
jest --randomizeIf tests fail with random ordering but pass normally, you have order-dependent tests.
Detection Strategy 4: Run in Parallel
Race conditions surface when tests run concurrently:
# Run with high parallelism to expose races
pytest tests/ -n 8 <span class="hljs-comment"># pytest-xdistRoot Causes and Fixes
1. Timing Dependencies (Most Common)
Tests that use sleep() or wait for a fixed duration are inherently flaky. If the system is slower than expected — under load in CI, on a different machine — the sleep isn't long enough.
Broken:
def test_email_sent():
trigger_signup(email='user@example.com')
time.sleep(2) # hope the email service processed it
assert email_service.sent_count() == 1Fixed:
def test_email_sent():
trigger_signup(email='user@example.com')
# Wait until the condition is true, up to a timeout
assert wait_for(
lambda: email_service.sent_count() == 1,
timeout=10,
interval=0.1
), "Email not sent within 10 seconds"def wait_for(condition, timeout=10, interval=0.1):
deadline = time.time() + timeout
while time.time() < deadline:
if condition():
return True
time.sleep(interval)
return False2. Shared State Between Tests
Tests that modify shared state and don't clean up affect tests that run after them.
Broken:
# Tests share the same database without cleanup
def test_user_count():
create_user('alice@example.com')
assert User.count() == 1 # fails if other tests added users firstFixed:
@pytest.fixture(autouse=True)
def clean_database():
yield
User.delete_all()
Order.delete_all()
def test_user_count():
create_user('alice@example.com')
assert User.count() == 1 # always starts from 0Or use transactions that roll back after each test:
@pytest.fixture(autouse=True)
def database_transaction(db):
with db.transaction() as txn:
yield txn
txn.rollback()3. Race Conditions in Async Code
Async code has timing-dependent behavior that's easy to get wrong in tests:
Broken:
test('processes message from queue', async () => {
await queue.publish({ type: 'order.created', orderId: '123' });
// Message is processed asynchronously — this may run before processing
expect(await Order.findById('123')).toHaveProperty('status', 'confirmed');
});Fixed:
test('processes message from queue', async () => {
await queue.publish({ type: 'order.created', orderId: '123' });
// Poll until the state changes
const order = await waitFor(
async () => {
const o = await Order.findById('123');
if (o?.status !== 'confirmed') throw new Error('Not yet');
return o;
},
{ timeout: 5000 }
);
expect(order.status).toBe('confirmed');
});4. External Service Dependencies
Tests that call real external services fail when those services are down or rate-limited.
Broken:
def test_geocoding():
result = google_maps.geocode('1600 Amphitheatre Pkwy')
assert result['lat'] == pytest.approx(37.42, abs=0.01)Fixed:
@patch('myapp.maps.requests.get')
def test_geocoding(mock_get):
mock_get.return_value.json.return_value = {
'results': [{'geometry': {'location': {'lat': 37.42, 'lng': -122.08}}}]
}
result = google_maps.geocode('1600 Amphitheatre Pkwy')
assert result['lat'] == pytest.approx(37.42, abs=0.01)5. Date and Time Dependencies
Tests that use new Date() or datetime.now() are inherently flaky — they behave differently based on when they run.
Broken:
test('sends birthday email on birthday', () => {
const user = { name: 'Alice', birthday: new Date() }; // today's date
sendBirthdayEmails([user]);
expect(emailService.sentCount()).toBe(1);
// Fails if run at midnight during date rollover
});Fixed:
test('sends birthday email on birthday', () => {
const today = new Date('2024-03-15');
jest.setSystemTime(today);
const user = { name: 'Alice', birthday: new Date('2024-03-15') };
sendBirthdayEmails([user], { now: today });
expect(emailService.sentCount()).toBe(1);
jest.useRealTimers();
});Pass time as a parameter or use a clock abstraction instead of calling system time directly.
6. Port Conflicts
Tests that start servers on fixed ports fail when something else is using that port.
Broken:
def test_server():
server = Server(port=3000)
server.start()
# Fails if port 3000 is in useFixed:
def test_server():
server = Server(port=0) # OS assigns an available port
server.start()
actual_port = server.port # get the assigned port
response = requests.get(f'http://localhost:{actual_port}/health')
assert response.status_code == 200
server.stop()7. Random Data Without Seeds
Tests that generate random data without fixing the seed fail when they happen to generate edge-case values.
Broken:
def test_price_calculation():
price = random.uniform(0.01, 1000.00)
tax = calculate_tax(price)
assert tax > 0 # fails if price rounds to 0.000 somehowFixed:
@pytest.mark.parametrize('price', [0.01, 1.00, 99.99, 999.99])
def test_price_calculation(price):
tax = calculate_tax(price)
assert tax > 0Or set a fixed seed for reproducible random values:
def test_with_random_data():
random.seed(42)
price = random.uniform(0.01, 1000.00)
# same value every runSystematic Remediation
When you have many flaky tests, prioritize by impact:
- High-frequency flakes in critical paths — fix immediately
- Tests that block deployment — fix before next release
- Occasional flakes in peripheral tests — quarantine and fix in next sprint
Track flakiness metrics over time. Set a goal (e.g., <1% flake rate) and measure progress weekly. Treat flakiness as a defect, not a fact of life.
The goal is a test suite where every failure means something. When you achieve that, your team will trust CI and stop dismissing failures.