Test Retry Patterns: When to Retry Tests and When to Fix Them
Retrying failing tests is a controversial topic. Critics say it hides flakiness and trains developers to ignore failures. Proponents say some level of retry is pragmatic in distributed test environments. Both are right — the answer is about when and how you retry.
The Case Against Retries
Automatic retries mask flaky tests. When CI reruns a test and it passes, the failure disappears from your metrics. Developers never see it. The flaky test stays flaky indefinitely.
Retries also slow down CI. If 5% of tests retry and each retry adds 30 seconds, your pipeline gets meaningfully slower. Multiply across many teams and CI instances, and this adds up.
Most importantly, a test that passes on the second try might be hiding a race condition, an infrastructure problem, or a real but intermittent bug. Making that failure invisible is dangerous.
The Case For Limited Retries
Not all test failures are created equal. Some failures are caused by:
- Flaky external services in a staging environment
- Race conditions in third-party infrastructure (not your code)
- Transient network issues between test runners and services
- Garbage collection pauses in JVMs causing timeouts
For these infrastructure-related failures, retrying once is pragmatic. The alternative is constant false alarms that developers learn to ignore — which is arguably worse than masking flakiness.
The key is transparency: retried tests should be visible in your metrics, not silently hidden.
Retry Strategies
Strategy 1: Retry Once, Log Always
The most defensible approach. Retry at most once, but always log the initial failure:
# conftest.py (pytest)
import pytest
@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
outcome = yield
report = outcome.get_result()
if report.failed and hasattr(item, '_first_failure'):
# This is a retry — mark as flaky
report.flaky = True
if report.failed and not hasattr(item, '_first_failure'):
item._first_failure = True# pytest-rerunfailures
pip install pytest-rerunfailures
<span class="hljs-comment"># Retry once, with 1 second delay
pytest tests/ --reruns 1 --reruns-delay 1Strategy 2: Category-Based Retries
Apply retries only to tests tagged as potentially flaky for infrastructure reasons:
# Only retry tests marked as infrastructure-dependent
@pytest.mark.retry(max_attempts=3)
@pytest.mark.integration
def test_external_payment_webhook():
# This test calls a real webhook endpoint that's sometimes slow
...# conftest.py
def pytest_runtest_protocol(item, nextitem):
retry_mark = item.get_closest_marker('retry')
if retry_mark:
max_attempts = retry_mark.kwargs.get('max_attempts', 3)
item.max_reruns = max_attempts - 1Strategy 3: Retry with Exponential Backoff
For tests that fail due to rate limits or eventual consistency:
import time
import pytest
from functools import wraps
def retry_with_backoff(max_attempts=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except (AssertionError, Exception) as e:
if attempt == max_attempts - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"\nTest failed (attempt {attempt + 1}/{max_attempts}). "
f"Retrying in {delay}s... Error: {e}")
time.sleep(delay)
return wrapper
return decorator
@retry_with_backoff(max_attempts=3, base_delay=1)
def test_cache_invalidation_propagates():
# Eventual consistency — may take 1-3 seconds to propagate
update_cache_key('user-123', {'name': 'Alice Updated'})
cache_value = read_from_replica('user-123')
assert cache_value['name'] == 'Alice Updated'Framework-Specific Retry Configuration
Jest (JavaScript)
// jest.config.js
module.exports = {
testTimeout: 10000,
// Built-in retry (Jest 29+)
retryTimes: 1,
// Or per-test with jest-circus
};// Per-test retry
jest.retryTimes(3);
test('async operation completes', async () => {
// This test will retry up to 3 times
await expect(asyncOperation()).resolves.toBe('success');
});Cypress
// cypress.config.js
module.exports = {
retries: {
runMode: 2, // Retry up to 2 times in CI
openMode: 0 // No retries in interactive mode (you're watching)
}
};
// Per-test
it('loads the dashboard', { retries: 3 }, () => {
cy.visit('/dashboard');
cy.get('[data-testid="dashboard"]').should('be.visible');
});Playwright
// playwright.config.js
module.exports = {
retries: process.env.CI ? 2 : 0,
// Only retry in CI; locally you want to see failures immediately
};GitHub Actions
# Retry the whole job (crude but sometimes necessary)
jobs:
test:
strategy:
max-parallel: 1
steps:
- name: Run tests
uses: nick-fields/retry@v2
with:
timeout_minutes: 30
max_attempts: 3
command: npm testMeasuring Retry Impact
Track retry metrics to understand your flakiness burden:
# Generate retry report from test results
import json
def analyze_retries(junit_xml_path):
"""Parse JUnit XML and find retried tests."""
import xml.etree.ElementTree as ET
tree = ET.parse(junit_xml_path)
root = tree.getroot()
retried = []
for testcase in root.iter('testcase'):
# pytest-rerunfailures adds rerun count as attribute
reruns = int(testcase.get('reruns', 0))
if reruns > 0:
retried.append({
'name': testcase.get('name'),
'reruns': reruns,
'final_status': 'passed' if testcase.find('failure') is None else 'failed'
})
print(f"Tests that required retries: {len(retried)}")
for t in retried:
print(f" {t['reruns']} retry(ies), final={t['final_status']}: {t['name']}")
# Alert if retry rate exceeds threshold
total_tests = int(root.get('tests', 0))
retry_rate = len(retried) / total_tests if total_tests else 0
if retry_rate > 0.05:
print(f"WARNING: Retry rate {retry_rate:.1%} exceeds 5% threshold")Set up a dashboard showing retry trends over time. A rising retry rate means your test infrastructure is degrading.
When Retries Are Never Acceptable
Some failures should never be retried:
Deterministic business logic tests. If calculate_tax(100.00) returns 8.00 instead of 7.50, retrying won't help. The code is wrong.
Authorization failures. If a user can access a resource they shouldn't, retrying will just confirm the security hole.
Data integrity violations. If an operation creates a duplicate record, retrying might create another one.
Any test in a local development environment. Retries locally mask bugs you should be fixing immediately. Disable retries in development mode.
// Never retry locally
const retries = process.env.CI ? 2 : 0;The Right Mental Model
Think of retries as a safety valve, not a solution. They're for the gap between "we know this is flaky" and "we've fixed the root cause."
The correct process:
- Test fails → retry once automatically
- Retry logged → flaky test tracking system records this failure
- Threshold exceeded → alert fires (e.g., 3 retries in a week for one test)
- Team investigates → root cause found, test fixed or quarantined
- Fix verified → test no longer retries
Without step 2-4, retries are just hiding problems. With them, retries are a reasonable buffer while you work through your flakiness backlog.
The goal is a test suite with zero retries — not because you disabled them, but because your tests are stable enough not to need them.