Root Causes of Flaky Tests in CI: Why Tests Pass Locally but Fail in CI
"It works on my machine" is one of the oldest complaints in software development. With tests, the equivalent is "it passes locally but fails in CI." This pattern has specific, identifiable causes — and fixing them systematically makes your test suite dramatically more reliable.
Why CI Is Different from Local
CI environments differ from developer machines in important ways:
- Different OS (often Linux in CI vs macOS locally)
- No GPU or display — tests that render graphics fail without a virtual display
- Resource constraints — less CPU and RAM per job, shared infrastructure
- Network restrictions — no access to external services, different DNS
- Fresh state every run — no caches, no persistent files, no leftover processes
- Parallelism — multiple test jobs running concurrently on shared infrastructure
- Time zone — usually UTC; your machine might be local time
Understanding these differences lets you anticipate flakiness before it happens.
Root Cause 1: Resource Starvation
CI machines share resources. Your tests might complete in 200ms locally but take 2000ms in CI when CPU is throttled.
Symptoms: Tests that use fixed timeouts fail intermittently. Database queries time out. Async operations complete too slowly.
How to find it:
# Add timing to your test output
pytest tests/ -v --durations=10
<span class="hljs-comment"># Compare local vs CI timing
<span class="hljs-comment"># Local: test_user_creation ... 0.08s
<span class="hljs-comment"># CI: test_user_creation ... 2.34s ← resource starvationFix:
# Bad: hardcoded timeout
def wait_for_email():
time.sleep(2) # assumes 2s is enough
# Good: configurable timeout
def wait_for_email(timeout=None):
timeout = timeout or int(os.environ.get('TEST_TIMEOUT', '5'))
deadline = time.time() + timeout
while time.time() < deadline:
if email_service.has_pending():
return True
time.sleep(0.1)
return FalseSet TEST_TIMEOUT=10 in CI environments where resources are constrained.
Root Cause 2: No Display / Headless Environment
Browser tests using Selenium or Playwright fail in CI without a display server. The browser can't render without one.
Symptom: WebDriverException: Message: unknown error: Chrome failed to start: exited normally
Fix for Selenium:
# GitHub Actions
- name: Run browser tests
run: pytest tests/browser/
env:
DISPLAY: ":99"
- name: Start virtual display
run: |
Xvfb :99 -screen 0 1920x1080x24 &Or use headless mode:
from selenium.webdriver.chrome.options import Options
options = Options()
if os.environ.get('CI'):
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options)Playwright handles headless automatically, making this much simpler:
const browser = await chromium.launch({ headless: !!process.env.CI });Root Cause 3: File System Differences
Linux (CI) and macOS (local) have different file systems:
- Case sensitivity: macOS is case-insensitive by default; Linux is case-sensitive.
require('./MyModule')findsmymodule.json Mac but fails on Linux. - Path separators:
\vs/. Hardcoded Windows paths break on Unix. - Line endings:
\r\nvs\n. Test that compares file contents can fail. - Temp directories:
/tmpin Linux, may be different path on Mac.
Symptoms: Module not found errors in CI that don't happen locally. String comparison tests fail on file content.
Fix case sensitivity:
# Find case mismatches in your codebase
git ls-files <span class="hljs-pipe">| <span class="hljs-built_in">tr <span class="hljs-string">'[:upper:]' <span class="hljs-string">'[:lower:]' <span class="hljs-pipe">| <span class="hljs-built_in">sort <span class="hljs-pipe">| <span class="hljs-built_in">uniq -dFix temp directory:
import tempfile
import os
# Bad
tmp_file = '/tmp/test_output.txt'
# Good
tmp_file = os.path.join(tempfile.gettempdir(), 'test_output.txt')Fix line endings:
# Bad: breaks on different OS
assert file_content == "line1\r\nline2\r\n"
# Good: normalize before comparing
assert file_content.replace('\r\n', '\n') == "line1\nline2\n"Root Cause 4: Test Ordering and Isolation
When tests run in parallel in CI (common for performance), they share resources that weren't designed for concurrent access: ports, files, database records.
Symptom: Tests pass individually but fail when run together. Failures don't reproduce with serial execution.
Find ordering issues:
# Run in random order to detect ordering dependencies
pytest tests/ -p randomly --randomly-seed=random
<span class="hljs-comment"># Run in parallel to detect isolation issues
pytest tests/ -n 4Fix shared database state:
# Bad: tests share data
class TestOrders:
def setup_method(self):
self.order = Order.create(customer='alice', total=99.99)
def test_order_total(self):
# May see orders from other parallel test classes
assert Order.total_revenue() == 99.99# Good: each test gets its own isolated data
@pytest.fixture
def db_session():
with begin_transaction() as txn:
yield txn
txn.rollback()
def test_order_total(db_session):
order = Order.create(customer='alice', total=99.99, session=db_session)
assert Order.total_revenue(session=db_session) == 99.99Fix port conflicts for parallel tests:
import socket
def get_free_port():
with socket.socket() as s:
s.bind(('', 0))
return s.getsockname()[1]
@pytest.fixture
def server():
port = get_free_port()
s = Server(port=port)
s.start()
yield s
s.stop()Root Cause 5: Network and External Service Access
CI often blocks outbound network connections for security. Tests that call real APIs fail.
Symptoms: ConnectionRefusedError, requests.exceptions.ConnectionError, DNS lookup failures.
Fix: Mock all external service calls. Never call real external services from unit or integration tests:
# Bad: calls real Stripe API
def test_payment_processing():
result = stripe.charge.create(amount=999, currency='usd', source='tok_visa')
assert result['status'] == 'succeeded'
# Good: mock the API client
@patch('myapp.payments.stripe.charge.create')
def test_payment_processing(mock_charge):
mock_charge.return_value = {'status': 'succeeded', 'id': 'ch_test'}
result = process_payment(amount=999)
assert result['status'] == 'succeeded'For integration tests that need a real database, use Docker Compose in CI:
# GitHub Actions
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test_db
POSTGRES_USER: test_user
POSTGRES_PASSWORD: test_pass
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5Root Cause 6: Time Zone and Clock Issues
CI usually runs in UTC. Local machines run in the developer's local time zone. Tests that use the current time without specifying a time zone can fail.
Symptom: Date-related tests fail in CI but pass locally. Tests fail at certain hours (usually midnight UTC, corresponding to business hours elsewhere).
Fix:
# Bad: depends on local timezone
def test_shows_today_orders():
order = create_order()
orders = get_orders_for_date(date.today())
assert order in orders # fails if date.today() differs between order creation and query
# Good: explicit timezone handling
from datetime import datetime, timezone
def test_shows_today_orders():
now = datetime(2024, 3, 15, 12, 0, 0, tzinfo=timezone.utc)
order = create_order(created_at=now)
orders = get_orders_for_date(now.date(), tz=timezone.utc)
assert order in ordersSet TZ=UTC in CI configuration and run your tests locally with the same setting to catch these issues early:
TZ=UTC pytest tests/Root Cause 7: Caching Side Effects
Local environments accumulate cached data: compiled assets, downloaded dependencies, test data. CI starts fresh each run.
Symptom: Tests that verify "first run" behavior always pass in CI but fail locally (or vice versa). Tests that create cache files and expect them to persist fail in CI.
Fix: Explicitly set up and tear down any cache state in tests:
@pytest.fixture(autouse=True)
def clear_cache():
cache.clear()
yield
cache.clear()
def test_cache_miss_fetches_from_database(clear_cache):
# No cache populated yet — guaranteed fresh state
result = get_user_profile('user123')
assert db.query_count == 1 # fetched from DB, not cacheDiagnosing in Practice
When a test fails in CI but not locally:
- Reproduce in CI — run the same test multiple times in CI to confirm it's flaky, not a one-time infra issue
- Add logging — wrap the failing assertion with detailed logging about state, timestamps, and context
- Check the category — does the failure look like a timing issue? A state pollution issue? A file system issue?
- Run locally with CI constraints — use
TZ=UTC, run in Docker with limited CPU/memory, usepytest -n 4for parallel execution - Isolate the test — run just the failing test in CI to see if it fails without other tests running alongside it
# Run the specific failing test in CI mode
docker run --<span class="hljs-built_in">rm -e TZ=UTC -e CI=<span class="hljs-literal">true \
--cpus 1 --memory 512m \
your-test-image pytest tests/path/to/test.py::test_specific_case -vSystematic diagnosis beats guessing. The differences between local and CI are finite and knowable — work through them methodically.