Feature Flag Testing: How to Test Applications with Feature Flags
Feature flags are one of the most powerful tools in modern software deployment. They're also one of the most underestimated sources of bugs.
When you add a feature flag, you don't just add a feature — you add a new dimension of application state. Every flag doubles the number of states your application can be in. With 10 flags, you have 1,024 possible combinations. Testing all of them manually isn't possible. Testing none of them is reckless.
This guide covers practical strategies for feature flag testing: how to structure tests, what to automate, and how to prevent flag-related incidents.
Why Feature Flags Break Things
Feature flags feel safe because they give you an off switch. In practice, they introduce failure modes that pure code changes don't have:
Flag evaluation at wrong time Flags are evaluated at runtime, not deploy time. A flag state can change while a user is mid-session, producing inconsistent behavior within a single user journey.
Stale cache with fresh flag If your application caches feature flag values, a newly enabled flag may not take effect immediately. Tests pass with stale cache. Users see the old behavior.
Cross-flag dependencies Flag A controls the new checkout. Flag B controls the new payment provider. Enabling A without B causes a runtime error no one anticipated because the combination was never tested.
Flag state leaked between test runs A test enables a flag, then fails before cleaning up. The next test runs with a flag state it didn't expect, producing a cascade of false failures.
Production flag state diverges from test flag state Your CI environment evaluates flags differently than production (different user IDs, different rules, different rollout percentages). The test passes. Production fails.
The Four Types of Feature Flag Tests
1. Flag State Tests
Verify that the application behaves correctly when a specific flag is enabled or disabled.
// Jest example — testing flag-gated component rendering
describe('NewCheckoutFlow', () => {
it('shows new checkout when flag is enabled', () => {
mockFeatureFlag('new-checkout', true);
render(<CheckoutPage />);
expect(screen.getByTestId('new-checkout-form')).toBeInTheDocument();
});
it('shows legacy checkout when flag is disabled', () => {
mockFeatureFlag('new-checkout', false);
render(<CheckoutPage />);
expect(screen.getByTestId('legacy-checkout-form')).toBeInTheDocument();
});
});This is the baseline. Both flag states must have tests.
2. Flag Transition Tests
Verify that changing a flag state mid-session doesn't corrupt application state.
# Pytest example — flag changes during active session
def test_flag_change_doesnt_corrupt_cart(client, user_session):
# Start with old checkout
with flag_state('new-checkout', False):
response = client.post('/cart/add', json={'product_id': 'abc'})
assert response.status_code == 200
# Flag changes while session is active
with flag_state('new-checkout', True):
response = client.get('/cart')
assert response.status_code == 200
# Cart should still have the item — state not lost
assert len(response.json()['items']) == 13. Cross-Flag Combination Tests
For flags that interact, test the critical combinations explicitly:
@pytest.mark.parametrize("new_checkout,new_payment", [
(True, True), # Both new — happy path
(True, False), # New checkout, old payment — should this work?
(False, True), # Old checkout, new payment — should this work?
(False, False), # Both old — legacy path
])
def test_checkout_payment_combination(client, new_checkout, new_payment):
with flag_state('new-checkout', new_checkout):
with flag_state('new-payment-provider', new_payment):
response = client.post('/checkout/complete', json=test_order())
# Define expected behavior for each combination
if new_checkout and not new_payment:
# This combination shouldn't be enabled simultaneously
# but if it is, it should fail gracefully
assert response.status_code in [200, 422]
else:
assert response.status_code == 2004. Flag Cleanup Tests
Flags accumulate. Test that old flags are being removed, not just disabled:
# Lint step in CI — fail if any flag is older than 90 days and still disabled
helpmetest flags audit --max-age=90d --status=disabled --failThis prevents flag debt from silently accumulating in your codebase.
E2E Testing with Feature Flags
Browser-level testing is where feature flags get complicated. You need the browser session to evaluate flags correctly, which means either:
- Forcing flag state via override cookie/header (safest)
- Targeting the test user with the flag in your flag provider (fragile)
- Using a separate test environment with flags pre-configured (slow)
Override via Query Parameter (LaunchDarkly, PostHog)
Most flag providers support override mechanisms for testing:
*** Test Cases ***
New Checkout Flag Enabled — Full Purchase Flow
# Force flag state via URL override
Navigate To ${BASE_URL}/checkout?ld-flag-new-checkout=true
Fill Text [data-testid="card-number"] 4242424242424242
Fill Text [data-testid="card-expiry"] 12/27
Fill Text [data-testid="card-cvc"] 123
Click [data-testid="complete-purchase"]
Wait For Elements State [data-testid="order-confirmation"] visible
New Checkout Flag Disabled — Legacy Purchase Flow
Navigate To ${BASE_URL}/checkout?ld-flag-new-checkout=false
# Assert legacy flow elements exist
Wait For Elements State [data-testid="legacy-payment-form"] visibleOverride via API Before Test
For backends that support it, set the flag state via API before the browser test:
*** Test Cases ***
New Dashboard Flag Enabled
# Set flag for test user
${response}= POST ${FLAG_API_URL}/override
... json={"flag": "new-dashboard", "user": "e2e-test-user", "value": true}
# Now run the browser test
Navigate To ${BASE_URL}/dashboard
Wait For Elements State [data-testid="new-dashboard-header"] visible
[Teardown] Reset Flag Override new-dashboard e2e-test-userThe teardown is critical — leaked flag state breaks subsequent tests.
Testing Feature Flags in Different Providers
LaunchDarkly
LaunchDarkly supports targeting rules that let you force flag values for specific user keys. For testing:
// Use a dedicated test user key with explicit flag targeting
const ldClient = LDClient.init(sdkKey, {
user: { key: 'e2e-test-user-do-not-modify' }
});
// In LaunchDarkly dashboard: create targeting rule
// "If user key is 'e2e-test-user-do-not-modify', serve true/false"In your E2E tests, always authenticate as this test user. The targeting rule guarantees consistent flag evaluation regardless of rollout percentage.
PostHog
PostHog supports feature flag overrides via the /decide endpoint. Set override values in your test fixtures:
# pytest fixture
@pytest.fixture
def posthog_flags(requests_mock):
requests_mock.post(
'https://app.posthog.com/decide/',
json={
'featureFlags': {
'new-checkout': True,
'new-dashboard': False,
}
}
)Unleash
Unleash supports context-based targeting. For tests, use a dedicated test context:
const unleash = initialize({
url: process.env.UNLEASH_URL,
appName: 'e2e-tests',
context: {
userId: 'e2e-test-user',
environment: 'test',
}
});Define toggle variants in Unleash that always return specific values for environment=test.
Setting Up Flag Testing in CI/CD
A CI pipeline for feature-flagged applications should test both flag states:
name: Feature Flag Tests
on: [push, pull_request]
jobs:
test-flag-off:
runs-on: ubuntu-latest
env:
FEATURE_NEW_CHECKOUT: "false"
steps:
- uses: actions/checkout@v3
- name: Run tests with flag disabled
run: npx helpmetest run --suite=checkout --env=FEATURE_NEW_CHECKOUT=false
test-flag-on:
runs-on: ubuntu-latest
env:
FEATURE_NEW_CHECKOUT: "true"
steps:
- uses: actions/checkout@v3
- name: Run tests with flag enabled
run: npx helpmetest run --suite=checkout --env=FEATURE_NEW_CHECKOUT=trueBoth jobs must pass before merge. If the flag-on tests fail, the feature isn't ready to enable. If the flag-off tests fail, you broke the existing behavior.
Preventing Flag Debt
Feature flags that stay in code indefinitely are technical debt. A flag that's been enabled globally for 6 months is just dead code with extra steps.
Build a flag lifecycle into your process:
1. Define expiry when creating a flag
# Feature flag definition
name: new-checkout
description: "New checkout flow with Stripe Elements"
expires: 2026-08-01 # Must be removed or promoted by this date
owner: checkout-team2. Audit flags in CI
# Fail CI if any flag is past its expiry date
helpmetest flags check-expiry --fail-on-expired3. Write removal tests Before removing a flag, write a test that will fail if the flag code is still present:
it('should not contain feature flag code for new-checkout (flag retired)', () => {
const source = readFileSync('./src/checkout.js', 'utf8');
expect(source).not.toContain('new-checkout');
expect(source).not.toContain('isEnabled(');
});This forces the team to actually remove the code, not just delete the flag definition.
Monitoring Flag-Gated Features in Production
After enabling a flag for all users, you need to know if it breaks anything. HelpMeTest health checks let you monitor specific user flows 24/7:
# Monitor the new checkout flow after global flag rollout
helpmetest health new-checkout-monitor 5mConfigure the health check to perform the complete checkout flow with a test credit card. If it starts failing after you roll out the flag globally, you know within 5 minutes.
Summary: Feature Flag Testing Checklist
Before enabling a flag in production:
- Unit tests for flag-enabled state
- Unit tests for flag-disabled state
- E2E test for flag-enabled happy path
- E2E test for flag-enabled error paths
- Cross-flag combination tests (if applicable)
- Flag expiry date set
- Flag removal test written
- Health check monitoring configured
After global rollout:
- Verify both CI flag-state test jobs still pass
- Check health check dashboards for first 48 hours
- Schedule flag removal review
Feature flags are powerful because they separate deployment from release. Good feature flag testing is what makes that separation safe.