Feature Flag Testing: How to Test Applications with Feature Flags

Feature Flag Testing: How to Test Applications with Feature Flags

Feature flags are one of the most powerful tools in modern software deployment. They're also one of the most underestimated sources of bugs.

When you add a feature flag, you don't just add a feature — you add a new dimension of application state. Every flag doubles the number of states your application can be in. With 10 flags, you have 1,024 possible combinations. Testing all of them manually isn't possible. Testing none of them is reckless.

This guide covers practical strategies for feature flag testing: how to structure tests, what to automate, and how to prevent flag-related incidents.

Why Feature Flags Break Things

Feature flags feel safe because they give you an off switch. In practice, they introduce failure modes that pure code changes don't have:

Flag evaluation at wrong time Flags are evaluated at runtime, not deploy time. A flag state can change while a user is mid-session, producing inconsistent behavior within a single user journey.

Stale cache with fresh flag If your application caches feature flag values, a newly enabled flag may not take effect immediately. Tests pass with stale cache. Users see the old behavior.

Cross-flag dependencies Flag A controls the new checkout. Flag B controls the new payment provider. Enabling A without B causes a runtime error no one anticipated because the combination was never tested.

Flag state leaked between test runs A test enables a flag, then fails before cleaning up. The next test runs with a flag state it didn't expect, producing a cascade of false failures.

Production flag state diverges from test flag state Your CI environment evaluates flags differently than production (different user IDs, different rules, different rollout percentages). The test passes. Production fails.

The Four Types of Feature Flag Tests

1. Flag State Tests

Verify that the application behaves correctly when a specific flag is enabled or disabled.

// Jest example — testing flag-gated component rendering
describe('NewCheckoutFlow', () => {
  it('shows new checkout when flag is enabled', () => {
    mockFeatureFlag('new-checkout', true);
    render(<CheckoutPage />);
    expect(screen.getByTestId('new-checkout-form')).toBeInTheDocument();
  });

  it('shows legacy checkout when flag is disabled', () => {
    mockFeatureFlag('new-checkout', false);
    render(<CheckoutPage />);
    expect(screen.getByTestId('legacy-checkout-form')).toBeInTheDocument();
  });
});

This is the baseline. Both flag states must have tests.

2. Flag Transition Tests

Verify that changing a flag state mid-session doesn't corrupt application state.

# Pytest example — flag changes during active session
def test_flag_change_doesnt_corrupt_cart(client, user_session):
    # Start with old checkout
    with flag_state('new-checkout', False):
        response = client.post('/cart/add', json={'product_id': 'abc'})
        assert response.status_code == 200

    # Flag changes while session is active
    with flag_state('new-checkout', True):
        response = client.get('/cart')
        assert response.status_code == 200
        # Cart should still have the item — state not lost
        assert len(response.json()['items']) == 1

3. Cross-Flag Combination Tests

For flags that interact, test the critical combinations explicitly:

@pytest.mark.parametrize("new_checkout,new_payment", [
    (True, True),    # Both new — happy path
    (True, False),   # New checkout, old payment — should this work?
    (False, True),   # Old checkout, new payment — should this work?
    (False, False),  # Both old — legacy path
])
def test_checkout_payment_combination(client, new_checkout, new_payment):
    with flag_state('new-checkout', new_checkout):
        with flag_state('new-payment-provider', new_payment):
            response = client.post('/checkout/complete', json=test_order())
            # Define expected behavior for each combination
            if new_checkout and not new_payment:
                # This combination shouldn't be enabled simultaneously
                # but if it is, it should fail gracefully
                assert response.status_code in [200, 422]
            else:
                assert response.status_code == 200

4. Flag Cleanup Tests

Flags accumulate. Test that old flags are being removed, not just disabled:

# Lint step in CI — fail if any flag is older than 90 days and still disabled
helpmetest flags audit --max-age=90d --status=disabled --fail

This prevents flag debt from silently accumulating in your codebase.

E2E Testing with Feature Flags

Browser-level testing is where feature flags get complicated. You need the browser session to evaluate flags correctly, which means either:

  1. Forcing flag state via override cookie/header (safest)
  2. Targeting the test user with the flag in your flag provider (fragile)
  3. Using a separate test environment with flags pre-configured (slow)

Override via Query Parameter (LaunchDarkly, PostHog)

Most flag providers support override mechanisms for testing:

*** Test Cases ***
New Checkout Flag Enabled — Full Purchase Flow
    # Force flag state via URL override
    Navigate To    ${BASE_URL}/checkout?ld-flag-new-checkout=true
    Fill Text    [data-testid="card-number"]    4242424242424242
    Fill Text    [data-testid="card-expiry"]    12/27
    Fill Text    [data-testid="card-cvc"]    123
    Click    [data-testid="complete-purchase"]
    Wait For Elements State    [data-testid="order-confirmation"]    visible

New Checkout Flag Disabled — Legacy Purchase Flow
    Navigate To    ${BASE_URL}/checkout?ld-flag-new-checkout=false
    # Assert legacy flow elements exist
    Wait For Elements State    [data-testid="legacy-payment-form"]    visible

Override via API Before Test

For backends that support it, set the flag state via API before the browser test:

*** Test Cases ***
New Dashboard Flag Enabled
    # Set flag for test user
    ${response}=    POST    ${FLAG_API_URL}/override
    ...    json={"flag": "new-dashboard", "user": "e2e-test-user", "value": true}
    
    # Now run the browser test
    Navigate To    ${BASE_URL}/dashboard
    Wait For Elements State    [data-testid="new-dashboard-header"]    visible
    
    [Teardown]    Reset Flag Override    new-dashboard    e2e-test-user

The teardown is critical — leaked flag state breaks subsequent tests.

Testing Feature Flags in Different Providers

LaunchDarkly

LaunchDarkly supports targeting rules that let you force flag values for specific user keys. For testing:

// Use a dedicated test user key with explicit flag targeting
const ldClient = LDClient.init(sdkKey, {
  user: { key: 'e2e-test-user-do-not-modify' }
});

// In LaunchDarkly dashboard: create targeting rule
// "If user key is 'e2e-test-user-do-not-modify', serve true/false"

In your E2E tests, always authenticate as this test user. The targeting rule guarantees consistent flag evaluation regardless of rollout percentage.

PostHog

PostHog supports feature flag overrides via the /decide endpoint. Set override values in your test fixtures:

# pytest fixture
@pytest.fixture
def posthog_flags(requests_mock):
    requests_mock.post(
        'https://app.posthog.com/decide/',
        json={
            'featureFlags': {
                'new-checkout': True,
                'new-dashboard': False,
            }
        }
    )

Unleash

Unleash supports context-based targeting. For tests, use a dedicated test context:

const unleash = initialize({
  url: process.env.UNLEASH_URL,
  appName: 'e2e-tests',
  context: {
    userId: 'e2e-test-user',
    environment: 'test',
  }
});

Define toggle variants in Unleash that always return specific values for environment=test.

Setting Up Flag Testing in CI/CD

A CI pipeline for feature-flagged applications should test both flag states:

name: Feature Flag Tests
on: [push, pull_request]

jobs:
  test-flag-off:
    runs-on: ubuntu-latest
    env:
      FEATURE_NEW_CHECKOUT: "false"
    steps:
      - uses: actions/checkout@v3
      - name: Run tests with flag disabled
        run: npx helpmetest run --suite=checkout --env=FEATURE_NEW_CHECKOUT=false

  test-flag-on:
    runs-on: ubuntu-latest
    env:
      FEATURE_NEW_CHECKOUT: "true"
    steps:
      - uses: actions/checkout@v3
      - name: Run tests with flag enabled
        run: npx helpmetest run --suite=checkout --env=FEATURE_NEW_CHECKOUT=true

Both jobs must pass before merge. If the flag-on tests fail, the feature isn't ready to enable. If the flag-off tests fail, you broke the existing behavior.

Preventing Flag Debt

Feature flags that stay in code indefinitely are technical debt. A flag that's been enabled globally for 6 months is just dead code with extra steps.

Build a flag lifecycle into your process:

1. Define expiry when creating a flag

# Feature flag definition
name: new-checkout
description: "New checkout flow with Stripe Elements"
expires: 2026-08-01  # Must be removed or promoted by this date
owner: checkout-team

2. Audit flags in CI

# Fail CI if any flag is past its expiry date
helpmetest flags check-expiry --fail-on-expired

3. Write removal tests Before removing a flag, write a test that will fail if the flag code is still present:

it('should not contain feature flag code for new-checkout (flag retired)', () => {
  const source = readFileSync('./src/checkout.js', 'utf8');
  expect(source).not.toContain('new-checkout');
  expect(source).not.toContain('isEnabled(');
});

This forces the team to actually remove the code, not just delete the flag definition.

Monitoring Flag-Gated Features in Production

After enabling a flag for all users, you need to know if it breaks anything. HelpMeTest health checks let you monitor specific user flows 24/7:

# Monitor the new checkout flow after global flag rollout
helpmetest health new-checkout-monitor 5m

Configure the health check to perform the complete checkout flow with a test credit card. If it starts failing after you roll out the flag globally, you know within 5 minutes.

Summary: Feature Flag Testing Checklist

Before enabling a flag in production:

  • Unit tests for flag-enabled state
  • Unit tests for flag-disabled state
  • E2E test for flag-enabled happy path
  • E2E test for flag-enabled error paths
  • Cross-flag combination tests (if applicable)
  • Flag expiry date set
  • Flag removal test written
  • Health check monitoring configured

After global rollout:

  • Verify both CI flag-state test jobs still pass
  • Check health check dashboards for first 48 hours
  • Schedule flag removal review

Feature flags are powerful because they separate deployment from release. Good feature flag testing is what makes that separation safe.

Set up automated E2E testing for your feature flags →

Read more