Shift-Left with Feature Flags and Dark Launches: Testing in Production Safely

Shift-Left with Feature Flags and Dark Launches: Testing in Production Safely

The most realistic test environment is production. Staging environments, no matter how carefully maintained, diverge from production over time. Real user traffic patterns are different from synthetic load tests. Real data has shapes that test fixtures do not replicate. Real infrastructure behaves differently at scale.

Feature flags and dark launches are the techniques that let you test in production without exposing users to risk. They shift quality validation left into the deployment process itself, rather than treating deployment as a binary switch that either works or causes an incident.

What Feature Flags Actually Do

A feature flag is a conditional in your code that decides whether a user sees a new behaviour. At its simplest:

if (featureFlags.isEnabled('new-checkout-flow', userId)) {
  return renderNewCheckout();
} else {
  return renderLegacyCheckout();
}

The flag state is controlled outside the code — in a dashboard, via an API, or through a configuration service. This decouples deployment from release. You deploy the new checkout code on Monday. You release it to users when you are confident it works.

That gap between deployment and release is where shift-left testing happens in production.

Dark Launches: Shipping Code Before Exposing It

A dark launch deploys a new code path and runs it in production but does not show results to users. The new path executes, logs, and measures — but the output is discarded and the old path still serves users.

This pattern is valuable for validating backend behaviour at production scale:

def get_product_recommendations(user_id: str) -> list:
    # Legacy algorithm — serves the user
    legacy_results = legacy_recommender.get(user_id)

    # New algorithm — runs in shadow, result discarded
    if feature_flags.is_enabled('shadow-recommender', user_id):
        try:
            new_results = ml_recommender.get(user_id)
            metrics.record('shadow_recommender', {
                'latency_ms': new_results.latency,
                'result_count': len(new_results.items),
                'overlap_with_legacy': overlap(legacy_results, new_results)
            })
        except Exception as e:
            metrics.record('shadow_recommender_error', {'error': str(e)})

    return legacy_results

The new ML recommender runs against production data at production scale. You collect latency metrics, result quality metrics, and error rates. When those metrics meet your thresholds, you flip the flag and serve the new results.

You have tested the new system exhaustively in production before any user sees its output.

Canary Deployments: Gradual Traffic Shifting

A canary deployment routes a small percentage of traffic to the new version of a service before shifting all traffic. The name comes from canaries in coal mines — an early warning system.

Feature flags implement canary deployments at the application layer:

// Route 1% of users to new payment processor
const useNewPaymentProcessor = featureFlags.isEnabledForPercentage(
  'new-payment-processor',
  userId,
  percentage: 1
);

At 1%, a failure affects one in a hundred users. You observe error rates, latency, and conversion metrics. If they look good, increase to 5%, then 20%, then 100%. If they look bad, flip the flag back. No rollback deployment required.

This approach is fundamentally shift-left: you are catching production issues before they affect most users, using real production traffic as your test signal.

Targeting strategies

Different feature flag platforms offer different targeting strategies:

  • Percentage rollout: Random sample of users, controlled by percentage
  • User attribute targeting: Beta users, specific company accounts, specific geographies
  • Ring deployment: Internal users → beta users → early adopters → general availability
  • A/B testing: Two variants with equal split, measure conversion metrics

LaunchDarkly, Unleash, and Flagsmith all support these strategies with different tradeoffs.

Kill Switches: The Safety Net

A kill switch is a feature flag whose primary purpose is to turn off a problematic feature without a deployment. Every significant feature that carries production risk should have one.

@app.route('/api/export')
def export_data():
    if not feature_flags.is_enabled('data-export-feature'):
        return jsonify({'error': 'Feature temporarily unavailable'}), 503

    # Export logic — potentially expensive, risk of timeout
    return generate_export(request.args)

When the export feature causes unexpected database load at 2am on a Friday, you disable the flag from the dashboard. The feature is off in seconds. No on-call deployment. No rollback. No incident escalation.

Kill switches shift the risk management left by building the off-switch before you need it, rather than scrambling for one during an incident.

LaunchDarkly Patterns

LaunchDarkly is the market leader for enterprise feature flags. Its SDK pattern is consistent across languages:

import ldclient
from ldclient.config import Config

ldclient.set_config(Config(os.environ['LAUNCHDARKLY_SDK_KEY']))
ld_client = ldclient.get()

# Evaluate a flag for a user
context = ldclient.Context.builder(user_id) \
    .set('email', user_email) \
    .set('plan', user_plan) \
    .build()

if ld_client.variation('new-billing-flow', context, False):
    return handle_new_billing()
else:
    return handle_legacy_billing()

Key LaunchDarkly patterns for shift-left testing:

Targeting rules for internal testing: Create a segment for internal users (email domain match or explicit user IDs). Enable new features for internal users weeks before external rollout. Your team becomes the canary.

Experimentation flags: LaunchDarkly's experimentation layer tracks conversion events against flag variants. You can measure whether the new checkout flow has a higher conversion rate than the old one before deciding to roll it out fully.

Flag cleanup governance: Flags accumulate technical debt. LaunchDarkly's Code References feature scans your codebase to find which flags are still referenced and which are dead. Set a policy: flags older than 90 days with 100% rollout get cleaned up.

Unleash: The Open-Source Alternative

Unleash is the leading open-source feature flag platform. For teams that prefer to self-host or avoid per-seat pricing, it covers most enterprise flag patterns:

const { initialize } = require('unleash-client');

const unleash = initialize({
  url: 'https://unleash.internal.example.com/api/',
  appName: 'checkout-service',
  customHeaders: { Authorization: process.env.UNLEASH_API_TOKEN },
});

// Gradual rollout strategy
if (unleash.isEnabled('new-checkout', { userId })) {
  return renderNewCheckout();
}

Unleash's gradual rollout strategy uses MurmurHash on the user ID to ensure consistent assignment — the same user always gets the same variant, even across sessions and services.

Integrating Feature Flags With Your Test Suite

Feature flags create a testing challenge: your tests need to cover both flag states. A test that only validates the flag-on state will miss regressions in the fallback path.

The solution is to parameterise your tests:

*** Test Cases ***
Checkout Works With New Flow Enabled
    [Setup]    Enable Feature Flag    new-checkout-flow    ${TEST_USER}
    Given the user has items in their cart
    When they complete checkout
    Then the order confirmation is displayed
    [Teardown]    Disable Feature Flag    new-checkout-flow    ${TEST_USER}

Checkout Works With Legacy Flow
    [Setup]    Disable Feature Flag    new-checkout-flow    ${TEST_USER}
    Given the user has items in their cart
    When they complete checkout
    Then the order confirmation is displayed
    [Teardown]    Enable Feature Flag    new-checkout-flow    ${TEST_USER}

With HelpMeTest, you can schedule both test variants to run on a regular cadence. This ensures neither path regresses while a gradual rollout is in progress.

The Deployment Pipeline With Flags

A shift-left deployment pipeline with feature flags looks like:

  1. Deploy: New code ships to production with flag off. Zero user impact.
  2. Internal canary: Enable for internal users. Real production validation with no customer risk.
  3. Dark launch (if applicable): Run new backend path in shadow mode. Collect metrics.
  4. 1% rollout: Enable for 1% of external users. Monitor error rates and conversion.
  5. Gradual increase: 5% → 20% → 50% → 100%, with monitoring at each step.
  6. Flag removal: Once fully rolled out and stable, remove the flag from code and clean up.

Each stage is a testing opportunity. Each stage catches different failure modes. By the time you reach 100%, the feature has been validated against production data, production traffic, and production infrastructure.

This is shift-left applied to production — not moving testing to an earlier environment, but moving the risk of deployment to an earlier point in the rollout where the blast radius is small.

Read more