Testing

Shadow Testing: Mirroring Production Traffic for Safe Testing

HelpMeTest

22 May 2026 — 5 min read

Shadow testing (also called traffic mirroring or dark launching) sends a copy of real production requests to a new version of your service alongside the production version. The new version handles the request but its response is discarded — users only see the response from the stable version.

This lets you validate a new version against real production traffic patterns without any risk to users. You're testing with real data volumes, real request distributions, and real edge cases that your synthetic test suite never covers.

How Shadow Testing Works

Client Request
    │
    ├──► Stable Service (v1) ──► Response to Client
    │
    └──► Shadow Service (v2) ──► Response discarded
                                      │
                                      └──► Logged/compared

The proxy (typically your API gateway, service mesh, or load balancer) duplicates each incoming request. The stable service handles the original and returns a response to the client. The shadow service handles the copy asynchronously — slower responses and errors from the shadow don't affect the client.

Key properties:

Zero user impact — users only see stable service responses
Real traffic — test with actual production request patterns, not synthetic data
Asynchronous — shadow service performance doesn't need to match production
Reversible — disable mirroring to stop shadow testing instantly

Use Cases

Shadow testing is valuable for:

Database migration validation — mirror traffic to a service using the new schema before cutting over. Compare query results to catch data access regressions.

Infrastructure changes — moving from one cache implementation, message queue, or database to another? Shadow traffic to the new stack and compare behaviors.

Major refactors — a complete service rewrite should behave identically to the original. Shadow testing proves this at scale, with all the edge cases production traffic provides.

New service versions with behavior changes — validate that performance under real load matches expectations before users experience it.

Setting Up Traffic Mirroring

Nginx

Nginx supports request mirroring via the mirror module:

location / {
    # Route to stable service
    proxy_pass http://stable-service;
    
    # Mirror to shadow service
    mirror /shadow;
    mirror_request_body on;
}

location /shadow {
    internal;
    proxy_pass http://shadow-service$request_uri;
    # Don't wait for shadow — fire and forget
    proxy_connect_timeout 1;
    proxy_send_timeout 1;
    proxy_read_timeout 1;
}

The internal directive prevents direct access to /shadow. mirror_request_body on copies the request body — required for POST/PUT requests.

Istio (Service Mesh)

Istio's VirtualService supports traffic mirroring:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
    - my-service
  http:
    - route:
        - destination:
            host: my-service-stable
            port:
              number: 8080
          weight: 100
      mirror:
        host: my-service-shadow
        port:
          number: 8080
      mirrorPercentage:
        value: 10  # Mirror 10% of traffic to shadow

The mirrorPercentage lets you start with a small percentage of traffic — useful for shadow testing resource-intensive services where 100% mirroring would double your infrastructure cost.

AWS API Gateway

API Gateway supports request mirroring via Lambda or via traffic routing with weighted targets. For simpler setups, a Lambda@Edge function can duplicate requests to a shadow endpoint.

Comparing Shadow Responses

The raw mirroring is just half the work. The value comes from comparing shadow responses to stable responses.

Log-Based Comparison

The simplest approach: log shadow service responses and compare them offline.

Add to your shadow service:

import logging
import json

@app.route('/api/orders', methods=['GET'])
def get_orders():
    result = fetch_orders()
    
    # Log the response for comparison
    logging.info(json.dumps({
        'endpoint': '/api/orders',
        'request_id': request.headers.get('X-Request-ID'),
        'response': result,
        'timestamp': time.time()
    }))
    
    return jsonify(result)

Then run a comparison job that matches request IDs between stable and shadow logs and diffs the responses.

Diffy

Diffy (from Twitter/X, now open source) is purpose-built for shadow response comparison. It sits between the proxy and both services, receives requests, sends them to both, and compares responses in real time.

# Run Diffy
docker run -p 31900:31900 -p 31903:31903 \
  -e DIFFY_PRIMARY=http://stable-service:8080 \
  -e DIFFY_SECONDARY=http://shadow-service:8080 \
  diffy:latest

Diffy tracks which endpoints have differences and at what rate. Access the dashboard to see:

Which endpoints differ between stable and shadow
Sample diffs for investigation
Noise filtering (to ignore irrelevant differences like timestamps)

Shadow Proxy with Response Comparison

A custom shadow proxy pattern:

import asyncio
import aiohttp
from deepdiff import DeepDiff

async def shadow_compare(request_data, stable_url, shadow_url):
    async with aiohttp.ClientSession() as session:
        # Send to both simultaneously
        stable_task = session.post(stable_url, json=request_data)
        shadow_task = session.post(shadow_url, json=request_data)
        
        stable_resp, shadow_resp = await asyncio.gather(
            stable_task, shadow_task, return_exceptions=True
        )
        
        stable_body = await stable_resp.json()
        shadow_body = await shadow_resp.json()
        
        diff = DeepDiff(stable_body, shadow_body, ignore_order=True)
        if diff:
            log_difference(request_data, stable_body, shadow_body, diff)
        
        return stable_body  # Return stable response to client

Handling Side Effects

Shadow testing breaks down when the service has side effects: sending emails, writing to databases, charging payment methods. A mirrored request will trigger the side effect twice — or cause conflicts between the stable and shadow database state.

Strategies:

Shadow-safe databases — the shadow service writes to a separate database or schema. Responses may differ (different IDs, different timestamps) but behavior can still be compared.

Read-only mirroring — only mirror GET requests, which typically have no side effects. POST/PUT/DELETE requests are excluded.

Stub side effects — configure the shadow service to stub or disable side effects (email sending, payment processing) while keeping the core logic real.

Idempotency — for services where duplicate writes are safe (because of idempotency keys), full mirroring may work. Verify this carefully before enabling.

What to Look For

When comparing shadow and stable responses:

Response structure differences — fields added, removed, or renamed in the shadow version. These are API contract changes that need documentation or client updates.

Value differences — same fields, different values. Often indicates a bug in the new version's business logic.

Error rate differences — the shadow service returning more 5xx responses means regressions.

Performance differences — shadow requests taking 10x longer than stable means the new version has performance problems that would be noticed by users after cutover.

Noise — timestamps, request IDs, and other inherently non-deterministic values will always differ. Configure your comparison to ignore these fields, or the signal gets buried in noise.

Noise Filtering

Not all differences are bugs. Common sources of noise:

Timestamps and created-at fields
UUIDs and auto-generated IDs
Session tokens and nonces
Order of items in unordered collections
Floating-point rounding differences

Configure your comparison to ignore these:

diff = DeepDiff(
    stable_body,
    shadow_body,
    ignore_order=True,
    exclude_paths=[
        "root['created_at']",
        "root['request_id']",
        "root['timestamp']",
    ],
    ignore_numeric_type_changes=True,
)

Diffy has built-in noise reduction: it tracks differences over many requests and surfaces only those that appear consistently, filtering out one-off randomness.

Metrics to Track During Shadow Testing

Difference rate — what percentage of requests produce different responses? At launch, expect some differences; over time, this should trend toward zero as you fix discrepancies.

Shadow error rate — how often does the shadow service return 5xx? Compare to stable error rate.

Shadow latency — p50, p95, p99 latency for shadow service. Even though users don't see it, high latency could indicate performance issues that will affect users after cutover.

Difference by endpoint — which API endpoints have the most differences? These are the highest-risk paths for cutover.

Cutting Over After Shadow Testing

Shadow testing is a validation step before cutover, not a replacement for it. When you're ready to move traffic:

Difference rate is below your threshold (e.g., < 0.1%)
Shadow error rate matches stable error rate
Shadow latency is acceptable
All high-difference endpoints have been investigated and either fixed or accepted

Then perform the cutover — either a canary rollout (progressively shift traffic) or a full swap (flip the load balancer). Shadow testing gives you confidence; the cutover is still a live operation that needs monitoring.

Shadow testing is most powerful for infrastructure-level changes where you can't easily predict all the ways production traffic will stress the new system. Synthetic test suites test what you think will happen. Shadow testing tests what actually happens.