Shadow Testing: Mirroring Production Traffic for Safe Testing
Shadow testing (also called traffic mirroring or dark launching) sends a copy of real production requests to a new version of your service alongside the production version. The new version handles the request but its response is discarded — users only see the response from the stable version.
This lets you validate a new version against real production traffic patterns without any risk to users. You're testing with real data volumes, real request distributions, and real edge cases that your synthetic test suite never covers.
How Shadow Testing Works
Client Request
│
├──► Stable Service (v1) ──► Response to Client
│
└──► Shadow Service (v2) ──► Response discarded
│
└──► Logged/comparedThe proxy (typically your API gateway, service mesh, or load balancer) duplicates each incoming request. The stable service handles the original and returns a response to the client. The shadow service handles the copy asynchronously — slower responses and errors from the shadow don't affect the client.
Key properties:
- Zero user impact — users only see stable service responses
- Real traffic — test with actual production request patterns, not synthetic data
- Asynchronous — shadow service performance doesn't need to match production
- Reversible — disable mirroring to stop shadow testing instantly
Use Cases
Shadow testing is valuable for:
Database migration validation — mirror traffic to a service using the new schema before cutting over. Compare query results to catch data access regressions.
Infrastructure changes — moving from one cache implementation, message queue, or database to another? Shadow traffic to the new stack and compare behaviors.
Major refactors — a complete service rewrite should behave identically to the original. Shadow testing proves this at scale, with all the edge cases production traffic provides.
New service versions with behavior changes — validate that performance under real load matches expectations before users experience it.
Setting Up Traffic Mirroring
Nginx
Nginx supports request mirroring via the mirror module:
location / {
# Route to stable service
proxy_pass http://stable-service;
# Mirror to shadow service
mirror /shadow;
mirror_request_body on;
}
location /shadow {
internal;
proxy_pass http://shadow-service$request_uri;
# Don't wait for shadow — fire and forget
proxy_connect_timeout 1;
proxy_send_timeout 1;
proxy_read_timeout 1;
}The internal directive prevents direct access to /shadow. mirror_request_body on copies the request body — required for POST/PUT requests.
Istio (Service Mesh)
Istio's VirtualService supports traffic mirroring:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service-stable
port:
number: 8080
weight: 100
mirror:
host: my-service-shadow
port:
number: 8080
mirrorPercentage:
value: 10 # Mirror 10% of traffic to shadowThe mirrorPercentage lets you start with a small percentage of traffic — useful for shadow testing resource-intensive services where 100% mirroring would double your infrastructure cost.
AWS API Gateway
API Gateway supports request mirroring via Lambda or via traffic routing with weighted targets. For simpler setups, a Lambda@Edge function can duplicate requests to a shadow endpoint.
Comparing Shadow Responses
The raw mirroring is just half the work. The value comes from comparing shadow responses to stable responses.
Log-Based Comparison
The simplest approach: log shadow service responses and compare them offline.
Add to your shadow service:
import logging
import json
@app.route('/api/orders', methods=['GET'])
def get_orders():
result = fetch_orders()
# Log the response for comparison
logging.info(json.dumps({
'endpoint': '/api/orders',
'request_id': request.headers.get('X-Request-ID'),
'response': result,
'timestamp': time.time()
}))
return jsonify(result)Then run a comparison job that matches request IDs between stable and shadow logs and diffs the responses.
Diffy
Diffy (from Twitter/X, now open source) is purpose-built for shadow response comparison. It sits between the proxy and both services, receives requests, sends them to both, and compares responses in real time.
# Run Diffy
docker run -p 31900:31900 -p 31903:31903 \
-e DIFFY_PRIMARY=http://stable-service:8080 \
-e DIFFY_SECONDARY=http://shadow-service:8080 \
diffy:latestDiffy tracks which endpoints have differences and at what rate. Access the dashboard to see:
- Which endpoints differ between stable and shadow
- Sample diffs for investigation
- Noise filtering (to ignore irrelevant differences like timestamps)
Shadow Proxy with Response Comparison
A custom shadow proxy pattern:
import asyncio
import aiohttp
from deepdiff import DeepDiff
async def shadow_compare(request_data, stable_url, shadow_url):
async with aiohttp.ClientSession() as session:
# Send to both simultaneously
stable_task = session.post(stable_url, json=request_data)
shadow_task = session.post(shadow_url, json=request_data)
stable_resp, shadow_resp = await asyncio.gather(
stable_task, shadow_task, return_exceptions=True
)
stable_body = await stable_resp.json()
shadow_body = await shadow_resp.json()
diff = DeepDiff(stable_body, shadow_body, ignore_order=True)
if diff:
log_difference(request_data, stable_body, shadow_body, diff)
return stable_body # Return stable response to clientHandling Side Effects
Shadow testing breaks down when the service has side effects: sending emails, writing to databases, charging payment methods. A mirrored request will trigger the side effect twice — or cause conflicts between the stable and shadow database state.
Strategies:
Shadow-safe databases — the shadow service writes to a separate database or schema. Responses may differ (different IDs, different timestamps) but behavior can still be compared.
Read-only mirroring — only mirror GET requests, which typically have no side effects. POST/PUT/DELETE requests are excluded.
Stub side effects — configure the shadow service to stub or disable side effects (email sending, payment processing) while keeping the core logic real.
Idempotency — for services where duplicate writes are safe (because of idempotency keys), full mirroring may work. Verify this carefully before enabling.
What to Look For
When comparing shadow and stable responses:
Response structure differences — fields added, removed, or renamed in the shadow version. These are API contract changes that need documentation or client updates.
Value differences — same fields, different values. Often indicates a bug in the new version's business logic.
Error rate differences — the shadow service returning more 5xx responses means regressions.
Performance differences — shadow requests taking 10x longer than stable means the new version has performance problems that would be noticed by users after cutover.
Noise — timestamps, request IDs, and other inherently non-deterministic values will always differ. Configure your comparison to ignore these fields, or the signal gets buried in noise.
Noise Filtering
Not all differences are bugs. Common sources of noise:
- Timestamps and created-at fields
- UUIDs and auto-generated IDs
- Session tokens and nonces
- Order of items in unordered collections
- Floating-point rounding differences
Configure your comparison to ignore these:
diff = DeepDiff(
stable_body,
shadow_body,
ignore_order=True,
exclude_paths=[
"root['created_at']",
"root['request_id']",
"root['timestamp']",
],
ignore_numeric_type_changes=True,
)Diffy has built-in noise reduction: it tracks differences over many requests and surfaces only those that appear consistently, filtering out one-off randomness.
Metrics to Track During Shadow Testing
Difference rate — what percentage of requests produce different responses? At launch, expect some differences; over time, this should trend toward zero as you fix discrepancies.
Shadow error rate — how often does the shadow service return 5xx? Compare to stable error rate.
Shadow latency — p50, p95, p99 latency for shadow service. Even though users don't see it, high latency could indicate performance issues that will affect users after cutover.
Difference by endpoint — which API endpoints have the most differences? These are the highest-risk paths for cutover.
Cutting Over After Shadow Testing
Shadow testing is a validation step before cutover, not a replacement for it. When you're ready to move traffic:
- Difference rate is below your threshold (e.g., < 0.1%)
- Shadow error rate matches stable error rate
- Shadow latency is acceptable
- All high-difference endpoints have been investigated and either fixed or accepted
Then perform the cutover — either a canary rollout (progressively shift traffic) or a full swap (flip the load balancer). Shadow testing gives you confidence; the cutover is still a live operation that needs monitoring.
Shadow testing is most powerful for infrastructure-level changes where you can't easily predict all the ways production traffic will stress the new system. Synthetic test suites test what you think will happen. Shadow testing tests what actually happens.