Shadow Mode Testing with Diffy: Twitter's Approach to Regression Testing

Shadow Mode Testing with Diffy: Twitter's Approach to Regression Testing

Twitter's engineering team faced a problem that scales with complexity: every time they deployed a new version of a service, they needed to verify it produced the same responses as the old version across the enormous variety of real-world requests that hit their platform. Writing test cases manually could not keep up. Generating synthetic traffic did not capture the full diversity of production inputs. So they built Diffy: a proxy that automatically compares responses from the current version and the candidate version in real-time, using actual production traffic, without any user impact.

Diffy was open-sourced in 2015 and is still used by companies worldwide. It represents a mature approach to shadow mode testing — running a new version of a service in parallel, comparing its output to the current version, and flagging regressions automatically.

How Diffy Works

Diffy sits in front of both the current (primary) and new (candidate) versions of your service. When a request comes in, Diffy:

  1. Sends the request to the primary instance and gets a response
  2. Sends the same request to the candidate instance
  3. Sends the same request to a second primary instance (called the secondary or noise baseline)
  4. Compares the candidate response to the primary response
  5. Uses the primary-vs-secondary comparison to identify noise (natural variation between two identical instances)
  6. Reports only the differences that are not also present in the noise baseline

The noise filtering is critical. Two instances of the same code will sometimes return slightly different responses — timestamps differ, IDs are generated, random elements are included. By comparing two identical primaries, Diffy learns what variation is normal and filters it out of the candidate comparison.

Setting Up Diffy

The original Twitter Diffy is a Scala/Finagle service. A more accessible alternative is opendiffy, a maintained fork. There is also diffy-go, a Go implementation.

For the original:

git clone https://github.com/opendiffy/diffy
<span class="hljs-built_in">cd diffy
./sbt assembly

<span class="hljs-comment"># Run Diffy
java -jar target/scala-2.12/diffy-server.jar \
  -candidate=localhost:9992 \
  -master.primary=localhost:9990 \
  -master.secondary=localhost:9991 \
  -service.protocol=http \
  -proxy.port=:8880 \
  -admin.port=:8881 \
  -metric.port=:8882 \
  -rootUrl=<span class="hljs-string">"localhost:8881"

Your services:

  • localhost:9990 — primary (current production version)
  • localhost:9991 — secondary (another current production instance, for noise baseline)
  • localhost:9992 — candidate (new version you want to validate)
  • localhost:8880 — Diffy proxy (send your traffic here)

Now send requests to Diffy's proxy port. Diffy fans out each request to all three, compares, and exposes results at the admin endpoint.

Docker Compose Setup

For local development and testing:

version: '3'
services:
  primary:
    image: my-service:current
    ports:
      - "9990:8080"
    environment:
      DATABASE_URL: postgres://db:5432/mydb
  
  secondary:
    image: my-service:current
    ports:
      - "9991:8080"
    environment:
      DATABASE_URL: postgres://db:5432/mydb
  
  candidate:
    image: my-service:next
    ports:
      - "9992:8080"
    environment:
      DATABASE_URL: postgres://db:5432/mydb
  
  diffy:
    image: opendiffy/diffy:latest
    ports:
      - "8880:8880"
      - "8881:8881"
    command: >
      -candidate=candidate:8080
      -master.primary=primary:8080
      -master.secondary=secondary:8080
      -service.protocol=http
      -proxy.port=:8880
      -admin.port=:8881
      -rootUrl=diffy:8881
    depends_on:
      - primary
      - secondary
      - candidate

  db:
    image: postgres:15
    environment:
      POSTGRES_DB: mydb

Sending Traffic to Diffy

With Diffy running, you can send traffic manually, from your CI pipeline, or as a mirror of production traffic:

Manual API testing:

# Any request to Diffy is compared across all three instances
curl http://localhost:8880/api/users/123
curl -X POST http://localhost:8880/api/orders \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{"product_id": 456, "quantity": 1}'

<span class="hljs-comment"># Replay a recorded request log
<span class="hljs-keyword">while IFS= <span class="hljs-built_in">read -r request; <span class="hljs-keyword">do
  curl -s <span class="hljs-variable">$request > /dev/null
<span class="hljs-keyword">done < test-requests.txt

From CI — replay a test corpus:

#!/bin/bash
<span class="hljs-comment"># ci-diffy-test.sh

DIFFY_HOST=<span class="hljs-string">"localhost:8880"
DIFFY_ADMIN=<span class="hljs-string">"localhost:8881"

<span class="hljs-comment"># Replay request corpus
python3 replay_requests.py \
  --target <span class="hljs-string">"http://$DIFFY_HOST" \
  --requests corpus/api-requests.json \
  --workers 4

<span class="hljs-comment"># Wait for comparison to complete
<span class="hljs-built_in">sleep 5

<span class="hljs-comment"># Fetch results
RESULTS=$(curl -s <span class="hljs-string">"http://$DIFFY_ADMIN/api/1/errors")
ERROR_COUNT=$(<span class="hljs-built_in">echo <span class="hljs-variable">$RESULTS <span class="hljs-pipe">| python3 -c <span class="hljs-string">"import sys,json; d=json.load(sys.stdin); print(len(d))")

<span class="hljs-built_in">echo <span class="hljs-string">"Diffy found $ERROR_COUNT response differences"

<span class="hljs-keyword">if [ <span class="hljs-string">"$ERROR_COUNT" -gt <span class="hljs-string">"0" ]; <span class="hljs-keyword">then
  <span class="hljs-built_in">echo <span class="hljs-string">"Differences found:"
  <span class="hljs-built_in">echo <span class="hljs-variable">$RESULTS <span class="hljs-pipe">| python3 -m json.tool
  <span class="hljs-built_in">exit 1
<span class="hljs-keyword">fi

<span class="hljs-built_in">echo <span class="hljs-string">"All responses match — candidate is safe to deploy"

Understanding Diffy's Results

The admin UI at localhost:8881 shows a table of endpoints with:

  • Endpoint: the HTTP method + path pattern
  • Requests: how many requests Diffy has seen
  • Differences: how many responses differed from primary
  • Noise: the noise rate between two primary instances
  • Critical differences: differences minus noise (what you actually need to investigate)

The key metric is critical differences. If primary-vs-secondary has 3% natural variation (because both include a timestamp field) and candidate-vs-primary has 4% variation, the critical difference is only 1%. If candidate-vs-primary has 20% variation, you have a real regression regardless of noise.

You can drill into any difference and see the exact JSON diff between primary and candidate responses:

{
  "endpoint": "GET /api/users/{id}",
  "primaryResponse": {
    "status": 200,
    "body": {
      "id": 123,
      "name": "Alice",
      "created_at": "2024-01-15T10:30:00Z",
      "plan": "free"
    }
  },
  "candidateResponse": {
    "status": 200,
    "body": {
      "id": 123,
      "name": "Alice",
      "created_at": "2024-01-15T10:30:00Z",
      "plan": "free",
      "plan_expires_at": null
    }
  },
  "diff": {
    "+ plan_expires_at": null
  }
}

This difference — a new field plan_expires_at — is intentional (the new version adds it). You need to mark it as expected. Diffy lets you mark specific field paths as "excluded" or "irrelevant" to filter expected changes.

Configuring Noise Filters

Some response fields are inherently non-deterministic and should never be compared:

# Exclude specific JSON paths from comparison
java -jar diffy-server.jar \
  -candidate=localhost:9992 \
  -master.primary=localhost:9990 \
  -master.secondary=localhost:9991 \
  -service.protocol=http \
  -proxy.port=:8880 \
  -admin.port=:8881 \
  -excludeHttpHeadersComparison=<span class="hljs-literal">true \
  -allowHttpSideEffects=<span class="hljs-literal">true

For response body fields, use Diffy's exclude list in the admin UI or configure them via the API:

# Exclude 'timestamp' and 'request_id' fields from all comparisons
curl -X POST http://localhost:8881/api/1/excludes \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{"path": "timestamp"}'

curl -X POST http://localhost:8881/api/1/excludes \
  -H <span class="hljs-string">"Content-Type: application/json" \
  -d <span class="hljs-string">'{"path": "request_id"}'

How Twitter and Foursquare Used Diffy

Twitter used Diffy as a standard part of their service deployment pipeline. Before any service could be deployed to production, it had to pass a Diffy validation run with real traffic (or a recorded corpus). The pipeline would:

  1. Deploy the candidate to a shadow pool
  2. Mirror 1-5% of live traffic through Diffy for 30 minutes
  3. Check Diffy's difference rate
  4. If critical differences > threshold, block the deployment
  5. If critical differences < threshold, proceed to full deployment

This process caught regressions that integration tests missed — subtle JSON serialization changes, new fields that broke consumer contracts, performance regressions that changed response content.

Foursquare used a similar approach for their venue recommendation APIs, where the response body contains recommendations that change based on algorithms. They used Diffy to verify that a new recommendation algorithm did not degrade response quality in ways that could not easily be expressed as a pass/fail test.

Limitations and Considerations

Diffy has important limitations to understand:

Side effects: POST, PUT, DELETE requests are replayed three times (primary, secondary, candidate). This means three writes, three deletes, or three transactions. For write APIs, you must either use read-only traffic or ensure your test environment can safely handle triplicated writes. Some teams run Diffy only on GET traffic for this reason.

Latency: Diffy adds latency because it is a proxy that waits for all three responses. In production mirroring scenarios, the response to the user comes from primary immediately, but Diffy still needs to fan out the request. This is usually acceptable.

Non-deterministic responses: Anything that changes between requests (pagination cursors, generated IDs, random sampling) needs to be excluded. The noise baseline helps, but complex non-determinism requires manual exclusion configuration.

Database state: The primary and candidate may diverge over time if the candidate makes schema changes or data migrations. Diffy works best when primary and candidate share a read replica database and write paths are tested separately.

Despite these limitations, Diffy remains one of the most effective tools for catching API regressions using real traffic. For services with rich APIs and complex response bodies, the ability to automatically detect behavioral changes across all endpoints — without writing any test assertions — is extremely valuable.

Read more