Distributed Tracing for Microservices Testing: A Practical Guide

Distributed Tracing for Microservices Testing: A Practical Guide

A unit test tells you a function works in isolation. A distributed trace tells you whether it worked as part of a real request flowing through ten services. These two views are complements, not replacements — and modern testing stacks need both.

Why Distributed Tracing Changes Testing

In a monolith, a failing integration test points at a file and line. In microservices, a failure might be:

  • Service B timing out waiting for Service C
  • A message consumer silently dropping events
  • A database connection pool exhausted by an upstream retry storm
  • A missing traceparent header breaking correlation

Unit and integration tests won't catch these. Distributed tracing, combined with assertions, will.

The Core Idea: Treat Traces as Test Artifacts

Instead of only asserting on HTTP responses, also assert on the trace that request generated:

User request → [Service A] → [Service B] → [Service C]
                    ↓               ↓              ↓
               span: A.handle  span: B.query  span: C.write
               
Your test asserts:
- All 3 spans present (no service silently skipped)
- No ERROR status on any span
- B.query took < 50ms (SLO check)
- traceparent propagated from A to B to C

Setting Up a Test Trace Collector

For integration tests, run a lightweight trace collector that your services export to:

# docker-compose.test.yml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP
      - "9999:9999"  # Prometheus metrics
    volumes:
      - ./otel-collector-test.yaml:/etc/otelcol/config.yaml

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # HTTP thrift
# otel-collector-test.yaml
receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger, logging]

Querying Traces in Tests

Jaeger exposes a REST API you can query after running a scenario:

const JAEGER_API = 'http://localhost:16686/api';

async function getTracesByOperation(service, operation, since) {
  const url = `${JAEGER_API}/traces?service=${service}&operation=${operation}&start=${since}&limit=10`;
  const res = await fetch(url);
  const data = await res.json();
  return data.data;
}

async function getSpansForTrace(traceId) {
  const res = await fetch(`${JAEGER_API}/traces/${traceId}`);
  const data = await res.json();
  return data.data[0].spans;
}

Now write assertions:

test('order placement creates complete trace across all services', async () => {
  const beforeMs = Date.now() * 1000; // Jaeger uses microseconds
  
  const response = await placeOrder({ userId: 'u1', sku: 'A' });
  expect(response.status).toBe(201);
  
  await sleep(500); // allow trace export
  
  const traces = await getTracesByOperation('order-service', 'POST /orders', beforeMs);
  expect(traces).toHaveLength(1);
  
  const spans = await getSpansForTrace(traces[0].traceID);
  const serviceNames = [...new Set(spans.map(s => s.process.serviceName))];
  
  // All services participated
  expect(serviceNames).toContain('order-service');
  expect(serviceNames).toContain('inventory-service');
  expect(serviceNames).toContain('payment-service');
  expect(serviceNames).toContain('notification-service');
  
  // No errors
  const errorSpans = spans.filter(s => s.tags.some(t => t.key === 'error' && t.value));
  expect(errorSpans).toHaveLength(0);
});

Asserting on Latency SLOs

test('inventory check completes within SLO', async () => {
  const beforeMs = Date.now() * 1000;
  await checkInventory({ sku: 'A' });
  await sleep(500);
  
  const traces = await getTracesByOperation('inventory-service', 'GET /inventory/{sku}', beforeMs);
  const spans = await getSpansForTrace(traces[0].traceID);
  
  const inventorySpan = spans.find(s => s.operationName === 'GET /inventory/{sku}');
  const durationMs = inventorySpan.duration / 1000; // Jaeger stores microseconds
  
  expect(durationMs).toBeLessThan(100); // 100ms SLO
});

Testing Context Propagation

test('traceparent header propagates from gateway to downstream', async () => {
  const beforeMs = Date.now() * 1000;
  
  // Make a request with a known trace ID
  const traceId = '4bf92f3577b34da6a3ce929d0e0e4736';
  await fetch('http://gateway/orders', {
    method: 'POST',
    headers: {
      'traceparent': `00-${traceId}-00f067aa0ba902b7-01`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ sku: 'A' }),
  });
  
  await sleep(500);
  
  const spans = await getSpansForTrace(traceId);
  const serviceNames = [...new Set(spans.map(s => s.process.serviceName))];
  
  // Propagation worked — all services share the same trace ID
  expect(serviceNames.length).toBeGreaterThan(1);
});

Detecting Silent Failures

The most valuable use of trace-based testing: catching services that swallow errors without surfacing them.

test('payment failure is visible in trace even when order API returns 200', async () => {
  mockPaymentService.rejectNext('card_declined');
  
  const response = await placeOrder({ userId: 'u1', sku: 'A' });
  // Some architectures return 200 and handle failures async
  
  const beforeMs = startedAt;
  const spans = await getSpansForTrace(latestTraceId);
  
  const paymentSpan = spans.find(s => s.operationName === 'charge-card');
  
  expect(paymentSpan).toBeDefined();
  expect(paymentSpan.tags.some(t => t.key === 'error' && t.value === true)).toBe(true);
});

Trace-Based Smoke Tests in Production

For production verification after deployment:

#!/bin/bash
<span class="hljs-comment"># smoke-test.sh — run after deploy

TRACE_API=<span class="hljs-string">"https://traces.internal/api"

<span class="hljs-comment"># Trigger a synthetic transaction
curl -X POST https://api.prod.example.com/orders \
  -H <span class="hljs-string">"X-Synthetic: true" \
  -d <span class="hljs-string">'{"sku":"TEST-SKU"}'

<span class="hljs-built_in">sleep 5

<span class="hljs-comment"># Assert trace was exported and complete
node -e <span class="hljs-string">"
const traces = await fetch('${TRACE_API}/traces?service=order-service&limit=1').then(r=>r.json());
const spans = traces.data[0].spans;
const services = [...new Set(spans.map(s=>s.process.serviceName))];
console.assert(services.includes('payment-service'), 'payment service span missing!');
console.log('Smoke test PASSED. Services:', services);
"

Common Pitfalls

Flaky trace assertions — traces export asynchronously. Always wait 500ms–2s before querying.

Missing baggage vs traceparenttraceparent carries trace context; baggage carries business context. Test both if you use both.

Sampling dropping test traces — configure 100% sampling in test environments. Never sample down in dev/test.

Span name drift — if instrumentation changes the span name, all trace-based assertions break. Treat span names as a contract.

Testing with HelpMeTest

HelpMeTest runs real browser scenarios end-to-end. Combined with a trace collector in your test environment, HelpMeTest scenarios can trigger traces and then your assertions can verify the trace was correct — catching distributed failures that no amount of unit testing would surface.

Key Takeaways

  • Use an in-test trace collector (Jaeger, OTEL Collector) rather than mocking telemetry
  • Assert on span presence, service participation, error status, and latency
  • Test propagation explicitly — traceparent correctness is not obvious
  • Use traces to catch silent failures invisible to API response codes
  • Configure 100% sampling in test environments

Distributed tracing turns your observability stack into a test oracle — one that sees exactly what happened across every service boundary.

Read more