Distributed Tracing for Microservices Testing: A Practical Guide
A unit test tells you a function works in isolation. A distributed trace tells you whether it worked as part of a real request flowing through ten services. These two views are complements, not replacements — and modern testing stacks need both.
Why Distributed Tracing Changes Testing
In a monolith, a failing integration test points at a file and line. In microservices, a failure might be:
- Service B timing out waiting for Service C
- A message consumer silently dropping events
- A database connection pool exhausted by an upstream retry storm
- A missing
traceparentheader breaking correlation
Unit and integration tests won't catch these. Distributed tracing, combined with assertions, will.
The Core Idea: Treat Traces as Test Artifacts
Instead of only asserting on HTTP responses, also assert on the trace that request generated:
User request → [Service A] → [Service B] → [Service C]
↓ ↓ ↓
span: A.handle span: B.query span: C.write
Your test asserts:
- All 3 spans present (no service silently skipped)
- No ERROR status on any span
- B.query took < 50ms (SLO check)
- traceparent propagated from A to B to CSetting Up a Test Trace Collector
For integration tests, run a lightweight trace collector that your services export to:
# docker-compose.test.yml
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "9999:9999" # Prometheus metrics
volumes:
- ./otel-collector-test.yaml:/etc/otelcol/config.yaml
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # HTTP thrift# otel-collector-test.yaml
receivers:
otlp:
protocols:
grpc:
http:
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
exporters: [jaeger, logging]Querying Traces in Tests
Jaeger exposes a REST API you can query after running a scenario:
const JAEGER_API = 'http://localhost:16686/api';
async function getTracesByOperation(service, operation, since) {
const url = `${JAEGER_API}/traces?service=${service}&operation=${operation}&start=${since}&limit=10`;
const res = await fetch(url);
const data = await res.json();
return data.data;
}
async function getSpansForTrace(traceId) {
const res = await fetch(`${JAEGER_API}/traces/${traceId}`);
const data = await res.json();
return data.data[0].spans;
}Now write assertions:
test('order placement creates complete trace across all services', async () => {
const beforeMs = Date.now() * 1000; // Jaeger uses microseconds
const response = await placeOrder({ userId: 'u1', sku: 'A' });
expect(response.status).toBe(201);
await sleep(500); // allow trace export
const traces = await getTracesByOperation('order-service', 'POST /orders', beforeMs);
expect(traces).toHaveLength(1);
const spans = await getSpansForTrace(traces[0].traceID);
const serviceNames = [...new Set(spans.map(s => s.process.serviceName))];
// All services participated
expect(serviceNames).toContain('order-service');
expect(serviceNames).toContain('inventory-service');
expect(serviceNames).toContain('payment-service');
expect(serviceNames).toContain('notification-service');
// No errors
const errorSpans = spans.filter(s => s.tags.some(t => t.key === 'error' && t.value));
expect(errorSpans).toHaveLength(0);
});Asserting on Latency SLOs
test('inventory check completes within SLO', async () => {
const beforeMs = Date.now() * 1000;
await checkInventory({ sku: 'A' });
await sleep(500);
const traces = await getTracesByOperation('inventory-service', 'GET /inventory/{sku}', beforeMs);
const spans = await getSpansForTrace(traces[0].traceID);
const inventorySpan = spans.find(s => s.operationName === 'GET /inventory/{sku}');
const durationMs = inventorySpan.duration / 1000; // Jaeger stores microseconds
expect(durationMs).toBeLessThan(100); // 100ms SLO
});Testing Context Propagation
test('traceparent header propagates from gateway to downstream', async () => {
const beforeMs = Date.now() * 1000;
// Make a request with a known trace ID
const traceId = '4bf92f3577b34da6a3ce929d0e0e4736';
await fetch('http://gateway/orders', {
method: 'POST',
headers: {
'traceparent': `00-${traceId}-00f067aa0ba902b7-01`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ sku: 'A' }),
});
await sleep(500);
const spans = await getSpansForTrace(traceId);
const serviceNames = [...new Set(spans.map(s => s.process.serviceName))];
// Propagation worked — all services share the same trace ID
expect(serviceNames.length).toBeGreaterThan(1);
});Detecting Silent Failures
The most valuable use of trace-based testing: catching services that swallow errors without surfacing them.
test('payment failure is visible in trace even when order API returns 200', async () => {
mockPaymentService.rejectNext('card_declined');
const response = await placeOrder({ userId: 'u1', sku: 'A' });
// Some architectures return 200 and handle failures async
const beforeMs = startedAt;
const spans = await getSpansForTrace(latestTraceId);
const paymentSpan = spans.find(s => s.operationName === 'charge-card');
expect(paymentSpan).toBeDefined();
expect(paymentSpan.tags.some(t => t.key === 'error' && t.value === true)).toBe(true);
});Trace-Based Smoke Tests in Production
For production verification after deployment:
#!/bin/bash
<span class="hljs-comment"># smoke-test.sh — run after deploy
TRACE_API=<span class="hljs-string">"https://traces.internal/api"
<span class="hljs-comment"># Trigger a synthetic transaction
curl -X POST https://api.prod.example.com/orders \
-H <span class="hljs-string">"X-Synthetic: true" \
-d <span class="hljs-string">'{"sku":"TEST-SKU"}'
<span class="hljs-built_in">sleep 5
<span class="hljs-comment"># Assert trace was exported and complete
node -e <span class="hljs-string">"
const traces = await fetch('${TRACE_API}/traces?service=order-service&limit=1').then(r=>r.json());
const spans = traces.data[0].spans;
const services = [...new Set(spans.map(s=>s.process.serviceName))];
console.assert(services.includes('payment-service'), 'payment service span missing!');
console.log('Smoke test PASSED. Services:', services);
"Common Pitfalls
Flaky trace assertions — traces export asynchronously. Always wait 500ms–2s before querying.
Missing baggage vs traceparent — traceparent carries trace context; baggage carries business context. Test both if you use both.
Sampling dropping test traces — configure 100% sampling in test environments. Never sample down in dev/test.
Span name drift — if instrumentation changes the span name, all trace-based assertions break. Treat span names as a contract.
Testing with HelpMeTest
HelpMeTest runs real browser scenarios end-to-end. Combined with a trace collector in your test environment, HelpMeTest scenarios can trigger traces and then your assertions can verify the trace was correct — catching distributed failures that no amount of unit testing would surface.
Key Takeaways
- Use an in-test trace collector (Jaeger, OTEL Collector) rather than mocking telemetry
- Assert on span presence, service participation, error status, and latency
- Test propagation explicitly —
traceparentcorrectness is not obvious - Use traces to catch silent failures invisible to API response codes
- Configure 100% sampling in test environments
Distributed tracing turns your observability stack into a test oracle — one that sees exactly what happened across every service boundary.