Distributed Tracing in Microservices Tests: Finding Failures Across Services

Distributed Tracing in Microservices Tests: Finding Failures Across Services

When a test fails in a distributed system, the error message tells you what happened but not where. A 500 response from your API gateway might be caused by a timeout in service C, waiting on a database query in service B, initiated by a bad request from service A. Without tracing, you're debugging with a flashlight in a cave.

Distributed tracing gives you a complete picture of every request as it flows through your system — and it's just as valuable in your test environment as in production.

How Distributed Tracing Works

Every request gets a trace ID when it enters the system. Each service creates a span tagged with that trace ID. Spans record start time, duration, errors, and metadata.

Collect all spans for a trace ID and you see the complete call graph: which services were called, in what order, how long each took, and where errors occurred.

Trace: 4f8a2b3c
├── order-service.createOrder (120ms)
│   ├── inventory-service.reserve (45ms) ✓
│   ├── payment-service.charge (68ms) ✗ ERROR: card_declined
│   └── notification-service.send (NEVER CALLED)

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the vendor-neutral standard for distributed tracing. Instrument once, export to any backend.

// Node.js tracing setup
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://jaeger:4318/v1/traces',
  }),
  instrumentations: [new HttpInstrumentation()],
});

sdk.start();
# Python tracing setup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentation

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
RequestsInstrumentation().instrument()

Jaeger in Your Test Environment

Run Jaeger in Docker Compose alongside your services:

# docker-compose.test.yml
services:
  jaeger:
    image: jaegertracing/all-in-one:1.51
    ports:
      - "16686:16686"  # UI
      - "4318:4318"    # OTLP HTTP
    environment:
      COLLECTOR_OTLP_ENABLED: "true"

  order-service:
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4318/v1/traces
      OTEL_SERVICE_NAME: order-service

  payment-service:
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4318/v1/traces
      OTEL_SERVICE_NAME: payment-service

Trace-Based Test Assertions

The most powerful use of tracing in tests is programmatically verifying the trace after a test runs.

Return Trace IDs from Your APIs

// Express middleware — return trace ID in response header
app.use((req, res, next) => {
  const span = trace.getActiveSpan();
  if (span) {
    res.setHeader('X-Trace-Id', span.spanContext().traceId);
  }
  next();
});

Query Traces in Tests

import requests
import time

def test_checkout_calls_payment_and_inventory():
    response = requests.post('http://order-service/checkout', json={
        'customerId': 'abc',
        'items': [{'productId': '123', 'quantity': 1}]
    })
    trace_id = response.headers['X-Trace-Id']
    
    time.sleep(0.5)  # wait for span export
    
    trace_data = requests.get(
        f'http://localhost:16686/api/traces/{trace_id}'
    ).json()
    
    services_called = {
        span['process']['serviceName']
        for span in trace_data['data'][0]['spans']
    }
    assert 'payment-service' in services_called
    assert 'inventory-service' in services_called

Verify Execution Order

def test_payment_called_after_inventory_reserved():
    response = checkout(order_data)
    trace = get_trace(response.headers['X-Trace-Id'])
    
    spans = {s['process']['serviceName']: s for s in trace['spans']}
    
    inventory_end = (
        spans['inventory-service']['startTime'] +
        spans['inventory-service']['duration']
    )
    payment_start = spans['payment-service']['startTime']
    
    assert payment_start >= inventory_end, \
        "Payment was called before inventory was reserved"

Verify Cache Hits (No DB Calls)

def test_cached_response_skips_database():
    requests.get('http://product-service/products/123')  # warm cache
    
    response = requests.get('http://product-service/products/123')
    trace = get_trace(response.headers['X-Trace-Id'])
    
    operations = [span['operationName'] for span in trace['spans']]
    db_calls = [op for op in operations if 'SELECT' in op]
    
    assert len(db_calls) == 0, f"Unexpected DB queries: {db_calls}"

Custom Business Spans

Auto-instrumentation covers HTTP and DB calls. Add custom spans for business operations:

tracer = trace.get_tracer(__name__)

def process_payment(order_id, amount):
    with tracer.start_as_current_span('process_payment') as span:
        span.set_attribute('order.id', order_id)
        span.set_attribute('payment.amount', amount)
        
        try:
            result = charge_card(amount)
            span.set_attribute('payment.transaction_id', result['id'])
            return result
        except CardDeclinedException as e:
            span.set_status(StatusCode.ERROR)
            span.record_exception(e)
            span.set_attribute('payment.decline_code', e.code)
            raise

Verify these attributes in tests:

def test_payment_span_records_order_context():
    response = process_order(order_id='ord_123', amount=99.99)
    trace = get_trace(response.trace_id)
    
    payment_span = next(s for s in trace['spans'] 
                        if s['operationName'] == 'process_payment')
    
    tags = {t['key']: t['value'] for t in payment_span['tags']}
    assert tags['order.id'] == 'ord_123'
    assert tags['payment.amount'] == 99.99

Trace Assertion Library

Build a reusable assertion helper for your team:

class TraceAssertions {
  constructor(spans) {
    this.spans = spans;
  }

  static async load(traceId, { waitMs = 500 } = {}) {
    await new Promise(r => setTimeout(r, waitMs));
    const res = await fetch(`http://localhost:16686/api/traces/${traceId}`);
    const data = await res.json();
    return new TraceAssertions(data.data[0].spans);
  }

  serviceCalled(name) {
    const called = this.spans.some(s => s.process.serviceName === name);
    if (!called) throw new Error(`Expected ${name} to be called. Called: ${this._services()}`);
    return this;
  }

  serviceNotCalled(name) {
    const called = this.spans.some(s => s.process.serviceName === name);
    if (called) throw new Error(`Expected ${name} NOT to be called`);
    return this;
  }

  noErrors() {
    const errors = this.spans.filter(s => s.tags.some(t => t.key === 'error' && t.value));
    if (errors.length) {
      const names = errors.map(s => `${s.process.serviceName}:${s.operationName}`);
      throw new Error(`Unexpected errors: ${names.join(', ')}`);
    }
    return this;
  }

  _services() {
    return [...new Set(this.spans.map(s => s.process.serviceName))].join(', ');
  }
}

// In tests:
it('checkout flow calls all required services', async () => {
  const response = await checkout(orderData);
  const trace = await TraceAssertions.load(response.headers['x-trace-id']);
  
  trace
    .serviceCalled('inventory-service')
    .serviceCalled('payment-service')
    .serviceNotCalled('fraud-detection')
    .noErrors();
});

Debugging Failures with Traces

When a test fails, use traces to pinpoint the cause:

  1. Get the trace ID from the failed request header or test output
  2. Open Jaeger UI at http://localhost:16686
  3. Search by trace ID to see the full call graph
  4. Find error spans (shown in red) — click for stack traces and context
  5. Check timing — identify slow spans and unexpected call sequences

This replaces grepping through logs across multiple services.

Log Trace IDs in Test Output

Make trace links available in test reports:

@pytest.fixture(autouse=True)
def print_trace_url(request):
    trace_ids = []
    request.node.trace_ids = trace_ids
    yield
    for tid in trace_ids:
        print(f"\n  Trace: http://localhost:16686/trace/{tid}")

When a CI test fails, the output includes a direct Jaeger link. One click shows exactly what happened.

Always Sample in Test Environments

Production typically samples a percentage of traces to manage volume. In tests, sample everything:

const { AlwaysOnSampler } = require('@opentelemetry/sdk-trace-base');

const sdk = new NodeSDK({
  sampler: new AlwaysOnSampler(),
});

Distributed tracing turns debugging from log archaeology into visual exploration. For microservices integration tests, it's one of the highest-value tools you can add to your stack.

Read more