Distributed Tracing in Tests: Using OpenTelemetry and Jaeger to Debug Microservices

Distributed Tracing in Tests: Using OpenTelemetry and Jaeger to Debug Microservices

When a microservices integration test fails with a vague 500 error, you have a problem: the error came from somewhere in a chain of four services, passed through an API gateway, and touched a database and a message queue. The stack trace in your test output tells you where the request entered the system. It tells you nothing about where it broke.

Distributed tracing solves this. When you instrument your services with OpenTelemetry and collect traces in Jaeger, every test run produces a detailed map of what happened across every service, in what order, with what data, and where time was spent. Test failures go from "something returned 500" to "the inventory service's database query timed out after 4.2 seconds because a missing index caused a full table scan."

This guide covers how to instrument services with OpenTelemetry, how to assert trace structure in tests, and how to use Jaeger as a debugging tool when tests fail.

Why Tracing Changes How You Test

Traditional test assertions are binary: the request succeeded or it didn't, the response body matched or it didn't. Tracing adds a third axis — how the operation executed. This matters for:

  • Debugging flaky tests — A test that fails 1 in 20 runs is often caused by a race condition or intermittent downstream dependency. The trace shows exactly which call was slow or failed.
  • Performance regression detection — Assert that the span for a database query is under 50ms. If a code change doubles query time, the trace-based assertion catches it before it ships.
  • Verifying architectural constraints — Assert that service A never calls service C directly — it must go through service B. Trace structure makes these architectural rules testable.
  • Debugging in staging — When an E2E test fails in a CI pipeline, the trace ID in the test output lets you pull up the full distributed trace in Jaeger instantly, without needing to reproduce locally.

Instrumenting Services with OpenTelemetry

OpenTelemetry provides a vendor-neutral instrumentation API. Here's how to add it to a Node.js microservice:

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/resources \
  @opentelemetry/semantic-conventions

Create a tracer setup file that runs before everything else:

// tracing.js — load this before any other imports
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const exporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://jaeger:4318/v1/traces',
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '0.0.0',
  }),
  traceExporter: exporter,
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.error('Error terminating tracing', error));
});

Load it at startup: node -r ./tracing.js server.js

For custom spans on business-critical operations:

const { trace, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId, customerId) {
  // Start a custom span for this business operation
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);
    span.setAttribute('customer.id', customerId);
    
    try {
      const order = await orderRepository.findById(orderId);
      
      // Nested span for the payment step
      const paymentResult = await tracer.startActiveSpan(
        'chargeCustomer',
        async (paymentSpan) => {
          paymentSpan.setAttribute('payment.amount', order.totalAmount);
          paymentSpan.setAttribute('payment.currency', 'USD');
          
          try {
            const result = await paymentService.charge(customerId, order.totalAmount);
            paymentSpan.setAttribute('payment.transaction_id', result.transactionId);
            return result;
          } catch (err) {
            paymentSpan.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
            paymentSpan.recordException(err);
            throw err;
          } finally {
            paymentSpan.end();
          }
        }
      );
      
      span.setAttribute('order.status', 'completed');
      return { success: true, transactionId: paymentResult.transactionId };
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Running Jaeger in Your Test Environment

For local development and CI, run Jaeger all-in-one (stores traces in memory — fine for testing):

# docker-compose.test.yml
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "6831:6831/udp"   # Thrift compact (legacy agents)
      - "16686:16686"      # UI
      - "4317:4317"        # OTLP gRPC
      - "4318:4318"        # OTLP HTTP
    environment:
      COLLECTOR_OTLP_ENABLED: "true"
      SPAN_STORAGE_TYPE: memory
      MEMORY_MAX_TRACES: 10000

  order-service:
    build: ./order-service
    environment:
      SERVICE_NAME: order-service
      OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4318/v1/traces
    depends_on:
      - jaeger

Querying Traces in Tests

Jaeger exposes a REST API for querying traces. Use it in your tests to assert on trace structure after triggering operations:

# trace_assertions.py
import requests
import time
from typing import List, Dict, Optional

JAEGER_API = "http://localhost:16686/api"

class TraceAssertor:
    def get_traces(
        self,
        service: str,
        operation: Optional[str] = None,
        tags: Optional[Dict[str, str]] = None,
        lookback: str = "1m",
        limit: int = 10
    ) -> List[Dict]:
        """Fetch recent traces from Jaeger."""
        params = {
            "service": service,
            "limit": limit,
            "lookback": lookback,
        }
        if operation:
            params["operation"] = operation
        if tags:
            params["tags"] = str(tags).replace("'", '"')
        
        resp = requests.get(f"{JAEGER_API}/traces", params=params)
        resp.raise_for_status()
        return resp.json().get("data", [])

    def get_spans_by_operation(self, trace: Dict, operation: str) -> List[Dict]:
        """Get all spans in a trace matching an operation name."""
        return [
            span for span in trace["spans"]
            if span["operationName"] == operation
        ]

    def get_span_tags(self, span: Dict) -> Dict[str, str]:
        """Extract tags from a span as a simple dict."""
        return {tag["key"]: tag["value"] for tag in span.get("tags", [])}

    def assert_span_exists(self, trace: Dict, operation: str) -> Dict:
        """Assert that a span with the given operation exists and return it."""
        spans = self.get_spans_by_operation(trace, operation)
        assert spans, f"Expected span '{operation}' not found in trace. Found: {[s['operationName'] for s in trace['spans']]}"
        return spans[0]

    def assert_no_error_spans(self, trace: Dict) -> None:
        """Assert that no spans in the trace have error status."""
        error_spans = [
            span for span in trace["spans"]
            if any(
                tag["key"] == "otel.status_code" and tag["value"] == "ERROR"
                for tag in span.get("tags", [])
            )
        ]
        if error_spans:
            ops = [s["operationName"] for s in error_spans]
            assert False, f"Found error spans: {ops}"

Now write tests that use these assertions:

# test_order_processing_traces.py
import pytest
import requests
import time
from trace_assertions import TraceAssertor

GATEWAY_URL = "http://localhost:8080"
assertor = TraceAssertor()

class TestOrderProcessingTraces:
    def test_successful_order_creates_expected_trace_structure(self):
        """
        When an order is placed successfully, the trace should show:
        1. The HTTP POST span from the API gateway
        2. The processOrder business span
        3. The chargeCustomer span
        4. A database write span
        No spans should have error status.
        """
        # Trigger the operation
        resp = requests.post(
            f"{GATEWAY_URL}/api/orders",
            json={"customerId": "cust_123", "items": [{"sku": "PROD-1", "quantity": 1}]},
            headers={"X-Trace-Test": "test_successful_order"}
        )
        assert resp.status_code == 201
        
        time.sleep(2)  # Allow traces to propagate to Jaeger
        
        traces = assertor.get_traces(
            service="order-service",
            operation="POST /orders",
            lookback="30s",
            limit=5
        )
        assert traces, "No traces found for order creation"
        
        trace = traces[0]
        
        # Verify business span exists
        process_span = assertor.assert_span_exists(trace, "processOrder")
        span_tags = assertor.get_span_tags(process_span)
        assert "order.id" in span_tags, "processOrder span missing order.id attribute"
        assert "customer.id" in span_tags
        
        # Verify payment span is nested
        assertor.assert_span_exists(trace, "chargeCustomer")
        
        # Verify database write occurred
        assertor.assert_span_exists(trace, "pg.query")
        
        # No errors
        assertor.assert_no_error_spans(trace)

    def test_payment_failure_is_captured_in_trace(self):
        """
        When payment fails, the chargeCustomer span should have error status
        and an exception event, making root cause immediately visible.
        """
        resp = requests.post(
            f"{GATEWAY_URL}/api/orders",
            json={
                "customerId": "cust_invalid_card",
                "items": [{"sku": "PROD-1", "quantity": 1}]
            }
        )
        assert resp.status_code in [402, 500]
        
        time.sleep(2)
        
        traces = assertor.get_traces(
            service="order-service",
            operation="POST /orders",
            lookback="30s"
        )
        assert traces
        
        trace = traces[0]
        payment_spans = assertor.get_spans_by_operation(trace, "chargeCustomer")
        assert payment_spans, "chargeCustomer span should exist even on failure"
        
        payment_span = payment_spans[0]
        tags = assertor.get_span_tags(payment_span)
        
        assert tags.get("otel.status_code") == "ERROR", (
            "chargeCustomer span should have ERROR status on payment failure"
        )
        
        # Verify exception event was recorded
        events = payment_span.get("logs", [])  # Jaeger calls these "logs"
        exception_events = [
            e for e in events
            if any(f["key"] == "event" and "exception" in str(f["value"]).lower()
                   for f in e.get("fields", []))
        ]
        assert exception_events, "Expected exception event on payment failure span"

    def test_span_duration_within_sla(self):
        """
        The processOrder span should complete within 2 seconds under normal conditions.
        Catches performance regressions before they reach production.
        """
        requests.post(
            f"{GATEWAY_URL}/api/orders",
            json={"customerId": "cust_123", "items": [{"sku": "PROD-1", "quantity": 1}]}
        )
        
        time.sleep(2)
        
        traces = assertor.get_traces(service="order-service", operation="POST /orders", lookback="30s")
        assert traces
        
        trace = traces[0]
        process_span = assertor.assert_span_exists(trace, "processOrder")
        
        # Duration in microseconds in Jaeger
        duration_ms = process_span["duration"] / 1000
        assert duration_ms < 2000, (
            f"processOrder took {duration_ms:.0f}ms, expected < 2000ms. "
            f"Check for performance regression."
        )

Trace-Based Testing Patterns

Beyond debugging, traces enable testing patterns that aren't possible with pure HTTP assertions:

Architectural constraint testing — Verify that service B is the only gateway to the database, and service A never writes directly:

def test_frontend_service_never_calls_database_directly():
    """
    Frontend requests must go through order-service, never hit the database directly.
    This enforces the architectural boundary.
    """
    requests.get(f"{GATEWAY_URL}/api/orders?customerId=cust_123")
    time.sleep(2)
    
    traces = assertor.get_traces(service="frontend", lookback="30s")
    
    for trace in traces:
        for span in trace["spans"]:
            # Frontend spans should not include database operations
            assert "pg.query" not in span["operationName"], (
                f"Frontend service made direct database call: {span['operationName']}"
            )
            assert "SELECT" not in span["operationName"].upper()

Fan-out verification — When one request should trigger multiple downstream calls:

def test_order_creation_notifies_all_downstream_services():
    """
    Creating an order should trigger calls to inventory, payment, and notification services.
    Missing a downstream call indicates a bug in the orchestration logic.
    """
    requests.post(f"{GATEWAY_URL}/api/orders", json=order_payload)
    time.sleep(3)
    
    traces = assertor.get_traces(service="order-service", lookback="30s")
    assert traces
    
    trace = traces[0]
    services_called = {
        span["processID"] for span in trace["spans"]
    }
    
    required_services = {"inventory-service", "payment-service", "notification-service"}
    missing = required_services - services_called
    assert not missing, f"Order creation did not call: {missing}"

Integrating Trace Assertions into CI

For CI pipelines, print the Jaeger trace URL in test failure output so developers can immediately open it:

# conftest.py
import pytest

JAEGER_UI = "http://your-jaeger-host:16686"

@pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
    outcome = yield
    report = outcome.get_result()
    
    if report.when == "call" and report.failed:
        # Try to get the trace ID if the test stored it
        trace_id = getattr(item, "_trace_id", None)
        if trace_id:
            report.sections.append((
                "Jaeger Trace",
                f"View trace: {JAEGER_UI}/trace/{trace_id}"
            ))

Store the trace ID in tests that trigger HTTP calls:

def test_order_creation(request):
    resp = requests.post(f"{GATEWAY_URL}/api/orders", json=order_payload)
    
    # Save trace ID from response header for failure reporting
    trace_id = resp.headers.get("X-Trace-Id")
    if trace_id:
        request.node._trace_id = trace_id
    
    assert resp.status_code == 201

Getting the Most Out of Trace-Based Testing

The key discipline is correlating test runs to traces. Pass a unique test identifier as a custom header in every request (X-Test-Run-ID), and add it as a span attribute. Then you can filter Jaeger for exactly the spans generated by a specific test run — essential in shared test environments where multiple tests run concurrently.

Start by instrumenting one critical path end-to-end: frontend → API gateway → core service → database. Run your existing integration tests and look at the traces they produce. You'll immediately see things you didn't know were happening — unexpected service calls, unexpectedly slow queries, calls that succeed but take 10x longer than they should.

Traces don't replace test assertions — they complement them. When a test fails, the assertion tells you what went wrong; the trace tells you why. Together, they turn microservices debugging from an art into a process.

Read more