Testing Observability in Microservices: Logs, Metrics, and Traces

Testing Observability in Microservices: Logs, Metrics, and Traces

Testing observability in microservices is the discipline that nobody implements until an incident exposes the gap. You're in the middle of a production outage, you open Grafana, and the dashboard is empty. Or the traces are there but missing the tenant_id attribute that would tell you which customer is affected. Or the alert that should have fired didn't, because someone changed the metric name without updating the alert rule. These are not hypothetical failures — they happen on every engineering team that treats observability as "something we'll configure properly later."

Testing that your logs, metrics, and traces are correct is as important as testing the business logic they instrument. This guide covers how to write those tests using Testcontainers, Prometheus, Jaeger, and OpenTelemetry.

What You're Actually Testing

Observability has three layers, each requiring different test approaches:

Layer What to Test Failure Mode
Structured Logs Required fields present, correct log levels, no PII leakage Missing trace_id makes log correlation impossible; PII in logs is a compliance violation
Metrics Counters/histograms emitted, correct labels, correct values Missing metrics = blind dashboards; wrong labels = incorrect aggregation
Traces Spans created with correct attributes, parent-child relationships, sampling Missing spans = broken distributed traces; wrong attributes = useless trace search
Alerts Rules fire on correct conditions, with correct severity and runbook links Alert doesn't fire = silent incident; alert fires too eagerly = alert fatigue

Testing Structured Logs

Structured logs are only useful if they contain the fields your log aggregation platform (Loki, Elasticsearch, CloudWatch Insights) needs for correlation and filtering.

Minimum Required Fields

Every log line in a production microservices system should contain:

  • trace_id — links log to distributed trace
  • span_id — links log to specific span within a trace
  • service — which service emitted the log
  • level — log level (info, warn, error)
  • timestamp — ISO 8601 with timezone
  • tenant_id — for multi-tenant systems
  • request_id — for HTTP request correlation
  • user_id — for user-scoped operations (but NOT in debug logs to avoid PII leakage)

Log Field Tests

# test_structured_logging.py
import json
import pytest
import requests
import subprocess
import time

def capture_logs_for_request(endpoint: str, headers: dict) -> list[dict]:
    """Make a request and capture the structured logs it generates."""
    # In a test environment, your service writes logs to a file or stdout
    # Redirect stdout to a temp file, make the request, parse the output
    # This example assumes logs are written to a file in test mode
    import tempfile
    import os

    log_file = os.environ.get("TEST_LOG_FILE", "/tmp/service-test.log")
    initial_size = os.path.getsize(log_file) if os.path.exists(log_file) else 0

    response = requests.get(endpoint, headers=headers)

    time.sleep(0.1)  # Brief wait for async log writes

    logs = []
    if os.path.exists(log_file):
        with open(log_file) as f:
            f.seek(initial_size)
            for line in f:
                line = line.strip()
                if line:
                    try:
                        logs.append(json.loads(line))
                    except json.JSONDecodeError:
                        pass  # Non-JSON log lines (startup messages, etc.)

    return logs, response


def test_http_request_logs_contain_required_fields():
    """Every HTTP request should generate a log with required correlation fields."""
    logs, response = capture_logs_for_request(
        "http://localhost:8000/api/orders",
        {"Authorization": "Bearer test-token", "X-Request-Id": "req-abc-123"}
    )

    # Find the access log entry for this request
    access_logs = [l for l in logs if l.get("log_type") == "access" or
                   l.get("message", "").startswith("GET /api/orders")]

    assert len(access_logs) >= 1, "HTTP request should generate at least one access log entry"

    log = access_logs[0]

    # Required fields for log correlation
    assert "trace_id" in log, "Log must include trace_id for distributed trace correlation"
    assert log["trace_id"] != "", "trace_id must not be empty"

    assert "request_id" in log, "Log must include request_id"
    assert log.get("request_id") == "req-abc-123", \
        "Log request_id must match X-Request-Id header"

    assert "service" in log, "Log must include service name"
    assert "level" in log, "Log must include log level"
    assert "timestamp" in log, "Log must include timestamp"

    # Timestamp must be ISO 8601
    from datetime import datetime
    try:
        datetime.fromisoformat(log["timestamp"].replace("Z", "+00:00"))
    except ValueError:
        pytest.fail(f"Timestamp '{log['timestamp']}' is not valid ISO 8601")

    # HTTP-specific fields
    assert "http.method" in log or "method" in log
    assert "http.status_code" in log or "status_code" in log
    assert "http.path" in log or "path" in log


def test_error_logs_contain_exception_details():
    """Error logs must contain stack traces and enough context to debug without reproduction."""
    # Trigger an error by sending an invalid request
    logs, response = capture_logs_for_request(
        "http://localhost:8000/api/orders/invalid-uuid-format",
        {"Authorization": "Bearer test-token"}
    )

    error_logs = [l for l in logs if l.get("level") in ("error", "ERROR")]

    if response.status_code >= 500 and not error_logs:
        pytest.fail("5xx response generated no error logs — impossible to debug")

    if error_logs:
        error_log = error_logs[0]
        assert "error" in error_log or "exception" in error_log, \
            "Error log must include error details"
        assert "stack_trace" in error_log or "traceback" in error_log or \
               "exception.stacktrace" in error_log, \
            "Error log must include stack trace"


def test_logs_do_not_contain_sensitive_fields():
    """Logs must not contain passwords, tokens, or payment card data."""
    SENSITIVE_PATTERNS = [
        "password", "secret", "token", "bearer", "api_key",
        "card_number", "cvv", "ssn", "credit_card"
    ]

    logs, _ = capture_logs_for_request(
        "http://localhost:8000/api/orders",
        {"Authorization": "Bearer test-token-12345"}
    )

    for log in logs:
        log_str = json.dumps(log).lower()
        for pattern in SENSITIVE_PATTERNS:
            if pattern == "bearer":
                # Authorization header values should be redacted
                assert "bearer test-token-12345" not in log_str, \
                    f"Full Bearer token appeared in logs — token must be redacted"
            elif pattern == "token":
                # Token values (not the word) should not appear verbatim
                assert "test-token-12345" not in log_str, \
                    f"Token value 'test-token-12345' appeared in logs — must be redacted"

Testing Prometheus Metrics

Prometheus metrics need to be tested for both existence (the metric is registered and emitted) and correctness (the value and labels are accurate):

// PrometheusMetricsTest.java
import io.prometheus.client.CollectorRegistry;
import io.prometheus.client.exporter.common.TextFormat;
import org.junit.jupiter.api.*;
import java.io.StringWriter;

public class PrometheusMetricsTest {

    private CollectorRegistry registry;
    private OrderService orderService;

    @BeforeEach
    void setUp() {
        registry = new CollectorRegistry();
        orderService = new OrderService(registry); // inject test registry
    }

    @Test
    void orderCreatedCounterShouldIncrementOnSuccess() throws Exception {
        double initialCount = getCounterValue("orders_created_total",
            Map.of("status", "success"));

        orderService.createOrder(validOrderRequest());

        double newCount = getCounterValue("orders_created_total",
            Map.of("status", "success"));

        Assertions.assertEquals(initialCount + 1, newCount,
            "orders_created_total{status='success'} should increment by 1");
    }

    @Test
    void orderCreatedCounterShouldTrackFailuresSeparately() throws Exception {
        double initialFailureCount = getCounterValue("orders_created_total",
            Map.of("status", "failure"));

        Assertions.assertThrows(ValidationException.class,
            () -> orderService.createOrder(invalidOrderRequest()));

        double newFailureCount = getCounterValue("orders_created_total",
            Map.of("status", "failure"));

        Assertions.assertEquals(initialFailureCount + 1, newFailureCount,
            "Failure counter should be separate from success counter");
    }

    @Test
    void requestDurationHistogramShouldBeEmitted() throws Exception {
        orderService.createOrder(validOrderRequest());

        // Verify histogram was recorded
        double[] buckets = getHistogramBuckets("http_request_duration_seconds",
            Map.of("method", "POST", "path", "/api/orders", "status", "201"));

        Assertions.assertNotNull(buckets,
            "http_request_duration_seconds histogram must be emitted for POST /api/orders");
        Assertions.assertTrue(buckets[buckets.length - 1] >= 1,
            "At least one request should be recorded in the histogram");
    }

    @Test
    void metricsEndpointShouldExposeAllRequiredMetrics() throws Exception {
        // Make some requests to generate metrics
        orderService.createOrder(validOrderRequest());

        // Scrape the /metrics endpoint
        String metricsOutput = scrapeMetricsEndpoint();

        List<String> requiredMetrics = List.of(
            "orders_created_total",
            "http_request_duration_seconds",
            "http_requests_in_flight",
            "jvm_memory_used_bytes",
            "process_cpu_seconds_total"
        );

        for (String metric : requiredMetrics) {
            Assertions.assertTrue(metricsOutput.contains(metric),
                "Required metric '" + metric + "' not found in /metrics output");
        }
    }

    @Test
    void metricLabelsShouldMatchPrometheusNamingConventions() throws Exception {
        String metricsOutput = scrapeMetricsEndpoint();

        // Parse all metric names and validate they follow snake_case convention
        Pattern metricNamePattern = Pattern.compile("^[a-z_:][a-z0-9_:]*$");
        for (String line : metricsOutput.split("\n")) {
            if (line.startsWith("#")) continue;
            String metricName = line.split("[{\\s]")[0];
            if (!metricName.isEmpty()) {
                Assertions.assertTrue(metricNamePattern.matcher(metricName).matches(),
                    "Metric name '" + metricName + "' violates Prometheus naming convention");
            }
        }
    }

    private double getCounterValue(String metricName, Map<String, String> labels) {
        return registry.getSampleValue(metricName,
            labels.keySet().toArray(new String[0]),
            labels.values().toArray(new String[0]));
    }
}

Testing OpenTelemetry Spans with Testcontainers and Jaeger

Use Testcontainers to spin up a real Jaeger instance and verify that your spans are created with the correct attributes:

// OpenTelemetryTracingTest.java
@Testcontainers
public class OpenTelemetryTracingTest {

    @Container
    static GenericContainer<?> jaeger = new GenericContainer<>("jaegertracing/all-in-one:1.55")
        .withExposedPorts(16686, 4317) // UI and OTLP gRPC
        .waitingFor(Wait.forHttp("/").forPort(16686));

    private JaegerClient jaegerClient;

    @BeforeAll
    static void configureOTel() {
        // Configure OpenTelemetry to send spans to the test Jaeger instance
        String otlpEndpoint = "http://localhost:" + jaeger.getMappedPort(4317);
        System.setProperty("otel.exporter.otlp.endpoint", otlpEndpoint);
        System.setProperty("otel.service.name", "order-service-test");
        System.setProperty("otel.traces.exporter", "otlp");
    }

    @BeforeEach
    void setUp() {
        jaegerClient = new JaegerClient(
            "http://localhost:" + jaeger.getMappedPort(16686));
    }

    @Test
    void createOrderShouldGenerateSpanWithRequiredAttributes() throws Exception {
        String requestId = "trace-test-" + UUID.randomUUID();

        // Make the request
        mockMvc.perform(post("/api/orders")
            .header("X-Request-Id", requestId)
            .header("X-Tenant-Id", "tenant-abc")
            .contentType(MediaType.APPLICATION_JSON)
            .content(validOrderJson()))
            .andExpect(status().isCreated());

        // Wait for span to be exported to Jaeger
        Thread.sleep(1000);

        // Query Jaeger for spans from this service
        List<Span> spans = jaegerClient.findTraces("order-service-test",
            Map.of("request.id", requestId));

        Assertions.assertFalse(spans.isEmpty(),
            "No spans found in Jaeger for request " + requestId);

        Span rootSpan = spans.get(0);

        // Required span attributes for observability
        assertSpanAttribute(rootSpan, "http.method", "POST");
        assertSpanAttribute(rootSpan, "http.target", "/api/orders");
        assertSpanAttribute(rootSpan, "http.status_code", "201");
        assertSpanAttribute(rootSpan, "tenant.id", "tenant-abc");
        assertSpanAttribute(rootSpan, "request.id", requestId);

        // Verify the span duration was recorded (not zero)
        Assertions.assertTrue(rootSpan.getDuration() > 0,
            "Span duration must be > 0");
    }

    @Test
    void databaseQueryShouldCreateChildSpan() throws Exception {
        String requestId = "db-span-test-" + UUID.randomUUID();

        mockMvc.perform(get("/api/orders")
            .header("X-Request-Id", requestId)
            .header("Authorization", "Bearer test-token"))
            .andExpect(status().isOk());

        Thread.sleep(1000);

        List<Span> allSpans = jaegerClient.findAllSpansForRequest(requestId);

        // Should have at minimum: HTTP span + DB query span
        Assertions.assertTrue(allSpans.size() >= 2,
            "Expected HTTP span + DB span, got " + allSpans.size() + " spans");

        // Find the database span
        Optional<Span> dbSpan = allSpans.stream()
            .filter(s -> s.getOperationName().contains("db") ||
                         s.getTags().containsKey("db.system"))
            .findFirst();

        Assertions.assertTrue(dbSpan.isPresent(),
            "Database query should create a child span with db.system attribute");

        // DB span should have SQL statement (or at minimum the operation type)
        Span db = dbSpan.get();
        Assertions.assertTrue(
            db.getTags().containsKey("db.statement") ||
            db.getTags().containsKey("db.operation"),
            "DB span must include db.statement or db.operation for query diagnosis"
        );
    }

    @Test
    void spanSamplingRateShouldBeCorrect() throws Exception {
        // Send 100 requests and count how many generate traces
        int requestCount = 100;
        List<String> requestIds = new ArrayList<>();

        for (int i = 0; i < requestCount; i++) {
            String requestId = "sampling-test-" + UUID.randomUUID();
            requestIds.add(requestId);
            mockMvc.perform(get("/api/health")
                .header("X-Request-Id", requestId))
                .andExpect(status().isOk());
        }

        Thread.sleep(2000); // wait for export

        long tracedCount = requestIds.stream()
            .filter(id -> !jaegerClient.findTraces("order-service-test",
                Map.of("request.id", id)).isEmpty())
            .count();

        // For a 10% sampling rate, expect roughly 10 traces (with tolerance)
        double samplingRate = (double) tracedCount / requestCount;
        Assertions.assertTrue(samplingRate >= 0.05 && samplingRate <= 0.20,
            "Sampling rate should be ~10%, got " + (samplingRate * 100) + "%");
    }
}

Testing Prometheus Alert Rules

Alert rules are SQL-like PromQL expressions. Test them by pushing known metric values and verifying the alerts fire (or don't fire) correctly:

# test_alert_rules.py
import time
import requests
from prometheus_client import CollectorRegistry, Counter, Gauge, push_to_gateway

PUSHGATEWAY_URL = "http://localhost:9091"
PROMETHEUS_URL = "http://localhost:9090"

def push_metric(metric_name: str, value: float, labels: dict = {}):
    """Push a metric value to the Pushgateway for testing alert rules."""
    registry = CollectorRegistry()
    g = Gauge(metric_name, "Test metric", list(labels.keys()), registry=registry)
    g.labels(**labels).set(value)
    push_to_gateway(PUSHGATEWAY_URL, job="alert-test", registry=registry)

def get_firing_alerts() -> list[dict]:
    """Query Prometheus for currently firing alerts."""
    response = requests.get(f"{PROMETHEUS_URL}/api/v1/alerts")
    alerts = response.json()["data"]["alerts"]
    return [a for a in alerts if a["state"] == "firing"]

def test_high_error_rate_alert_fires():
    """The HighErrorRate alert should fire when error rate exceeds 5%."""
    # Push a high error rate: 50 errors out of 100 requests = 50% error rate
    push_metric("http_requests_total", 50, {"status": "500", "service": "order-service"})
    push_metric("http_requests_total", 50, {"status": "200", "service": "order-service"})

    # Wait for Prometheus to evaluate the alert rule (depends on evaluation_interval)
    time.sleep(35)  # 30s for_duration + evaluation interval

    alerts = get_firing_alerts()
    high_error_alerts = [a for a in alerts if a["labels"]["alertname"] == "HighErrorRate"]

    assert len(high_error_alerts) > 0, \
        "HighErrorRate alert should fire when error rate is 50%"

    # Verify alert has required labels and annotations
    alert = high_error_alerts[0]
    assert "severity" in alert["labels"], "Alert must have severity label"
    assert alert["labels"]["severity"] in ("warning", "critical")
    assert "runbook_url" in alert["annotations"], \
        "Alert must include runbook_url annotation for on-call engineers"
    assert "summary" in alert["annotations"], "Alert must include summary annotation"

def test_high_error_rate_alert_does_not_fire_under_threshold():
    """The HighErrorRate alert must NOT fire when error rate is below 5%."""
    # Push a normal error rate: 2 errors out of 100 = 2% error rate
    push_metric("http_requests_total", 2, {"status": "500", "service": "order-service"})
    push_metric("http_requests_total", 98, {"status": "200", "service": "order-service"})

    time.sleep(35)

    alerts = get_firing_alerts()
    high_error_alerts = [a for a in alerts if a["labels"]["alertname"] == "HighErrorRate"
                         and a["labels"].get("service") == "order-service"]

    assert len(high_error_alerts) == 0, \
        "HighErrorRate alert must not fire when error rate is only 2%"

def test_alert_includes_service_label_for_routing():
    """Alerts must include the service label so AlertManager can route to the correct team."""
    push_metric("http_requests_total", 100, {"status": "500", "service": "payment-service"})

    time.sleep(35)

    alerts = get_firing_alerts()
    payment_alerts = [a for a in alerts if
                      a["labels"].get("service") == "payment-service"]

    if payment_alerts:
        alert = payment_alerts[0]
        assert "service" in alert["labels"], \
            "Alert must include 'service' label for AlertManager routing"
        assert "team" in alert["labels"] or "team" in alert["annotations"], \
            "Alert must include team label/annotation for PagerDuty routing"

Observability Test Coverage Matrix

Test Category Minimum Coverage
Structured logs Required fields present on every HTTP request
Log levels ERROR used for errors, INFO for business events, DEBUG not logged in production
PII in logs No passwords, tokens, card numbers, or SSNs
HTTP metrics Request count, error count, and duration histogram per endpoint
Business metrics At least one counter per business operation (order created, payment processed)
OTel spans Root span per request with correct HTTP attributes
OTel child spans DB queries, external HTTP calls, and queue operations each have child spans
Span attributes tenant_id, user_id, and request_id present on relevant spans
Alert rules Each alert tested both firing and non-firing conditions
Alert metadata runbook_url and summary annotations on every alert rule

Observability is only as good as the tests that verify it. Instrumenting your services without testing the instrumentation gives you the illusion of observability — everything looks fine in the dashboard until an incident proves otherwise. Adding observability tests to your CI pipeline, alongside functional tests, is the only way to guarantee that the signals you depend on during incidents are actually there when you need them.

Platforms like HelpMeTest can integrate observability validation into continuous test runs, ensuring that every deployment is verified not just for functional correctness but for the logging, metrics, and tracing infrastructure that your on-call engineers depend on.

Read more