Jaeger: Distributed Tracing for Microservices

Learn Jaeger for distributed tracing visualization. Covers trace analysis, dependency mapping, and integration with OpenTelemetry.

published: March 22, 2026 reading time: 33 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Jaeger visualizes how requests flow through your microservices by collecting trace spans and assembling them into complete request paths. The collector receives spans via OTLP, stores them in Elasticsearch or Cassandra, and the query service exposes them to the UI. This guide walks through deployment options from all-in-one development setups to production clusters with adaptive sampling strategies. You'll also find debugging workflows for slow requests and error tracking, plus guidance on tail-based sampling to ensure you capture the traces that actually matter.

Jaeger: Distributed Tracing for Microservices

Jaeger is an open-source distributed tracing system for monitoring and troubleshooting microservices. It shows you how requests flow through your services, where latency lives, and how your services depend on each other.

This guide covers Jaeger deployment, trace analysis, and practical debugging workflows. For tracing fundamentals, see our Distributed Tracing guide first.

Introduction

graph TB
    A[Services] -->|OTLP| B[Jaeger Collector]
    B --> C[Jaeger Backend]
    C --> D[Elasticsearch]
    C --> E[Cassandra]
    C --> F[Kafka]
    G[Jaeger Query] --> C
    H[Jaeger UI] --> G

Jaeger uses the OpenTelemetry collector pattern:

Agent: Sidecar or daemonset that receives spans via UDP
Collector: Receives spans, processes them, and stores them
Query: Backend service for trace retrieval
UI: Web interface for trace exploration

Deployment Options

All-in-One Quick Start

For local development:

docker run -d \
  --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
  -p 6831:6831/UDP \
  -p 6832:6832/UDP \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Access the UI at http://localhost:16686.

Production Deployment

For Kubernetes:

# jaeger-operator.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production
  collector:
    maxReplicas: 5
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  storage:
    type: elasticsearch
    elasticsearch:
      name: elasticsearch
      doNotProvision: true
      secretName: jaeger-elasticsearch
  query:
    replicas: 2
    options:
      query:
        base-path: /jaeger

External Elasticsearch Backend

apiVersion: v1
kind: Secret
metadata:
  name: jaeger-elasticsearch
  namespace: observability
type: Opaque
stringData:
  ELASTICSEARCH_SERVER: "https://elasticsearch:9200"
  ELASTICSEARCH_USERNAME: "jaeger"
  ELASTICSEARCH_PASSWORD: "${ES_PASSWORD}"
---
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production
  collector:
    autoscale: true
    maxReplicas: 5
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      storage:
        size: 200Gi
        storageClassName: fast-storage
      indexCleaner:
        enabled: true
        numberOfDays: 14
        schedule: "55 5 * * *"
  query:
    replicas: 2

Jaeger UI

The Jaeger UI has several views for trace analysis.

Search View

The search view is where every trace investigation begins. You start by selecting a service from the dropdown — if you do not know which service is involved, you can search across all services, but narrowing by service first dramatically reduces noise. The service name comes from the service.name attribute your instrumentation sets, so consistency in how you name services matters: “auth-service” and “auth” appear as separate services in Jaeger and will split your results.

Once you have a service selected, the operation dropdown populates with every span operation that service has emitted. Use this to narrow to a specific endpoint or function — if you know the checkout service is slow on the POST /orders path, select that operation instead of seeing every trace from that service. The time range selector defaults to the last hour, which is usually the right starting point for incident investigations. For historical analysis, you can stretch back to 7 days or more depending on your retention configuration.

The tag filters are where searches get precise. You can add multiple tag filters that must all match simultaneously. A common pattern during incidents: filter by error=true to see only failed traces, combined with a duration filter to find traces exceeding your SLO threshold. Tags must match exactly — http.status_code=500 will not match http.status_code=200 entries. If you have a specific trace ID from a user report or log entry, the Trace ID field accepts that directly and takes you straight to the trace without needing service or time context.

Duration range filters use a min/max format in microseconds. If your p95 latency is 500ms, set min to 500000 to find traces slower than your target. Combined with an error tag filter, this quickly surfaces the worst-performing traces for a given operation. The results table sorts by time by default (newest first), but you can re-sort by duration to find the slowest traces in your time window.

Trace Detail View

The trace detail view displays a waterfall of spans arranged by start time. Each span appears as a horizontal bar whose width represents its duration relative to the total trace time, making it immediately obvious where time is being spent. Spans are indented under their parents, so you can visually follow the call chain from the root span (typically the incoming request) down through each service and operation.

Reading the waterfall from left to right shows chronological order of operations. Spans that start at the same vertical level ran concurrently — if the auth service and inventory service both start at the same timestamp, they executed in parallel. Spans nested under another span are children of that parent and represent a synchronous call where the child completed before the parent could proceed. A span that appears below its parent but starts at the same time indicates an async call where the parent did not wait for the child to complete.

The color coding in the trace view varies by implementation. In Jaeger, spans are color-coded by service, which makes it easy to spot which service dominates the timeline at a glance. Red or pink highlighting typically flags spans with error tags. If you see a wide red bar under the payment service, click it to see what went wrong before digging into other services.

The first bottleneck to look for is any span that takes up a disproportionate share of the total trace time. In the waterfall, this appears as a bar that is visually much wider than its siblings. Drill into that span to see its child spans — the bottleneck is often inside a child call rather than in the parent service itself. External calls to payment gateways, databases, and third-party APIs are the most common culprits for unexplained latency.

Span Detail Panel

The span detail panel opens when you click any span in the trace waterfall. It has four tabs that surface different information about that operation, and all of them are useful during debugging.

The Tags tab shows every key-value attribute attached to the span. These are the attributes your instrumentation set via span.setAttribute. The tags you care about most depend on the operation type: for HTTP spans look at http.method, http.url, and http.status_code; for database spans look at db.system, db.statement, and db.operation. If you set custom business attributes like order.id or customer.tier, they appear here too. A missing expected tag usually means the instrumentation did not execute that code path, which tells you something about how the request was processed.

The Logs tab shows timestamped events emitted during the span lifetime. Your span.addEvent calls appear here with their timestamps and any payload data. This is where database query logs, cache hit/miss events, and intermediate step timings appear. If an exception was thrown, the stack trace shows up as a log entry with the error message and a full traceback. When troubleshooting, look for logs that appear near the end of the span timeline — a log entry immediately before a sudden end often indicates where the operation failed.

The References tab shows the parent-child relationships for this span. The parent span ID links back to the span that created this one, confirming the propagation chain is intact. If you see a span with no parent reference in a trace that should have a full chain, the span was created without proper parent context — that is a instrumentation bug and explains why the trace looks broken. The References tab also shows linked spans for operations like Kafka message publishing, where the consumer span references the producer span.

The Compare tab (in newer Jaeger versions) lets you compare this span against another span in the same trace, which is useful for identifying where timing diverged between two similar requests.

Trace Analysis Workflows

Debugging a Slow Request

Find the bottleneck in a slow trace:

Search for traces with high duration
Identify which service has the longest spans
Check span tags for business context
Review span logs for errors or unusual events

# Search for slow traces via API
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h&maxDuration=5s" | \
  jq '.data[] | {traceID: .traceID, duration: .duration, services: [.spans[].process.serviceName] | unique}'

Finding Error Sources

Identify which service is causing errors:

Filter traces by error status
Examine the error span and its logs
Follow the trace backward to find root cause

# Find traces with errors
curl -s "http://jaeger-query:16686/api/traces?service=checkout-service&lookback=1h" | \
  jq '.data[] | select(.spans[].tags // [] | any(.key == "error" and .vBool == true))'

Analyzing Service Dependencies

Use the dependency view to understand service relationships:

Navigate to the Dependency graph view
Click on services to see call patterns
Identify services with high fan-out
Spot potential bottlenecks

Advanced Features

Adaptive Sampling

Reduce storage by sampling intelligently:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: adaptive-sampling
spec:
  strategy: adaptive
  sampling:
    type: adaptive
    adaptive:
      sampling_server_url: jaeger-agent:5778
      max_traces_per_second: 100
      initial_sampling_rate: 10
      adaptive:
        enabled: true

Barrelfish View

Barrelfish is a specialized service-level aggregation view in Jaeger that shifts your perspective from individual trace debugging to fleet-wide operational health. While the Trace Detail View shows you a waterfall of spans for a single request, Barrelfish zooms out to display aggregate metrics grouped by service — making it ideal for spotting degradation patterns, identifying which services are under load, and understanding how your architecture behaves under real traffic.

The dashboard exposes four key dimensions for every instrumented service. Average latency per operation gives you a quick read on which endpoints are slowing down — useful for catching regressions before they escalate into p99 breaches. Request throughput shows the volume of traces flowing through each service, so you can correlate traffic spikes with latency changes or error rate jumps. Error rates highlight services where things are going wrong, fast — a sudden spike here tells you where to start your root-cause investigation without digging through every trace manually. Dependency links map the connections between services, revealing cascading failure patterns: if the checkout service is timing out, the dependency link might point you straight to the payment provider call as the culprit.

To get the most out of this view, set a short time window (last 5–15 minutes) for real-time triage, or stretch it to an hour for broader trend analysis. The real power of Barrelfish emerges when you pair it with the Trace Detail View — start here to find the failing service, then click through to a specific trace to drill into individual span timings and error logs. It is particularly effective during incident response, where the first question is almost always “which service is affected?” rather than “which trace is slow?”

Trace Quality Scoring

Jaeger can score traces based on quality:

# Trace quality indicators
- Missing span tags
- Incomplete trace depth
- High error rate
- Excessive span count

Jaeger vs Zipkin: Comparison and Migration

Understanding how Jaeger relates to Zipkin helps when evaluating or migrating distributed tracing solutions.

Shared Foundations

Both Jaeger and Zipkin implement the OpenTracing standard with compatible data models:

Span: Represents a unit of work with name, start time, duration, and attributes
Trace: A collection of spans forming a complete request path
Context propagation: Both support W3C Trace Context for cross-service correlation

This compatibility means you can instrument services using OpenTelemetry and route traces to either system.

Key Differences

Aspect	Jaeger	Zipkin
Architecture	Collector, Query, UI, Agent as separate components	Simple architecture with Collector and Query only
Storage backends	Elasticsearch, Cassandra, Kafka	In-memory, Cassandra, MySQL, PostgreSQL
Sampling strategies	Probabilistic, Adaptive, Tail-based	Probabilistic, Rate-limiting
OTLP support	Native OTLP receiver	Requires Zipkin collector adapter
Service mesh integration	Native Istio and Linkerd support	Limited built-in support
Operations complexity	Higher due to more components	Lower, simpler to operate

Using the Zipkin Receiver

Jaeger includes a Zipkin-compatible receiver for migrations:

# Configure Jaeger collector to accept Zipkin spans
docker run -d \
  --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
  -p 6831:6831/UDP \
  -p 9411:9411 \
  jaegertracing/all-in-one:latest

This lets existing Zipkin-instrumented services send traces to Jaeger without re-instrumentation.

Migration Path

When migrating from Zipkin to Jaeger:

Phase 1: Deploy Jaeger alongside Zipkin, configure services to send traces to both
Phase 2: Validate Jaeger data completeness and sampling behavior
Phase 3: Switch primary monitoring to Jaeger, keep Zipkin as fallback
Phase 4: Decommission Zipkin once confidence is established

# Dual-export configuration for gradual migration
# Instrument services to send to both endpoints during transition
instrumentation:
  tracing:
    exporters:
      - jaeger_exporter:
          endpoint: http://jaeger-collector:14250
      - zipkin_exporter:
          endpoint: http://zipkin:9411/api/v2/spans

When to Choose Jaeger Over Zipkin

Choose Jaeger when:

You need advanced sampling strategies like tail-based sampling
Your environment uses Kubernetes with service mesh (Istio/Linkerd)
You require long-term trace storage with Elasticsearch or Cassandra
You want native OpenTelemetry protocol support without adapters
Your team needs more sophisticated trace visualization and analysis

Choose Zipkin when:

You have existing Zipkin instrumentation and limited migration budget
Your deployment scale is modest and simple architecture is preferred
You need quick setup with minimal operational overhead
Your organization already has expertise with Zipkin tooling

Integrating with OpenTelemetry

OpenTelemetry is the modern standard for tracing:

Automatic Instrumentation

# Python auto-instrumentation
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(
    trace.TracerProvider(
        resource=Resource.create({"service.name": "my-service"})
    )
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger-collector:4317"))
)

FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

Manual Span Creation

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(order: Order) {
  const span = tracer.startSpan("OrderService.process");

  try {
    await validateOrder(order, span);
    await chargePayment(order, span);
    await fulfillOrder(order, span);

    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
}

async function validateOrder(order: Order, parentSpan: Span) {
  const span = tracer.startSpan("OrderService.validate", {
    parent: parentSpan,
  });

  span.setAttribute("validation.type", "business");

  // Validation logic

  span.end();
}

Performance Analysis

Latency Percentiles

Analyze latency distribution:

# Get latency percentiles via Jaeger API
curl -s "http://jaeger-query:16686/api/services" | jq '.[].name'

# Get trace stats
curl -s "http://jaeger-query:16686/api/traces/stats?service=api-gateway" | jq

Throughput Analysis

Understand request volume patterns:

# Get traces count over time
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&start=$(date -d '1 hour ago' +%s000000)&end=$(date +%s000000)&limit=1000" | \
  jq '.data | length'

Error Rate Correlation

Correlate errors across services:

# Find traces with errors and their services
curl -s "http://jaeger-query:16686/api/traces?service=*&lookback=1h" | \
  jq '.data[] | {
      traceID: .traceID,
      errors: [.spans[] | select(.tags // [] | any(.key == "error")) | .process.serviceName]
    } | select(.errors | length > 0)'

Alerting on Trace Data

Prometheus Metrics from Jaeger

# jaeger-metrics-exporter.yaml
apiVersion: v1
kind: Service
metadata:
  name: jaeger-metrics
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8888"
spec:
  ports:
    - port: 8888
      targetPort: 8888
---
# Prometheus scrape config
scrape_configs:
  - job_name: "jaeger"
    static_configs:
      - targets: ["jaeger-collector:8888"]

Key Metrics to Monitor

Metric	Description
`jaeger_collector_traces_received`	Incoming traces count
`jaeger_collector_spans_received`	Total spans received
`jaeger_collector_queue_length`	Pending spans in queue
`jaeger_query_latency`	Query response time

Storage Backends

Jaeger supports multiple storage backends.

Elasticsearch

Best for large-scale production deployments:

# Elasticsearch with ILM
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      indexParameters:
        numberOfShards: 5
        numberOfReplicas: 1

Cassandra

Traditional choice for high volume:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  storage:
    type: cassandra
    cassandra:
      servers: cassandra:9042
      keyspace: jaeger_v1
      replication_factor: 2

Kafka

For buffering and replay:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: streaming
  collector:
    maxReplicas: 10
  storage:
    type: kafka
    kafka:
      brokers:
        - kafka:9092
      topic: jaeger-spans
      partitions: 10
  ingester:
    replicas: 2

SLO Integration with Tracing

Correlate trace data with Service Level Objectives for better reliability.

Defining SLOs from Traces

# Define SLO thresholds based on trace latency
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: slo-tracking
spec:
  strategy: production
  sampling:
    type: adaptive
    adaptive:
      # Ensure traces for SLO boundary requests are always captured
      sampling_server_url: jaeger-agent:5778
      # Prioritize capturing slow traces that may breach SLOs
      max_traces_per_second: 100

Trace-Based SLO Dashboard

Key trace metrics to visualize:

Metric	SLO Target	Query Method
p99 Latency	< 500ms	Trace duration percentiles
Error Rate	< 0.1%	Error spans / total spans
Availability	> 99.9%	Successful traces / total traces
Trace Completeness	> 95%	Complete traces / total traces

Latency Budgets with Traces

Allocate latency budget across services:

# Get latency breakdown by service for an SLO window
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h" | \
  jq '[.data[].spans[] | {
    service: .process.serviceName,
    operation: .operationName,
    duration_ms: (.duration / 1000),
    errors: (.tags[] | select(.key == "error") | .key) | length
  }] | group_by(.service) | map({
    service: .[0].service,
    avg_duration_ms: (map(.duration_ms) | add / length),
    max_duration_ms: (map(.duration_ms) | max),
    error_count: (map(.errors) | add)
  })'

Tail-Based Sampling for SLO Traces

Ensure error and slow traces are always captured:

# Tail-based sampling configuration
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: slo-sampling
spec:
  strategy: production
  collector:
    sampling:
      type: tail-based
      tail-based:
        sampling:
          - type: probabilistic
            probabilistic: 0.1
          - type: latencyn
            latency:
              lower: 100ms
              upper: 500ms
            probabilistic: 0.5
          - type: always
            category:
              - error
            probabilistic: 1.0

Multi-Region Deployment Considerations

Deploy Jaeger across regions for global microservices visibility.

Architecture Patterns

graph LR
    A[US-East Services] --> B[US-East Jaeger Collector]
    C[EU-West Services] --> D[EU-West Jaeger Collector]
    E[AP-South Services] --> F[AP-South Jaeger Collector]
    B --> G[Central Storage]
    D --> G
    F --> G
    H[Global Query] --> G

Cross-Region Trace Context

Propagate trace context across regions without losing context:

# Cross-region trace context propagation
from opentelemetry import propagate
from opentelemetry.trace import set_span_in_context

def forward_to_region(headers: dict, region: str):
    # Inject current trace context into headers for cross-region call
    propagate.inject(headers)

    # Add region-specific tags
    current_span = trace.get_current_span()
    current_span.set_attribute("destination.region", region)

    response = requests.post(
        f"https://{region}-api.example.com/process",
        headers=headers
    )
    return response

Regional Sampling Strategies

# Regional sampling configuration
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: multi-region-jaeger
  namespace: observability
spec:
  strategy: adaptive
  sampling:
    type: adaptive
    adaptive:
      # Higher sampling rate in production regions
      sampling_server_url: jaeger-agent:5778
      max_traces_per_second: 200
      initial_sampling_rate: 20
      # Always sample cross-region calls
      policies:
        - name: cross-region
          type: tag
          tag:
            key: cross_region
            value: "true"
          probabilistic: 1.0

Storage Considerations for Multi-Region

Approach	Pros	Cons
Centralized (single ES)	Simple, consistent	Latency for remote collectors
Distributed (per-region ES)	Low latency	Complex cross-region queries
Hybrid (hot-warm ES)	Balance of both	Operational complexity

Cross-Region Trace Correlation

# Correlate traces across regions
curl -s "http://jaeger-query.global:16686/api/traces?service=*&lookback=1h" | \
  jq '.data[] | select(.spans[] | .tags[] | .key == "region" and .vStr == "us-east") | {
    traceID: .traceID,
    regions: [.spans[].tags[] | select(.key == "region") | .vStr] | unique,
    duration_ms: (.duration / 1000)
  } | select(.regions | length > 1)'

Production Failure Scenarios

Failure	Impact	Mitigation
Jaeger storage backend degraded	Traces dropped; incomplete debugging data	Configure sampling; scale storage; implement buffer queue
Collector queue overflow	Spans dropped; monitoring gaps	Monitor queue depth; scale collectors; implement backpressure
Query service performance	Slow trace search; UI timeouts	Optimize queries; add caching; scale query replicas
Adaptive sampling too aggressive	Missing important traces	Review sampling rates; ensure error traces always sampled
Trace context propagation broken	Incomplete traces; orphaned spans	Implement proper propagation in all services; test regularly
ES backend slow	Delayed trace availability	Monitor ES cluster; optimize indices; add warm storage tier

Common Pitfalls / Anti-Patterns

1. Not Sampling Tail for Errors

Head-based sampling misses most error traces at low sampling rates:

# Bad: Only probabilistic sampling
strategy: probabilistic
sampling:
  type: probabilistic
  probabilistic:
    sampling_rate: 0.01  # 1% - misses most errors

# Good: Adaptive sampling with error traces always sampled
strategy: adaptive
sampling:
  type: adaptive
  adaptive:
    max_traces_per_second: 100
    sampling_server_url: jaeger-agent:5778

2. Missing Semantic Attributes

Traces without standard attributes are hard to query:

// Bad: Missing semantic attributes
const span = tracer.startSpan("OrderService.process");
span.setAttribute("orderId", order.id); // Non-standard name

// Good: Use semantic conventions
const span = tracer.startSpan("OrderService.process");
span.setAttribute("order.id", order.id);
span.setAttribute("order.total", order.total);
span.setAttribute("customer.tier", customer.tier);

3. Creating Child Spans Without Parent Context

Orphaned spans break trace continuity:

// Bad: No parent context
async function processOrder(order) {
  const span = tracer.startSpan("OrderService.process");
  await validateOrder(order); // Creates orphaned span
  span.end();
}

// Good: Propagate parent context
async function processOrder(order, parentSpan) {
  const span = tracer.startSpan("OrderService.process", {
    parent: parentSpan,
  });
  await validateOrder(order, span); // Child span linked
  span.end();
}

4. Storing Large Payloads in Span Events

Span events are not a data store:

// Bad: Large payload in span
span.addEvent("response", { body: JSON.stringify(largeResponse) });

// Good: Reference by ID or summary
span.addEvent("response", {
  "response.size_bytes": largeResponse.length,
  "response.status": "success",
});

5. Not Monitoring Jaeger Itself

Jaeger monitoring blind spots:

# Prometheus metrics from Jaeger
scrape_configs:
  - job_name: "jaeger"
    static_configs:
      - targets: ["jaeger-collector:8888"]

Real-world Failure Scenarios

Scenario 1: Jaeger Collector OOM During Traffic Spike

What happened: A product launch caused a 20x spike in trace volume. The Jaeger collector’s in-memory queue filled up and the process was killed by the OOM killer.

Root cause: The collector was configured with a fixed in-memory queue size but no dead-letter queue or sampling strategy to handle sudden volume increases.

Impact: Approximately 15 minutes of traces were lost during the product launch window. Engineers could not correlate the elevated error rate with specific services.

Lesson learned: Configure adaptive sampling or tail-based sampling to automatically reduce trace volume during traffic spikes. Set up dead-letter queues for failed trace ingestion. Monitor collector queue depth and set OOM alerts.

Scenario 2: Cassandra Storage Saturation

What happened: Over several months, the Cassandra storage cluster used by Jaeger reached its capacity limit. Trace data began being dropped silently as inserts failed.

Root cause: No capacity planning was done for trace retention. The default 30-day retention was never adjusted as the system scaled.

Impact: Historical traces older than 2 weeks became unavailable, making retrospective analysis of a security incident impossible.

Lesson learned: Plan storage capacity based on expected trace volume and retention requirements. Monitor storage node disk usage and set alerts. Consider down-sampling older traces to reduce storage costs.

Trade-off Analysis

When designing a Jaeger deployment, several key trade-offs require careful consideration.

Storage Backend Selection

Criteria	Elasticsearch	Cassandra	Kafka
Scalability	Horizontal sharding, ILM support	Tunable consistency, wide rows	Partition-based parallelism
Query Performance	Excellent aggregations, full-text search	Fast reads for trace ID lookups	Requires separate consumer
Operational Complexity	High (cluster management, ILM)	Medium (SSTables, compaction)	Medium (brokers, replication)
Cost at Scale	Higher (memory-heavy)	Lower (disk-efficient)	Variable (depends on retention)
Best For	Large teams, advanced analytics	High write throughput	Event-driven replay scenarios

Sampling Strategy Trade-offs

Strategy	Storage Savings	Debug Coverage	Latency Overhead	Complexity
Probabilistic	High	Low (misses rare events)	Minimal	Simple
Adaptive	Medium	High (prioritizes errors)	Low	Medium
Tail-based	Variable	Highest (captures all important)	Higher	Complex
Rate-limiting	Predictable	Medium	Minimal	Simple

Deployment Architecture Comparison

Aspect	All-in-One	Production (Separated)	Multi-Region
Resource Usage	Minimal	High	Very High
Scalability	None	Horizontal collectors/query	Regional collectors
Operational Overhead	Minimal	Medium	High
Fault Isolation	Poor (single point)	Good (component isolation)	Excellent (regional failure domains)
Latency	Low (local only)	Medium (network to collectors)	Higher (cross-region)
Setup Time	Minutes	Hours	Days
Use Case	Development, CI/CD	Production (single region)	Global enterprises

Agent Deployment Models

Model	Pros	Cons	Best Environment
Sidecar (per pod)	Simple injection, local UDP	Resource overhead per pod	Kubernetes with sidecar injection
Daemonset (node-level)	Shared resource pool, lower overhead	Requires host networking	Dense node deployments
Agentless (direct to collector)	No agent maintenance	Higher latency (network hop), firewall rules	Secure environments, small scale

Instrumentation Approach Trade-offs

Approach	Effort	Granularity	Maintenance	Best Stage
Auto-instrumentation	Low	Medium	Low	Initial adoption
Manual instrumentation	High	Fine-grained	Higher	Production hardening
Hybrid (auto + manual)	Medium	Fine-grained	Medium	Mature observability

When to Use Jaeger

Use Jaeger when:

Debugging latency issues across microservice boundaries
Understanding service dependencies and call patterns
Root cause analysis for cascading failures
Optimizing performance by identifying bottlenecks
Validating trace context propagation
Monitoring distributed transactions
Detecting anomalies in request flows

Don’t use Jaeger when:

You have single monolithic applications without service boundaries
You need purely metric-based monitoring
Low-latency tracing overhead is unacceptable
You need long-term log storage
You only need aggregate analytics (use dashboards)

Interview Questions

1. What is distributed tracing, and how does Jaeger implement it?

Expected answer points:

Distributed tracing tracks requests across service boundaries in microservices architectures
Jaeger implements tracing via the OpenTelemetry collector pattern with agents, collectors, query, and UI components
Spans represent individual operations, and traces are composed of connected spans forming a request path
Jaeger stores traces in backends like Elasticsearch, Cassandra, or Kafka for querying and visualization

2. What are the main components of Jaeger architecture?

Expected answer points:

Jaeger Agent: Sidecar or daemonset that receives spans via UDP and forwards to collectors
Jaeger Collector: Receives spans, processes them, and stores in the configured backend
Jaeger Query: Backend service that retrieves traces from storage for display
Jaeger UI: Web interface for searching and visualizing traces
Supported storage backends: Elasticsearch, Cassandra, Kafka

3. What sampling strategies does Jaeger support, and when would you use each?

Expected answer points:

Probabilistic: Fixed percentage of traces captured, simple but may miss important events
Adaptive: Dynamic sampling based on traffic, prioritizes rare events and errors
Tail-based: Collects full trace after seeing the end, ideal for capturing all slow or error traces
Use adaptive for production with high traffic, tail-based for debugging specific issues

4. How do you debug a slow request using Jaeger?

Expected answer points:

Search for traces with high duration using the Jaeger UI or API
Identify which service has the longest spans in the trace waterfall
Check span tags for business context (order ID, user ID, etc.)
Review span logs for errors, database queries, or external calls
Follow the trace backward to find where latency was introduced

5. What is trace context propagation and why is it important?

Expected answer points:

Trace context propagation passes trace and span IDs across service boundaries via HTTP headers
Without proper propagation, spans become orphaned and traces appear incomplete
OpenTelemetry uses W3C Trace Context headers (traceparent, tracestate)
Context must be injected before outgoing requests and extracted on incoming requests

6. How do you integrate Jaeger with OpenTelemetry?

Expected answer points:

Use OpenTelemetry SDK with Jaeger exporter or OTLP exporter pointing to Jaeger collector
Auto-instrumentation available for Python, Java, Node.js, Go, .NET and other languages
Manual instrumentation using OpenTelemetry API for custom business logic
Configure resource attributes like service.name for proper identification

7. What storage backend would you choose for a large-scale Jaeger deployment and why?

Expected answer points:

Elasticsearch: Best for large-scale production, excellent query performance, ILM support
Cassandra: Traditional choice, good for very high write throughput, tunable consistency
Kafka: Enables buffering and replay, useful for event-driven architectures
Consider ingestion rate, query patterns, operational complexity, and existing infrastructure

8. What are common pitfalls when using Jaeger in production?

Expected answer points:

Head-based sampling missing error traces at low sampling rates
Missing semantic attributes making traces hard to query
Orphaned spans from not propagating parent context
Storing large payloads in span events causing performance issues
Not monitoring Jaeger itself leading to observability gaps

9. How do you correlate trace data with SLOs and alerting?

Expected answer points:

Define SLO thresholds based on trace latency percentiles and error rates
Use tail-based sampling to ensure error and slow traces are always captured
Export Jaeger metrics to Prometheus for alerting on collector queue depth, ingestion rate
Create dashboards correlating trace data with business-level SLOs

10. What security considerations apply when deploying Jaeger?

Expected answer points:

Authenticate Jaeger UI access to prevent unauthorized trace data exposure
Avoid sensitive data (passwords, tokens, PII) in span tags and logs
Configure TLS for all Jaeger endpoints including gRPC and HTTP
Restrict storage backend access and audit trace data exports
Sanitize trace context headers before external calls to prevent injection

11. How does Jaeger compare to Zipkin, and what are the migration considerations?

Expected answer points:

Jaeger and Zipkin share the same trace data model (spans and traces) making interoperability possible
Jaeger supports native OTLP ingestion while Zipkin requires additional collectors for OTLP
Jaeger provides more advanced sampling strategies including tail-based sampling
Zipkin has a simpler deployment model suited for smaller deployments
Migration involves re-instrumenting services with Jaeger clients or using the Zipkin receiver in Jaeger collector
Jaeger stores data in Elasticsearch, Cassandra, or Kafka while Zipkin typically uses in-memory or Cassandra

12. How do you integrate Jaeger with service mesh environments like Istio or Linkerd?

Expected answer points:

Istio automatically instruments traffic with Jaeger via the OpenTelemetry collector addon
Configure Istio to export traces to Jaeger using the mesh config and tracing options
Linkerd uses its own distributed tracing capability that can export to Jaeger
Service mesh sidecars handle trace context propagation automatically across proxy boundaries
Jaeger helps identify service mesh performance issues and proxy overhead
High cardinality of mesh-generated traces requires careful sampling strategy configuration

13. What is the difference between head-based and tail-based sampling in Jaeger?

Expected answer points:

Head-based sampling decides at the start of a trace whether to sample, using probabilistic or rate-limiting strategies
Tail-based sampling collects all traces but makes the sampling decision at the end based on policy rules
Head-based sampling is simpler and requires less resources but may miss important rare events
Tail-based sampling ensures error traces and slow traces are always captured for debugging
Jaeger supports adaptive sampling which combines both approaches dynamically

14. How do you analyze trace flame graphs in Jaeger, and what insights do they provide?

Expected answer points:

Flame graphs visualize trace duration as stacked bars showing time spent in each span
Wide blocks indicate where most time is spent, identifying latency bottlenecks
Deep stacks show trace depth and parent-child relationships across services
Jaeger's trace detail view provides a waterfall representation that functions like a flame graph
Color coding helps distinguish services, errors, and external calls
Compare flame graphs across time periods to detect performance regressions

15. What are the scaling considerations for Jaeger in high-throughput production environments?

Expected answer points:

Scale Jaeger collectors horizontally to handle increased ingestion load
Use Kafka as a buffer between collectors and storage to handle spikes
Configure adaptive sampling to reduce storage requirements without losing critical traces
Elasticsearch index lifecycle management helps control storage costs
Query service can be scaled horizontally with read replicas
Monitor queue depth and span latency to detect scaling needs proactively

16. How do you implement distributed tracing propagation across asynchronous message queues?

Expected answer points:

Inject trace context into message headers or properties before publishing
Extract trace context on the consumer side and create child spans
OpenTelemetry provides automatic propagation for Kafka, RabbitMQ, and other messaging systems
Ensure message processing spans have proper parent context for complete trace continuity
Batch message processing requires careful span management to avoid orphaned spans

17. What strategies exist for reducing trace cardinality while maintaining debugging capabilities?

Expected answer points:

Use tag allowlisting to restrict which attributes are stored with spans
Implement sampling strategies that capture complete traces for errors and slow requests
Aggregate high-cardinality attributes like user IDs or request IDs into bucketed values
Configure storage backends with appropriate index mappings to handle cardinality
Use trace quality scoring to identify and drop low-value traces

18. How does Jaeger handle clock skew issues in distributed trace timestamps?

Expected answer points:

Jaeger uses relative time measurements within a single trace rather than absolute wall-clock times
Spans record duration and relative start times calculated from the trace start
Clock skew between services is handled by respecting parent span start times as baseline
Jaeger UI displays spans using relative offsets from trace start, not absolute timestamps
For cross-region traces, use trace duration for comparison rather than timestamps

19. What are the operational best practices for running Jaeger in Kubernetes?

Expected answer points:

Use the Jaeger Operator for declarative Kubernetes deployments and upgrades
Deploy the agent as a daemonset for sidecar-less architecture with lower overhead
Configure resource limits based on expected trace volume and processing requirements
Use separate namespaces for observability components to enable proper RBAC
Implement pod disruption policies for collector and query deployments
Monitor Jaeger itself with Prometheus metrics exported on port 8888

20. How do you correlate traces from Jaeger with metrics in Prometheus and logs in ELK?

Expected answer points:

Use consistent service names and labels across Jaeger, Prometheus, and ELK
Export Jaeger metrics to Prometheus for correlation between trace latency and system metrics
Include trace ID in log lines to enable cross-platform correlation
Use the trace ID from Jaeger spans to search corresponding logs in Elasticsearch
Build Grafana dashboards that combine Jaeger trace data with Prometheus metrics
Implement trace span IDs in error messages to link directly to the relevant trace in Jaeger UI

Conclusion

Jaeger gives you distributed tracing for microservices debugging and analysis. Trace visualization, dependency mapping, and performance analysis make sense of complex request flows.

Start with automatic instrumentation for your services, deploy the all-in-one version for development, and scale to a production deployment with Elasticsearch storage when you are ready.

For complete observability, combine tracing with metrics (Prometheus & Grafana) and logs (ELK Stack). The Distributed Tracing guide covers OpenTelemetry integration in more detail.

Quick Recap

Key Takeaways:

Jaeger provides distributed tracing visualization for microservices
Deploy collectors, query, and UI separately in production
Use adaptive or tail-based sampling to capture errors
Always propagate trace context through HTTP headers and queues
Monitor Jaeger itself: ingestion rate, queue depth, storage latency
Combine with metrics (Prometheus) and logs (ELK) for complete observability

Copy/Paste Checklist:

# Production Jaeger CRD
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  collector:
    maxReplicas: 5
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
  query:
    replicas: 2

# Adaptive sampling config
spec:
  sampling:
    type: adaptive
    adaptive:
      max_traces_per_second: 100
      initial_sampling_rate: 10

# Prometheus metrics scrape
scrape_configs:
  - job_name: 'jaeger-collector'
    static_configs:
      - targets: ['jaeger-collector:8888']

// Trace context propagation
import { propagation, context } from "@opentelemetry/api";

function injectTraceContext(headers: Record<string, string>) {
  propagation.inject(context.active(), headers);
}

function extractTraceContext(headers: Record<string, string>) {
  return propagation.extract(context.active(), headers);
}

// Error span recording
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });

Reference Checklists

Observability Checklist

Traces received per second (ingestion rate)
Spans received and processed
Collector queue depth
Backend storage write latency
Query service latency (p50, p95, p99)
Active connections to storage

Jaeger Infrastructure Metrics

Traces received per second (ingestion rate)
Spans received and processed
Collector queue depth
Backend storage write latency
Query service latency (p50, p95, p99)
Active connections to storage

Trace Coverage Metrics

Services with instrumentation
Trace completeness (spans per trace average)
Error trace percentage
Slow trace percentage (>threshold)
Span attribute coverage

Sampling Configuration

Head-based sampling rate
Adaptive sampling thresholds
Tail sampling policies (errors, slow traces)
Always-sampled tag configuration

Alerting Rules

Collector queue depth > threshold
Storage write latency degraded
Query service down
Ingestion rate drop (potential issue)

Security Checklist

Jaeger UI access authenticated
No sensitive data in span tags or logs (passwords, tokens, PII)
TLS configured for all endpoints
Trace data access logged and audited
Sampling does not inadvertently exclude security-relevant traces
Trace context headers sanitized before external calls
Storage backend access restricted
No internal service names exposed to external trace exports

Jaeger: Distributed Tracing for Microservices

Introduction

Deployment Options

All-in-One Quick Start

Production Deployment

External Elasticsearch Backend

Jaeger UI

Search View

Trace Detail View

Span Detail Panel

Trace Analysis Workflows

Debugging a Slow Request

Finding Error Sources

Analyzing Service Dependencies

Advanced Features

Adaptive Sampling

Barrelfish View

Trace Quality Scoring

Jaeger vs Zipkin: Comparison and Migration

Shared Foundations

Key Differences

Using the Zipkin Receiver

Migration Path

When to Choose Jaeger Over Zipkin

Integrating with OpenTelemetry

Automatic Instrumentation

Manual Span Creation

Performance Analysis

Latency Percentiles

Throughput Analysis

Error Rate Correlation

Alerting on Trace Data

Prometheus Metrics from Jaeger

Key Metrics to Monitor

Storage Backends

Elasticsearch

Cassandra

Kafka

SLO Integration with Tracing

Defining SLOs from Traces

Trace-Based SLO Dashboard

Latency Budgets with Traces

Tail-Based Sampling for SLO Traces

Multi-Region Deployment Considerations

Architecture Patterns

Cross-Region Trace Context

Regional Sampling Strategies

Storage Considerations for Multi-Region

Cross-Region Trace Correlation

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

1. Not Sampling Tail for Errors

2. Missing Semantic Attributes

3. Creating Child Spans Without Parent Context

4. Storing Large Payloads in Span Events

5. Not Monitoring Jaeger Itself

Real-world Failure Scenarios

Scenario 1: Jaeger Collector OOM During Traffic Spike

Scenario 2: Cassandra Storage Saturation

Trade-off Analysis

Storage Backend Selection

Sampling Strategy Trade-offs

Deployment Architecture Comparison

Agent Deployment Models

Instrumentation Approach Trade-offs

When to Use Jaeger

Interview Questions

Further Reading

Conclusion

Quick Recap