Jaeger: Distributed Tracing for Microservices

Learn Jaeger for distributed tracing visualization. Covers trace analysis, dependency mapping, and integration with OpenTelemetry.

published: reading time: 28 min read author: GeekWorkBench

Jaeger: Distributed Tracing for Microservices

Jaeger is an open-source distributed tracing system for monitoring and troubleshooting microservices. It shows you how requests flow through your services, where latency lives, and how your services depend on each other.

This guide covers Jaeger deployment, trace analysis, and practical debugging workflows. For tracing fundamentals, see our Distributed Tracing guide first.

Introduction

graph TB
    A[Services] -->|OTLP| B[Jaeger Collector]
    B --> C[Jaeger Backend]
    C --> D[Elasticsearch]
    C --> E[Cassandra]
    C --> F[Kafka]
    G[Jaeger Query] --> C
    H[Jaeger UI] --> G

Jaeger uses the OpenTelemetry collector pattern:

  • Agent: Sidecar or daemonset that receives spans via UDP
  • Collector: Receives spans, processes them, and stores them
  • Query: Backend service for trace retrieval
  • UI: Web interface for trace exploration

Deployment Options

All-in-One Quick Start

For local development:

docker run -d \
  --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
  -p 6831:6831/UDP \
  -p 6832:6832/UDP \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Access the UI at http://localhost:16686.

Production Deployment

For Kubernetes:

# jaeger-operator.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production
  collector:
    maxReplicas: 5
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  storage:
    type: elasticsearch
    elasticsearch:
      name: elasticsearch
      doNotProvision: true
      secretName: jaeger-elasticsearch
  query:
    replicas: 2
    options:
      query:
        base-path: /jaeger

External Elasticsearch Backend

apiVersion: v1
kind: Secret
metadata:
  name: jaeger-elasticsearch
  namespace: observability
type: Opaque
stringData:
  ELASTICSEARCH_SERVER: "https://elasticsearch:9200"
  ELASTICSEARCH_USERNAME: "jaeger"
  ELASTICSEARCH_PASSWORD: "${ES_PASSWORD}"
---
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
  namespace: observability
spec:
  strategy: production
  collector:
    autoscale: true
    maxReplicas: 5
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      storage:
        size: 200Gi
        storageClassName: fast-storage
      indexCleaner:
        enabled: true
        numberOfDays: 14
        schedule: "55 5 * * *"
  query:
    replicas: 2

Jaeger UI

The Jaeger UI has several views for trace analysis.

Search View

The search view lets you find traces by:

  • Service name
  • Operation name
  • Trace ID
  • Time range
  • Tag filters
  • Duration range

Trace Detail View

graph TD
    A[API Gateway<br/>Total: 1.2s] --> B[Auth Service<br/>200ms]
    A --> C[Order Service<br/>800ms]
    C --> D[Payment Service<br/>500ms]
    C --> E[Inventory Service<br/>150ms]
    C --> F[Notification<br/>50ms]
    D --> G[External Bank<br/>400ms]

The trace view shows:

  • Parent-child span relationships
  • Timing for each span
  • Span tags and logs
  • Total trace duration

Span Detail Panel

Clicking a span reveals:

  • Operation name and service
  • Start time and duration
  • Tags (key-value attributes)
  • Logs (timestamped events)
  • References (parent-child links)

Trace Analysis Workflows

Debugging a Slow Request

Find the bottleneck in a slow trace:

  1. Search for traces with high duration
  2. Identify which service has the longest spans
  3. Check span tags for business context
  4. Review span logs for errors or unusual events
# Search for slow traces via API
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h&maxDuration=5s" | \
  jq '.data[] | {traceID: .traceID, duration: .duration, services: [.spans[].process.serviceName] | unique}'

Finding Error Sources

Identify which service is causing errors:

  1. Filter traces by error status
  2. Examine the error span and its logs
  3. Follow the trace backward to find root cause
# Find traces with errors
curl -s "http://jaeger-query:16686/api/traces?service=checkout-service&lookback=1h" | \
  jq '.data[] | select(.spans[].tags // [] | any(.key == "error" and .vBool == true))'

Analyzing Service Dependencies

Use the dependency view to understand service relationships:

  1. Navigate to the Dependency graph view
  2. Click on services to see call patterns
  3. Identify services with high fan-out
  4. Spot potential bottlenecks

Advanced Features

Adaptive Sampling

Reduce storage by sampling intelligently:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: adaptive-sampling
spec:
  strategy: adaptive
  sampling:
    type: adaptive
    adaptive:
      sampling_server_url: jaeger-agent:5778
      max_traces_per_second: 100
      initial_sampling_rate: 10
      adaptive:
        enabled: true

Barrelfish View

Barrelfish visualizes trace data by service showing:

  • Average latency per operation
  • Request throughput
  • Error rates
  • Dependency links

Trace Quality Scoring

Jaeger can score traces based on quality:

# Trace quality indicators
- Missing span tags
- Incomplete trace depth
- High error rate
- Excessive span count

Jaeger vs Zipkin: Comparison and Migration

Understanding how Jaeger relates to Zipkin helps when evaluating or migrating distributed tracing solutions.

Shared Foundations

Both Jaeger and Zipkin implement the OpenTracing standard with compatible data models:

  • Span: Represents a unit of work with name, start time, duration, and attributes
  • Trace: A collection of spans forming a complete request path
  • Context propagation: Both support W3C Trace Context for cross-service correlation

This compatibility means you can instrument services using OpenTelemetry and route traces to either system.

Key Differences

AspectJaegerZipkin
ArchitectureCollector, Query, UI, Agent as separate componentsSimple architecture with Collector and Query only
Storage backendsElasticsearch, Cassandra, KafkaIn-memory, Cassandra, MySQL, PostgreSQL
Sampling strategiesProbabilistic, Adaptive, Tail-basedProbabilistic, Rate-limiting
OTLP supportNative OTLP receiverRequires Zipkin collector adapter
Service mesh integrationNative Istio and Linkerd supportLimited built-in support
Operations complexityHigher due to more componentsLower, simpler to operate

Using the Zipkin Receiver

Jaeger includes a Zipkin-compatible receiver for migrations:

# Configure Jaeger collector to accept Zipkin spans
docker run -d \
  --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_ENDPOINT=http://localhost:9411/api/v2/spans \
  -p 6831:6831/UDP \
  -p 9411:9411 \
  jaegertracing/all-in-one:latest

This lets existing Zipkin-instrumented services send traces to Jaeger without re-instrumentation.

Migration Path

When migrating from Zipkin to Jaeger:

  1. Phase 1: Deploy Jaeger alongside Zipkin, configure services to send traces to both
  2. Phase 2: Validate Jaeger data completeness and sampling behavior
  3. Phase 3: Switch primary monitoring to Jaeger, keep Zipkin as fallback
  4. Phase 4: Decommission Zipkin once confidence is established
# Dual-export configuration for gradual migration
# Instrument services to send to both endpoints during transition
instrumentation:
  tracing:
    exporters:
      - jaeger_exporter:
          endpoint: http://jaeger-collector:14250
      - zipkin_exporter:
          endpoint: http://zipkin:9411/api/v2/spans

When to Choose Jaeger Over Zipkin

Choose Jaeger when:

  • You need advanced sampling strategies like tail-based sampling
  • Your environment uses Kubernetes with service mesh (Istio/Linkerd)
  • You require long-term trace storage with Elasticsearch or Cassandra
  • You want native OpenTelemetry protocol support without adapters
  • Your team needs more sophisticated trace visualization and analysis

Choose Zipkin when:

  • You have existing Zipkin instrumentation and limited migration budget
  • Your deployment scale is modest and simple architecture is preferred
  • You need quick setup with minimal operational overhead
  • Your organization already has expertise with Zipkin tooling

Integrating with OpenTelemetry

OpenTelemetry is the modern standard for tracing:

Automatic Instrumentation

# Python auto-instrumentation
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(
    trace.TracerProvider(
        resource=Resource.create({"service.name": "my-service"})
    )
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger-collector:4317"))
)

FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

Manual Span Creation

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(order: Order) {
  const span = tracer.startSpan("OrderService.process");

  try {
    await validateOrder(order, span);
    await chargePayment(order, span);
    await fulfillOrder(order, span);

    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
}

async function validateOrder(order: Order, parentSpan: Span) {
  const span = tracer.startSpan("OrderService.validate", {
    parent: parentSpan,
  });

  span.setAttribute("validation.type", "business");

  // Validation logic

  span.end();
}

Performance Analysis

Latency Percentiles

Analyze latency distribution:

# Get latency percentiles via Jaeger API
curl -s "http://jaeger-query:16686/api/services" | jq '.[].name'

# Get trace stats
curl -s "http://jaeger-query:16686/api/traces/stats?service=api-gateway" | jq

Throughput Analysis

Understand request volume patterns:

# Get traces count over time
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&start=$(date -d '1 hour ago' +%s000000)&end=$(date +%s000000)&limit=1000" | \
  jq '.data | length'

Error Rate Correlation

Correlate errors across services:

# Find traces with errors and their services
curl -s "http://jaeger-query:16686/api/traces?service=*&lookback=1h" | \
  jq '.data[] | {
      traceID: .traceID,
      errors: [.spans[] | select(.tags // [] | any(.key == "error")) | .process.serviceName]
    } | select(.errors | length > 0)'

Alerting on Trace Data

Prometheus Metrics from Jaeger

# jaeger-metrics-exporter.yaml
apiVersion: v1
kind: Service
metadata:
  name: jaeger-metrics
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8888"
spec:
  ports:
    - port: 8888
      targetPort: 8888
---
# Prometheus scrape config
scrape_configs:
  - job_name: "jaeger"
    static_configs:
      - targets: ["jaeger-collector:8888"]

Key Metrics to Monitor

MetricDescription
jaeger_collector_traces_receivedIncoming traces count
jaeger_collector_spans_receivedTotal spans received
jaeger_collector_queue_lengthPending spans in queue
jaeger_query_latencyQuery response time

Storage Backends

Jaeger supports multiple storage backends.

Elasticsearch

Best for large-scale production deployments:

# Elasticsearch with ILM
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy
      indexParameters:
        numberOfShards: 5
        numberOfReplicas: 1

Cassandra

Traditional choice for high volume:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  storage:
    type: cassandra
    cassandra:
      servers: cassandra:9042
      keyspace: jaeger_v1
      replication_factor: 2

Kafka

For buffering and replay:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: streaming
  collector:
    maxReplicas: 10
  storage:
    type: kafka
    kafka:
      brokers:
        - kafka:9092
      topic: jaeger-spans
      partitions: 10
  ingester:
    replicas: 2

SLO Integration with Tracing

Correlate trace data with Service Level Objectives for better reliability.

Defining SLOs from Traces

# Define SLO thresholds based on trace latency
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: slo-tracking
spec:
  strategy: production
  sampling:
    type: adaptive
    adaptive:
      # Ensure traces for SLO boundary requests are always captured
      sampling_server_url: jaeger-agent:5778
      # Prioritize capturing slow traces that may breach SLOs
      max_traces_per_second: 100

Trace-Based SLO Dashboard

Key trace metrics to visualize:

MetricSLO TargetQuery Method
p99 Latency< 500msTrace duration percentiles
Error Rate< 0.1%Error spans / total spans
Availability> 99.9%Successful traces / total traces
Trace Completeness> 95%Complete traces / total traces

Latency Budgets with Traces

Allocate latency budget across services:

# Get latency breakdown by service for an SLO window
curl -s "http://jaeger-query:16686/api/traces?service=api-gateway&lookback=1h" | \
  jq '[.data[].spans[] | {
    service: .process.serviceName,
    operation: .operationName,
    duration_ms: (.duration / 1000),
    errors: (.tags[] | select(.key == "error") | .key) | length
  }] | group_by(.service) | map({
    service: .[0].service,
    avg_duration_ms: (map(.duration_ms) | add / length),
    max_duration_ms: (map(.duration_ms) | max),
    error_count: (map(.errors) | add)
  })'

Tail-Based Sampling for SLO Traces

Ensure error and slow traces are always captured:

# Tail-based sampling configuration
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: slo-sampling
spec:
  strategy: production
  collector:
    sampling:
      type: tail-based
      tail-based:
        sampling:
          - type: probabilistic
            probabilistic: 0.1
          - type: latencyn
            latency:
              lower: 100ms
              upper: 500ms
            probabilistic: 0.5
          - type: always
            category:
              - error
            probabilistic: 1.0

Multi-Region Deployment Considerations

Deploy Jaeger across regions for global microservices visibility.

Architecture Patterns

graph LR
    A[US-East Services] --> B[US-East Jaeger Collector]
    C[EU-West Services] --> D[EU-West Jaeger Collector]
    E[AP-South Services] --> F[AP-South Jaeger Collector]
    B --> G[Central Storage]
    D --> G
    F --> G
    H[Global Query] --> G

Cross-Region Trace Context

Propagate trace context across regions without losing context:

# Cross-region trace context propagation
from opentelemetry import propagate
from opentelemetry.trace import set_span_in_context

def forward_to_region(headers: dict, region: str):
    # Inject current trace context into headers for cross-region call
    propagate.inject(headers)

    # Add region-specific tags
    current_span = trace.get_current_span()
    current_span.set_attribute("destination.region", region)

    response = requests.post(
        f"https://{region}-api.example.com/process",
        headers=headers
    )
    return response

Regional Sampling Strategies

# Regional sampling configuration
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: multi-region-jaeger
  namespace: observability
spec:
  strategy: adaptive
  sampling:
    type: adaptive
    adaptive:
      # Higher sampling rate in production regions
      sampling_server_url: jaeger-agent:5778
      max_traces_per_second: 200
      initial_sampling_rate: 20
      # Always sample cross-region calls
      policies:
        - name: cross-region
          type: tag
          tag:
            key: cross_region
            value: "true"
          probabilistic: 1.0

Storage Considerations for Multi-Region

ApproachProsCons
Centralized (single ES)Simple, consistentLatency for remote collectors
Distributed (per-region ES)Low latencyComplex cross-region queries
Hybrid (hot-warm ES)Balance of bothOperational complexity

Cross-Region Trace Correlation

# Correlate traces across regions
curl -s "http://jaeger-query.global:16686/api/traces?service=*&lookback=1h" | \
  jq '.data[] | select(.spans[] | .tags[] | .key == "region" and .vStr == "us-east") | {
    traceID: .traceID,
    regions: [.spans[].tags[] | select(.key == "region") | .vStr] | unique,
    duration_ms: (.duration / 1000)
  } | select(.regions | length > 1)'

Production Failure Scenarios

FailureImpactMitigation
Jaeger storage backend degradedTraces dropped; incomplete debugging dataConfigure sampling; scale storage; implement buffer queue
Collector queue overflowSpans dropped; monitoring gapsMonitor queue depth; scale collectors; implement backpressure
Query service performanceSlow trace search; UI timeoutsOptimize queries; add caching; scale query replicas
Adaptive sampling too aggressiveMissing important tracesReview sampling rates; ensure error traces always sampled
Trace context propagation brokenIncomplete traces; orphaned spansImplement proper propagation in all services; test regularly
ES backend slowDelayed trace availabilityMonitor ES cluster; optimize indices; add warm storage tier

Common Pitfalls / Anti-Patterns

1. Not Sampling Tail for Errors

Head-based sampling misses most error traces at low sampling rates:

# Bad: Only probabilistic sampling
strategy: probabilistic
sampling:
  type: probabilistic
  probabilistic:
    sampling_rate: 0.01  # 1% - misses most errors

# Good: Adaptive sampling with error traces always sampled
strategy: adaptive
sampling:
  type: adaptive
  adaptive:
    max_traces_per_second: 100
    sampling_server_url: jaeger-agent:5778

2. Missing Semantic Attributes

Traces without standard attributes are hard to query:

// Bad: Missing semantic attributes
const span = tracer.startSpan("OrderService.process");
span.setAttribute("orderId", order.id); // Non-standard name

// Good: Use semantic conventions
const span = tracer.startSpan("OrderService.process");
span.setAttribute("order.id", order.id);
span.setAttribute("order.total", order.total);
span.setAttribute("customer.tier", customer.tier);

3. Creating Child Spans Without Parent Context

Orphaned spans break trace continuity:

// Bad: No parent context
async function processOrder(order) {
  const span = tracer.startSpan("OrderService.process");
  await validateOrder(order); // Creates orphaned span
  span.end();
}

// Good: Propagate parent context
async function processOrder(order, parentSpan) {
  const span = tracer.startSpan("OrderService.process", {
    parent: parentSpan,
  });
  await validateOrder(order, span); // Child span linked
  span.end();
}

4. Storing Large Payloads in Span Events

Span events are not a data store:

// Bad: Large payload in span
span.addEvent("response", { body: JSON.stringify(largeResponse) });

// Good: Reference by ID or summary
span.addEvent("response", {
  "response.size_bytes": largeResponse.length,
  "response.status": "success",
});

5. Not Monitoring Jaeger Itself

Jaeger monitoring blind spots:

# Prometheus metrics from Jaeger
scrape_configs:
  - job_name: "jaeger"
    static_configs:
      - targets: ["jaeger-collector:8888"]

Real-world Failure Scenarios

Scenario 1: Jaeger Collector OOM During Traffic Spike

What happened: A product launch caused a 20x spike in trace volume. The Jaeger collector’s in-memory queue filled up and the process was killed by the OOM killer.

Root cause: The collector was configured with a fixed in-memory queue size but no dead-letter queue or sampling strategy to handle sudden volume increases.

Impact: Approximately 15 minutes of traces were lost during the product launch window. Engineers could not correlate the elevated error rate with specific services.

Lesson learned: Configure adaptive sampling or tail-based sampling to automatically reduce trace volume during traffic spikes. Set up dead-letter queues for failed trace ingestion. Monitor collector queue depth and set OOM alerts.

Scenario 2: Cassandra Storage Saturation

What happened: Over several months, the Cassandra storage cluster used by Jaeger reached its capacity limit. Trace data began being dropped silently as inserts failed.

Root cause: No capacity planning was done for trace retention. The default 30-day retention was never adjusted as the system scaled.

Impact: Historical traces older than 2 weeks became unavailable, making retrospective analysis of a security incident impossible.

Lesson learned: Plan storage capacity based on expected trace volume and retention requirements. Monitor storage node disk usage and set alerts. Consider down-sampling older traces to reduce storage costs.

Trade-off Analysis

When designing a Jaeger deployment, several key trade-offs require careful consideration.

Storage Backend Selection

CriteriaElasticsearchCassandraKafka
ScalabilityHorizontal sharding, ILM supportTunable consistency, wide rowsPartition-based parallelism
Query PerformanceExcellent aggregations, full-text searchFast reads for trace ID lookupsRequires separate consumer
Operational ComplexityHigh (cluster management, ILM)Medium (SSTables, compaction)Medium (brokers, replication)
Cost at ScaleHigher (memory-heavy)Lower (disk-efficient)Variable (depends on retention)
Best ForLarge teams, advanced analyticsHigh write throughputEvent-driven replay scenarios

Sampling Strategy Trade-offs

StrategyStorage SavingsDebug CoverageLatency OverheadComplexity
ProbabilisticHighLow (misses rare events)MinimalSimple
AdaptiveMediumHigh (prioritizes errors)LowMedium
Tail-basedVariableHighest (captures all important)HigherComplex
Rate-limitingPredictableMediumMinimalSimple

Deployment Architecture Comparison

AspectAll-in-OneProduction (Separated)Multi-Region
Resource UsageMinimalHighVery High
ScalabilityNoneHorizontal collectors/queryRegional collectors
Operational OverheadMinimalMediumHigh
Fault IsolationPoor (single point)Good (component isolation)Excellent (regional failure domains)
LatencyLow (local only)Medium (network to collectors)Higher (cross-region)
Setup TimeMinutesHoursDays
Use CaseDevelopment, CI/CDProduction (single region)Global enterprises

Agent Deployment Models

ModelProsConsBest Environment
Sidecar (per pod)Simple injection, local UDPResource overhead per podKubernetes with sidecar injection
Daemonset (node-level)Shared resource pool, lower overheadRequires host networkingDense node deployments
Agentless (direct to collector)No agent maintenanceHigher latency (network hop), firewall rulesSecure environments, small scale

Instrumentation Approach Trade-offs

ApproachEffortGranularityMaintenanceBest Stage
Auto-instrumentationLowMediumLowInitial adoption
Manual instrumentationHighFine-grainedHigherProduction hardening
Hybrid (auto + manual)MediumFine-grainedMediumMature observability

When to Use Jaeger

Use Jaeger when:

  • Debugging latency issues across microservice boundaries
  • Understanding service dependencies and call patterns
  • Root cause analysis for cascading failures
  • Optimizing performance by identifying bottlenecks
  • Validating trace context propagation
  • Monitoring distributed transactions
  • Detecting anomalies in request flows

Don’t use Jaeger when:

  • You have single monolithic applications without service boundaries
  • You need purely metric-based monitoring
  • Low-latency tracing overhead is unacceptable
  • You need long-term log storage
  • You only need aggregate analytics (use dashboards)

Interview Questions

1. What is distributed tracing, and how does Jaeger implement it?

Expected answer points:

  • Distributed tracing tracks requests across service boundaries in microservices architectures
  • Jaeger implements tracing via the OpenTelemetry collector pattern with agents, collectors, query, and UI components
  • Spans represent individual operations, and traces are composed of connected spans forming a request path
  • Jaeger stores traces in backends like Elasticsearch, Cassandra, or Kafka for querying and visualization
2. What are the main components of Jaeger architecture?

Expected answer points:

  • Jaeger Agent: Sidecar or daemonset that receives spans via UDP and forwards to collectors
  • Jaeger Collector: Receives spans, processes them, and stores in the configured backend
  • Jaeger Query: Backend service that retrieves traces from storage for display
  • Jaeger UI: Web interface for searching and visualizing traces
  • Supported storage backends: Elasticsearch, Cassandra, Kafka
3. What sampling strategies does Jaeger support, and when would you use each?

Expected answer points:

  • Probabilistic: Fixed percentage of traces captured, simple but may miss important events
  • Adaptive: Dynamic sampling based on traffic, prioritizes rare events and errors
  • Tail-based: Collects full trace after seeing the end, ideal for capturing all slow or error traces
  • Use adaptive for production with high traffic, tail-based for debugging specific issues
4. How do you debug a slow request using Jaeger?

Expected answer points:

  • Search for traces with high duration using the Jaeger UI or API
  • Identify which service has the longest spans in the trace waterfall
  • Check span tags for business context (order ID, user ID, etc.)
  • Review span logs for errors, database queries, or external calls
  • Follow the trace backward to find where latency was introduced
5. What is trace context propagation and why is it important?

Expected answer points:

  • Trace context propagation passes trace and span IDs across service boundaries via HTTP headers
  • Without proper propagation, spans become orphaned and traces appear incomplete
  • OpenTelemetry uses W3C Trace Context headers (traceparent, tracestate)
  • Context must be injected before outgoing requests and extracted on incoming requests
6. How do you integrate Jaeger with OpenTelemetry?

Expected answer points:

  • Use OpenTelemetry SDK with Jaeger exporter or OTLP exporter pointing to Jaeger collector
  • Auto-instrumentation available for Python, Java, Node.js, Go, .NET and other languages
  • Manual instrumentation using OpenTelemetry API for custom business logic
  • Configure resource attributes like service.name for proper identification
7. What storage backend would you choose for a large-scale Jaeger deployment and why?

Expected answer points:

  • Elasticsearch: Best for large-scale production, excellent query performance, ILM support
  • Cassandra: Traditional choice, good for very high write throughput, tunable consistency
  • Kafka: Enables buffering and replay, useful for event-driven architectures
  • Consider ingestion rate, query patterns, operational complexity, and existing infrastructure
8. What are common pitfalls when using Jaeger in production?

Expected answer points:

  • Head-based sampling missing error traces at low sampling rates
  • Missing semantic attributes making traces hard to query
  • Orphaned spans from not propagating parent context
  • Storing large payloads in span events causing performance issues
  • Not monitoring Jaeger itself leading to observability gaps
9. How do you correlate trace data with SLOs and alerting?

Expected answer points:

  • Define SLO thresholds based on trace latency percentiles and error rates
  • Use tail-based sampling to ensure error and slow traces are always captured
  • Export Jaeger metrics to Prometheus for alerting on collector queue depth, ingestion rate
  • Create dashboards correlating trace data with business-level SLOs
10. What security considerations apply when deploying Jaeger?

Expected answer points:

  • Authenticate Jaeger UI access to prevent unauthorized trace data exposure
  • Avoid sensitive data (passwords, tokens, PII) in span tags and logs
  • Configure TLS for all Jaeger endpoints including gRPC and HTTP
  • Restrict storage backend access and audit trace data exports
  • Sanitize trace context headers before external calls to prevent injection
11. How does Jaeger compare to Zipkin, and what are the migration considerations?

Expected answer points:

  • Jaeger and Zipkin share the same trace data model (spans and traces) making interoperability possible
  • Jaeger supports native OTLP ingestion while Zipkin requires additional collectors for OTLP
  • Jaeger provides more advanced sampling strategies including tail-based sampling
  • Zipkin has a simpler deployment model suited for smaller deployments
  • Migration involves re-instrumenting services with Jaeger clients or using the Zipkin receiver in Jaeger collector
  • Jaeger stores data in Elasticsearch, Cassandra, or Kafka while Zipkin typically uses in-memory or Cassandra
12. How do you integrate Jaeger with service mesh environments like Istio or Linkerd?

Expected answer points:

  • Istio automatically instruments traffic with Jaeger via the OpenTelemetry collector addon
  • Configure Istio to export traces to Jaeger using the mesh config and tracing options
  • Linkerd uses its own distributed tracing capability that can export to Jaeger
  • Service mesh sidecars handle trace context propagation automatically across proxy boundaries
  • Jaeger helps identify service mesh performance issues and proxy overhead
  • High cardinality of mesh-generated traces requires careful sampling strategy configuration
13. What is the difference between head-based and tail-based sampling in Jaeger?

Expected answer points:

  • Head-based sampling decides at the start of a trace whether to sample, using probabilistic or rate-limiting strategies
  • Tail-based sampling collects all traces but makes the sampling decision at the end based on policy rules
  • Head-based sampling is simpler and requires less resources but may miss important rare events
  • Tail-based sampling ensures error traces and slow traces are always captured for debugging
  • Jaeger supports adaptive sampling which combines both approaches dynamically
14. How do you analyze trace flame graphs in Jaeger, and what insights do they provide?

Expected answer points:

  • Flame graphs visualize trace duration as stacked bars showing time spent in each span
  • Wide blocks indicate where most time is spent, identifying latency bottlenecks
  • Deep stacks show trace depth and parent-child relationships across services
  • Jaeger's trace detail view provides a waterfall representation that functions like a flame graph
  • Color coding helps distinguish services, errors, and external calls
  • Compare flame graphs across time periods to detect performance regressions
15. What are the scaling considerations for Jaeger in high-throughput production environments?

Expected answer points:

  • Scale Jaeger collectors horizontally to handle increased ingestion load
  • Use Kafka as a buffer between collectors and storage to handle spikes
  • Configure adaptive sampling to reduce storage requirements without losing critical traces
  • Elasticsearch index lifecycle management helps control storage costs
  • Query service can be scaled horizontally with read replicas
  • Monitor queue depth and span latency to detect scaling needs proactively
16. How do you implement distributed tracing propagation across asynchronous message queues?

Expected answer points:

  • Inject trace context into message headers or properties before publishing
  • Extract trace context on the consumer side and create child spans
  • OpenTelemetry provides automatic propagation for Kafka, RabbitMQ, and other messaging systems
  • Ensure message processing spans have proper parent context for complete trace continuity
  • Batch message processing requires careful span management to avoid orphaned spans
17. What strategies exist for reducing trace cardinality while maintaining debugging capabilities?

Expected answer points:

  • Use tag allowlisting to restrict which attributes are stored with spans
  • Implement sampling strategies that capture complete traces for errors and slow requests
  • Aggregate high-cardinality attributes like user IDs or request IDs into bucketed values
  • Configure storage backends with appropriate index mappings to handle cardinality
  • Use trace quality scoring to identify and drop low-value traces
18. How does Jaeger handle clock skew issues in distributed trace timestamps?

Expected answer points:

  • Jaeger uses relative time measurements within a single trace rather than absolute wall-clock times
  • Spans record duration and relative start times calculated from the trace start
  • Clock skew between services is handled by respecting parent span start times as baseline
  • Jaeger UI displays spans using relative offsets from trace start, not absolute timestamps
  • For cross-region traces, use trace duration for comparison rather than timestamps
19. What are the operational best practices for running Jaeger in Kubernetes?

Expected answer points:

  • Use the Jaeger Operator for declarative Kubernetes deployments and upgrades
  • Deploy the agent as a daemonset for sidecar-less architecture with lower overhead
  • Configure resource limits based on expected trace volume and processing requirements
  • Use separate namespaces for observability components to enable proper RBAC
  • Implement pod disruption policies for collector and query deployments
  • Monitor Jaeger itself with Prometheus metrics exported on port 8888
20. How do you correlate traces from Jaeger with metrics in Prometheus and logs in ELK?

Expected answer points:

  • Use consistent service names and labels across Jaeger, Prometheus, and ELK
  • Export Jaeger metrics to Prometheus for correlation between trace latency and system metrics
  • Include trace ID in log lines to enable cross-platform correlation
  • Use the trace ID from Jaeger spans to search corresponding logs in Elasticsearch
  • Build Grafana dashboards that combine Jaeger trace data with Prometheus metrics
  • Implement trace span IDs in error messages to link directly to the relevant trace in Jaeger UI

Further Reading

Conclusion

Jaeger gives you distributed tracing for microservices debugging and analysis. Trace visualization, dependency mapping, and performance analysis make sense of complex request flows.

Start with automatic instrumentation for your services, deploy the all-in-one version for development, and scale to a production deployment with Elasticsearch storage when you are ready.

For complete observability, combine tracing with metrics (Prometheus & Grafana) and logs (ELK Stack). The Distributed Tracing guide covers OpenTelemetry integration in more detail.

Quick Recap

Key Takeaways:

  • Jaeger provides distributed tracing visualization for microservices
  • Deploy collectors, query, and UI separately in production
  • Use adaptive or tail-based sampling to capture errors
  • Always propagate trace context through HTTP headers and queues
  • Monitor Jaeger itself: ingestion rate, queue depth, storage latency
  • Combine with metrics (Prometheus) and logs (ELK) for complete observability

Copy/Paste Checklist:

# Production Jaeger CRD
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-jaeger
spec:
  strategy: production
  collector:
    maxReplicas: 5
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
  query:
    replicas: 2

# Adaptive sampling config
spec:
  sampling:
    type: adaptive
    adaptive:
      max_traces_per_second: 100
      initial_sampling_rate: 10

# Prometheus metrics scrape
scrape_configs:
  - job_name: 'jaeger-collector'
    static_configs:
      - targets: ['jaeger-collector:8888']
// Trace context propagation
import { propagation, context } from "@opentelemetry/api";

function injectTraceContext(headers: Record<string, string>) {
  propagation.inject(context.active(), headers);
}

function extractTraceContext(headers: Record<string, string>) {
  return propagation.extract(context.active(), headers);
}

// Error span recording
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });

Reference Checklists

Observability Checklist

  • Traces received per second (ingestion rate)
  • Spans received and processed
  • Collector queue depth
  • Backend storage write latency
  • Query service latency (p50, p95, p99)
  • Active connections to storage

Jaeger Infrastructure Metrics

  • Traces received per second (ingestion rate)
  • Spans received and processed
  • Collector queue depth
  • Backend storage write latency
  • Query service latency (p50, p95, p99)
  • Active connections to storage

Trace Coverage Metrics

  • Services with instrumentation
  • Trace completeness (spans per trace average)
  • Error trace percentage
  • Slow trace percentage (>threshold)
  • Span attribute coverage

Sampling Configuration

  • Head-based sampling rate
  • Adaptive sampling thresholds
  • Tail sampling policies (errors, slow traces)
  • Always-sampled tag configuration

Alerting Rules

  • Collector queue depth > threshold
  • Storage write latency degraded
  • Query service down
  • Ingestion rate drop (potential issue)

Security Checklist

  • Jaeger UI access authenticated
  • No sensitive data in span tags or logs (passwords, tokens, PII)
  • TLS configured for all endpoints
  • Trace data access logged and audited
  • Sampling does not inadvertently exclude security-relevant traces
  • Trace context headers sanitized before external calls
  • Storage backend access restricted
  • No internal service names exposed to external trace exports

Category

Related Posts

Distributed Tracing: Trace Context and OpenTelemetry

Master distributed tracing for microservices. Learn trace context propagation, OpenTelemetry instrumentation, and how to debug request flows across services.

#observability #tracing #distributed-systems

Distributed Operating Systems

Explore distributed file systems, RPC mechanisms, cluster scheduling, and the fundamental concepts behind modern distributed operating systems.

#operating-systems #distributed-os-concepts #distributed-systems

Performance Profiling

Master Linux performance profiling with perf, ftrace, BCC tools, and flame graphs to identify and eliminate kernel bottlenecks.

#operating-systems #performance-profiling #linux