Apache Kafka: Distributed Streaming Platform
Learn how Apache Kafka handles distributed streaming with partitions, consumer groups, exactly-once semantics, and event-driven architecture patterns.
Introduction
Apache Kafka is a distributed streaming platform built for high-throughput, fault-tolerant event streaming at scale. Originally developed at LinkedIn and later open-sourced through the Apache Foundation, Kafka has become the backbone of event-driven architectures across industries — powering real-time analytics pipelines, event sourcing systems, microservices communication, and log aggregation at companies like Netflix, Uber, and Airbnb.
Unlike traditional message queues that delete messages after consumption, Kafka persists messages in immutable logs. Producers write events to topics; consumers read from those topics independently. This durability and the ability to replay events make Kafka well-suited for systems where late-arriving consumers or audit trails matter.
This post covers the core concepts that make Kafka work: topics and partitions, consumer groups, offset management, exactly-once semantics, broker replication, dead letter queues, and backpressure handling. By the end, you’ll understand how Kafka achieves its legendary throughput, how to design consumer groups for parallelism, and how to avoid the common pitfalls in production.
Core Concepts
Topics and partitions
Kafka organizes data into topics. Unlike a queue where messages are consumed and deleted, Kafka topics are logs. Messages are appended and kept for a configurable retention period: hours, days, or indefinitely.
Topic: order-events
Partition 0: [msg1, msg2, msg3, msg5, msg8]
Partition 1: [msg4, msg6, msg7, msg9]
Partition 2: [msg10, msg11, msg12]
Each topic splits into partitions for parallelism. Partitions distribute across brokers. Within a partition, messages have a monotonically increasing offset that uniquely identifies each one.
Message keying
Producers can specify a key when publishing:
producer.send(new ProducerRecord("order-events", orderId, orderJson));
Kafka hashes the key to determine the partition. Messages with the same key always go to the same partition, which means ordering per key. All events for the same order arrive in the same partition, in order.
Partition assignment
Partitions assign to brokers at topic creation time:
Broker 1: Partition 0, Partition 3
Broker 2: Partition 1, Partition 4
Broker 3: Partition 2, Partition 5
The leader broker for each partition handles reads and writes. Followers replicate for fault tolerance.
Consumer groups
Kafka consumers belong to consumer groups. Each message in a topic delivers to one consumer within each group.
graph LR
Producer -->|publish| Topic[Topic with 3 Partitions]
Topic -->|P0| CG1[Consumer Group A]
Topic -->|P1| CG1
Topic -->|P2| CG1
Topic -->|P0| CG2[Consumer Group B]
Topic -->|P1| CG2
Topic -->|P2| CG2
Group A might have one consumer processing all partitions. Group B might have three consumers, each owning one partition. Both groups receive all messages independently.
Rebalancing
When a consumer joins or leaves a group, Kafka rebalances partition assignments. The system redistributes partitions to maintain even load.
Rebalancing has a cost: during rebalancing, no consumer processes messages. If rebalances happen frequently, throughput suffers.
Offset management
Kafka tracks consumption progress via offsets. Consumers commit offsets to mark what they have processed:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
process(record);
}
consumer.commitSync();
}
If a consumer crashes after processing but before committing, it receives the same messages on restart. This is at-least-once delivery. With idempotent processing, duplicates are harmless.
Exactly-Once Delivery
Overview
Exactly-once is the gold standard: each message processes precisely once, no duplicates, no loss. Kafka offers exactly-once semantics via transactions.
The problem
Exactly-once is hard because processing involves multiple systems:
Kafka -> Consumer -> Database
If the consumer crashes after writing to the database but before committing the Kafka offset, the message reprocesses on restart, causing a duplicate database write.
Kafka transactions
Kafka transactions solve this by atomically committing offsets and output:
producer.initTransactions();
producer.beginTransaction();
producer.send(producerRecord);
producer.sendOffsetsToTransaction(consumer.offsets(), consumer.groupMetadata());
producer.commitTransaction();
If the transaction commits, the message writes and the offset commits atomically. If it aborts, neither happens. The consumer reads the message again.
When you need exactly-once
Most use cases do not need exactly-once. At-least-once with idempotent processing is simpler and performs better. Consider exactly-once only when:
- Duplicate processing has serious consequences (financial transactions, inventory updates)
- Your output system does not support idempotent writes
- The cost is acceptable
For most event processing pipelines, at-least-once plus idempotency is the right choice.
End-to-end exactly-once flow
The exactly-once flow in practice:
sequenceDiagram
participant Producer
participant Kafka as Kafka Cluster
participant Consumer as Consumer App
participant DB as Output DB
participant Coordinator as Transaction Coordinator
Producer->>Kafka: send() with transactional ID
Kafka->>Producer: acknowledged
Consumer->>Kafka: poll() receives record
Consumer->>DB: write to database
Consumer->>Coordinator: sendOffsetsToTransaction()
Coordinator->>Kafka: commit offsets + data atomically
Kafka->>Coordinator: commit confirmed
Coordinator->>Consumer: offset commit confirmed
Note over Consumer,DB: If crash here: replay from committed offset, skip already-written records
The transaction coordinator bundles the database write and the offset commit together. They either both commit or both abort. When the consumer restarts, it resumes from the last committed offset — skipping any records that were already written to the output system.
Without transactions, there is a window between “wrote to database” and “committed offset.” Crash in that window and you get duplicates. Transactions close that window.
Broker Replication and Fault Tolerance
Replication and ISR
Kafka replicates partitions across brokers for fault tolerance. Each partition has a leader and multiple followers. Followers replicate messages from the leader, staying in sync.
graph LR
subgraph Broker-1
L1[Partition 0 Leader]
end
subgraph Broker-2
F1[Partition 0 Follower - ISR]
end
subgraph Broker-3
F2[Partition 0 Follower - ISR]
end
Producer -->|writes| L1
L1 -->|replicate| F1
L1 -->|replicate| F2
L1 -->|consume| Consumer
In-Sync Replicas (ISR) are replicas that have fully caught up with the leader. Only ISR members can become leaders after a failure. The replication.factor setting controls how many replicas exist, and min.insync.replicas defines the minimum ISR size for acknowledging writes.
For example, with replication.factor=3 and min.insync.replicas=2, a partition has 3 replicas. Writes acknowledge when at least 2 replicas (the leader plus 1 follower) have persisted the message.
Key configuration defaults
| Parameter | Default | Description | Production recommendation |
|---|---|---|---|
replication.factor | 1 | Number of replicas per partition | 3 for critical topics |
min.insync.replicas | 1 | Minimum ISR for acknowledge | 2 (requires acks=all) |
retention.ms | 7 days | Message retention period | Based on replay requirements |
acks | 1 (leader) | Acknowledgment required | all for critical data |
compression.type | producer | Compression codec | lz4 or zstd |
max.in.flight.requests.per.connection | 5 | Unacknowledged requests | 1 for exactly-once |
Dead Letter Queues
When processing fails and you cannot retry, messages need somewhere to go. Dead letter queues (DLQs) catch messages that consumer processing repeatedly fails on.
@Bean
public DeadLetterPublishingPostProcessor deadLetterPublishingPostProcessor(
ConcurrentKafkaListenerContainerFactory<String, String> factory) {
DefaultErrorHandler errorHandler = new DefaultErrorHandler(
new FixedBackOff(1000L, 3) // 3 retries, 1 second apart
);
// Send to DLQ after retries exhausted
errorHandler.setBackOffMultiplier(2);
factory.setCommonErrorHandler(errorHandler);
return new DeadLetterPublishingPostProcessor(
factory.getKafkaTemplate(),
(record, exception) -> new TopicPartition(
record.topic() + ".DLQ", // convention: topic.DLQ
record.partition()
)
);
}
This sends failed messages to order-events.DLQ after 3 retry attempts. The DLQ preserves the original topic, partition, and key so you can investigate without losing context.
DLQ design considerations
- Monitoring: Alert when DLQ depth exceeds zero — messages piling up means something is wrong upstream
- Retention: DLQ retention is often shorter than main topic; set a TTL or explicit cleanup job
- Reprocessing: DLQ messages can be reprocessed by a debugging consumer or manually re-published to the original topic after fixing the issue
- Causality tracking: Include the original exception stack trace or error code in the message value for debugging
Backpressure Handling
Kafka producers send messages faster than consumers can process them. Backpressure management prevents unbounded lag growth.
Consumer-side backpressure
max.poll.records limits how many messages a consumer fetches per poll:
config.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 100); // process 100, then poll again
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000); // 5 minute max poll interval
fetch.min.bytes and fetch.max.wait.ms control how much data the consumer waits for:
config.put(ConsumerConfig.FETCH_MIN_BYTES_CONFIG, 1024 * 1024); // wait for at least 1MB
config.put(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG, 500); // or 500ms, whichever comes first
Producer-side backpressure
Producer buffer memory (buffer.memory) and batch.size interact to create natural backpressure. If the broker is slow and the send buffer fills, send() blocks or throws an exception:
config.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 32 * 1024 * 1024); // 32 MB buffer
config.put(ProducerConfig.BATCH_SIZE_CONFIG, 16 * 1024); // 16 KB batches
config.put(ProducerConfig.LINGER_MS_CONFIG, 5); // wait up to 5ms to batch
Backpressure signals to watch
| Signal | Meaning | Action |
|---|---|---|
| Consumer lag growing | Producers outpacing consumers | Add consumers or optimize processing |
Producer send() blocking | Broker throughput saturated | Add brokers or reduce producer load |
| Under-replicated partitions > 0 | Followers falling behind leader | Check disk I/O, network, or broker load |
| Request timeout exceptions | Brokers too slow to respond | Increase request.timeout.ms or scale |
Delivery Guarantees
Kafka provides three delivery guarantee levels. Understanding when each applies matters for system design.
At-most-once delivery
Messages may be lost but are never duplicated. This happens when consumers commit offsets before processing:
consumer.commitAsync(); // commit before processing
process(record); // if crash here, message is lost
Use at-most-once when:
- Duplicate messages cost more than missed messages (sensor data aggregation, metrics)
- You need lowest possible latency and can tolerate gaps
At-least-once delivery (default)
Messages are never lost but may be duplicated. Consumer commits after processing:
process(record); // do work first
consumer.commitSync(); // then commit offset
If the consumer crashes after processing but before committing, the same messages reprocess on restart. With idempotent operations (writes with unique keys), duplicates are safe.
Use at-least-once when:
- Missing messages is worse than duplicates (inventory updates, payment processing)
- Your consumers are idempotent
Exactly-once delivery
Each message processes exactly once, no duplicates, no loss. Requires Kafka transactions as shown earlier in this post.
Use exactly-once when:
- Duplicate processing has serious consequences
- Your output system cannot handle idempotent writes
- The performance cost is acceptable
| Guarantee | Message Loss | Duplicates | Latency | Best Use Case |
|---|---|---|---|---|
| At-most-once | Yes | No | Lowest | Telemetry, metrics, gaps acceptable |
| At-least-once | No | Yes | Medium | Most use cases; idempotent consumers |
| Exactly-once | No | No | Highest | Financial, inventory, idempotency-impossible |
Consumer Group Coordination
Consumer group behavior depends on partition assignment strategy and coordination needs.
Sticky assignor
The default sticky assignor minimizes partition movement during rebalances. When a consumer leaves, its partitions stay together when reassigned rather than being scattered across remaining consumers.
config.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
StickyAssignor.class.getName());
Benefit: fewer message reorderings during rebalance since partitions that were together stay together.
Cooperative sticky assignor
For minimal disruption during rebalances, use cooperative sticky assignor. It allows incremental rebalancing without stopping consumption entirely:
config.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
CooperativeStickyAssignor.class.getName());
The consumer continues processing while partition ownership shifts incrementally.
Standalone consumers
Sometimes you need a single consumer not part of a group:
consumer.subscribe(List.of("topic")); // group-based
// vs
consumer.assign(List.of(new TopicPartition("topic", 0))); // standalone, manual assignment
Standalone consumers manually assign partitions and manage their own offsets. Useful for:
- One-off administrative tasks (consuming from beginning to rebuild state)
- Specialized processing pipelines that should not interfere with normal consumers
- Debugging and testing
Maximum parallelism calculation
For a given topic, maximum consumer parallelism equals partition count:
Topic with 12 partitions
→ Up to 12 consumers in the group (each gets 1 partition)
→ Adding more consumers does nothing (no extra partitions)
If you need 20 consumers for parallel processing:
→ Topic must have at least 20 partitions
Partition count is fixed at topic creation. Plan accordingly.
Kafka Streams example
Kafka Streams is a client library for building real-time stream processing applications. The word count example is the canonical demonstration. It counts occurrences of words across an infinite stream of text events:
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Materialized;
import java.util.Arrays;
import java.util.Properties;
public class WordCountApplication {
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-app");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
StreamsBuilder builder = new StreamsBuilder();
// Source: read from input topic
KStream<String, String> textLines = builder.stream("text-lines-topic");
// Process: split into words, group, count
textLines
.flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("\\W+")))
.groupBy((key, word) -> word)
.count(Materialized.as("word-counts-store"))
.toStream()
.to("word-counts-output-topic");
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
}
}
How it works:
flatMapValuessplits each input line into lowercase wordsgroupBygroups by word, discarding the message keycountmaintains a running count per word in a state storetoStream.towrites results to the output topic
Scaling behavior: each partition processes words independently. The word “kafka” appearing in partitions 0 and 1 produces two separate counts that must be aggregated downstream if you need a global count. For true global word count, either use a single partition or aggregate in a subsequent step.
Fault tolerance: Kafka Streams checkpointing persists state to Kafka topics. If a stream processor crashes, it resumes from the last checkpointed position without data loss.
Topic-Specific Deep Dives
Advanced Partition Sizing
The number of partitions determines parallelism for both producers and consumers. Getting it right at topic creation matters since partition count is immutable.
Partition count sizing
Factors that drive partition count:
| Factor | Impact | Consideration |
|---|---|---|
| Desired consumer parallelism | More partitions means more concurrent consumers | One partition per consumer in a group |
| Producer throughput target | Each partition handles about 10 MB/s | Partitions should be >= target_MB_s / 10 |
| Maximum broker scale | Kafka performance degrades past roughly 4000 partitions per broker | Consider broker count when sizing |
| Key cardinality | High-cardinality keys distribute evenly | Avoid partitions that become hot spots |
Sizing formula:
def calculate_partition_count(
target_throughput_mbps: float,
max_consumer_parallelism: int,
num_brokers: int,
replication_factor: int = 3
) -> dict:
"""
Calculate recommended partition count for a topic.
"""
# Throughput-based: each partition handles ~10 MB/s reliably
partitions_for_throughput = math.ceil(target_throughput_mbps / 10)
# Consumer-based: need at least as many partitions as consumers
partitions_for_consumers = max_consumer_parallelism
# Broker-based: avoid too many partitions per broker
# Guideline: < 4000 partitions per broker for good performance
max_partitions_per_broker = 4000
partitions_for_broker_capacity = num_brokers * max_partitions_per_broker
recommended = max(partitions_for_throughput, partitions_for_consumers)
recommended = min(recommended, partitions_for_broker_capacity)
# Account for replication: total partitions including replicas
total_partitions_in_cluster = recommended * replication_factor
return {
'recommended_partitions': recommended,
'based_on': 'throughput' if partitions_for_throughput >= partitions_for_consumers else 'consumer_parallelism',
'throughput_achievable_mbps': recommended * 10,
'max_consumer_parallelism': recommended,
'total_partition_slots_used': total_partitions_in_cluster,
'partitions_per_broker': recommended / num_brokers
}
# Example: 100 MB/s target, 30 consumers, 6 brokers
result = calculate_partition_count(100, 30, 6)
# Returns: recommended=30 (based on consumer count),
# throughput_achievable=300 MB/s,
# partitions_per_broker=5
Practical guidelines:
- Start conservative: 6-12 partitions for most topics
- Increase when consumer lag appears and more consumers cannot be added
- Monitor per-partition throughput to detect hot spots
- When repartitioning is needed, create a new topic and migrate data with a consumer that reads from both
Producer and consumer implications:
- More partitions means more producer connections and higher client-side memory
- More partitions means longer leader election time after broker failure
- Consumers with many partitions need more heap for offset tracking
Kafka use cases
Kafka excels at specific workloads:
Event streaming
Capture events from many sources and distribute to multiple consumers:
Clickstream -> Kafka -> Analytics
-> Personalization
-> Fraud Detection
-> Audit Log
Message bus replacement
Replace traditional message queues with Kafka for better throughput and replay capability.
Data integration
Connect different systems without point-to-point coupling:
Database CDC -> Kafka -> Search Index
-> Cache
-> Data Warehouse
-> ML Pipeline
Change Data Capture from databases fits this pattern well. Any system can consume the stream without touching the source database.
Kafka vs traditional queues
| Aspect | Kafka | Traditional Queue |
|---|---|---|
| Retention | Days/weeks/forever | Until consumed |
| Replay | Yes | No (usually) |
| Ordering | Per partition | Per queue |
| Throughput | Very high | Moderate |
| Consumer groups | Independent per group | Shared or exclusive |
Kafka’s retention and replay make it unique. You can reprocess historical data if your processing logic changes. Traditional queues cannot do this.
For broader event-driven patterns, see our post on event-driven architecture. For pub/sub patterns that overlap with Kafka’s topic model, see pub/sub patterns.
Trade-off Analysis
When designing Kafka-based systems, understanding trade-offs helps you make informed decisions.
Throughput vs Durability
| Approach | Throughput | Durability | Latency | Complexity |
|---|---|---|---|---|
acks=1 (leader only) | Highest | Low | Lowest | Lowest |
acks=all + min.isr=2 | Medium | High | Medium | Medium |
acks=all + min.isr=3 | Lower | Highest | Higher | Medium |
| Exactly-once mode | Lowest | Highest | Highest | Highest |
Partition Count vs Overhead
| Partitions | Max Consumers | Memory Overhead | Leader Election Time |
|---|---|---|---|
| 6-12 | 6-12 | Low | Fast (< 100ms) |
| 50-100 | 50-100 | Medium | Medium (1-2s) |
| 500+ | 500+ | High | Slow (10+s) |
| 4000+/broker | Depends | Very high | Very slow |
Retention vs Storage Cost
| Retention | Storage Multiplier | Replay Window | Best For |
|---|---|---|---|
| 1 hour | 1x baseline | Limited | Real-time only |
| 7 days | 5-7x baseline | One week | Most applications |
| 30 days | 20-30x baseline | One month | Regulatory compliance |
| Indefinite | Variable | Full history | Event sourcing, audit logs |
Consumer Scaling Constraints
| Scenario | Limiting Factor | Solution |
|---|---|---|
| Lag growing, spare partitions | Consumer count < partitions | Add consumers up to partition count |
| Lag growing, no spare partitions | Partition count limits parallelism | Increase partitions (recreate topic) |
| Lag growing, many consumers but still behind | Processing bottleneck | Optimize logic, scale horizontally with more partitions |
| Rebalances causing throughput drops | Frequent consumer restarts | Use sticky/cooperative sticky assignor |
Exactly-once vs At-least-once Decision Matrix
| Factor | Use At-least-once | Use Exactly-once |
|---|---|---|
| Duplicate processing cost | Low (metrics, analytics) | High (financial, inventory) |
| Output system support | Any | Must support idempotent writes |
| Throughput requirement | High | Moderate to high |
| Operational complexity | Lower | Higher |
| End-to-end latency tolerance | Low | Higher |
Production Failure Scenarios
Understanding how Kafka fails in production helps you design more resilient systems.
Scenario 1: Broker network partition
When a broker loses network connectivity but doesn’t crash:
Broker-2 becomes unreachable
→ Partition 0 leader (on Broker-2) stops responding
→ Followers on Broker-1 and Broker-3 detect leader timeout
→ Kafka controller triggers leader election
→ ISR shrinks to exclude unreachable broker
→ min.insync.replicas check: if remaining < min.insync.replicas, writes fail
→ Producer receives NotEnoughReplicasException
Impact: Temporary unavailability of the partition. If unclean.leader.election=true and no ISR available, potential message loss.
Scenario 2: Zombie consumer problem
Consumer crashes but doesn’t leave group gracefully:
Consumer-1 processes messages and writes to database
Consumer-1 crashes AFTER database write but BEFORE offset commit
→ Consumer-1's partition sits unassigned
→ No consumer processes those messages (lag grows)
→ After session.timeout expires, Kafka triggers rebalance
→ New consumer picks up partition, replays same messages
→ Without idempotent processing: duplicate writes occur
The fix: idempotent consumers handle this gracefully. If you cannot make processing idempotent, use exactly-once semantics.
Scenario 3: Schema Registry mismatch
Producer and consumer evolve schemas independently:
Producer v1: { "userId": "string", "action": "string" }
Producer v2: { "userId": "string", "action": "string", "metadata": { "source": "string" } }
Consumer still expects v1 schema
→ Deserialization fails
→ Message goes to DLQ (if configured) or consumer crashes
→ Backlog of unprocessed messages accumulates
The fix: use Schema Registry with compatibility checking. Never evolve schemas in incompatible ways.
Scenario 4: Clock skew in clustered deployment
Brokers on different machines have clock skew:
Broker-1 clock: 1000
Broker-2 clock: 990 (10 seconds behind)
Broker-3 clock: 1010 (10 seconds ahead)
Leader write at timestamp 1005 (Broker-1 time)
→ Broker-2 sees this as future timestamp (1005 > 990)
→ Log segment index corrupted on Broker-2 replica
→ During leader election, Broker-2's replica deemed invalid
→ ISR shrinks, potential data loss if Broker-1 goes down
The fix: use NTP synchronization across all brokers. Monitor clock drift with automated alerts.
Scenario 5: Partition reassignment during peak load
Operations team rebalances partitions while producers are at full throughput:
Original: Broker-1 has partitions 0,1,2; Broker-2 has 3,4,5
Reassignment triggered:
→ New replicas start copying from leaders (network spike)
→ ISR temporarily expands (old + new replicas receiving)
→ Controller overwhelmed by metadata updates
→ Request latency P99 spikes to 10+ seconds
→ Producer buffers fill, send() blocks
→ Consumer lag grows as brokers struggle to keep up
The fix: schedule reassignments during low-traffic windows. Use Cruise Control for automated, throttled reassignment.
| Failure | Impact | Mitigation |
|---|---|---|
| Broker goes down | Partition leader election; temporary unavailability | Configure replication factor of 3; use ISR configuration |
| Controller failure | Cluster-wide coordination pause | Run multiple brokers; use ZooKeeper/KRaft for controller election |
| Network partition | Followers fall out of ISR; potential data loss | Monitor ISR size; alert when replicas fall behind |
| Producer retry storm | Duplicate messages after transient failures | Enable idempotent producer; design idempotent consumers |
| Consumer rebalance storm | Throughput drops during rebalancing | Use sticky partition assignment; avoid frequent consumer restarts |
| Offset commit failure | Messages reprocessed or skipped | Use transactional producers with exactly-once semantics when needed |
| Partition imbalance | Some brokers overloaded while others idle | Monitor partition distribution; use Cruise Control for rebalancing |
| Data loss on leader change | Under-replicated partitions lose messages | Ensure min.insync.replicas >= 2; acks=all on producers |
Common Pitfalls / Anti-Patterns
Pitfall 1: too many partitions
Each partition increases Kafka’s overhead (file handles, memory, leader elections). Creating thousands of partitions when you only need dozens causes unnecessary complexity. Start with fewer partitions and increase only when needed.
Not planning for partition count
Partition count is fixed at topic creation. If you need more later, you must recreate the topic. Plan partition count based on expected throughput and consumer parallelism requirements.
Pitfall 2: ignoring consumer lag
If consumers cannot keep up with producers, lag grows indefinitely. Monitor lag continuously and scale consumers or optimize processing logic.
Pitfall 3: not using compression
Uncompressed messages waste network bandwidth and storage. Enable compression (LZ4, ZSTD, or Snappy) on producers.
Pitfall 4: auto.offset.reset = earliest without understanding consequences
This setting replays all messages from the beginning of the log. In production with large retention, this can overwhelm consumers. Use it deliberately, not as a default.
Pitfall 5: sending sensitive data unencrypted
Kafka does not encrypt messages by default. If you send PII, credentials, or sensitive data, enable encryption or do not send such data through Kafka.
Interview Questions
Expected answer points:
- Kafka is a distributed streaming platform built around the concept of a durable log, not a queue
- Messages in Kafka topics are retained for a configurable period (hours, days, indefinitely) rather than being deleted upon consumption
- This retention enables replay capability: consumers can re-read historical data from any point in the log
- Traditional queues typically delete messages after consumption and do not support replay
- Kafka's topic model supports multiple independent consumer groups, each reading the same data independently
Expected answer points:
- Within a partition, messages have monotonically increasing offsets that define total order
- Producers can specify a message key; Kafka hashes the key to determine partition assignment
- All messages with the same key go to the same partition, guaranteeing order per key
- Consumers read messages in offset order from their assigned partitions
- Cross-partition ordering is not guaranteed; only per-partition ordering is maintained
Expected answer points:
- At-most-once: consumer commits offsets before processing; messages may be lost but never duplicated (lowest latency)
- At-least-once (default): consumer processes messages then commits offsets; messages may be duplicated but never lost
- Exactly-once: uses Kafka transactions to atomically commit offsets and output writes; no duplicates, no loss (highest latency)
- The choice depends on use case: at-least-once with idempotent consumers handles most scenarios well
- Exactly-once should only be used when duplicate processing has serious consequences and the cost is acceptable
Expected answer points:
- A consumer group is a set of consumers cooperating to consume messages from a topic
- Each partition is delivered to exactly one consumer within a group (ensuring parallelism)
- Different consumer groups each receive all messages independently
- When a consumer joins or leaves, Kafka rebalances partition assignments across the group
- Rebalancing temporarily pauses consumption; frequent rebalances hurt throughput
Expected answer points:
- ISR are replicas that have fully caught up with the partition leader
- Only ISR members can become leader after a failure
- The min.insync.replicas setting determines the minimum ISR size for acknowledging writes
- With replication.factor=3 and min.insync.replicas=2, writes acknowledge when at least 2 replicas persist the message
- If all in-sync replicas fall behind or fail, the partition becomes unavailable for writes
Expected answer points:
- Dead Letter Queues (DLQs) catch messages that exhaust retries
- Configure a DefaultErrorHandler with FixedBackOff for retry behavior (e.g., 3 retries, 1 second apart)
- DLQ preserves original topic, partition, and key for debugging
- Monitor DLQ depth — messages piling up indicate upstream issues
- DLQ messages can be reprocessed after fixing the underlying issue
Expected answer points:
- replication.factor = 3 (or higher) for critical topics to ensure redundancy
- min.insync.replicas = 2 (requires acks=all) so writes persist to multiple replicas
- acks = all (wait for all ISR to acknowledge before confirming write)
- retention.ms configured based on replay requirements (hours, days, or indefinite)
- Enable producer idempotency to prevent duplicates during retries
Expected answer points:
- Partition count determines maximum consumer parallelism (one consumer per partition per group)
- Each partition handles roughly 10 MB/s throughput; size partitions accordingly
- Kafka performance degrades past ~4000 partitions per broker
- Partition count is immutable after topic creation; plan ahead
- More partitions means higher overhead (file handles, memory, leader election time)
Expected answer points:
- ZooKeeper: traditional metadata management (partition leadership, ISR, consumer offsets, ACLs)
- KRaft: Kafka's built-in consensus protocol (Kafka 3.3+) removes ZooKeeper dependency
- KRaft scales better — ZooKeeper struggles with millions of znodes (common with many consumer groups)
- KRaft enables simpler cluster setup and faster controller election
- Kafka 3.3+ supports live migration from ZooKeeper to KRaft without downtime
Expected answer points:
- Consumer-side: limit max.poll.records and configure fetch.min.bytes/fetch.max.wait.ms
- Producer-side: buffer.memory and batch.size create natural backpressure when broker is slow
- Monitor consumer lag — growing lag signals producers outpacing consumers
- Watch for under-replicated partitions, producer send() blocking, and request timeout exceptions
- Scale consumers horizontally (if partitions allow) or optimize processing logic
- Scale brokers if broker throughput is the bottleneck
Expected answer points:
- Log compaction retains the latest message for each key within a partition, discarding older messages with the same key
- Unlike time-based retention which deletes messages after a period, compaction keeps the most recent value for each key indefinitely
- Use cases: maintaining a lookup table or changelog where only the latest state matters (e.g., customer profile updates)
- Enables Kafka as a key-value store or database for event sourcing where you need the current state, not full history
- Compaction runs in the background and does not block normal writes
Expected answer points:
- Schema Registry stores and validates message schemas (Avro or JSON Schema) separately from Kafka
- Producers and consumers register and retrieve schemas by subject name, enabling schema validation at publish time
- Schema compatibility modes control evolution: BACKWARD, FORWARD, FULL, NONE
- BACKWARD compatibility allows consumers reading new data to work with old schemas (most common)
- Without Schema Registry, incompatible schema changes cause deserialization errors or silent data corruption
- Schema Registry also compresses message payloads by storing schema IDs instead of full schemas in each message
Expected answer points:
- Kafka Connect is a framework for scalably and reliably streaming data between Kafka and external systems
- Connectors are plugins that define how to interact with source systems (databases, S3, JDBC) or sink systems
- Workers are the processes that execute connectors; they can run in standalone or distributed mode
- Converters handle serialization/deserialization of record keys and values (Avro, JSON, Parquet)
- Transforms are optional lightweight modifications to records (e.g., filtering, adding fields)
- Offset storage: Connect manages offset tracking internally, typically in a Kafka topic
Expected answer points:
- The Transaction Coordinator is a Kafka broker component that manages the two-phase commit protocol for transactions
- Phase 1 (prepare): producer sends commit request; coordinator writes a prepare marker to all involved partitions
- Phase 2 (commit): coordinator writes a commit marker atomically across all partitions
- If the producer crashes before commit, the coordinator detects and aborts the transaction on restart
- Consumer uses the transaction marker to filter out aborted transactions via isolation.level=read_committed
- The transactional.id ensures exactly-once semantics even across producer restarts
Expected answer points:
- Range assignor: assigns partitions contiguously per topic; can cause imbalance if topic partition counts differ
- RoundRobin assignor: distributes partitions evenly across consumers regardless of topic; better balance
- Sticky assignor: minimizes partition movement during rebalances, reducing reprocessing overhead
- Cooperative sticky assignor: allows incremental rebalancing without full stop-the-world pauses
- Choice affects rebalance duration, message ordering during rebalance, and consumer CPU utilization
- For stateful consumers, sticky assignment prevents costly state migration
Expected answer points:
- Adding consumers only helps if there are spare partitions; max parallelism equals partition count
- Increasing partitions increases parallelism but is permanent and costly: more file handles, memory overhead, longer leader election
- More partitions also increases end-to-end latency (more replication, more producer batching decisions)
- Rebalancing with many partitions takes longer, temporarily impacting throughput
- Rule of thumb: target 10-100 MB/s per partition throughput, start conservative (6-12 partitions)
- If lag is growing and no spare partitions, you must add partitions (requires topic recreation or new topic)
Expected answer points:
- Consumer lag is the difference between the latest offset and the consumer's committed offset
- Kafka exposes lag per partition via KafkaConsumer.metrics() or tools like kafka-consumer-groups
- JMX metrics: ConsumerLag and FetchManager metrics in the kafka.consumer group
- Monitoring tools: Confluent Control Center, Kafka Manager, Prometheus with JMX exporter, Datadog
- Acceptable threshold depends on SLA: for 5-minute SLA, lag should stay under 5 minutes
- Growing lag signals producers outpacing consumers; investigate processing bottlenecks or scale consumers
Expected answer points:
- Retention policy determines how long messages are kept before being eligible for deletion
- Configured via retention.ms (time-based) or retention.bytes (size-based) per topic
- Choose based on use case: event sourcing needs long retention for replay, real-time analytics may need shorter
- Consider downstream consumers needing to reprocess historical data if processing logic changes
- Longer retention increases storage costs; balance between replay window and cost
- For compliance or audit requirements, retention may need to be days or weeks
Expected answer points:
- Partition replication: each partition has a leader and ISR followers on different brokers
- If a broker fails, Kafka automatically elects a new leader from ISR members
- min.insync.replicas ensures writes persist to multiple replicas before acknowledgment
- Controller broker manages partition leadership and cluster coordination (backed by ZooKeeper or KRaft)
- unclean.leader.election setting controls behavior when no ISR is available: can lose messages if set to true
- racks configuration allows placing replicas in different physical locations for rack-aware failure handling
Expected answer points:
- Kafka uses partition-centric model with consumer groups; Pulsar uses subscription types (exclusive, failover, shared, key-shared)
- Pulsar separates storage into tiered storage (BookKeeper for real-time, Apache Tiered Storage for historical)
- RabbitMQ is a traditional broker with exchanges and queues, not a log-based system; no native replay
- Pulsar supports geo-replication out-of-the-box; Kafka requires MirrorMaker
- Kafka has a larger ecosystem and community; Pulsar offers better multi-tenancy and geo-replication
- For pure message queuing with acknowledgments, RabbitMQ excels; for high-throughput event streaming with replay, Kafka excels
Further Reading
- Apache Kafka Documentation - Official Kafka docs covering all core concepts and configuration
- Kafka Streams Documentation - Official guide for building stream processing applications
- Confluent Blog - In-depth Kafka articles, best practices, and case studies
- Kafka Improvement Proposals (KIPs) - Design documents for upcoming Kafka features
- Kafka: The Definitive Guide - Comprehensive book on Kafka architecture and usage
- Designing Event-Driven Systems - Free book covering event-driven architecture patterns with Kafka
Conclusion
Key points
- Kafka is a distributed log, not a queue; messages are retained and can be replayed
- Topics divide into partitions for parallelism; same key goes to same partition
- Consumer groups enable independent consumption by different services
- At-least-once delivery is the default; exactly-once requires Kafka transactions
- Rebalancing happens when consumers join or leave; frequent rebalances hurt throughput
- ZooKeeper (or KRaft in newer versions) manages cluster metadata
Configuration Essentials
Core Setup
- Partition count planned for target throughput and consumer parallelism
- Replication factor set to 3 for critical topics
- min.insync.replicas configured to 2 (requires acks=all)
- Retention policy set based on replay requirements
Producer Configuration
- Idempotent producer enabled
- Compression enabled (LZ4 or ZSTD recommended)
- acks=all for critical data
- retries configured with appropriate backoff
Consumer Configuration
- Consumer group offset reset policy defined
- max.poll.records tuned for processing capacity
- session.timeout appropriate for your use case
- Partition assignment strategy selected (sticky or cooperative sticky for production)
Operations and Monitoring
Observability checklist
Metrics to monitor
- Consumer lag: difference between latest offset and consumer position (critical for SLAs)
- Under-replicated partitions: partitions without full replication
- ISR size: In-Sync Replicas count per partition
- Message throughput: messages and bytes per second per topic
- Request latency: P99 producer and consumer request latencies
- Disk usage: broker disk utilization and growth rate
- Consumer group status: active members and partition assignments
- Controller status: leader elections and controller changes
Logs to capture
- Broker startup and shutdown events
- Partition leader election events
- Consumer group rebalancing events
- Producer acknowledgment failures and retries
- Controller changes and election events
- Under-replicated partition events
- Disk space warnings
Alerts to configure
- Consumer lag exceeds SLA threshold (for example, more than 5 minutes behind)
- Under-replicated partitions greater than 0
- Broker disk usage above 80%
- Producer error rate above 1%
- Consumer group has no active members
- Controller is unavailable
- Messages per second exceeds capacity threshold
- Request latency P99 exceeds threshold
ZooKeeper and KRaft storage requirements
Kafka needs somewhere to store cluster metadata — ZooKeeper traditionally, KRaft in Kafka 3.3+.
What is stored:
- Partition leadership and ISR membership
- Consumer group offsets and membership
- Access control lists (ACLs)
- Topic configurations
- Delegation token information
ZooKeeper/KRaft storage ≈
partitions × (leadership + ISR state)
+ consumer_groups × (member offsets + metadata)
+ topics × (configs + ACLs)
+ delegation_tokens
Typical storage needs:
| Cluster Size | Topics | Partitions | Consumer Groups | ZooKeeper/KRaft Storage |
|---|---|---|---|---|
| Small (3 brokers) | 50 | 200 | 30 | 50-200 MB |
| Medium (6 brokers) | 200 | 1,000 | 150 | 200-500 MB |
| Large (12 brokers) | 500 | 5,000 | 500 | 1-3 GB |
| Very large (24+ brokers) | 1,000+ | 20,000+ | 2,000+ | 5-10 GB |
Storage is not the real bottleneck. Znode count is. ZooKeeper falls over when there are millions of child znodes — which happens when you have many consumer group members or partition replicas. KRaft mode removes ZooKeeper dependency and scales better.
If you are still on ZooKeeper, migrate to KRaft. Kafka 3.3+ supports live migration — no downtime needed.
Deployment Readiness
Pre-deployment checklist
- [ ] Replication factor set to 3 for critical topics
- [ ] min.insync.replicas configured appropriately
- [ ] Consumer lag monitoring configured and alerts set
- [ ] Producer retries and idempotency configured
- [ ] Compression enabled on producers
- [ ] Schema Registry deployed for schema validation
- [ ] ACLs configured for topic access control
- [ ] TLS/SSL encryption enabled for all connections
- [ ] Partition count planned based on throughput requirements
- [ ] Retention policy configured (hours, days, weeks)
- [ ] Dead letter topic configured for failed messages
- [ ] Consumer group offset reset policy defined
- [ ] Backup and disaster recovery plan documented
Security
- SASL/SCRAM or mTLS configured for authentication
- ACLs configured for topic access control
- TLS/SSL encryption enabled for all connections
- Sensitive data encryption configured
Security checklist
- Authentication: use SASL/PLAIN or SCRAM for client authentication; mTLS for certificate-based auth
- Authorization: implement ACLs to restrict topic access; principle of least privilege
- Encryption in transit: enable SSL/TLS for all broker and client connections
- Encryption at rest: use disk encryption or Kafka’s Secret API for sensitive data
- Schema validation: validate message schemas with Confluent Schema Registry
- Data sanitization: sanitize message keys and values to prevent injection
- Audit logging: enable Kafka’s audit logging for admin operations
- Network segmentation: place brokers in private networks; restrict inter-broker communication
Category
Related Posts
Exactly-Once Delivery: The Elusive Guarantee
Explore exactly-once semantics in distributed messaging - why it's hard, how Kafka and SQS approach it, and practical patterns for deduplication.
Ordering Guarantees in Distributed Messaging
Understand how message brokers provide ordering guarantees, from FIFO queues to causal ordering across partitions, and the trade-offs in distributed systems.
Dead Letter Queues: Handling Message Failures Gracefully
Design and implement Dead Letter Queues for reliable message processing. Learn DLQ patterns, retry strategies, monitoring, and recovery workflows.