etcd: Distributed Key-Value Store for Configuration

Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.

published: March 24, 2026 reading time: 33 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

etcd is the Raft-based key-value store that holds all Kubernetes cluster state—pods, services, the works. This post explains how etcd actually works: the replicated log that keeps all nodes in sync, watches that push updates to clients instead of making them poll, and leases that let you claim temporary ownership of keys. The operational reality is that etcd is extremely disk I/O sensitive (get NVMe SSDs or suffer the consequences), the Raft algorithm handles leader election and log replication so you don't have to, and understanding this machinery helps you actually debug what's happening when your cluster misbehaves. For anyone running Kubernetes, knowing etcd is mandatory since most control plane problems trace back to it.

etcd: Distributed Key-Value Store for Service Discovery and Configuration

etcd started at CoreOS as a coordination service built on the Raft consensus algorithm. Its job is storing configuration, detecting failures, and coordinating leader elections in distributed systems. Today etcd is best known as Kubernetes’ backing store — holding all cluster state: pods, services, deployments, the works.

The name follows the Unix convention of naming system services “etc” (for configuration) with a “d” for daemon. etcd was essentially meant to be the “etc” for distributed systems.

Core Architecture

etcd is a distributed, consistent, highly-available key-value store. It is not a general-purpose database — etcd handles specific things well:

Small amounts of critical data (configuration, locks, leader election)
Strong consistency — no eventual consistency surprises
Fast distributed coordination primitives

The architecture is built on Raft to maintain a replicated log. The Raft paper is titled “In Search of an Understandable Consensus Algorithm”, and etcd’s implementation is widely considered the reference implementation.

graph TD
    A[Client] -->|Read/Write| B[Leader]
    B -->|Replicate| C[Follower 1]
    B -->|Replicate| D[Follower 2]
    B -->|Replicate| E[Follower 3]

    C -.->|Heartbeat| B
    D -.->|Heartbeat| B
    E -.->|Heartbeat| B

    B --> F[(Raft Log)]
    C --> G[(Raft Log)]
    D --> H[(Raft Log)]
    E --> I[(Raft Log)]

In a Raft cluster, one node is the leader and handles all writes. Writes go through the leader’s replicated log, committed to a majority of nodes before durability. Reads can go to any node, but followers might return stale data unless you request linearizable reads.

The Raft Consensus Algorithm

Raft achieves consensus by electing a leader and replicating a log of commands. The algorithm has three parts:

Leader Election:

Nodes start as followers
Followers become candidates if they miss leader communication within the election timeout
Candidates request votes from other nodes
Majority votes = new leader

Log Replication:

Leader accepts commands from clients
Appends to local log
Replicates to followers via AppendEntries messages
Command on majority of logs = applied to state machine, client notified

Safety:

Only servers with committed entries can become leader
Committed entries survive minority node failures

// Simplified Raft write flow in etcd
func (s *Server) Propose(ctx context.Context, cmd []byte) error {
    // 1. Submit to Raft log
    ch := make(chan applyFuture)
    s.node.Propose(ctx, cmd, ch)

    // 2. Wait for commit
    result := <-ch
    if result.Error() != nil {
        return result.Error()
    }

    // 3. Apply to state machine
    return s.apply(cmd)
}

The key insight: as long as a majority agrees on log contents, the cluster is consistent. Leader crashes = new leader with most complete log elected.

Data Model and API

etcd stores data in a flat key-value namespace with a hierarchical directory structure (like a filesystem). Keys are strings, values can be arbitrary bytes.

# Store a simple value
etcdctl put /services/api/server1 "10.0.0.1:8080"

# Retrieve the value
etcdctl get /services/api/server1

# Store JSON configuration
etcdctl put /config/database '{"host": "db.example.com", "port": 5432}'

# List all keys under a prefix
etcdctl get --prefix /services/

The etcd API follows the gRPC protocol, which you can call directly or via client libraries:

import "go.etcd.io/etcd/client/v3"

cli, _ := client.NewFromURLs([]string{"http://localhost:2379"})
defer cli.Close()

// Write a value
cli.Put(context.Background(), "/config/featureflags", `{"newUI": true}`)

// Read a value
resp, _ := cli.Get(context.Background(), "/config/featureflags")
fmt.Println(string(resp.Kvs[0].Value))

// Atomic compare-and-swap (requires current value)
cli.Txn(context.Background()).If(
    clientv3.Compare(clientv3.Value("/config/featureflags"), "=", `{"newUI": false}`),
).Then(
    clientv3.OpPut("/config/featureflags", `{"newUI": true}`),
).Else(
    clientv3.OpPut("/config/featureflags", `{"newUI": false}`),
).Commit()

Transactions enable atomic compare-and-swap operations, essential for distributed locks and leader election.

Watches and Reactive Configuration

Native watch support is etcd’s most powerful feature. Subscribe to key changes instead of polling, and get notified immediately when data changes.

// Watch a single key
watchChan := cli.Watch(context.Background(), "/services/api/")
for resp := range watchChan {
    for _, event := range resp.Events {
        fmt.Printf("%s %q: %q\n", event.Type, event.Kv.Key, event.Kv.Value)
    }
}

// Watch an entire prefix recursively
watchChan := cli.Watch(context.Background(), "/services/", clientv3.WithPrefix())

Watches in etcd are cheap and scalable because they are true subscriptions, not polling. The etcd server tracks which clients watch which keys and pushes updates directly.

sequenceDiagram
    participant K as Kubernetes Controller
    participant E as etcd Server
    participant W as Watch Stream

    K->>E: Watch /pods/
    E->>W: Create subscription
    W-->>K: Initial state

    Note over E: Pod created
    E->>W: Push new event
    W-->>K: Pod event notification
    K->>K: Reconcile pod state

This watch mechanism lets Kubernetes controllers react immediately to cluster changes without polling.

Security Configuration

Security is not bolted on after the fact. etcd ships with a layered model that handles authentication, authorization, and encrypted transport. In production, the two areas that matter most are RBAC for authorization, and mutual TLS for encrypting traffic between cluster members and clients. The sections below walk through both so you can harden a production etcd deployment.

Authentication and RBAC

etcd ships with built-in role-based access control. Before enabling auth, you create users and bind them to roles that permit specific key prefixes.

# Create a user with password authentication
etcdctl user add root
# Add user to the root role (full access)
etcdctl user grant-role root root

# Enable authentication
etcdctl auth enable

# Now all requests require credentials
etcdctl --user root:password put /services/api/server1 "10.0.0.1:8080"

Built-in roles:

Role	Permissions
`root`	Full read/write/admin on all keys
`guest`	Read on all keys (default role)
`etcd-cluster`	Read/write on cluster communication keys

Custom roles for least privilege:

# Create a role scoped to a key prefix
etcdctl role add app-config
etcdctl role grant-permission app-config readwrite /config/app/

# Create a user and assign the role
etcdctl user add app-service
etcdctl user grant-role app-service app-config

Without RBAC, any client that can reach etcd port 2379 has full cluster access. In Kubernetes, the etcd RBAC layer is typically bypassed because the API server handles authentication — but for standalone etcd deployments, RBAC is the first line of defense.

TLS/mTLS Setup for Cluster Communication

etcd supports encrypted communication between members and clients via mutual TLS. Each member and client presents a certificate; both sides verify the other’s identity.

Certificate requirements:

Each etcd member needs a server certificate with its hostname/IP as Subject Alternative Names (SANs)
Each client needs a client certificate for authentication
The CA certificate must be trusted by all members and clients

# Generate certificates using cfssl or openssl
# Example using openssl - simplified

# 1. Create the CA
openssl genrsa -out ca-key.pem 2048
openssl req -x509 -new -nodes -key ca-key.pem -days 365 -out ca-crt.pem \
  -subj "/CN=etcd-ca"

# 2. Create member certificate with SANs for all cluster IPs
openssl genrsa -out member-key.pem 2048
# SANs: etcd-1, etcd-2, etcd-3, 192.168.1.1, 192.168.1.2, 192.168.1.3
openssl req -new -key member-key.pem -out member.csr \
  -subj "/CN=etcd-member"
# Sign with CA...

# 3. Client certificates (no SAN required)
openssl genrsa -out client-key.pem 2048
openssl req -new -key client-key.pem -out client.csr \
  -subj "/CN=etcd-client"
# Sign with CA...

Cluster member flags:

# On each member, specify certificates
etcd \
  --cert-file=/etc/etcd/ssl/server.crt \
  --key-file=/etc/etcd/ssl/server.key \
  --trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --peer-cert-file=/etc/etcd/ssl/peer.crt \
  --peer-key-file=/etc/etcd/ssl/peer.key \
  --peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --initial-cluster etcd-1=https://192.168.1.1:2380,etcd-2=https://192.168.1.2:2380 \
  --listen-peer-urls https://0.0.0.0:2380 \
  --listen-client-urls https://0.0.0.0:2379

Client connection with TLS:

# Connect a client with certificates
etcdctl --cacert=/etc/etcd/ssl/ca.crt \
  --cert=/etc/etcd/ssl/client.crt \
  --key=/etc/etcd/ssl/client.key \
  endpoint health

Without TLS in production, network-level attackers can read all cluster state including Kubernetes secrets. TLS also prevents man-in-the-middle attacks that could inject false configuration into your cluster.

Operational Considerations

Production etcd means dealing with edge cases that happy-path docs skip. Leases interact badly with crashes and network partitions — they can silently delete your keys. Multi-tenant environments have no native namespace isolation, so prefix conventions and RBAC roles are the only isolation tool available. And if you are not monitoring the right metrics, you will not know something is wrong until the cluster stops accepting writes. The three subsections below cover these operational realities.

Lease Revocation Edge Cases

etcd leases attach TTLs to keys. When a lease expires, all keys attached to it are deleted atomically. The edge case is what happens when the lease holder crashes before renewal.

Grace period behavior:

// When a lease holder crashes:
// 1. The lease continues running on etcd server
// 2. If no keepalive arrives before --lease-grant-timeout, etcd revokes the lease
// 3. All keys attached to the lease are deleted
// 4. Watchers on those keys receive delete events

// The grace period is approximately: (lease TTL) + (leader election timeout)
// If leader is down, lease cannot be renewed until new leader is elected

Practical implications:

Scenario	Lease Behavior
Holder crashes, lease TTL = 10s	Keys deleted ~10s after crash (plus network delay)
Leader fails during grace period	New leader elected, old holder’s lease revoked, keys deleted
Network partition (holder isolated)	Partition-side holder cannot renew; majority-side lease revoked after TTL expires
Holder JVM pauses (GC)	If pause exceeds lease TTL, keys deleted; use longer TTL + reliable keepalive

Recommendation: set lease TTL at 2-5x your expected renewal interval, and implement retry logic with exponential backoff on the client side.

Namespace Isolation for Multi-Tenant Environments

etcd has no native multi-tenant namespace abstraction. Isolation is achieved through key prefix conventions and RBAC roles.

Prefix-based tenancy:

# Tenant A's keys
etcdctl role add tenant-a
etcdctl role grant-permission tenant-a readwrite /tenants/a/

# Tenant B's keys
etcdctl role add tenant-b
etcdctl role grant-permission tenant-b readwrite /tenants/b/

# Each tenant's application only gets credentials granting access to their prefix
# Kubernetes uses this pattern: /registry/...

For Kubernetes itself, the convention is:

# Kubernetes state lives under these prefixes
/registry/apiregistration.k8s.io/apiservices/
/registry/services/endpoints/
/registry/pods/
/registry/secrets/
/registry/configmaps/

Multi-tenant considerations:

Concern	Mitigation
Tenant key sprawl	Enforce naming conventions; monitor key count per prefix
Resource contention	Separate etcd clusters per tenant at scale
Quota limits	etcd has no per-key-prefix quota; implement application-level checks
RBAC complexity	Use role inheritance where possible; automate role provisioning

At scale (hundreds of tenants), running a dedicated etcd cluster per tenant or using a managed service like etcd Operator becomes more practical than prefix-based isolation on a single cluster.

Key Monitoring Metrics

Monitor these metrics to catch etcd problems before they affect your cluster:

Critical metrics:

# Using Prometheus with etcd metrics endpoint (--listen-metrics-urls)
# Metrics endpoint: http://localhost:2381/metrics

# Commit index lag - how far behind followers are from leader
etcd_server_leader_changes_seen_total
etcd_server_commits_total
# For lag: compare follower applied_index vs leader committed_index

# WAL fsync latency - disk performance indicator
etcd_wal_fsync_duration_seconds_bucket
# P99 should be < 10ms; > 100ms indicates disk I/O problems

# Snapshot size
etcd_server_snapshot_apply_in_progress_total
# Snapshot size growing rapidly indicates write burst or missed snapshots

Prometheus alerting rules:

groups:
  - name: etcd
    rules:
      # Alert if WAL fsync P99 > 500ms
      - alert: EtcdWalFsyncLatency
        expr: histogram_quantile(0.99, rate(etcd_wal_fsync_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "etcd WAL fsync latency is critical"

      # Alert if follower commit index lag is growing
      - alert: EtcdCommitIndexLag
        expr: etcd_server_leader_commits_index - etcd_server_apply_commits_index > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "etcd follower is lagging behind leader"

      # Alert if snapshot restore is in progress (node recovering)
      - alert: EtcdSnapshotRestore
        expr: etcd_server_snapshot_apply_in_progress_total == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "etcd is currently restoring a snapshot"

Cluster Maintenance

etcd clusters do not maintain themselves. Rolling upgrades let you refresh cluster members without downtime, but the procedure has ordering requirements that trip people up. Disk I/O is the most common production bottleneck, and picking the wrong storage — NFS or a slow HDD — can cause leader election timeouts that cascade into cluster unavailability. The two subsections below cover upgrade procedures and disk requirements.

Rolling Upgrade Procedures

etcd supports rolling upgrades where one member is upgraded at a time with no cluster downtime.

Standard rolling upgrade steps:

# 1. Verify cluster health before starting
etcdctl endpoint health --cluster

# 2. Upgrade one member at a time
# Stop etcd on member, replace binary, restart
systemctl stop etcd

# 3. After restart, wait for member to catch up
# Check that the member joins the cluster
etcdctl member list

# 4. Verify the upgraded member is healthy
etcdctl endpoint health

# 5. Proceed to next member

Pre-upgrade checklist:

Review etcd release notes for any breaking changes between versions
Ensure your etcd data directory (—data-dir) has adequate free space (2x current DB size recommended)
Take a snapshot before starting: etcdctl snapshot save /backup/etcd-snap.db
Verify the new binary version is compatible with cluster API version (etcd --version)

Upgrade compatibility matrix:

Cluster Version	Supported Upgrades
3.4	3.4 -> 3.5 (same major version)
3.5	3.5 -> 3.6 (same major version)
3.6	3.6 -> 3.7 (if released)
Note	Major version jumps require full cluster restart (not rolling)

Never upgrade more than one minor version at a time. For example, going from 3.4 to 3.6 should be done as 3.4 -> 3.5 -> 3.6.

Disk I/O Requirements

etcd’s performance is fundamentally bound by disk I/O, particularly fsync latency on the WAL.

Requirements by environment:

Environment	Minimum	Recommended
Development/QA	HDD (7200 RPM) with ext4/xfs	SSD with NVMe
Production	SSD (SATA or NVMe)	NVMe SSD with >50K IOPS
Kubernetes Control Plane	NVMe SSD mandatory	NVMe SSD with <1ms fsync P99
Heavy write load	NVMe SSD with >100K IOPS and >1GB/s throughput	RAID-0 NVMe or dedicated NVMe device

Why SSD/NVMe is mandatory:

# etcd writes to WAL on every committed operation
# Each write must be fsynced before acknowledgment
# HDD fsync latency: 10-50ms (unacceptable)
# SSD fsync latency: 0.1-1ms (acceptable)
# NVMe fsync latency: 0.05-0.2ms (optimal)

# A 10-node cluster with 1 HDD
# fsync P99 = 30ms
# Max theoretical throughput = 1000ms / 30ms = ~33 writes/second
# At 1000 writes/second -> 30+ second commit latency -> leader election

Disk configuration best practices:

# Use --data-dir on dedicated disk or partition
# Do NOT share with application data
etcd --data-dir=/var/lib/etcd

# For NVMe, consider disabling disk I/O scheduler
echo "none" > /sys/block/nvme0n1/queue/scheduler

# Mount with noatime to avoid unnecessary writes
# /etc/fstab entry:
# /dev/nvme0n1 /var/lib/etcd ext4 defaults,noatime 0 2

# Monitor disk I/O
iostat -x 1
# Watch %util and avgqu-sz - if avgqu-sz > 4 for sustained periods, disk is saturated

Don’t use network-attached storage (NAS/NFS) for etcd data. Network latency adds directly to fsync latency and will cause leader election timeouts. Local NVMe or dedicated SSD is the only production-supported configuration.

Leader Election

etcd provides primitives for leader election. Create a dedicated directory and use transactions to ensure only one process leads at a time.

// Simplified leader election using etcd
func ElectLeader(cli *clientv3.Client, leaderName string) {
    // Try to acquire leadership by creating a key
    _, err := cli.Put(context.Background(), "/election/leader", leaderName,
        clientv3.WithLease(clientv3.NewLease(cli)))

    if err != nil {
        // Key exists, check current leader
        resp, _ := cli.Get(context.Background(), "/election/leader")
        fmt.Printf("Current leader: %s\n", resp.Kvs[0].Value)
        return
    }

    fmt.Printf("%s is now the leader!\n", leaderName)

    // Keep leadership by renewing the lease
    // If this process dies, the lease expires and leadership is released
}

The lease mechanism matters. Leader crashes, lease expires, another node takes over. This prevents stale leadership where a dead process appears to still be in charge.

Handling Failures and Recovery

etcd tolerates node failures gracefully:

Minority down: Cluster continues serving reads and writes
Leader down: New leader elected within seconds, brief write pause
Majority down: Cluster unavailable (cannot guarantee consistency without majority)

# Check cluster health
etcdctl endpoint health

# Check member list
etcdctl member list

# Remove a failed member
etcdctl member remove <member-id>

Recovering nodes must catch up on missed updates. etcd supports:

Snapshot recovery: Restore from a recent snapshot and replay WAL entries
Member migration: Replace failed node with a new one that learns from the leader

For production clusters, regular snapshots are essential. Without them, a recovering node might replay years of WAL entries.

WAL and Snapshot Mechanics

etcd uses a Write-Ahead Log (WAL) to ensure durability of committed operations before applying them to the state machine. Every write that passes Raft consensus gets written to the WAL before being applied.

WAL structure:

wal/
├── 0.index              # WAL metadata
├── 1.snap/              # Snapshot of state at index 1
│   ├── db               # BoltDB snapshot file
│   └── metadata
├── 2.wal/               # WAL for entries 2 onwards
│   ├── 0.index          # First WAL segment index
│   ├── 0000000000000000-0000000000000000.wal
│   └── 0000000000000001-0000000000000256.wal

Each WAL segment file contains record batches with:

Type: Entry (raft log entry), State (hard state like term/vote), etc.
Term: Raft term number
Index: Log entry index
Data: The actual write command

graph LR
    A[Client Write] --> B[Raft Propose]
    B --> C[Append to WAL]
    C --> D[Replicate to Followers]
    D --> E[Wait for Majority Ack]
    E --> F[Mark Committed in WAL]
    F --> G[Apply to State Machine]
    G --> H[Write Snapshot]

Snapshotting process:

When the WAL grows too large or during routine snapshots, etcd creates a snapshot:

etcd takes a BoltDB snapshot of the current state
WAL is truncated - old entries are discarded
New member joining can restore from snapshot + replay only recent WAL

# Manually trigger a snapshot
etcdctl snapshot save /tmp/etcd-snap.db

# Check snapshot status
etcdctl snapshot status /tmp/etcd-snap.db

# Restore from snapshot (for disaster recovery)
etcdctl snapshot restore /tmp/etcd-snap.db \
  --name etcd-1 \
  --initial-cluster etcd-1=http://192.168.1.1:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls http://192.168.1.1:2380

Why snapshots matter:

Without Snapshots	With Snapshots
New node must replay ALL WAL entries	New node replays only recent entries
Recovery time = years of entries	Recovery time = minutes of entries
Disk space unbounded	WAL bounded by snapshot interval
Memory pressure from WAL index	Clean, bounded WAL

Operational best practices:

Monitor etcd_server_snapshot_apply_in_progress_total metric
Set --snapshot-count to control how often snapshots are taken (default: 100,000)
Ensure adequate disk space for WAL + snapshots during heavy write periods

Performance Characteristics

etcd is not designed for high throughput on large datasets. It excels at:

Megabytes to gigabytes of critical configuration
Consistent latency operations (1-10ms typically)
Thousands of watches and frequent small updates

Rough numbers for a 3-node cluster on modern hardware:

Write latency: 1-5ms at 10k+ QPS
Read latency: sub-millisecond for cached reads
Watch scalability: 100k+ concurrent watches

These come from etcd team’s benchmarking. Real-world performance depends on network latency, disk I/O, and cluster size.

Kubernetes Integration

Kubernetes stores all cluster state in etcd. kubectl get pods? API server reads from etcd. Deploy a service? API server writes to etcd, controllers watching react.

graph TD
    A[kubectl] -->|HTTP| B[API Server]
    B -->|Read/Write| C[etcd]
    D[Controller] -->|Watch| B
    D -.->|Reconcile| E[Pod]
    B -.->|Notify| D

This tight coupling means etcd failures affect the entire Kubernetes control plane. If etcd is unavailable:

No new pods scheduled
Existing pods keep running (kubelets manage them independently)
Services cannot be created or modified
DNS might stop resolving new entries

Kubernetes deployments typically run etcd on dedicated nodes with redundant storage and monitoring.

Common Pitfalls / Anti-Patterns

When to Use etcd

Pick etcd when your distributed system needs processes to agree on shared state and writing your own consensus layer is not on the table. The question is: what happens during a split? Is it worse for two nodes to both think they hold a lock, or worse for the system to become temporarily unavailable? If availability over divergence is the answer, etcd is the right tool.

Primary use cases:

Use Case	Why etcd fits	Typical alternatives
Service discovery	Watches propagate endpoint changes in milliseconds	DNS + health checks, Consul
Leader election	Lease-based acquisition with automatic expiry on crash	ZooKeeper, Consul
Distributed locking	Compare-and-swap transactions prevent race conditions	Redis (Redlock), ZooKeeper
Feature flags / config	Atomic updates with watches eliminate polling	Config maps in Kubernetes, Consul
Cluster membership	Raft handles join/leave/recovery consistently	Static config files, Consul

Decision checklist — answer yes to most of these:

Do you need strong consistency, where every node sees the same data after any write?
Is your data size modest (MB to low GB), changing frequently and needing instant propagation?
Are you already running Kubernetes? etcd is the natural fit for cluster state.
Is avoiding a separate coordination service important, or do you need guarantees that service mesh tools do not provide?
Is network partition tolerance important, with explicit preference for unavailability over divergence?

Concrete examples from production systems:

Kubernetes uses etcd for all cluster state, which means the moment you deploy pods or services, you inherit proven coordination patterns. In a service mesh setup, etcd handles leader election for your control plane and distributes feature flags that components subscribe to via watches — configuration changes propagate without pod restarts.

For distributed locking, the pattern is: acquire a lease-attached key to claim the lock, renew it periodically while doing work, and let the lease expire if the holder crashes. This avoids the orphaned-lock problem that affects Redis-based locks without a TTL.

// Distributed lock pattern with etcd leases
func AcquireLock(cli *clientv3.Client, lockKey string, ttl int64) error {
    lease := clientv3.NewLease(cli)
    _, err := cli.Grant(context.Background(), ttl)
    if err != nil {
        return err
    }

    // Atomically write only if key doesn't exist
    _, err = cli.Txn(context.Background()).If(
        clientv3.Compare(clientv3.Version(lockKey), "=", 0),
    ).Then(
        clientv3.OpPut(lockKey, "locked", clientv3.WithLease(lease)),
    ).Commit()

    if err != nil {
        return fmt.Errorf("lock already held: %w", err)
    }

    // Keep lock alive for duration of work
    go func() {
        for {
            cli.KeepAlive(context.Background(), lease)
            time.Sleep(time.Duration(ttl/2) * time.Second)
        }
    }()
    return nil
}

When ZooKeeper or Consul might be better:

Consul is worth considering if you need multi-datacenter replication with built-in health checks and a DNS interface — the agent-based model reduces what your team has to operate. ZooKeeper fits naturally if Kafka or HBase are already in your stack, since those systems depend on it anyway. For everyone else on Kubernetes, or anyone who just needs a clean Raft implementation without extra dependencies, etcd is the straightforward choice.

When Not to Use etcd

etcd is not meant for general application data. It’s designed for metadata and coordination, so high-volume application data belongs in Redis or a proper database. High-throughput event streams also push beyond what etcd handles well — while 1-10ms latency at thousands of operations per second works fine for coordination, it won’t meet the demands of event pipelines. For queuing or job scheduling, reach for a dedicated message queue like Kafka or RabbitMQ instead.

Production Failure Scenarios

Failure	Impact	Mitigation
Quorum loss (N/2+ nodes fail)	etcd cluster becomes unavailable; no reads or writes possible	Use 5-node cluster for better fault tolerance; monitor node health; set appropriate `election-timeout`
Disk I/O saturation	etcd is extremely disk I/O sensitive; slow fsync causes leader heartbeats to timeout	Use SSDs (NVMe preferred); isolate etcd disk I/O from other workloads; monitor `fsync` latency
Snapshot restore taking too long	When leader sends snapshots to new nodes, followers are unavailable during download	Pre-provision nodes with snapshot already loaded; ensure fast network between nodes; use pre-voted configured correctly
Lease holder crash	Keys with TTL expire immediately; watchers are notified	Use longer TTL for less-critical leases; implement leasekeepalive heartbeat to renew; handle expired lease gracefully
Raft proposal timeout misconfiguration	Too-short timeouts cause spurious leader elections; too-long causes slow failover	`heartbeat-interval` should be ~100ms; `election-timeout` should be 10x heartbeat; tune based on network RTT
Watch flood during leader election	Thousands of events fire simultaneously during leadership transition	Use filtering watchers where possible; debounce watch event processing; batch watch event handling
Inconsistent snapshot state	If snapshot is taken before WAL is flushed, restored node may have inconsistent state	Use raft log compaction carefully; ensure WAL and snapshot are properly sequenced; test restore procedures

Quick Recap Checklist

Interview Questions

1. How does etcd's Raft implementation achieve consensus, and what happens during a leader election?

Expected answer points:

Leader election: nodes start as followers, become candidates after election timeout, request votes, majority wins
Log replication: leader appends commands to its log, replicates via AppendEntries, commits when majority acks
Safety guarantee: only servers with committed entries can become leader, ensuring no data loss
Leader crash triggers new election; node with most complete log wins

2. What is the difference between linearizable reads and stale reads in etcd?

Expected answer points:

Follower reads without consistency guarantees can return stale (outdated) data
Linearizable reads go through the leader and verify the cluster is still the leader before returning data
Linearizable reads use the `serializable` mode or explicit `WithSerializable` — stronger guarantees but higher latency
Use linearizable reads when your application requires seeing all previously committed writes

3. How do etcd watches work, and why are they more efficient than polling?

Expected answer points:

Watches are true subscriptions — the server tracks which clients watch which keys
No polling interval means zero latency between a key change and notification delivery
Watches are scalable because the server pushes events only to interested subscribers
Can watch single keys or entire prefixes recursively with `WithPrefix()`
Event types include PUT, DELETE, and you receive the old and new values

4. What happens to keys attached to a lease when the lease holder crashes?

Expected answer points:

Lease continues running on the etcd server; client keepalive messages renew it
If no keepalive arrives within the grace period, etcd revokes the lease and deletes all attached keys atomically
Grace period = lease TTL + leader election timeout (if leader is down, lease cannot be renewed until new election)
JVM GC pauses can exceed lease TTL and cause unexpected key expiration
Best practice: set TTL at 2–5x expected renewal interval with exponential backoff retry

5. How does etcd's WAL (Write-Ahead Log) relate to BoltDB snapshots?

Expected answer points:

Every committed write goes to WAL before being applied to the state machine (BoltDB)
WAL ensures durability — writes are not acknowledged until written and fsynced
Snapshots capture the current BoltDB state; WAL is truncated afterward to reclaim space
New members restore from snapshot and replay only recent WAL entries (not years of history)
Without snapshots, recovery would require replaying ALL WAL entries from the beginning

6. Why are NVMe SSDs mandatory for production etcd deployments?

Expected answer points:

etcd writes to WAL on every committed operation; each write must be fsynced before acknowledgment
HDD fsync latency: 10–50ms → max ~33 writes/second (far below production needs)
SSD fsync latency: 0.1–1ms → thousands of writes/second (acceptable)
NVMe fsync latency: 0.05–0.2ms → supports 10k+ QPS consistently
Network-attached storage (NAS/NFS) adds network latency to fsync — never use for etcd data

7. Describe the etcd RBAC model and how prefix-based permissions work.

Expected answer points:

Built-in roles: root (full access), guest (read all keys), etcd-cluster (cluster communication)
Custom roles are created with specific prefix permissions using `role grant-permission`
Users are bound to roles; credentials are passed on every request via `--user` flag or TLS CN
Prefix-based isolation means no native multi-tenant namespace — tenant separation relies on naming conventions
At scale, dedicated etcd clusters per tenant or managed services (etcd Operator) are preferred

8. How would you perform a rolling upgrade of a 3-node etcd cluster from version 3.4 to 3.6?

Expected answer points:

Never skip minor versions — upgrade path is 3.4 → 3.5 → 3.6
Pre-upgrade: take a snapshot, verify free disk space (2x current DB size), review release notes
Upgrade one member at a time: stop etcd, replace binary, restart, wait for member to rejoin and catch up
Check `etcdctl endpoint health --cluster` after each member upgrade
Only proceed to next member when the current one is fully healthy and caught up
Major version jumps require full cluster restart (not rolling)

9. What is the minimum etcd cluster size for quorum, and what happens when quorum is lost?

Expected answer points:

3-node cluster tolerates 1 failure (quorum = 2); 5-node tolerates 2 failures (quorum = 3)
Quorum = majority = N/2 + 1; cluster requires quorum for both reads and writes
When quorum is lost: no new writes can be committed, reads may be stale or fail
Cluster becomes completely unavailable until at least quorum of nodes recover
Best practice: use 5-node clusters in production for better fault tolerance

10. How does Kubernetes use etcd, and what happens to Kubernetes when etcd is unavailable?

Expected answer points:

Kubernetes stores ALL cluster state (pods, services, deployments, configmaps, secrets) in etcd
Every `kubectl` command goes: API server → etcd for reads/writes
Controllers watch etcd and reconcile desired state — when etcd changes, controllers react
When etcd is unavailable: no new pods scheduled, no services created/modified, DNS may stop resolving new entries
Existing pods keep running (kubelets manage them independently without API server)
Typical mitigation: dedicated etcd nodes with redundant NVMe storage and monitoring

11. What is the purpose of the `etcd_server_leader_changes_seen_total` metric and what does a high value indicate?

Expected answer points:

Counts leader elections that have occurred on the cluster
High values indicate frequent leader elections — a symptom, not a direct problem
Causes: network latency, disk I/O saturation causing heartbeats to timeout, misconfigured election-timeout
Every election causes a brief write pause (leader must commit no-op entry to confirm authority)
Normal value is near zero; sustained elevated values indicate underlying infrastructure problems

12. Explain the etcd transaction (compare-and-swap) mechanism and give a use case.

Expected answer points:

Transactions use `If-Then-Else` semantics: condition(s), success operations, failure operations
Conditions can check key existence, value equality, version (mod revision), or lease
All operations in the transaction are applied atomically — either all succeed or all fail
Use case: distributed locks — atomically check a key doesn't exist AND set it to claim ownership
Use case: leader election — atomically claim leadership by checking and setting the election key
Without transactions, race conditions would allow multiple processes to believe they are leader

13. How does mTLS work in etcd cluster communication?

Expected answer points:

Both server and client present certificates; both sides verify the other's identity
Each member needs a server certificate with its hostname/IP as Subject Alternative Names (SANs)
Each client needs a client certificate for authentication — no SAN required
All members and clients must trust the same CA certificate
`--peer-cert-file`/`--peer-key-file` for inter-member communication; `--cert-file`/`--key-file` for client connections
Without TLS, attackers can read cluster state including Kubernetes secrets via man-in-the-middle

14. What are the key differences between how etcd handles minor vs. majority node failures?

Expected answer points:

Minority (N-1 down for 3-node): cluster continues serving reads and writes normally
Leader down: brief write pause (seconds) while new leader elected, then normal operation resumes
Majority down (quorum loss): cluster becomes unavailable — cannot commit writes or guarantee linearizability
Recovery: nodes that were down must catch up by replaying WAL entries from the leader or restoring from snapshot
Minority node rejoining: typically fast (snapshot + recent WAL replay); major rebuild can take much longer

15. Why does etcd recommend setting `heartbeat-interval` ~100ms and `election-timeout` 10x that value?

Expected answer points:

Heartbeat interval: frequency of leader heartbeat messages to followers (should be ~100ms)
Election timeout: randomized window (e.g., 1000–2000ms) before follower becomes candidate
10x ratio ensures heartbeat arrives well before follower times out — prevents spurious elections
If heartbeat-interval is too small, network jitter can cause unnecessary elections
If election-timeout is too small, slow disks or GC pauses trigger false leader elections
Values should be tuned based on network RTT between nodes — must exceed typical RTT by comfortable margin

16. What is snapshot compaction in etcd and why is it necessary?

Expected answer points:

As the WAL grows, replaying all entries on recovery takes longer and consumes more disk
Snapshot compaction periodically captures the current state machine state as a snapshot file
After a snapshot is taken, old WAL entries that are already applied can be discarded
This bounds disk usage and recovery time — new nodes restore from snapshot + replay only recent WAL
Without compaction, WAL grows unbounded and recovery time becomes unacceptable
Controlled by `--snapshot-count` flag (default: 100,000 entries)

17. How does the `etcd_wal_fsync_duration_seconds_bucket` metric help diagnose etcd problems?

Expected answer points:

Measures histogram of WAL fsync latency — how long each fsync call takes
P99 should be under 10ms; P99 above 100ms indicates disk I/O saturation
This is the most critical etcd performance metric — disk latency directly limits write throughput
Slow fsync causes leader heartbeat timeouts → leader elections → cluster instability
If elevated, check: disk %util, avgqu-sz, whether etcd shares disk with other workloads
Mitigation: move to NVMe, isolate etcd disk from other I/O, disable disk I/O scheduler

18. What is the relationship between etcd's lease mechanism and leader election in distributed locks?

Expected answer points:

Distributed locks in etcd use a key with a TTL lease — the lock is held only while the lease is valid
Lock acquisition: atomically create the lock key; if it already exists, someone else holds the lock
Lock release: either explicitly delete the key or let the lease expire (handles crash scenarios)
When the lock holder crashes, the lease expires and the key is deleted automatically — no stale locks
Leader election uses the same pattern: leader creates a key with a lease, others watch and react when it disappears
This eliminates the need for explicit fencing tokens in many scenarios

19. Under what conditions would you recommend a 5-node etcd cluster over a 3-node cluster?

Expected answer points:

3-node cluster tolerates 1 failure = 0 fault tolerance during recovery window
5-node cluster tolerates 2 simultaneous failures = better availability during upgrades and failures
Recommended for: production Kubernetes control planes, multi-region deployments, environments with frequent node maintenance
Trade-off: more nodes means higher resource cost and slightly higher write latency (more acknowledgments needed)
5-node is the sweet spot for production; 7-node used for very large-scale or geographically distributed clusters

20. When would you choose etcd over Consul or Zookeeper for service discovery?

Expected answer points:

etcd: best when you need strong consistency + Kubernetes integration; simple key-value API; excellent Raft implementation
Consul: best for multi-datacenter deployments with built-in service mesh (Consul Connect), DNS interface, and agent-based health checks
ZooKeeper: older and more complex API; used by Kafka, HDFS; requires custom tooling for service discovery patterns
Choose etcd when: you're already running Kubernetes (natural fit), need simple k-v with watches, prioritize consistency
Choose Consul when: you need multi-datacenter replication, built-in health checks, or service mesh capabilities
Choose Zookeeper when: you need a battle-tested coordination layer for Hadoop ecosystem tools (Kafka, HBase)

Conclusion

etcd is purpose-built for coordination in distributed systems:

Strong consistency: Via Raft, no eventual consistency
Reliability: Survives minority failures without data loss
Efficiency: Watches enable reactive patterns without polling
Simplicity: Few primitives that compose well

The Raft implementation is considered a gold standard. Reading etcd source code teaches you more about distributed consensus than any textbook.

For Kubernetes users, understanding etcd is essential for debugging cluster issues. Root cause is often etcd: slow disk I/O causing leader election timeouts, memory pressure from a bloated WAL, or network issues preventing consensus.

For other use cases: do you actually need etcd? Millions of reads per second on terabytes? Use a different database. Need reliable configuration storage with fast updates and watches? etcd (or AWS Parameter Store or Consul) fits.

etcd: Distributed Key-Value Store for Service Discovery and Configuration

Core Architecture

The Raft Consensus Algorithm

Data Model and API

Watches and Reactive Configuration

Security Configuration

Authentication and RBAC

TLS/mTLS Setup for Cluster Communication

Operational Considerations

Lease Revocation Edge Cases

Namespace Isolation for Multi-Tenant Environments

Key Monitoring Metrics

Cluster Maintenance

Rolling Upgrade Procedures

Disk I/O Requirements

Leader Election

Handling Failures and Recovery

WAL and Snapshot Mechanics

Performance Characteristics

Kubernetes Integration

Common Pitfalls / Anti-Patterns

When to Use etcd

When Not to Use etcd

Production Failure Scenarios

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Apache ZooKeeper: Consensus and Coordination

Google Chubby: The Lock Service That Inspired ZooKeeper

Raft Consensus Algorithm