etcd: Distributed Key-Value Store for Configuration

Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.

published: reading time: 30 min read author: GeekWorkBench

etcd: Distributed Key-Value Store for Service Discovery and Configuration

etcd started at CoreOS as a coordination service built on the Raft consensus algorithm. Its job is storing configuration, detecting failures, and coordinating leader elections in distributed systems. Today etcd is best known as Kubernetes’ backing store — holding all cluster state: pods, services, deployments, the works.

The name follows the Unix convention of naming system services “etc” (for configuration) with a “d” for daemon. etcd was essentially meant to be the “etc” for distributed systems.


Core Architecture

etcd is a distributed, consistent, highly-available key-value store. It is not a general-purpose database — etcd handles specific things well:

  • Small amounts of critical data (configuration, locks, leader election)
  • Strong consistency — no eventual consistency surprises
  • Fast distributed coordination primitives

The architecture is built on Raft to maintain a replicated log. The Raft paper is titled “In Search of an Understandable Consensus Algorithm”, and etcd’s implementation is widely considered the reference implementation.

graph TD
    A[Client] -->|Read/Write| B[Leader]
    B -->|Replicate| C[Follower 1]
    B -->|Replicate| D[Follower 2]
    B -->|Replicate| E[Follower 3]

    C -.->|Heartbeat| B
    D -.->|Heartbeat| B
    E -.->|Heartbeat| B

    B --> F[(Raft Log)]
    C --> G[(Raft Log)]
    D --> H[(Raft Log)]
    E --> I[(Raft Log)]

In a Raft cluster, one node is the leader and handles all writes. Writes go through the leader’s replicated log, committed to a majority of nodes before durability. Reads can go to any node, but followers might return stale data unless you request linearizable reads.


The Raft Consensus Algorithm

Raft achieves consensus by electing a leader and replicating a log of commands. The algorithm has three parts:

Leader Election:

  • Nodes start as followers
  • Followers become candidates if they miss leader communication within the election timeout
  • Candidates request votes from other nodes
  • Majority votes = new leader

Log Replication:

  • Leader accepts commands from clients
  • Appends to local log
  • Replicates to followers via AppendEntries messages
  • Command on majority of logs = applied to state machine, client notified

Safety:

  • Only servers with committed entries can become leader
  • Committed entries survive minority node failures
// Simplified Raft write flow in etcd
func (s *Server) Propose(ctx context.Context, cmd []byte) error {
    // 1. Submit to Raft log
    ch := make(chan applyFuture)
    s.node.Propose(ctx, cmd, ch)

    // 2. Wait for commit
    result := <-ch
    if result.Error() != nil {
        return result.Error()
    }

    // 3. Apply to state machine
    return s.apply(cmd)
}

The key insight: as long as a majority agrees on log contents, the cluster is consistent. Leader crashes = new leader with most complete log elected.


Data Model and API

etcd stores data in a flat key-value namespace with a hierarchical directory structure (like a filesystem). Keys are strings, values can be arbitrary bytes.

# Store a simple value
etcdctl put /services/api/server1 "10.0.0.1:8080"

# Retrieve the value
etcdctl get /services/api/server1

# Store JSON configuration
etcdctl put /config/database '{"host": "db.example.com", "port": 5432}'

# List all keys under a prefix
etcdctl get --prefix /services/

The etcd API follows the gRPC protocol, which you can call directly or via client libraries:

import "go.etcd.io/etcd/client/v3"

cli, _ := client.NewFromURLs([]string{"http://localhost:2379"})
defer cli.Close()

// Write a value
cli.Put(context.Background(), "/config/featureflags", `{"newUI": true}`)

// Read a value
resp, _ := cli.Get(context.Background(), "/config/featureflags")
fmt.Println(string(resp.Kvs[0].Value))

// Atomic compare-and-swap (requires current value)
cli.Txn(context.Background()).If(
    clientv3.Compare(clientv3.Value("/config/featureflags"), "=", `{"newUI": false}`),
).Then(
    clientv3.OpPut("/config/featureflags", `{"newUI": true}`),
).Else(
    clientv3.OpPut("/config/featureflags", `{"newUI": false}`),
).Commit()

Transactions enable atomic compare-and-swap operations, essential for distributed locks and leader election.


Watches and Reactive Configuration

Native watch support is etcd’s most powerful feature. Subscribe to key changes instead of polling, and get notified immediately when data changes.

// Watch a single key
watchChan := cli.Watch(context.Background(), "/services/api/")
for resp := range watchChan {
    for _, event := range resp.Events {
        fmt.Printf("%s %q: %q\n", event.Type, event.Kv.Key, event.Kv.Value)
    }
}

// Watch an entire prefix recursively
watchChan := cli.Watch(context.Background(), "/services/", clientv3.WithPrefix())

Watches in etcd are cheap and scalable because they are true subscriptions, not polling. The etcd server tracks which clients watch which keys and pushes updates directly.

sequenceDiagram
    participant K as Kubernetes Controller
    participant E as etcd Server
    participant W as Watch Stream

    K->>E: Watch /pods/
    E->>W: Create subscription
    W-->>K: Initial state

    Note over E: Pod created
    E->>W: Push new event
    W-->>K: Pod event notification
    K->>K: Reconcile pod state

This watch mechanism lets Kubernetes controllers react immediately to cluster changes without polling.


Security Configuration

Authentication and RBAC

etcd ships with built-in role-based access control. Before enabling auth, you create users and bind them to roles that permit specific key prefixes.

# Create a user with password authentication
etcdctl user add root
# Add user to the root role (full access)
etcdctl user grant-role root root

# Enable authentication
etcdctl auth enable

# Now all requests require credentials
etcdctl --user root:password put /services/api/server1 "10.0.0.1:8080"

Built-in roles:

RolePermissions
rootFull read/write/admin on all keys
guestRead on all keys (default role)
etcd-clusterRead/write on cluster communication keys

Custom roles for least privilege:

# Create a role scoped to a key prefix
etcdctl role add app-config
etcdctl role grant-permission app-config readwrite /config/app/

# Create a user and assign the role
etcdctl user add app-service
etcdctl user grant-role app-service app-config

Without RBAC, any client that can reach etcd port 2379 has full cluster access. In Kubernetes, the etcd RBAC layer is typically bypassed because the API server handles authentication — but for standalone etcd deployments, RBAC is the first line of defense.


TLS/mTLS Setup for Cluster Communication

etcd supports encrypted communication between members and clients via mutual TLS. Each member and client presents a certificate; both sides verify the other’s identity.

Certificate requirements:

  • Each etcd member needs a server certificate with its hostname/IP as Subject Alternative Names (SANs)
  • Each client needs a client certificate for authentication
  • The CA certificate must be trusted by all members and clients
# Generate certificates using cfssl or openssl
# Example using openssl - simplified

# 1. Create the CA
openssl genrsa -out ca-key.pem 2048
openssl req -x509 -new -nodes -key ca-key.pem -days 365 -out ca-crt.pem \
  -subj "/CN=etcd-ca"

# 2. Create member certificate with SANs for all cluster IPs
openssl genrsa -out member-key.pem 2048
# SANs: etcd-1, etcd-2, etcd-3, 192.168.1.1, 192.168.1.2, 192.168.1.3
openssl req -new -key member-key.pem -out member.csr \
  -subj "/CN=etcd-member"
# Sign with CA...

# 3. Client certificates (no SAN required)
openssl genrsa -out client-key.pem 2048
openssl req -new -key client-key.pem -out client.csr \
  -subj "/CN=etcd-client"
# Sign with CA...

Cluster member flags:

# On each member, specify certificates
etcd \
  --cert-file=/etc/etcd/ssl/server.crt \
  --key-file=/etc/etcd/ssl/server.key \
  --trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --peer-cert-file=/etc/etcd/ssl/peer.crt \
  --peer-key-file=/etc/etcd/ssl/peer.key \
  --peer-trusted-ca-file=/etc/etcd/ssl/ca.crt \
  --initial-cluster etcd-1=https://192.168.1.1:2380,etcd-2=https://192.168.1.2:2380 \
  --listen-peer-urls https://0.0.0.0:2380 \
  --listen-client-urls https://0.0.0.0:2379

Client connection with TLS:

# Connect a client with certificates
etcdctl --cacert=/etc/etcd/ssl/ca.crt \
  --cert=/etc/etcd/ssl/client.crt \
  --key=/etc/etcd/ssl/client.key \
  endpoint health

Without TLS in production, network-level attackers can read all cluster state including Kubernetes secrets. TLS also prevents man-in-the-middle attacks that could inject false configuration into your cluster.


Operational Considerations

Lease Revocation Edge Cases

etcd leases attach TTLs to keys. When a lease expires, all keys attached to it are deleted atomically. The edge case is what happens when the lease holder crashes before renewal.

Grace period behavior:

// When a lease holder crashes:
// 1. The lease continues running on etcd server
// 2. If no keepalive arrives before --lease-grant-timeout, etcd revokes the lease
// 3. All keys attached to the lease are deleted
// 4. Watchers on those keys receive delete events

// The grace period is approximately: (lease TTL) + (leader election timeout)
// If leader is down, lease cannot be renewed until new leader is elected

Practical implications:

ScenarioLease Behavior
Holder crashes, lease TTL = 10sKeys deleted ~10s after crash (plus network delay)
Leader fails during grace periodNew leader elected, old holder’s lease revoked, keys deleted
Network partition (holder isolated)Partition-side holder cannot renew; majority-side lease revoked after TTL expires
Holder JVM pauses (GC)If pause exceeds lease TTL, keys deleted; use longer TTL + reliable keepalive

Recommendation: set lease TTL at 2-5x your expected renewal interval, and implement retry logic with exponential backoff on the client side.


Namespace Isolation for Multi-Tenant Environments

etcd has no native multi-tenant namespace abstraction. Isolation is achieved through key prefix conventions and RBAC roles.

Prefix-based tenancy:

# Tenant A's keys
etcdctl role add tenant-a
etcdctl role grant-permission tenant-a readwrite /tenants/a/

# Tenant B's keys
etcdctl role add tenant-b
etcdctl role grant-permission tenant-b readwrite /tenants/b/

# Each tenant's application only gets credentials granting access to their prefix
# Kubernetes uses this pattern: /registry/...

For Kubernetes itself, the convention is:

# Kubernetes state lives under these prefixes
/registry/apiregistration.k8s.io/apiservices/
/registry/services/endpoints/
/registry/pods/
/registry/secrets/
/registry/configmaps/

Multi-tenant considerations:

ConcernMitigation
Tenant key sprawlEnforce naming conventions; monitor key count per prefix
Resource contentionSeparate etcd clusters per tenant at scale
Quota limitsetcd has no per-key-prefix quota; implement application-level checks
RBAC complexityUse role inheritance where possible; automate role provisioning

At scale (hundreds of tenants), running a dedicated etcd cluster per tenant or using a managed service like etcd Operator becomes more practical than prefix-based isolation on a single cluster.


Key Monitoring Metrics

Monitor these metrics to catch etcd problems before they affect your cluster:

Critical metrics:

# Using Prometheus with etcd metrics endpoint (--listen-metrics-urls)
# Metrics endpoint: http://localhost:2381/metrics

# Commit index lag - how far behind followers are from leader
etcd_server_leader_changes_seen_total
etcd_server_commits_total
# For lag: compare follower applied_index vs leader committed_index

# WAL fsync latency - disk performance indicator
etcd_wal_fsync_duration_seconds_bucket
# P99 should be < 10ms; > 100ms indicates disk I/O problems

# Snapshot size
etcd_server_snapshot_apply_in_progress_total
# Snapshot size growing rapidly indicates write burst or missed snapshots

Prometheus alerting rules:

groups:
  - name: etcd
    rules:
      # Alert if WAL fsync P99 > 500ms
      - alert: EtcdWalFsyncLatency
        expr: histogram_quantile(0.99, rate(etcd_wal_fsync_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "etcd WAL fsync latency is critical"

      # Alert if follower commit index lag is growing
      - alert: EtcdCommitIndexLag
        expr: etcd_server_leader_commits_index - etcd_server_apply_commits_index > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "etcd follower is lagging behind leader"

      # Alert if snapshot restore is in progress (node recovering)
      - alert: EtcdSnapshotRestore
        expr: etcd_server_snapshot_apply_in_progress_total == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "etcd is currently restoring a snapshot"

Cluster Maintenance

Rolling Upgrade Procedures

etcd supports rolling upgrades where one member is upgraded at a time with no cluster downtime.

Standard rolling upgrade steps:

# 1. Verify cluster health before starting
etcdctl endpoint health --cluster

# 2. Upgrade one member at a time
# Stop etcd on member, replace binary, restart
systemctl stop etcd

# 3. After restart, wait for member to catch up
# Check that the member joins the cluster
etcdctl member list

# 4. Verify the upgraded member is healthy
etcdctl endpoint health

# 5. Proceed to next member

Pre-upgrade checklist:

  • Review etcd release notes for any breaking changes between versions
  • Ensure your etcd data directory (—data-dir) has adequate free space (2x current DB size recommended)
  • Take a snapshot before starting: etcdctl snapshot save /backup/etcd-snap.db
  • Verify the new binary version is compatible with cluster API version (etcd --version)

Upgrade compatibility matrix:

Cluster VersionSupported Upgrades
3.43.4 -> 3.5 (same major version)
3.53.5 -> 3.6 (same major version)
3.63.6 -> 3.7 (if released)
NoteMajor version jumps require full cluster restart (not rolling)

Never upgrade more than one minor version at a time. For example, going from 3.4 to 3.6 should be done as 3.4 -> 3.5 -> 3.6.


Disk I/O Requirements

etcd’s performance is fundamentally bound by disk I/O, particularly fsync latency on the WAL.

Requirements by environment:

EnvironmentMinimumRecommended
Development/QAHDD (7200 RPM) with ext4/xfsSSD with NVMe
ProductionSSD (SATA or NVMe)NVMe SSD with >50K IOPS
Kubernetes Control PlaneNVMe SSD mandatoryNVMe SSD with <1ms fsync P99
Heavy write loadNVMe SSD with >100K IOPS and >1GB/s throughputRAID-0 NVMe or dedicated NVMe device

Why SSD/NVMe is mandatory:

# etcd writes to WAL on every committed operation
# Each write must be fsynced before acknowledgment
# HDD fsync latency: 10-50ms (unacceptable)
# SSD fsync latency: 0.1-1ms (acceptable)
# NVMe fsync latency: 0.05-0.2ms (optimal)

# A 10-node cluster with 1 HDD
# fsync P99 = 30ms
# Max theoretical throughput = 1000ms / 30ms = ~33 writes/second
# At 1000 writes/second -> 30+ second commit latency -> leader election

Disk configuration best practices:

# Use --data-dir on dedicated disk or partition
# Do NOT share with application data
etcd --data-dir=/var/lib/etcd

# For NVMe, consider disabling disk I/O scheduler
echo "none" > /sys/block/nvme0n1/queue/scheduler

# Mount with noatime to avoid unnecessary writes
# /etc/fstab entry:
# /dev/nvme0n1 /var/lib/etcd ext4 defaults,noatime 0 2

# Monitor disk I/O
iostat -x 1
# Watch %util and avgqu-sz - if avgqu-sz > 4 for sustained periods, disk is saturated

Don’t use network-attached storage (NAS/NFS) for etcd data. Network latency adds directly to fsync latency and will cause leader election timeouts. Local NVMe or dedicated SSD is the only production-supported configuration.


Leader Election

etcd provides primitives for leader election. Create a dedicated directory and use transactions to ensure only one process leads at a time.

// Simplified leader election using etcd
func ElectLeader(cli *clientv3.Client, leaderName string) {
    // Try to acquire leadership by creating a key
    _, err := cli.Put(context.Background(), "/election/leader", leaderName,
        clientv3.WithLease(clientv3.NewLease(cli)))

    if err != nil {
        // Key exists, check current leader
        resp, _ := cli.Get(context.Background(), "/election/leader")
        fmt.Printf("Current leader: %s\n", resp.Kvs[0].Value)
        return
    }

    fmt.Printf("%s is now the leader!\n", leaderName)

    // Keep leadership by renewing the lease
    // If this process dies, the lease expires and leadership is released
}

The lease mechanism matters. Leader crashes, lease expires, another node takes over. This prevents stale leadership where a dead process appears to still be in charge.


Handling Failures and Recovery

etcd tolerates node failures gracefully:

  • Minority down: Cluster continues serving reads and writes
  • Leader down: New leader elected within seconds, brief write pause
  • Majority down: Cluster unavailable (cannot guarantee consistency without majority)
# Check cluster health
etcdctl endpoint health

# Check member list
etcdctl member list

# Remove a failed member
etcdctl member remove <member-id>

Recovering nodes must catch up on missed updates. etcd supports:

  • Snapshot recovery: Restore from a recent snapshot and replay WAL entries
  • Member migration: Replace failed node with a new one that learns from the leader

For production clusters, regular snapshots are essential. Without them, a recovering node might replay years of WAL entries.


WAL and Snapshot Mechanics

etcd uses a Write-Ahead Log (WAL) to ensure durability of committed operations before applying them to the state machine. Every write that passes Raft consensus gets written to the WAL before being applied.

WAL structure:

wal/
├── 0.index              # WAL metadata
├── 1.snap/              # Snapshot of state at index 1
│   ├── db               # BoltDB snapshot file
│   └── metadata
├── 2.wal/               # WAL for entries 2 onwards
│   ├── 0.index          # First WAL segment index
│   ├── 0000000000000000-0000000000000000.wal
│   └── 0000000000000001-0000000000000256.wal

Each WAL segment file contains record batches with:

  • Type: Entry (raft log entry), State (hard state like term/vote), etc.
  • Term: Raft term number
  • Index: Log entry index
  • Data: The actual write command
graph LR
    A[Client Write] --> B[Raft Propose]
    B --> C[Append to WAL]
    C --> D[Replicate to Followers]
    D --> E[Wait for Majority Ack]
    E --> F[Mark Committed in WAL]
    F --> G[Apply to State Machine]
    G --> H[Write Snapshot]

Snapshotting process:

When the WAL grows too large or during routine snapshots, etcd creates a snapshot:

  1. etcd takes a BoltDB snapshot of the current state
  2. WAL is truncated - old entries are discarded
  3. New member joining can restore from snapshot + replay only recent WAL
# Manually trigger a snapshot
etcdctl snapshot save /tmp/etcd-snap.db

# Check snapshot status
etcdctl snapshot status /tmp/etcd-snap.db

# Restore from snapshot (for disaster recovery)
etcdctl snapshot restore /tmp/etcd-snap.db \
  --name etcd-1 \
  --initial-cluster etcd-1=http://192.168.1.1:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls http://192.168.1.1:2380

Why snapshots matter:

Without SnapshotsWith Snapshots
New node must replay ALL WAL entriesNew node replays only recent entries
Recovery time = years of entriesRecovery time = minutes of entries
Disk space unboundedWAL bounded by snapshot interval
Memory pressure from WAL indexClean, bounded WAL

Operational best practices:

  • Monitor etcd_server_snapshot_apply_in_progress_total metric
  • Set --snapshot-count to control how often snapshots are taken (default: 100,000)
  • Ensure adequate disk space for WAL + snapshots during heavy write periods

Performance Characteristics

etcd is not designed for high throughput on large datasets. It excels at:

  • Megabytes to gigabytes of critical configuration
  • Consistent latency operations (1-10ms typically)
  • Thousands of watches and frequent small updates

Rough numbers for a 3-node cluster on modern hardware:

  • Write latency: 1-5ms at 10k+ QPS
  • Read latency: sub-millisecond for cached reads
  • Watch scalability: 100k+ concurrent watches

These come from etcd team’s benchmarking. Real-world performance depends on network latency, disk I/O, and cluster size.


Kubernetes Integration

Kubernetes stores all cluster state in etcd. kubectl get pods? API server reads from etcd. Deploy a service? API server writes to etcd, controllers watching react.

graph TD
    A[kubectl] -->|HTTP| B[API Server]
    B -->|Read/Write| C[etcd]
    D[Controller] -->|Watch| B
    D -.->|Reconcile| E[Pod]
    B -.->|Notify| D

This tight coupling means etcd failures affect the entire Kubernetes control plane. If etcd is unavailable:

  • No new pods scheduled
  • Existing pods keep running (kubelets manage them independently)
  • Services cannot be created or modified
  • DNS might stop resolving new entries

Kubernetes deployments typically run etcd on dedicated nodes with redundant storage and monitoring.


Common Pitfalls / Anti-Patterns

When to Use etcd

For distributed coordination and locking, etcd excels. Its lease mechanism and compare-and-swap semantics make leader election, distributed locks, and barriers straightforward to implement. Service discovery works naturally too — you can register services, track their health, and discover endpoints dynamically without a separate discovery service. Configuration storage is another strong use case: cluster configuration, feature flags, and dynamic settings that need consistent reads across nodes fit perfectly. When your application needs multiple processes to agree on a single value across machines, etcd provides the consensus guarantees you need.

When Not to Use etcd

etcd is not meant for general application data. It’s designed for metadata and coordination, so high-volume application data belongs in Redis or a proper database. High-throughput event streams also push beyond what etcd handles well — while 1-10ms latency at thousands of operations per second works fine for coordination, it won’t meet the demands of event pipelines. For queuing or job scheduling, reach for a dedicated message queue like Kafka or RabbitMQ instead.

Production Failure Scenarios

FailureImpactMitigation
Quorum loss (N/2+ nodes fail)etcd cluster becomes unavailable; no reads or writes possibleUse 5-node cluster for better fault tolerance; monitor node health; set appropriate election-timeout
Disk I/O saturationetcd is extremely disk I/O sensitive; slow fsync causes leader heartbeats to timeoutUse SSDs (NVMe preferred); isolate etcd disk I/O from other workloads; monitor fsync latency
Snapshot restore taking too longWhen leader sends snapshots to new nodes, followers are unavailable during downloadPre-provision nodes with snapshot already loaded; ensure fast network between nodes; use pre-voted configured correctly
Lease holder crashKeys with TTL expire immediately; watchers are notifiedUse longer TTL for less-critical leases; implement leasekeepalive heartbeat to renew; handle expired lease gracefully
Raft proposal timeout misconfigurationToo-short timeouts cause spurious leader elections; too-long causes slow failoverheartbeat-interval should be ~100ms; election-timeout should be 10x heartbeat; tune based on network RTT
Watch flood during leader electionThousands of events fire simultaneously during leadership transitionUse filtering watchers where possible; debounce watch event processing; batch watch event handling
Inconsistent snapshot stateIf snapshot is taken before WAL is flushed, restored node may have inconsistent stateUse raft log compaction carefully; ensure WAL and snapshot are properly sequenced; test restore procedures

Quick Recap Checklist

  • etcd uses the Raft consensus algorithm — one leader, replicated log, majority commit
  • Writes go through the leader’s replicated log and are committed when a majority acknowledges
  • Reads can be served by followers but require linearizable reads for strong consistency
  • Watches are true subscriptions — no polling, push-based notifications on key changes
  • Leases attach TTL to keys; when the lease expires, all attached keys are deleted atomically
  • RBAC uses prefix-based permissions; without it, anyone reaching port 2379 has full access
  • TLS/mTLS encrypts all inter-node and client-server communication in production
  • WAL fsync latency is the primary performance bottleneck — use NVMe SSDs, never NAS/NFS
  • Snapshots bound WAL growth; without them, recovery can take years of replayed entries
  • Quorum loss (majority down) makes the cluster unavailable — use 5-node clusters for better fault tolerance
  • Rolling upgrades require one minor version at a time with no cluster restart between steps
  • Kubernetes stores all cluster state in etcd — etcd unavailability freezes the entire control plane

Interview Questions

1. How does etcd's Raft implementation achieve consensus, and what happens during a leader election?

Expected answer points:

  • Leader election: nodes start as followers, become candidates after election timeout, request votes, majority wins
  • Log replication: leader appends commands to its log, replicates via AppendEntries, commits when majority acks
  • Safety guarantee: only servers with committed entries can become leader, ensuring no data loss
  • Leader crash triggers new election; node with most complete log wins
2. What is the difference between linearizable reads and stale reads in etcd?

Expected answer points:

  • Follower reads without consistency guarantees can return stale (outdated) data
  • Linearizable reads go through the leader and verify the cluster is still the leader before returning data
  • Linearizable reads use the `serializable` mode or explicit `WithSerializable` — stronger guarantees but higher latency
  • Use linearizable reads when your application requires seeing all previously committed writes
3. How do etcd watches work, and why are they more efficient than polling?

Expected answer points:

  • Watches are true subscriptions — the server tracks which clients watch which keys
  • No polling interval means zero latency between a key change and notification delivery
  • Watches are scalable because the server pushes events only to interested subscribers
  • Can watch single keys or entire prefixes recursively with `WithPrefix()`
  • Event types include PUT, DELETE, and you receive the old and new values
4. What happens to keys attached to a lease when the lease holder crashes?

Expected answer points:

  • Lease continues running on the etcd server; client keepalive messages renew it
  • If no keepalive arrives within the grace period, etcd revokes the lease and deletes all attached keys atomically
  • Grace period = lease TTL + leader election timeout (if leader is down, lease cannot be renewed until new election)
  • JVM GC pauses can exceed lease TTL and cause unexpected key expiration
  • Best practice: set TTL at 2–5x expected renewal interval with exponential backoff retry
5. How does etcd's WAL (Write-Ahead Log) relate to BoltDB snapshots?

Expected answer points:

  • Every committed write goes to WAL before being applied to the state machine (BoltDB)
  • WAL ensures durability — writes are not acknowledged until written and fsynced
  • Snapshots capture the current BoltDB state; WAL is truncated afterward to reclaim space
  • New members restore from snapshot and replay only recent WAL entries (not years of history)
  • Without snapshots, recovery would require replaying ALL WAL entries from the beginning
6. Why are NVMe SSDs mandatory for production etcd deployments?

Expected answer points:

  • etcd writes to WAL on every committed operation; each write must be fsynced before acknowledgment
  • HDD fsync latency: 10–50ms → max ~33 writes/second (far below production needs)
  • SSD fsync latency: 0.1–1ms → thousands of writes/second (acceptable)
  • NVMe fsync latency: 0.05–0.2ms → supports 10k+ QPS consistently
  • Network-attached storage (NAS/NFS) adds network latency to fsync — never use for etcd data
7. Describe the etcd RBAC model and how prefix-based permissions work.

Expected answer points:

  • Built-in roles: root (full access), guest (read all keys), etcd-cluster (cluster communication)
  • Custom roles are created with specific prefix permissions using `role grant-permission`
  • Users are bound to roles; credentials are passed on every request via `--user` flag or TLS CN
  • Prefix-based isolation means no native multi-tenant namespace — tenant separation relies on naming conventions
  • At scale, dedicated etcd clusters per tenant or managed services (etcd Operator) are preferred
8. How would you perform a rolling upgrade of a 3-node etcd cluster from version 3.4 to 3.6?

Expected answer points:

  • Never skip minor versions — upgrade path is 3.4 → 3.5 → 3.6
  • Pre-upgrade: take a snapshot, verify free disk space (2x current DB size), review release notes
  • Upgrade one member at a time: stop etcd, replace binary, restart, wait for member to rejoin and catch up
  • Check `etcdctl endpoint health --cluster` after each member upgrade
  • Only proceed to next member when the current one is fully healthy and caught up
  • Major version jumps require full cluster restart (not rolling)
9. What is the minimum etcd cluster size for quorum, and what happens when quorum is lost?

Expected answer points:

  • 3-node cluster tolerates 1 failure (quorum = 2); 5-node tolerates 2 failures (quorum = 3)
  • Quorum = majority = N/2 + 1; cluster requires quorum for both reads and writes
  • When quorum is lost: no new writes can be committed, reads may be stale or fail
  • Cluster becomes completely unavailable until at least quorum of nodes recover
  • Best practice: use 5-node clusters in production for better fault tolerance
10. How does Kubernetes use etcd, and what happens to Kubernetes when etcd is unavailable?

Expected answer points:

  • Kubernetes stores ALL cluster state (pods, services, deployments, configmaps, secrets) in etcd
  • Every `kubectl` command goes: API server → etcd for reads/writes
  • Controllers watch etcd and reconcile desired state — when etcd changes, controllers react
  • When etcd is unavailable: no new pods scheduled, no services created/modified, DNS may stop resolving new entries
  • Existing pods keep running (kubelets manage them independently without API server)
  • Typical mitigation: dedicated etcd nodes with redundant NVMe storage and monitoring
11. What is the purpose of the `etcd_server_leader_changes_seen_total` metric and what does a high value indicate?

Expected answer points:

  • Counts leader elections that have occurred on the cluster
  • High values indicate frequent leader elections — a symptom, not a direct problem
  • Causes: network latency, disk I/O saturation causing heartbeats to timeout, misconfigured election-timeout
  • Every election causes a brief write pause (leader must commit no-op entry to confirm authority)
  • Normal value is near zero; sustained elevated values indicate underlying infrastructure problems
12. Explain the etcd transaction (compare-and-swap) mechanism and give a use case.

Expected answer points:

  • Transactions use `If-Then-Else` semantics: condition(s), success operations, failure operations
  • Conditions can check key existence, value equality, version (mod revision), or lease
  • All operations in the transaction are applied atomically — either all succeed or all fail
  • Use case: distributed locks — atomically check a key doesn't exist AND set it to claim ownership
  • Use case: leader election — atomically claim leadership by checking and setting the election key
  • Without transactions, race conditions would allow multiple processes to believe they are leader
13. How does mTLS work in etcd cluster communication?

Expected answer points:

  • Both server and client present certificates; both sides verify the other's identity
  • Each member needs a server certificate with its hostname/IP as Subject Alternative Names (SANs)
  • Each client needs a client certificate for authentication — no SAN required
  • All members and clients must trust the same CA certificate
  • `--peer-cert-file`/`--peer-key-file` for inter-member communication; `--cert-file`/`--key-file` for client connections
  • Without TLS, attackers can read cluster state including Kubernetes secrets via man-in-the-middle
14. What are the key differences between how etcd handles minor vs. majority node failures?

Expected answer points:

  • Minority (N-1 down for 3-node): cluster continues serving reads and writes normally
  • Leader down: brief write pause (seconds) while new leader elected, then normal operation resumes
  • Majority down (quorum loss): cluster becomes unavailable — cannot commit writes or guarantee linearizability
  • Recovery: nodes that were down must catch up by replaying WAL entries from the leader or restoring from snapshot
  • Minority node rejoining: typically fast (snapshot + recent WAL replay); major rebuild can take much longer
15. Why does etcd recommend setting `heartbeat-interval` ~100ms and `election-timeout` 10x that value?

Expected answer points:

  • Heartbeat interval: frequency of leader heartbeat messages to followers (should be ~100ms)
  • Election timeout: randomized window (e.g., 1000–2000ms) before follower becomes candidate
  • 10x ratio ensures heartbeat arrives well before follower times out — prevents spurious elections
  • If heartbeat-interval is too small, network jitter can cause unnecessary elections
  • If election-timeout is too small, slow disks or GC pauses trigger false leader elections
  • Values should be tuned based on network RTT between nodes — must exceed typical RTT by comfortable margin
16. What is snapshot compaction in etcd and why is it necessary?

Expected answer points:

  • As the WAL grows, replaying all entries on recovery takes longer and consumes more disk
  • Snapshot compaction periodically captures the current state machine state as a snapshot file
  • After a snapshot is taken, old WAL entries that are already applied can be discarded
  • This bounds disk usage and recovery time — new nodes restore from snapshot + replay only recent WAL
  • Without compaction, WAL grows unbounded and recovery time becomes unacceptable
  • Controlled by `--snapshot-count` flag (default: 100,000 entries)
17. How does the `etcd_wal_fsync_duration_seconds_bucket` metric help diagnose etcd problems?

Expected answer points:

  • Measures histogram of WAL fsync latency — how long each fsync call takes
  • P99 should be under 10ms; P99 above 100ms indicates disk I/O saturation
  • This is the most critical etcd performance metric — disk latency directly limits write throughput
  • Slow fsync causes leader heartbeat timeouts → leader elections → cluster instability
  • If elevated, check: disk %util, avgqu-sz, whether etcd shares disk with other workloads
  • Mitigation: move to NVMe, isolate etcd disk from other I/O, disable disk I/O scheduler
18. What is the relationship between etcd's lease mechanism and leader election in distributed locks?

Expected answer points:

  • Distributed locks in etcd use a key with a TTL lease — the lock is held only while the lease is valid
  • Lock acquisition: atomically create the lock key; if it already exists, someone else holds the lock
  • Lock release: either explicitly delete the key or let the lease expire (handles crash scenarios)
  • When the lock holder crashes, the lease expires and the key is deleted automatically — no stale locks
  • Leader election uses the same pattern: leader creates a key with a lease, others watch and react when it disappears
  • This eliminates the need for explicit fencing tokens in many scenarios
19. Under what conditions would you recommend a 5-node etcd cluster over a 3-node cluster?

Expected answer points:

  • 3-node cluster tolerates 1 failure = 0 fault tolerance during recovery window
  • 5-node cluster tolerates 2 simultaneous failures = better availability during upgrades and failures
  • Recommended for: production Kubernetes control planes, multi-region deployments, environments with frequent node maintenance
  • Trade-off: more nodes means higher resource cost and slightly higher write latency (more acknowledgments needed)
  • 5-node is the sweet spot for production; 7-node used for very large-scale or geographically distributed clusters
20. When would you choose etcd over Consul or Zookeeper for service discovery?

Expected answer points:

  • etcd: best when you need strong consistency + Kubernetes integration; simple key-value API; excellent Raft implementation
  • Consul: best for multi-datacenter deployments with built-in service mesh (Consul Connect), DNS interface, and agent-based health checks
  • ZooKeeper: older and more complex API; used by Kafka, HDFS; requires custom tooling for service discovery patterns
  • Choose etcd when: you're already running Kubernetes (natural fit), need simple k-v with watches, prioritize consistency
  • Choose Consul when: you need multi-datacenter replication, built-in health checks, or service mesh capabilities
  • Choose Zookeeper when: you need a battle-tested coordination layer for Hadoop ecosystem tools (Kafka, HBase)

Further Reading


Conclusion

etcd is purpose-built for coordination in distributed systems:

  • Strong consistency: Via Raft, no eventual consistency
  • Reliability: Survives minority failures without data loss
  • Efficiency: Watches enable reactive patterns without polling
  • Simplicity: Few primitives that compose well

The Raft implementation is considered a gold standard. Reading etcd source code teaches you more about distributed consensus than any textbook.

For Kubernetes users, understanding etcd is essential for debugging cluster issues. Root cause is often etcd: slow disk I/O causing leader election timeouts, memory pressure from a bloated WAL, or network issues preventing consensus.

For other use cases: do you actually need etcd? Millions of reads per second on terabytes? Use a different database. Need reliable configuration storage with fast updates and watches? etcd (or AWS Parameter Store or Consul) fits.

Category

Related Posts

Apache ZooKeeper: Consensus and Coordination

Explore ZooKeeper's Zab consensus protocol, hierarchical znodes, watches, leader election, and practical use cases for distributed coordination.

#distributed-systems #databases #zookeeper

Google Chubby: The Lock Service That Inspired ZooKeeper

Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.

#distributed-systems #databases #google

Raft Consensus Algorithm

Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.

#distributed-systems #consensus #raft