Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ

Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.

published: reading time: 31 min read author: GeekWorkBench

Kubernetes High Availability: Pod Disruption Budgets, HPA, and Multi-AZ Deployments

Production workloads need to stay available during node failures, cluster maintenance, and traffic spikes. Kubernetes provides mechanisms to handle these scenarios: Horizontal Pod Autoscaler (HPA) scales pods based on demand, Pod Disruption Budgets (PDB) ensure minimum availability during voluntary disruptions, and multi-AZ deployments protect against datacenter failures.

This post covers building resilient applications on Kubernetes using these tools and practices.

When to Use / When Not to Use

HPA suits variable traffic well

Web APIs, user-facing services, anything where load is unpredictable. If you have metrics that correlate with demand, HPA can react faster than you can.

Custom metrics unlock more. Queue depth for worker systems, request latency for latency-sensitive services, business metrics like active users. The autoscaler scales on what matters for your system.

When PDBs matter

Stateful applications need PDBs because their failure modes are harsher. A database losing quorum mid-request corrupts data. Stateless services restart cleanly.

If you need to drain nodes for maintenance without service blips, PDBs are essential. Cluster upgrades require draining nodes, and without PDBs you can temporarily lose quorum for stateful workloads.

Multi-AZ when single-datacenter is not enough

If your SLA is 99.99%, a single AZ failure takes you offline. Multi-AZ deployments ensure an AZ outage is invisible to users.

Compliance sometimes demands it. Data residency regulations may require geographic separation.

HPA is not always the answer

Batch jobs have a beginning and an end. They scale up front, run, scale down. HPA oscillating during a long job wastes resources.

Some services have latency requirements that autoscale cannot satisfy. If scale-to-zero latency is unacceptable, keep minimum replicas running.

When HPA Makes Sense

Use HPA when: variable traffic, meaningful metrics available, automatic capacity management needed.

Use PDBs when: stateful apps, cluster maintenance without downtime, protecting critical services during upgrades.

Use multi-AZ when: SLA demands it, zone failures must be invisible, compliance requires it.

Skip HPA for: batch jobs, latency-critical services requiring fixed capacity.

HA Architecture Flow

flowchart LR
    User --> LB[Load Balancer]
    LB --> HPA[HPA: Scales pods<br/>based on metrics]
    HPA --> AZ1[AZ-1 Pod]
    HPA --> AZ2[AZ-2 Pod]
    HPA --> AZ3[AZ-3 Pod]
    subgraph PDB[Pod Disruption Budget]
        PDB1[minAvailable: 2]
    end
    AZ1 --> PDB1
    AZ2 --> PDB1
    AZ3 --> PDB1

HPA scales pods horizontally across availability zones. PDB ensures at least 2 replicas stay up during voluntary disruptions. Together they handle traffic spikes and cluster maintenance without downtime.

HPA Configuration and Scaling Behavior

The Horizontal Pod Autoscaler automatically adjusts the number of pod replicas based on CPU utilization, memory usage, or custom metrics.

Basic HPA configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

This HPA maintains 70% CPU utilization and 80% memory utilization. It scales between 3 and 20 replicas. The behavior section controls scaling speed:

  • Scale-down stabilization window of 300 seconds prevents rapid flapping
  • Scale-down limits to 10% of pods per minute
  • Scale-up allows doubling pods in 15 seconds for rapid response

Custom metrics HPA

For metrics beyond CPU and memory, use the custom.metrics.k8s.io API:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-worker-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: queue_depth
          selector:
            matchLabels:
              queue_name: order-processing
        target:
          type: AverageValue
          averageValue: "100"

This scales based on message queue depth. When 100 messages per pod accumulate on average, the HPA adds more pods.

Checking HPA status

kubectl get hpa -n production
kubectl describe hpa web-frontend-hpa -n production

The describe output shows current metrics, replica counts, and scaling events.

HPA Scaling Policies Trade-off Table

Scaling PolicyBehaviorBest ForRisk
Aggressive (scale-up)Fast scale-up, slow scale-downTraffic spikes, flash salesOver-provisioning, higher costs
Conservative (scale-down)Slow scale-up, fast scale-downStable workloads, cost-sensitiveUnder-provisioning during growth
Stable (long stabilization window)Prevents flappingPredictable traffic, stateful appsSlower response to load changes
Mixed (separate up/down policies)Tune each direction independentlyMost production workloadsMore configuration complexity

The default HPA behavior is relatively aggressive on scale-up and conservative on scale-down. For stateful services with database connections, use a longer stabilization window to avoid connection churn during brief load fluctuations.

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0 # Immediately scale up
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 300 # 5 minute cooldown
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Pod Disruption Budgets for Safe Evictions

Pod Disruption Budgets (PDB) ensure minimum availability during voluntary disruptions. Voluntary disruptions include node drain operations for cluster upgrades and autoscaler scale-down events.

PDB definition

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-frontend-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-frontend

This PDB ensures at least 2 web-frontend pods are available during disruptions. If 5 pods exist and you drain a node, Kubernetes evicts only 3 pods, leaving 2 running.

Using maxUnavailable instead

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-frontend-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: web-frontend

maxUnavailable: 1 allows at most 1 pod to be unavailable. This is often clearer than minAvailable when you know your replica count.

Multiple PDBs for complex applications

# PDB for API servers
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
  namespace: production
spec:
  minAvailable: 3
  selector:
    matchLabels:
      tier: api
---
# PDB for frontend
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: frontend-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      tier: frontend

You can have multiple PDBs for different parts of an application.

Checking PDB status

kubectl get pdb -n production
kubectl describe pdb web-frontend-pdb -n production

Pod Priority and Preemption

Pod priority affects scheduling order and eviction decisions during resource pressure. Higher priority pods preempt lower priority pods when the cluster is full.

Priority class definition

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-high
value: 1000
globalDefault: false
description: "Production workloads with high priority"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-medium
value: 500
globalDefault: true
description: "Standard production workloads"

globalDefault: true means pods without an explicit priority class use production-medium by default.

Assigning priority to pods

spec:
  priorityClassName: production-high

Critical workloads like payment processing use high priority. Background batch jobs use lower priority and get preempted when needed.

Multi-AZ Deployment Strategies

Distributing pods across availability zones protects against single-datacenter failures. Kubernetes nodes typically run in multiple zones within a region.

Zone labels

Nodes have topology labels:

topology.kubernetes.io/zone: us-east-1a
topology.kubernetes.io/region: us-east-1

Pod anti-affinity for zone spreading

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: "app"
                operator: In
                values:
                  - web-frontend
          topologyKey: topology.kubernetes.io/zone

This spreads web-frontend pods across zones. If you have 3 replicas and 3 zones, each zone gets one pod.

Storage considerations

Persistent volumes with cloud provider storage may have zone constraints. EBS volumes exist in a single availability zone. If you schedule a pod in us-east-1b but the EBS volume is in us-east-1a, the pod cannot start.

Use volumeBindingMode: WaitForFirstConsumer in your StorageClass to delay volume binding until scheduler placement:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
volumeBindingMode: WaitForFirstConsumer

The scheduler then places the pod in the same zone as the volume.

StatefulSet with zone awareness

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: postgres
          topologyKey: topology.kubernetes.io/zone
  template:
    spec:
      containers:
        - name: postgres
          image: postgres:15

StatefulSets with zone-spread requirements may fail to schedule all replicas if zones are unavailable. Consider whether your application can operate with reduced replica count.

Cluster Federation Basics

Federation manages multiple Kubernetes clusters as a single logical cluster. You deploy workloads to a federated control plane that distributes them across member clusters.

Federation v2 (KubeFed) provides:

  • Cross-cluster scheduling: Deploy pods to multiple clusters
  • Cross-cluster service discovery: Access services across clusters
  • Replica placement: Distribute workloads based on geography

Federation architecture

┌─────────────────────────────────────────┐
│           Federated Control Plane        │
│  ┌─────────────┐  ┌──────────────────┐  │
│  │ KubeFed     │  │ Federated API    │  │
│  └─────────────┘  └──────────────────┘  │
└─────────────────────────────────────────┘
        │                  │
        ▼                  ▼
┌──────────────┐   ┌──────────────┐
│ Cluster us-east-1 │  │ Cluster eu-west-1 │
└──────────────┘   └──────────────┘

Federation is complex and requires careful planning. For most use cases, simpler approaches like GitOps with multiple clusters work better.

Failure Simulation Testing

Testing failure scenarios validates your HA configuration. Tools like chaoskube and Litmus simulate failures to verify resilience.

Using chaoskube

helm install chaoskube chaoskube/chaoskube \
  --set namespaces={production} \
  --set schedule="*/5 * * * *" \
  --set replicas=1

Chaoskube kills a pod every 5 minutes in the production namespace. If your PDDis configured correctly, your application stays available.

Manual node drain testing

kubectl drain node node-1 --ignore-daemonsets --delete-emptydir-data --force

Draining a node simulates cluster maintenance. Verify:

  • PDBs are respected
  • Pods reschedule to other nodes
  • Application remains available behind its Service

Load testing

Verify HPA responds correctly under load:

kubectl run -it load-generator \
  --image=busybox \
  --restart=Never \
  -- /bin/sh -c "while true; do wget -q -O- http://web-frontend; done"

Monitor HPA behavior during the test:

watch kubectl get hpa -n production

Production Failure Scenarios

HPA Flapping

HPA that scales up and down too quickly is worse than no HPA. Pods getting created and destroyed constantly burn resources and generate logs.

This happens when the stabilization window is too short or the scaling threshold is too tight. A pod that scales to 20 replicas, triggers scale-down, bounces back, is not doing anyone favors.

Set stabilizationWindowSeconds on scale-down to give the system time to settle. Five minutes is usually enough.

PDB Blocking Cluster Upgrades

A PDB with minAvailable set to the replica count blocks all drains. If you have 5 pods and minAvailable: 5, no pod ever gets evicted.

The kubectl drain command hangs. Cluster upgrades stall. You get an error about no satisfying PDB.

Set minAvailable to what your service can actually tolerate being unavailable. For stateless services, percentage-based PDBs like minAvailable: 50% are more practical.

All Replicas in One AZ

If you deploy 3 replicas without any topology awareness and one AZ fails, all 3 replicas disappear simultaneously. Your service goes down even though you thought you had redundancy.

Use topologySpreadConstraints or podAntiAffinity with topology.kubernetes.io/zone. Explicitly distribute across zones.

Anti-Patterns

minReplicas: 1

A single replica has no availability during restarts, upgrades, or failures. The kubelet restarting alone takes down your service.

Set minReplicas to at least 2 for anything you care about.

PDBs Set Too Aggressively

A PDB requiring all replicas to stay up blocks cluster maintenance entirely. You cannot drain nodes, you cannot upgrade the cluster, you cannot rotate infrastructure.

Set PDBs to the minimum your service needs, not the maximum.

HPA Without PDB

During a node drain, HPA sees reduced replicas and tries to scale up. Without a PDB, the newly created pods can get evicted too, creating a thrashing situation.

Pair HPA with PDBs for any production workload.

Vertical Pod Autoscaler Integration

VPA adjusts pod resource requests automatically based on actual usage, complementing HPA which adjusts replica counts. Use VPA for workloads where CPU/memory sizing is tricky and you want the scheduler to optimize resource allocation.

VPA Recommendation Modes

ModeBehaviorUse Case
OffNo action, just shows recommendationsTesting VPA before enabling
InitialSets resources only at pod creationNew deployments
AutoUpdates resources dynamicallyMature workloads in staging
RecreateUpdates and evicts pods to applyWilling to tolerate restarts

VPA Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-backend
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi

VPA with HPA

VPA handles vertical scaling (resource requests) while HPA handles horizontal scaling (replica count). They can run together:

# VPA for resource sizing
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-backend
  updatePolicy:
    updateMode: "Auto"
---
# HPA for replica scaling
apiVersion: autoscaling.k8s.io/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-backend
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Note: When using VPA in Auto mode with HPA, set minReplicas high enough that VPA resource increases do not cause HPA to immediately scale down before the new resource limits take effect.

Observability Hooks for HA

Key Metrics to Track

# HPA scaling events
kube_horizontalpodautoscaler_status_condition{condition="AbleToScale"}

# HPA current vs desired replicas
kube_horizontalpodautoscaler_status_replicas / kube_horizontalpodautoscaler_spec_replicas

# PDB violations (pods blocked from eviction)
kube_poddisruptionbudget_status_disruptions

# Pod restarts by deployment (evictions vs crashes)
kube_pod_container_status_restarts_total

# Node availability by zone
kube_node_status_condition{condition="Ready"}

Key Events to Log

  • HPA scale-up and scale-down events (check kubectl get events --watch)
  • PDB eviction blocks (pod cannot be evicted due to minAvailable)
  • VPA resource recommendations applied
  • Pod preemption events (higher priority pods evict lower ones)

Key Alerts to Configure

AlertConditionSeverity
HPA at max replicaskube_horizontalpodautoscaler_status_replicas == kube_horizontalpodautoscaler_spec_max_replicas for >5minWarning
HPA flappingScale direction changes 4+ times in 10minWarning
PDB blocking evictionskube_poddisruptionbudget_status_disruptions > 0Critical
Pod preemption eventskube_pod_status_nominated_node_name != "" increasingInfo
VPA recommendations ignoredPod OOMKilled despite VPA recommendationsWarning

Debug Commands

# Check HPA current state
kubectl get hpa -n production

# Watch HPA scaling decisions
kubectl get events --watch --field-selector involvedObject.kind=HorizontalPodAutoscaler

# Check PDB status
kubectl get pdb -n production
kubectl describe pdb web-frontend-pdb -n production

# Check VPA recommendations (without applying)
kubectl get vpa -n production -o yaml | grep -A 50 recommendation

# Check for preemption events
kubectl get events --sort-by='.lastTimestamp' | grep -i preempt

Interview Questions

1. Your multi-AZ Kubernetes cluster lost one availability zone. Walk through what happens and how you recover.

Nodes in the lost AZ are unreachable. The kube-controller-manager detects this and marks those nodes as Unknown, and the cloud provider controller deletes the instances. Pods that were running on those nodes are marked as Terminating and eventually Unknown. Kubernetes recreates pods on remaining nodes if the ReplicaSet or Deployment controller is healthy. Pods with PodDisruptionBudgets that were violated during the outage are recreated once the PDB allows. StatefulSets with volumes may not reschedule immediately since their PersistentVolumes in the lost AZ are unavailable — the StatefulSet controller waits for the volumes or for the pod to be force-deleted. Recovery steps: verify the cluster is healthy after the AZ returns, check PersistentVolume claims and manually delete any pods stuck in Terminating state (kubectl delete pod --grace-period=0 --force). Verify auto-scaling was triggered and pods are back to desired count.

2. You have a 3-node etcd cluster and one node fails. What happens and how do you recover?

With 3-node etcd, you can tolerate 1 member failure without cluster unavailability. The cluster continues serving requests with 2 nodes. The failed member's log snapshot falls behind. Recovery depends on whether the data is recoverable: if the node comes back quickly, etcd automatically rejoins and catches up from the leader. If the node is permanently lost, you must remove it from the etcd cluster using etcdctl member remove, then add a new member using etcdctl member add. For managed Kubernetes (GKE, EKS), the control plane handles this automatically — the managed etcd redundancy is one of the main benefits of managed clusters. For self-managed: always run etcd on dedicated nodes with proper monitoring on etcd cluster health metrics.

3. How do you design for zero-downtime upgrades of the Kubernetes control plane?

For managed clusters, use the platform's rolling upgrade feature — GKE, EKS, and AKS upgrade the control plane nodes without cluster downtime by using zero-downtime configurations. For self-managed: upgrade etcd first (since API server depends on it), then kube-apiserver, then controller-manager and scheduler. Use the kubectl cordon and drain approach on control plane nodes — treat them like worker nodes for upgrade purposes. Maintain etcd backups before any control plane changes using etcdctl snapshot save. The golden rule: never upgrade multiple control plane components simultaneously, and always verify each component is healthy before proceeding to the next. For single-node development clusters, downtime is unavoidable — schedule upgrades during maintenance windows.

4. How does HPA flapping occur and how do you prevent it?

Expected answer points:

  • HPA flapping is rapid scale-up and scale-down cycles that waste resources and cause instability
  • It happens when the stabilization window is too short or scaling thresholds are too tight
  • A pod that scales to 20, triggers scale-down, bounces back, is a classic symptom
  • Set stabilizationWindowSeconds on scale-down to give the system time to settle (5 minutes is typical)
  • Use scale-up stabilization for stateful services to avoid connection churn during brief load fluctuations
5. What is the difference between voluntary and involuntary disruptions in Kubernetes? Why does this distinction matter for PDB design?

Voluntary disruptions are actions initiated by administrators or the Kubernetes system: node drain operations for upgrades or repairs, kubectl drain commands, HPA scale-down events, and cluster autoscaler node removal. Involuntary disruptions are unplanned events outside cluster control: node hardware failures, cloud provider availability zone outages, network partitions, and kernel panics.

PDB only protects against voluntary disruptions. This is why PDB design must account for actual risk: an involuntary AZ outage that takes down nodes is not protected by a PDB, but the pods will be recreated by ReplicaSets once the cluster recovers. The distinction matters because PDBs cannot prevent all failure modes — they only ensure minimum availability during deliberate cluster operations. For true multi-AZ protection, combine PDBs with podAntiAffinity and topologySpreadConstraints so that involuntary disruptions also leave minimum replicas available.

6. How do topology spread constraints improve upon podAntiAffinity for multi-AZ high availability?

topologySpreadConstraints provide more granular control than podAntiAffinity alone. With podAntiAffinity, you can enforce that pods spread across zones, but you cannot control the exact balance — the scheduler might place 2 pods in one AZ and 1 in another if that is the only option. topologySpreadConstraints let you define a maxSkew value that enforces the maximum allowed difference in pod count between topology domains.

Example: with maxSkew=1 and topologyKey=topology.kubernetes.io/zone, the scheduler ensures that no zone has more than one extra pod compared to the smallest zone. If you have 3 replicas and 3 zones, you get exactly 1 pod per zone. If a zone fails, the remaining 2 zones each get 1 pod and the failed zone has 0 — this is balanced. Additionally, topologySpreadConstraints work with the scheduler's even-packing logic to achieve better resource utilization while maintaining balance, which anti-affinity cannot do. For StatefulSets where you need N replicas across K zones, this is essential.

7. Explain how a Pod Disruption Budget interacts with a Horizontal Pod Autoscaler during a node drain operation.

The interaction can create a thrashing scenario if not designed correctly. During a node drain, the ReplicaSet controller notices desired replicas versus available replicas and might signal to HPA that capacity is reduced. Without a PDB, HPA sees the reduced replica count and scales up to compensate. The newly created pods get scheduled, then the node drain continues and evicts them again, creating a cycle of pod creation and eviction.

With a PDB in place, the eviction is rate-limited — the PDB prevents more than N pods being unavailable at once. This actually gives HPA time to react without thrashing. However, if the PDB minAvailable is set too high (equal to the replica count), the drain blocks entirely. Best practice: set HPA minReplicas high enough that PDBs can allow a drain without triggering HPA to think the service is under-replicated. Pairing HPA with a properly configured PDB and using percentage-based minAvailable (e.g., 50%) prevents this interaction from causing problems.

8. What are the trade-offs between minAvailable and maxUnavailable in a Pod Disruption Budget? When would you choose one over the other?

minAvailable specifies the minimum number of pods that must remain available — if you set 3, at least 3 pods must be up at all times. maxUnavailable specifies the maximum number of pods that can be unavailable simultaneously — if you set 1, at most 1 pod can be down.

minAvailable is expressed as an absolute number or percentage of total replicas. Use minAvailable when you know the minimum your service needs to handle current load, and you want the PDB to protect that floor. This is useful for stateful services where N replicas are needed for quorum.

maxUnavailable is easier to reason about when you know your replica count and want to express "at most X can be down." A PDB with maxUnavailable: 1 on a 3-replica deployment means 2 are always up (67% availability). Use maxUnavailable when you want to ensure a specific number of replicas remain active during disruptions and you are comfortable with a fixed count rather than a percentage.

For services where replica count might change dynamically (HPA-enabled), maxUnavailable as a percentage is safer because it automatically scales with replica count. For stateful services with quorum requirements, minAvailable absolute values are clearer.

9. How does a DaemonSet contribute to high availability in a Kubernetes cluster? What are its HA limitations?

A DaemonSet ensures that exactly one pod runs on each node (or on nodes matching a selector). For HA, this is valuable for log collectors, metrics agents, and networking plugins that must run everywhere. If a node fails, the DaemonSet controller does not recreate the pod on that node — the pod is gone with the node. On new nodes added to the cluster, DaemonSet pods are automatically scheduled. This means DaemonSets do not provide resilience against node failures the way ReplicaSets do.

DaemonSet HA limitations: there is no built-in replica count or scaling — a DaemonSet pod exists once per node, not N times regardless of node count. For a 5-node cluster, you get 5 pods. If 2 nodes fail, you have 3 pods. If your workload needs exactly 3 replicas for HA regardless of node count, a Deployment with 3 replicas is better. DaemonSets are best for cluster-level services (monitoring, logging, networking) rather than application HA.

10. Describe how you would design an ingress controller setup for high availability across availability zones.

An ingress controller (nginx, Traefik, cloud LB) should be deployed as multiple replicas distributed across AZs with podAntiAffinity using topology.kubernetes.io/zone. The cloud provider's load balancer sits in front of the ingress replicas and distributes traffic. For AWS ALB/NLB, enable cross-zone load balancing so the load balancer can route to any ingress pod in any AZ.

Key design points: use a Deployment (not DaemonSet) with minReplicas >= 3 and podAntiAffinity to enforce zone spreading. Configure the cloud LB health check to detect ingress pod failures and remove failed pods from rotation. Set appropriate readinessProbe so the LB only sends traffic to ready ingress pods. For nginx ingress, use the -controller elections to ensure only one pod handles configuration updates to prevent conflicts. Consider using a PodDisruptionBudget on the ingress replicas so that upgrades do not cause brief unavailability.

11. What is the impact of API server latency on etcd write performance and overall cluster availability?

Kubernetes API server latency directly affects etcd write latency because every API server write goes through etcd. If API server request latency increases (e.g., from 10ms to 500ms), every kubectl apply, pod create, and deployment update slows proportionally. For a cluster handling 1000 writes per second, this compounds quickly.

etcd write latency is critical because etcd uses the Raft consensus protocol — writes require quorum from a majority of etcd nodes. Slow writes cause cascading effects: the API server times out waiting for etcd, kubelets report node heartbeats late, the scheduler becomes sluggish, and the controllers (ReplicaSet, Deployment) slow down. In the worst case, repeated API server timeouts cause watch streams to close, forcing controllers to re-list resources and potentially miss reconciliation events.

Best practices: keep etcd on dedicated nodes with SSD storage, monitor etcd_disk_wal_fsync_duration_seconds and etcd_server_leader_changes_total, and ensure etcd instances are in the same region as the API server. For managed Kubernetes, cloud providers handle this optimization.

12. How do you design a node pool strategy for production workloads requiring both compute optimization and high availability?

Node pools allow you to segment nodes by capability (compute optimized, memory optimized, GPU) while maintaining HA. For production, use multiple node pools: a general-purpose pool for standard workloads, a system pool for cluster critical components (DNS, monitoring), and an application pool sized for your workloads.

HA considerations for node pools: distribute nodes across AZs within each pool — a single large node pool with 10 nodes should have nodes in 3 AZs, not all 10 in one AZ. Use Cluster Autoscaler with node pool awareness so that scale-up events add nodes in the right AZ. Pod placement using topologySpreadConstraints and node affinity ensures your workloads use the right pool while maintaining zone balance. Do not mix system pods with application pods on the same nodes if you need guaranteed resource availability during node pressure.

For workloads with specific requirements (e.g., GPU for ML), a dedicated node pool with taints and tolerations isolates those workloads while allowing the autoscaler to scale that pool independently.

13. What is the relationship between PodDisruptionBudgets and readiness probes in maintaining service availability during disruptions?

Readiness probes determine whether a pod receives traffic through a Service endpoint. PDBs determine whether a pod can be evicted during voluntary disruptions. These are orthogonal but both critical for availability. A pod that fails its readiness probe is removed from Service endpoints, so it receives no traffic even if it is running. A pod that passes its readiness probe but is protected by a PDB cannot be evicted.

During a disruption, if a pod's readiness probe fails before eviction, the Service already removed it from endpoints — the eviction causes no traffic drop. If the readiness probe is too aggressive and marks pods as not ready during normal load fluctuations, you lose capacity unnecessarily. PDBs protect against eviction of pods that are otherwise healthy. Set readiness probes to detect genuine unavailability (process crash, health endpoint returning 500), not load-related latency spikes, so that pods remain available during normal traffic variation.

14. Explain how control plane HA in managed Kubernetes differs from self-managed deployments in terms of etcd quorum and API server availability.

Managed Kubernetes (GKE, EKS, AKS) runs the control plane as a managed service with built-in HA. The etcd cluster is replicated across multiple nodes managed by the cloud provider, with automatic failure detection and recovery. API server instances are behind load balancers with health checks, and the platform handles rolling updates of control plane components without cluster downtime. You do not see or manage these nodes — you trust the SLA.

Self-managed HA control plane requires you to provision multiple etcd nodes (odd number for quorum), multiple API server nodes behind a load balancer, and multiple controller-manager and scheduler instances with leader election. The etcd cluster needs explicit quorum management — if you lose the majority, the cluster goes read-only or down. You are responsible for upgrades, backups, and failure recovery. The trade-off is control versus operational burden. For most production workloads, managed Kubernetes with a multi-control-plane offering (e.g., GKE standard tier) provides sufficient HA without the operational complexity.

15. How does etcd write performance affect Kubernetes cluster responsiveness, and what metrics would you monitor to detect degradation early?

etcd is the state store for the entire cluster — every object (pods, services, configmaps, secrets) is stored there. Write performance directly controls API server responsiveness. The critical metrics to monitor: etcd_disk_wal_fsync_duration_seconds (write-ahead log sync time — should be under 10ms), etcd_server_leader_changes_total (frequent leader changes indicate network or load issues), etcd_server_proposals_failed_total (failed Raft proposals indicate quorum problems), and apiserver_request_latencies (API latency spikes often trace back to etcd).

When etcd write latency increases above 50ms, API server request latency increases proportionally. kubelet health checks start timing out, causing pods to be marked Unknown. Controller reconciliation loops slow down, causing delayed deployment rollouts and HPA reactions. At extreme latency (>500ms), the cluster can become effectively unresponsive to kubectl commands. Defensive measures: use etcd with SSD on dedicated nodes, enable etcd metrics in Prometheus, alert on apiserver_storage_latency_seconds and etcd_server_leader_changes_total.

16. Describe the failure modes of a StatefulSet with PersistentVolumes during an availability zone outage and how to mitigate them.

StatefulSet failure modes during AZ outage depend heavily on storage binding. StatefulSets with PersistentVolumeClaims bound to volumes in a single AZ will lose those volumes if the AZ goes down — pods cannot reschedule because their storage is unavailable. The StatefulSet controller waits for the volume to become available again (which may take until the AZ recovers) or requires manual force-deletion of the stuck pods.

Mitigation strategies: use regional storage solutions (e.g., AWS EFS, GCE PD Regional Persistent Disks) that replicate data across AZs rather than zonal volumes. For databases that need low-latency storage, configure zone-pinned storage with a PDB that prevents all replicas from being in the same AZ. For critical StatefulSets, deploy at least one replica in each AZ and use podAntiAffinity to prevent co-location. Accept that during an AZ outage, your StatefulSet will have reduced capacity — design your application to handle degraded replica count with reduced throughput.

17. What strategies would you use to perform a zero-downtime Kubernetes version upgrade on a self-managed cluster with multiple control plane nodes?

Zero-downtime control plane upgrades require careful sequencing. First, back up etcd using etcdctl snapshot save. Then upgrade etcd on all control plane nodes (do not upgrade multiple nodes simultaneously — upgrade one, verify health, then proceed). etcd must be upgraded before the API server because the API server depends on etcd. After etcd is healthy, upgrade the API server on one node at a time, verifying that kubectl continues to work after each node's upgrade.

After API server, upgrade controller-manager and scheduler. These use leader election, so the non-leader can be upgraded first with no service impact. The leader upgrade requires failing over leadership — this is handled automatically by the component but should be monitored. Finally, upgrade kubelet and kube-proxy on worker nodes using cordon, drain, upgrade, uncordon sequence. Use kubectl drain with PodDisruptionBudgets to ensure applications remain available throughout the node upgrade process. Never upgrade more than one control plane node at a time, and always verify cluster health after each component upgrade before proceeding.

18. How does pod affinity and anti-affinity interact with the Kubernetes scheduler to distribute workloads across topology domains?

Pod affinity rules tell the scheduler to place pods relative to other pods (same node, same zone, or different zone). Anti-affinity tells the scheduler to avoid placing pods near other pods. During scheduling, the scheduler evaluates these rules alongside resource availability and other constraints.

requiredDuringSchedulingIgnoredDuringExecution enforces hard constraints — if a pod cannot satisfy affinity rules (e.g., no node in the preferred zone has capacity), scheduling fails. preferredDuringSchedulingIgnoredDuringExecution is a soft preference — the scheduler tries to satisfy it but will place the pod elsewhere if it cannot. For zone spreading, use requiredDuringSchedulingIgnoredDuringExecution with topologyKey=topology.kubernetes.io/zone for strict distribution, or preferred for soft balancing. The scheduler treats each topologyKey value as a separate domain and attempts to spread pods evenly across domains while respecting affinity constraints.

19. You have a Deployment with 5 replicas using a PDB with minAvailable: 4. During a rolling update, what happens when you try to drain a node?

During a rolling update, the Deployment controller creates new pods before terminating old ones (rolling update strategy). If you try to drain a node during this process, the PDB constraint combines with the rolling update. The Deployment has 5 pods running, then creates 1 new pod (6 total). When you drain a node with PDB minAvailable: 4, the drain attempts to evict pods but the PDB blocks evictions if it would bring available pods below 4.

With 6 pods and PDB minAvailable: 4, 2 pods can be evicted. The drain proceeds, removes pods from the draining node, the ReplicaSet controller recreates them on other nodes. After the rolling update completes, you are back to 5 pods. If the rolling update is also in progress, you could have 7 pods temporarily (5 old + 1 new being created + 1 being rolled out). The drain still respects PDB and only evicts up to 2 pods. This demonstrates why it is important to ensure rolling update strategy and PDB are coordinated — a PDB set too high can block both rolling updates and node drains simultaneously.

20. What are the specific trade-offs when choosing between cluster-level anti-affinity (spreading across all zones) versus node-level anti-affinity (spreading across nodes within a single zone) for a stateless web application?

Cluster-level anti-affinity with topologyKey=topology.kubernetes.io/zone spreads pods across availability zones. This protects against entire AZ failures but increases network latency — a pod in us-east-1a talking to a pod in us-east-1b has higher latency than two pods in the same AZ. For stateless web applications where requests are short-lived and load-balanced, this latency is usually acceptable noise.

Node-level anti-affinity (topologyKey=kubernetes.io/hostname) spreads pods across individual nodes within a single AZ. This protects against node-level failures (hardware failure, kernel panic) but not AZ failures. Network latency between nodes in the same AZ is minimal. For latency-sensitive applications where pods frequently call each other, node-level spreading may be preferable for lower network latency, accepting the risk of AZ-level failures.

For production internet-facing applications, the best approach combines both: use topologySpreadConstraints with maxSkew=1 and topologyKey=topology.kubernetes.io/zone to spread across zones, with a soft preferred preference for kubernetes.io/hostname to also spread within a zone when possible. This gives AZ-level failure protection with some node-level spreading automatically.

Further Reading

Conclusion

High availability on Kubernetes requires multiple layers of protection. HPA handles traffic spikes by scaling pods horizontally. Pod Disruption Budgets ensure minimum availability during cluster operations. Multi-AZ deployments protect against datacenter failures.

Configure HPA with appropriate min and max replica counts and tuning parameters for scale-up and scale-down behavior. Set PDBs for all production workloads. Distribute StatefulSets and Deployments across availability zones using pod anti-affinity rules.

Test your HA setup with chaos engineering tools. Simulate node failures, pod evictions, and traffic spikes to verify your configuration handles real-world scenarios.

For more on advanced Kubernetes patterns, see the Advanced Kubernetes post.

Category

Related Posts

Health Checks: Liveness, Readiness, and Service Availability

Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.

#microservices #health-checks #kubernetes

Container Security: Image Scanning and Vulnerability Management

Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.

#container-security #docker #kubernetes

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.

#deployment #devops #kubernetes