Health Checks: Liveness, Readiness, and Service Availability

Master health check implementation for microservices including liveness probes, readiness probes, and graceful degradation patterns.

published: March 24, 2026 reading time: 29 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Health checks tell Kubernetes when to restart your container, when to route traffic to it, and when to give it time to initialize. Liveness probes catch deadlocks, readiness probes gate traffic routing, and startup probes handle slow-starting applications. The key rule: liveness checks must stay minimal and local, because checking dependencies in a liveness probe creates restart loops. This guide walks through probe configuration, dependency checking patterns, and the probe parameters that actually matter for production reliability.

Health Checks: Liveness, Readiness, and Service Availability

In distributed systems, your services do not exist in isolation. They call each other, depend on databases and caches, and serve traffic through load balancers. When a service starts failing, the rest of the system needs to know quickly. Health checks provide that visibility.

A properly implemented health check system tells Kubernetes when to route traffic to your pod, tells your load balancer which instances are ready, and gives your monitoring system early warning before problems cascade. Without health checks, you get cascading failures, traffic sent to dead instances, and problems that compound silently until they take down your entire application.

This article covers the three probe types Kubernetes provides, how to implement health endpoints in your services, how to handle deep health checks for dependencies, and the patterns that keep your system resilient when individual services fail.

Introduction

Health checks provide the visibility that distributed systems need to self-heal. When a service starts failing, the rest of the system needs to know quickly so it can stop routing traffic to the failing instance, trigger restarts when appropriate, and alert operators before problems cascade. Without health checks, you get cascading failures, traffic sent to dead instances, and problems that compound silently until they take down your entire application.

A properly implemented health check system tells Kubernetes when to route traffic to your pod, tells your load balancer which instances are ready, and gives your monitoring system early warning before problems escalate. This article covers the three probe types Kubernetes provides, how to implement health endpoints in your services, deep health checks for dependencies, and the configuration patterns that keep your system resilient.

The Three Probe Types

Kubernetes distinguishes between three states a pod can be in. Each state has a corresponding probe type that determines how Kubernetes manages the pod’s lifecycle and traffic routing.

graph TD
    A[Pod Starting] --> B{Startup Probe}
    B -->|Not Ready| C[Initializing]
    B -->|Ready| D{Liveness Probe}
    D -->|Failing| E[Restarting]
    D -->|Healthy| F{Readiness Probe}
    F -->|Failing| G[Remove from Traffic]
    F -->|Passing| H[Receive Traffic]
    E --> D
    G --> F

Liveness Probe: Is the Process Alive?

The liveness probe answers a simple question: is the process running and responsive? If the liveness probe fails, Kubernetes restarts the container. This handles situations where the process is alive but stuck in a deadlock or unresponsive state.

A basic liveness probe configuration looks like this:

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3

The liveness probe waits 10 seconds after startup before the first check. Then it checks every 15 seconds. If the check takes more than 5 seconds, it counts as a failure. After 3 consecutive failures, Kubernetes restarts the container.

Keep liveness probes simple. A liveness probe that checks dependencies will restart your service whenever your database is temporarily unavailable, which makes outages worse, not better.

Readiness Probe: Can the Service Accept Traffic?

The readiness probe answers: can this instance handle requests right now? A service might be running but not ready if it is warming up, loading configuration, or recovering from a dependency outage.

When the readiness probe fails, Kubernetes removes the pod from the service endpoint slice. Traffic stops being routed to that instance. The pod keeps running and the probe keeps checking. When the probe passes again, traffic resumes.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 2

Use readiness probes for checks that verify dependencies. If your service needs a database connection and a cache to serve requests properly, the readiness probe should verify both. Keep the probe fast to avoid removing instances unnecessarily during brief slowdowns.

Startup Probe: The Initialization Grace Period

The startup probe handles applications that need significant time to initialize. If your service takes 30 seconds to start, a liveness probe that starts checking after 10 seconds will kill the container before it is ready.

The startup probe delays all other probes until it succeeds:

startupProbe:
  httpGet:
    path: /started
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 12

With 5-second intervals and 12 failures allowed, the startup probe gives your service up to 60 seconds to initialize. Once the startup probe passes, Kubernetes switches to the liveness and readiness probes.

Startup probes suit applications that load large models, warm up JIT compilers, or perform initial data loads at startup.

Implementing Health Endpoints

Your service needs to expose endpoints that Kubernetes can query. Plan for three endpoints: a liveness endpoint for basic aliveness, a readiness endpoint that checks dependencies, and optionally a startup endpoint for initialization.

Basic Health Endpoint

The liveness endpoint should be trivially simple. It checks nothing except whether the HTTP server can respond:

@app.get("/live")
def liveness():
    return {"status": "alive"}

This endpoint must not check dependencies. If your database is down and this endpoint returns healthy, Kubernetes will keep the container running but the liveness probe passes. If the endpoint itself fails because the process is deadlocked, Kubernetes restarts the container, which is the desired behavior.

Readiness Endpoint with Dependency Checks

The readiness endpoint verifies your service can handle traffic:

@app.get("/ready")
def readiness():
    # Check database connectivity
    try:
        db.execute("SELECT 1")
    except Exception as e:
        raise HealthCheckFailed("Database unavailable")

    # Check cache connectivity
    try:
        cache.ping()
    except Exception as e:
        raise HealthCheckFailed("Cache unavailable")

    # Check downstream services
    for service in dependent_services:
        if not service.is_healthy():
            raise HealthCheckFailed(f"{service.name} unavailable")

    return {"status": "ready"}

Keep readiness checks fast. A 5-second timeout means 5 seconds of serving bad responses while your health check times out. Set timeouts aggressively and fail fast.

Startup Endpoint

The startup endpoint mirrors the readiness check but exists only during initialization:

@app.get("/started")
def startup():
    if not initialization_complete.is_set():
        raise HealthCheckFailed("Still initializing")
    return {"status": "started"}

Once initialization completes, this endpoint can return healthy permanently, or you can remove the startup probe configuration and let Kubernetes use only liveness and readiness probes.

Deep Health Checks

Simple endpoints that just return “healthy” catch process crashes but miss dependency failures. Deep health checks verify your dependencies are actually working.

Database Connectivity

Do not just check if the database process is running. Check if your application can execute queries:

def check_database():
    try:
        with db.connection() as conn:
            cursor = conn.cursor()
            cursor.execute("SELECT 1")
            result = cursor.fetchone()
            if result[0] != 1:
                raise HealthCheckFailed("Database query failed")
    except OperationalError:
        raise HealthCheckFailed("Database connection failed")

For PostgreSQL, SELECT 1 works. For MySQL, use SELECT 1 as well. For MongoDB, use db.admin.command('ping').

Cache Verification

Caches fail silently in most configurations. Verify your cache is actually storing and retrieving data:

def check_cache():
    try:
        test_key = f"health_check:{uuid.uuid4()}"
        test_value = str(time.time())

        cache.set(test_key, test_value, ex=10)
        retrieved = cache.get(test_key)

        if retrieved != test_value:
            raise HealthCheckFailed("Cache read/write mismatch")

        cache.delete(test_key)
    except Exception as e:
        raise HealthCheckFailed(f"Cache check failed: {e}")

Use a unique key per check to avoid collisions in shared cache environments.

Service Mesh Health Checks

When running behind a service mesh like Istio, Envoy handles health checking by default. You configure ReadinessGate in your pod spec and Envoy manages the actual health check calls:

readinessGates:
  - conditionType: "envoy.kubernetes.io/ready"

Your application still needs to expose a health endpoint for orchestration systems and load balancers that do not use Envoy’s sidecar proxy.

Kubernetes Configuration

Probe Configuration Options

Each probe type supports the same configuration parameters.

Parameter	Purpose	Typical Value
`initialDelaySeconds`	Wait before first check	Liveness: 10-30s, Readiness: 5-10s
`periodSeconds`	How often to check	10-15s for liveness, 5-10s for readiness
`timeoutSeconds`	When to count as failure	3-5s
`failureThreshold`	Failures before taking action	Liveness: 3, Readiness: 2
`successThreshold`	Consecutive successes to recover	1 for liveness (always 1)

Common Mistakes in Probe Configuration

Setting initialDelaySeconds too low causes premature failures. Your application needs time to start before Kubernetes starts checking. Set this based on your observed startup time, not your desired startup time.

Setting periodSeconds too short causes excessive load from health check requests. Setting it too long delays detection of failures. 10-15 seconds balances quick detection with minimal overhead.

Setting failureThreshold too low causes unnecessary restarts from transient issues. Setting it too high delays failure detection. For liveness probes, 3 failures over 45 seconds is reasonable. For readiness probes, 2 failures over 20 seconds balances sensitivity with stability.

Verifying Probe Configuration

Use kubectl to inspect probe configuration and test probes manually:

# Describe pod probe configuration
kubectl describe pod my-pod | grep -A 10 "Liveness"
kubectl describe pod my-pod | grep -A 10 "Readiness"

# Port-forward to test health endpoints
kubectl port-forward my-pod 8080:8080
curl http://localhost:8080/live
curl http://localhost:8080/ready

# Check pod status
kubectl get pod my-pod -o jsonpath='{.status.conditions[*]}'

Health Check Best Practices

Timeouts and Retries

Health check timeouts must be shorter than your request timeout. If your service times out requests at 30 seconds but health checks wait 10 seconds, failing health checks will not catch the problems fast enough.

For readiness probes checking dependencies, set timeouts at 2-3 seconds. Most dependency checks should complete in milliseconds. A 3-second timeout catches genuine problems without false positives from brief slowdowns.

Do not implement retry logic in health checks. Kubernetes handles retries at the probe level. If a health check fails, Kubernetes retries based on failureThreshold. Adding your own retry logic inside the health check endpoint adds latency and complexity without benefit.

Fallbacks and Graceful Degradation

When health checks fail, have a plan for degraded operation. If your recommendation service cannot reach its ML model, return popularity-based recommendations instead of errors. If your search service cannot reach Elasticsearch, fall back to database-backed search.

@app.get("/ready")
def readiness():
    try:
        check_database()
    except HealthCheckFailed:
        # Can we serve read-only traffic?
        if not app.allow_read_only_mode():
            raise
        return {"status": "ready", "mode": "read_only"}

    try:
        check_cache()
    except HealthCheckFailed:
        # Cache is optional
        return {"status": "ready", "cache": "degraded"}

    return {"status": "ready"}

What Not to Check in Health Endpoints

Keep liveness probes minimal. The liveness probe exists to detect deadlocks and crashes, not dependency outages. If your liveness probe fails whenever your database is unavailable, you restart into the same situation repeatedly.

Do not implement business logic in health checks. Health checks should verify infrastructure and dependencies, not application state. If you need to check application state, use separate monitoring endpoints with their own alerting.

Do not block health checks on long operations. A health check that takes 30 seconds to complete defeats its purpose. Set aggressive timeouts and fail fast.

Monitoring and Alerting

Health checks generate valuable signals for monitoring. Track health check latency and failure rates alongside application metrics.

# Track health check duration
def measure_health_check(name, check_func):
    start = time.time()
    try:
        check_func()
        duration = time.time() - start
        metrics.histogram("health_check_duration_seconds", duration, labels={"check": name})
        metrics.increment("health_check_success_total", labels={"check": name})
    except Exception:
        duration = time.time() - start
        metrics.histogram("health_check_duration_seconds", duration, labels={"check": name})
        metrics.increment("health_check_failure_total", labels={"check": name})
        raise

Set alerts on health check failures, not just application error rates. A failing health check often precedes customer-visible errors by several minutes.

When to Use / When Not to Use

When to Use Health Checks

Health checks are essential in these scenarios:

Container orchestration (Kubernetes, Docker Swarm) where orchestrators need to know when to restart or route traffic to your service
Load balancer integration where load balancers need to know which instances can receive traffic
Auto-scaling systems where scaling decisions depend on service health
Microservices with dependencies where you need to detect when downstream services are unavailable
Multi-instance deployments where you need to ensure all instances are healthy before serving traffic

When Not to Use Health Checks

Health checks may add unnecessary complexity in these cases:

Single-instance applications with no orchestration and no load balancing
Stateless batch jobs that run to completion and exit (though startup/shutdown hooks may still be useful)
Very short-lived tasks where the overhead of health check implementation outweighs the benefit
Services where failure is acceptable - non-critical background workers that can fail without impact

Probe Selection Guide

Scenario	Startup Probe	Liveness Probe	Readiness Probe
Slow-starting application	Required	Not needed until startup completes	Not needed until startup completes
Depends on external services	Not needed	Not recommended (restarts on transient deps)	Required (blocks traffic during dependency issues)
Serves cached data when deps fail	Not needed	Not recommended	Optional (can return healthy with degraded status)
Stateless computation	Required if startup time is non-trivial	Optional (process crash = container restart)	Optional
Database-backed API	Required	Not recommended	Required

Decision Flow

graph TD
    A[Implementing Health Checks] --> B{Application Slow to Start?}
    B -->|Yes| C[Add Startup Probe]
    B -->|No| D{Service Has Dependencies?}
    D -->|Yes| E{Need to Block Traffic When Deps Unavailable?}
    E -->|Yes| F[Add Readiness Probe]
    E -->|No| G[Add Liveness Probe]
    D -->|No| H{Can Crash Indicate Problem?}
    H -->|Yes| G
    H -->|No| I[No Probes Needed]
    C --> F
    C --> G

Topic Deep Dive: Kubernetes Probe Configuration and Failure Threshold Tuning

Getting probe configuration right is critical for reliability. Too sensitive and you restart healthy services. Too lenient and you route traffic to failing ones.

Initial Delay Calculation

Set initialDelaySeconds based on observed startup time, not desired startup time:

# Check actual startup time first
# kubectl run my-app --image=my-app && kubectl logs -f my-app
# Observe how long until the app is ready to serve

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 45 # Allow 45s for startup
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

A common mistake is setting initialDelaySeconds too low based on optimistic estimates. If your service takes 30 seconds to warm up its database connection pool and load configuration, set initialDelaySeconds to at least 35 seconds.

Period and Failure Threshold Tuning

The right balance depends on your service characteristics:

Service Type	Suggested Period	Failure Threshold	Detection Time
Fast stateless API	5-10s	2-3	10-30s
Database-backed service	10s	3	30-45s
Slow initializing service	15s	3	45s
ML model serving	20s	3	60s

Detection time = periodSeconds * failureThreshold. Aim for 30-60 second detection for transient issues while catching real problems quickly.

Kubernetes Probe Configuration for Different Scenarios

# Fast-starting stateless service
livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

# Slow-starting service with heavy initialization
startupProbe:
  httpGet:
    path: /started
    port: 8080
  failureThreshold: 12
  periodSeconds: 10  # 120s max startup time

livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 130  # After startup probe completes
  periodSeconds: 15
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 135
  periodSeconds: 10
  failureThreshold: 2

gRPC Health Checks

For gRPC services, Kubernetes supports gRPC probe as of 1.24:

readinessProbe:
  grpc:
    port: 50051
    service: ""
  initialDelaySeconds: 5
  periodSeconds: 10

The empty service field checks overall server health. Specify a service name to check specific service availability.

HTTP/TCP Comparison for Different Service Types

Service Type	Recommended Probe	Why
HTTP REST API	HTTP GET /health	Verifies entire stack
gRPC service	gRPC probe	Native gRPC support
Database	TCP check on port	Fast, verifies network reachability
Cache (Redis)	TCP check or Redis PING	Simple connectivity check
Message queue	HTTP if exposed, else TCP	Depends on exposure

Real-world Failure Scenarios

Scenario	What Happens	Root Cause	Mitigation
initialDelaySeconds too low	Container killed before startup completes	Optimistic configuration	Measure actual startup time
Period too short	Excessive load from probe requests	Probe fatigue	Use 10-15s for liveness, 5-10s for readiness
failureThreshold too low	False positives from transient issues	Oversensitive configuration	Require 3+ failures before action
Readiness probe checks external deps	Unhealthy when downstream is slow	Tight coupling	Make readiness probe fast, rely on circuit breakers
Liveness probe checks database	Continuous restart loop	Liveness depends on external	Liveness should check only local health

Trade-off Comparison

Strategy	Pros	Cons	Best For
HTTP /health endpoint	Easy to implement, comprehensive	Requires application support	Most REST services
TCP socket check	Simple, no app changes needed	Cannot verify application health	Legacy services, non-HTTP protocols
Exec probe	Can run custom scripts	Slower, more overhead	Complex health verification
gRPC probe	Native for gRPC services	Requires Kubernetes 1.24+	gRPC microservices
Sidecar health check	Decoupled from app	Additional complexity	Service mesh deployments

For more on building resilient systems, see Resilience Patterns, Circuit Breaker Pattern, and Kubernetes.

Quick Recap Checklist

Health checks let Kubernetes and your load balancers make intelligent routing decisions. Use liveness probes to detect crashed or deadlocked processes. Use readiness probes to control traffic routing based on dependency health. Use startup probes to give slow-starting applications time to initialize.

Keep liveness probes simple. Keep readiness probes fast and thorough. Set timeouts short and failure thresholds reasonable. Monitor your health checks and alert on failures.

Interview Questions

1. What is the difference between liveness, readiness, and startup probes in Kubernetes, and when would you use each?

Liveness probes determine if a container should be restarted. If the liveness probe fails, Kubernetes restarts the container. Use liveness probes to detect deadlock situations where the process is running but unresponsive.

Readiness probes determine if a container can receive traffic. If the readiness probe fails, Kubernetes removes the container from the service endpoint slice, stopping traffic to it. Use readiness probes to control traffic routing based on whether the service is ready to serve requests.

Startup probes delay all other probes until the container is ready. Use startup probes for applications that take significant time to initialize, preventing liveness probes from killing the container before it is ready.

2. Why should liveness probes be simple and not check dependencies?

If a liveness probe checks external dependencies like databases or caches, a temporary outage of that dependency causes the liveness probe to fail, restarting the container. The container starts up, the liveness probe checks the dependency again (still down), fails again, and the cycle repeats.

This makes the outage worse instead of better. The service restarts continuously, consuming resources and potentially making the dependency problem worse under the load of restart attempts.

Liveness probes should only check whether the process is running and the HTTP server can respond. Dependency health should be checked by readiness probes, which block traffic without restarting the container.

3. How do you determine the right values for initialDelaySeconds, periodSeconds, timeoutSeconds, and failureThreshold?

Measure actual startup time for initialDelaySeconds. Run the application, observe how long until it is ready, and set initialDelaySeconds slightly above that measured time.

PeriodSeconds should balance quick detection against probe overhead. 10-15 seconds for liveness, 5-10 seconds for readiness works for most services. Shorter periods add load; longer periods delay detection.

TimeoutSeconds should be short enough to fail fast but long enough for legitimate slow responses. 3-5 seconds is typical for HTTP probes.

FailureThreshold determines how many consecutive failures triggers action. Multiply periodSeconds by failureThreshold to get detection time. For liveness, 3 failures over 45 seconds is reasonable. For readiness, 2 failures over 20 seconds balances sensitivity with stability.

4. How do startup probes work and when are they necessary?

Startup probes delay the start of liveness and readiness probes until they pass. Until the startup probe succeeds, Kubernetes treats all probes as if they are pending. This gives slow-starting applications time to initialize.

If you set initialDelaySeconds on liveness probe to wait for initialization, and your initialization takes 60 seconds, you need initialDelaySeconds: 60. But if initialization fails, Kubernetes waits 60 seconds before even starting liveness checks, delaying the detection of the failure.

Startup probes solve this: set periodSeconds and failureThreshold so the total timeout (periodSeconds * failureThreshold) covers the maximum startup time. If the application starts in 30 seconds but fails to start at all, you detect the failure after 60 seconds of trying rather than waiting 30 seconds first.

5. What is the difference between active and passive health checks?

Active health checks run periodically from the load balancer or proxy, independent of production traffic. They query health endpoints or test TCP connections and mark instances unhealthy based on responses.

Passive health checks analyze actual request responses. If an instance returns errors or timeouts above a threshold, the load balancer marks it unhealthy without sending additional probe requests. Passive checks catch issues that affect real traffic but might not trigger active probe failures.

Most production systems use both: active checks for detection and passive checks for accuracy. AWS ALB uses passive health checks by default, for example.

6. How do you implement health checks for services with deep dependencies like databases and caches?

Implement a readiness endpoint that verifies connectivity to dependencies. Check database connectivity by executing a simple query like SELECT 1. Check cache connectivity by writing and reading a test value with a unique key to avoid collisions.

Keep checks fast—set aggressive timeouts (2-3 seconds). If a dependency is slow but working, the health check should not fail just because it did not complete within the timeout.

Consider implementing fallback behavior where the service can operate in degraded mode. If the database is unavailable but cached data suffices, the readiness probe can pass with a "degraded" status and the service continues serving from cache.

7. What happens when health check timeouts are set too high or too low?

If timeouts are too high, failing health checks take too long to detect real problems. A 30-second timeout on a health check that should complete in 100ms means you wait 30 seconds to learn the service is unresponsive. Detection time = timeout + processing time, so high timeouts delay failure detection.

If timeouts are too low, you get false positives from legitimate slow responses. Under load, a normally responsive service might take 2 seconds for a health check, but a 1-second timeout would mark it unhealthy even though it is working fine.

Set timeouts based on observed response times under normal load, plus a small buffer. Most health checks should complete in under 1 second; 3-5 seconds is a reasonable timeout range.

8. How do health checks interact with autoscaling decisions in Kubernetes?

Horizontal Pod Autoscaler (HPA) scales based on CPU utilization, memory usage, or custom metrics. Unhealthy pods affect these metrics differently depending on whether the unhealthiness is detected by readiness probes.

If a pod is failing readiness probes, it is removed from service endpoints and stops receiving traffic. This reduces its resource utilization, which might cause the HPA to scale down when it should actually scale up to handle the issue.

Configure HPA to ignore unhealthy pods using the behavior API. Set scale-down stabilization window to avoid aggressive scale-down during recovery. Some teams implement separate scaling based on health check failure rates rather than resource utilization.

9. What is the relationship between health checks and graceful shutdown in Kubernetes?

When Kubernetes sends SIGTERM to a container for termination, the container should stop accepting new connections but finish existing requests. The readiness probe should start failing immediately so the pod is removed from endpoints before it processes new requests.

The terminationGracePeriodSeconds setting controls how long to wait for the container to shut down. Configure the preStop hook to add delay before sending SIGTERM, giving the load balancer time to update its routing and stop sending new traffic.

If the container does not exit within the grace period, Kubernetes sends SIGKILL. Ensure your application handles SIGTERM by stopping listeners, draining connections, and exiting cleanly.

10. How do you implement health checks for services behind a service mesh like Istio?

Istio's Envoy sidecar handles health checking by default when you configure ReadinessGate in your pod spec. The sidecar performs health checks on behalf of the application, and Kubernetes receives the aggregated status through the ReadinessGate condition.

Your application still needs to expose a health endpoint for systems that do not use Envoy's sidecar proxy. Implement the endpoint as usual but understand that Istio will manage the actual probe calls to your pod.

For multi-container pods where the main container and sidecar have different health characteristics, configure separate probes for each. The sidecar might be ready when the main application is still initializing.

11. What are the differences between TCP, HTTP, and Exec probes, and when would you choose each?

HTTP probes send an HTTP GET request to the health endpoint. They are the most common for web services because they verify the entire application stack including HTTP server functionality. Kubernetes marks the probe as successful for response codes 200-399.

TCP probes attempt to open a TCP socket. Use these for services that do not expose HTTP endpoints, such as databases, mail servers, or legacy protocols. They only verify network connectivity, not application health.

Exec probes run a shell command inside the container. The probe succeeds if the command exits with zero. These are flexible but slower due to process overhead. Use when you need custom health logic that cannot be expressed as an HTTP endpoint.

Choose HTTP probes for most web services. Use TCP for non-HTTP services or when you only care about port availability. Use Exec only when you need custom verification logic that cannot be expressed via HTTP or TCP.

12. How do health checks affect deployment rollouts and rolling updates?

During a rolling update, Kubernetes gradually replaces old pods with new ones. Readiness probes determine when a new pod is ready to receive traffic. If the new version fails readiness checks, it stays in the rollout without receiving traffic, allowing you to catch issues before full deployment.

The rollout waits for the readiness probe to pass before the old pod terminates. This ensures continuous availability. Configure `maxUnavailable` and `maxSurge` in your rolling update strategy to control deployment speed versus availability.

Liveness probes do not directly affect rollouts. If a new pod's liveness probe fails, Kubernetes restarts it rather than rolling back. This is why readiness probes should be thorough and liveness probes should be minimal.

13. What happens when a readiness probe returns a 503 or other error code?

When a readiness probe returns a non-2xx response, Kubernetes marks the pod condition as `Ready: False`. The pod is removed from the service endpoint slice, stopping new traffic. Existing connections are not terminated, but no new traffic routes to that pod.

The pod continues running and the probe continues checking. When the probe passes again, Kubernetes marks `Ready: True` and adds the pod back to the endpoint slice. Traffic resumes automatically.

For a 503 specifically, your application is explicitly reporting it cannot handle requests. This is different from a timeout (which indicates the application may be stuck) or connection refused (which indicates the application is not listening).

14. How would you design health checks for a multi-tenant service where different tenants have different dependencies?

Multi-tenant services often have per-tenant dependencies. A tenant's database might be down while others are fine. Consider implementing tenant-aware readiness checks that verify connectivity to all tenant resources.

One approach is a aggregated health endpoint that reports status per tenant. The readiness probe sums across all tenants: if any tenant's critical dependency is down, mark the pod as not ready. Use a separate monitoring endpoint to expose per-tenant details without affecting traffic routing.

Alternatively, use separate readiness gates per tenant or implement tenant isolation at the ingress layer so unhealthy tenants do not affect others.

15. What is the relationship between health checks and circuit breakers?

Health checks and circuit breakers serve different but complementary purposes. Health checks detect failure so orchestration platforms can react. Circuit breakers detect failure patterns and prevent your service from calling downstream services that are known to be failing.

A circuit breaker monitors the success rate of downstream calls. When the failure rate exceeds a threshold, the circuit opens and subsequent calls fail immediately without making the actual request. This protects both your service and the downstream one from cascading failure.

Use readiness probes to report your service's overall health to Kubernetes. Use circuit breakers to protect your service from downstream failures. These patterns work together: the readiness probe might report degraded status while the circuit breaker prevents further damage.

16. How do you test health check configurations before deploying to production?

Test health checks under realistic failure conditions. Artificially inject failures for each dependency (database down, cache unavailable, downstream service timeout) and verify the probe behaves as expected.

Measure actual probe latency under load. Your health check might complete in 50ms normally but take 2 seconds under load. Set timeouts high enough to avoid false positives during traffic spikes.

Test startup time measurements. Force restart your application and measure how long until readiness probe passes. Set initialDelaySeconds to 10-20% above this measured time.

Use canary deployments to test new probe configurations with a small percentage of traffic before full rollout.

17. What are the security considerations for health check endpoints?

Health endpoints should not expose sensitive information. A health check that returns database connection strings, internal IP addresses, or configuration details creates an information disclosure vulnerability.

Keep health responses minimal: just a status indicator. If you need detailed diagnostics for monitoring, use a separate endpoint that is not exposed to the internet and protected by network policies.

Consider rate limiting on health endpoints to prevent abuse. An exposed `/live` endpoint with no rate limiting could be used for amplification attacks or reconnaissance.

18. How do you handle health checks for applications that have warm-up or cool-down periods?

Applications that need warm-up time (JIT compilation, model loading, connection pool initialization) should use a startup probe. Set the failure threshold high enough to cover the maximum warm-up time.

During warm-up, readiness probes should fail. After warm-up completes, readiness probes should pass. This ensures the application does not receive traffic until it is ready.

For cool-down (graceful shutdown), use preStop hooks and SIGTERM handling. The readiness probe should start failing immediately on SIGTERM so the pod is removed from endpoints before shutdown begins.

19. What is the impact of health check frequency on overall system performance?

Each health check probe consumes resources: CPU for the check, network bandwidth for the request, and memory for the connection. High-frequency probes compound across all pods in your cluster.

For a cluster with 1000 pods checking every 10 seconds, you have 100 probe requests per second hitting your services. If each probe takes 10ms of CPU, that is 1 second of CPU per second dedicated to health checking.

Balance detection speed against overhead. 10-15 second periods for liveness probes work for most services. Readiness probes can be more frequent (5-10 seconds) because they only affect traffic routing, not restarts.

20. How do you monitor the health of health checks themselves?

Track health check latency and failure rates as application metrics. If health check duration increases over time, your application may be degrading. If health check failures spike, investigate immediately.

Set up alerts for health check probe duration exceeding thresholds. A health check that normally takes 5ms taking 500ms indicates resource contention or connection pool exhaustion.

Monitor probe configuration changes. If someone reduces failureThreshold or increases periodSeconds, detection time changes. Alert on configuration drift from baseline.

Conclusion

Health checks form the foundation of reliable service discovery and availability detection in distributed systems. Liveness probes identify crashed or dead services that need restarting. Readiness probes determine whether a service can handle traffic after startup, deployments, or temporary degradation. Startup probes accommodate slow-starting applications without forcing overly aggressive defaults.

Probe configuration requires balancing detection speed against false positive risk. Set timeouts based on normal response times under load. Configure failure thresholds to tolerate brief issues without waiting too long to act. Use separate probes for different concerns when your application warrants it.

Health checks integrate deeply with orchestration platforms—Kubernetes uses them to manage pod lifecycle, autoscaling, and traffic routing. Service meshes layer additional health checking through sidecar proxies. External load balancers perform their own checks against your services.

Building reliable health checks means testing them under failure conditions. Verify that your services correctly report unhealthy states. Confirm that orchestration platforms respond appropriately. Measure detection and recovery times under various failure scenarios.

The right health check strategy depends on your application’s characteristics and your tolerance for different failure modes. Start with simple checks and add depth as operational needs demand.

Health Checks: Liveness, Readiness, and Service Availability

Introduction

The Three Probe Types

Liveness Probe: Is the Process Alive?

Readiness Probe: Can the Service Accept Traffic?

Startup Probe: The Initialization Grace Period

Implementing Health Endpoints

Basic Health Endpoint

Readiness Endpoint with Dependency Checks

Startup Endpoint

Deep Health Checks

Database Connectivity

Cache Verification

Service Mesh Health Checks

Kubernetes Configuration

Probe Configuration Options

Common Mistakes in Probe Configuration

Verifying Probe Configuration

Health Check Best Practices

Timeouts and Retries

Fallbacks and Graceful Degradation

What Not to Check in Health Endpoints

Monitoring and Alerting

When to Use / When Not to Use

When to Use Health Checks

When Not to Use Health Checks

Probe Selection Guide

Decision Flow

Topic Deep Dive: Kubernetes Probe Configuration and Failure Threshold Tuning

Initial Delay Calculation

Period and Failure Threshold Tuning

Kubernetes Probe Configuration for Different Scenarios

gRPC Health Checks

HTTP/TCP Comparison for Different Service Types

Real-world Failure Scenarios

Trade-off Comparison

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ

CI/CD Pipelines for Microservices