Kubernetes Resource Limits: CPU, Memory, and Quality of Service

Configure CPU and memory requests and limits to ensure fair scheduling, prevent resource starvation, and achieve predictable performance in Kubernetes clusters.

published: reading time: 21 min read author: GeekWorkBench

Kubernetes Resource Limits: CPU, Memory, and Quality of Service

Kubernetes clusters have finite resources. Nodes have CPU and memory that pods consume. Without resource constraints, one pod can starve others, degrade cluster stability, and make scheduling unpredictable. Kubernetes provides mechanisms to declare resource requirements, set hard limits, and classify pods by their importance.

This post explains requests vs limits, Quality of Service classes, and the namespace-level constraints that keep clusters healthy.

If you need Kubernetes fundamentals first, see the Kubernetes fundamentals post. For advanced scheduling patterns, see the Advanced Kubernetes post.

Introduction

Nodes have finite CPU and memory. Pods consume both. Without explicit resource declarations, the scheduler picks nodes based purely on capacity, and running containers can grab as much as they want. The result is noisy neighbors starving other pods, or node-level OOM events taking down services you did not even know were sharing the machine.

Resource requests and limits fix this. Requests tell the scheduler what a container needs to run. Limits cap what it can consume before Kubernetes steps in. Those declarations also assign your pods a Quality of Service class that determines eviction order when resources run short. Getting these values right is one of the most direct things you can do for cluster stability.

This post walks through requests and limits, the three QoS classes, and the namespace constraints (LimitRange, ResourceQuota) that keep one team from eating the entire cluster. You will also learn how to read actual resource usage with kubectl top and Vertical Pod Autoscaler so you are not guessing in the dark.

Requests vs Limits Explained

Every container in a pod can specify resource requests and resource limits:

apiVersion: v1
kind: Pod
metadata:
  name: web-app
  namespace: production
spec:
  containers:
    - name: web-app
      image: nginx:1.25
      resources:
        requests:
          memory: "256Mi"
          cpu: "250m"
        limits:
          memory: "512Mi"
          cpu: "500m"

Requests define what a container needs. The scheduler uses requests to decide which node to place the pod on. A node must have at least as much allocatable resources as the pod’s requests.

Limits define the maximum resources a container can use. When a container hits its memory limit, Kubernetes terminates and restarts it. When it hits its CPU limit, Kubernetes throttles the container.

CPU representation

CPU is measured in cores. You can express it as a whole number (1 CPU = 1 core) or millicores (1000m = 1 CPU). Common values:

  • 100m = 0.1 CPU (one-tenth of a core)
  • 250m = 0.25 CPU
  • 500m = 0.5 CPU
  • 1000m = 1 CPU

CPU is compressible. If a container exceeds its CPU limit, Kubernetes throttles it. The container does not get killed for CPU alone.

Memory representation

Memory is measured in bytes. You can use suffixes: Ki (kibibytes), Mi (mebibytes), Gi (gibibytes).

  • 128Mi = 128 mebibytes (~134 MB)
  • 256Mi = 256 mebibytes (~268 MB)
  • 1Gi = 1 gibibyte (~1.07 GB)

Memory is not compressible. If a container exceeds its memory limit, Kubernetes terminates it with an OOMKilled status.

What happens when limits are exceeded

State: Terminated
Reason: OOMKilled
Exit Code: 137

Exit code 137 indicates the container was killed by the OOM (Out of Memory) killer. Frequent OOMKilled pods indicate you need to increase memory limits or optimize the application’s memory usage.

QoS Classes

Kubernetes assigns pods to Quality of Service classes based on their resource requests and limits:

QoS ClassCriteriaBehavior
Guaranteedrequests == limits for all containersLast to be evicted
Burstablerequests < limits (or some limits not set)Evicted after Guaranteed
BestEffortNo requests or limits setFirst to be evicted

When to Use Each QoS Class

QoS ClassWhen to UseWhen NOT to Use
GuaranteedCritical infrastructure pods, databases, licensed software with strict resource needsWhen you do not need eviction priority guarantees
BurstableMost application workloads that benefit from burst capacityWhen every pod needs identical resource guarantees
BestEffortBatch jobs, environments where pods are truly disposableProduction workloads, anything that matters

Rule of thumb: Most workloads should be Burstable. Set Guaranteed for workloads that must not be evicted under any circumstances. Never run production workloads as BestEffort.

QoS Decision Flow

flowchart TD
    A[Pod submitted] --> B{Requests == Limits\nfor all containers?}
    B -->|Yes| G[Guaranteed QoS]
    B -->|No| C{Any requests\nor limits set?}
    C -->|Yes| Bu[Burstable QoS]
    C -->|No| BE[BestEffort QoS]
    Bu --> D{Node under\nmemory pressure?}
    BE --> D
    G --> D
    D -->|Guaranteed| L1[Last to evict]
    D -->|Burstable| L2[Middle eviction]
    D -->|BestEffort| L3[First to evict]

Guaranteed pods

containers:
  - name: database
    image: postgres:15
    resources:
      limits:
        memory: "2Gi"
        cpu: "2000m"
      requests:
        memory: "2Gi"
        cpu: "2000m"

Pods with identical requests and limits get the highest QoS. Use this for critical workloads that should not be evicted.

BestEffort pods

containers:
  - name: batch-job
    image: my-batch-job
    resources: {}

No resource specifications means BestEffort. These pods are first in line for eviction when the node runs low on resources.

Burstable pods

containers:
  - name: web-app
    image: nginx:1.25
    resources:
      requests:
        memory: "128Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Most pods fall into Burstable. They have some guaranteed resources but can burst above their requests when available.

LimitRange for Namespace Quotas

A LimitRange sets default, minimum, and maximum resource limits for pods and containers in a namespace. Without it, pods without resource specs become BestEffort.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - max:
        memory: "4Gi"
        cpu: "2000m"
      min:
        memory: "64Mi"
        cpu: "50m"
      default:
        memory: "256Mi"
        cpu: "250m"
      defaultRequest:
        memory: "128Mi"
        cpu: "100m"
      type: Container

This LimitRange:

  • Sets maximum memory and CPU per container
  • Sets minimum memory and CPU per container
  • Applies default limits when containers specify no limits
  • Applies default requests when containers specify no requests

Without this LimitRange, a container with no resource specs gets no guaranteed resources and can be evicted first.

Applying LimitRange

kubectl apply -f limitrange.yaml
kubectl describe limitrange default-limits -n production

The output shows the actual limits applied:

Type        Resource  Min   Max   Default Request  Default Limit
Container    cpu       50m   2     100m             250m
Container    memory    64Mi  4Gi   128Mi            256Mi

ResourceQuota for Cluster-Wide Limits

A ResourceQuota limits total resource consumption in a namespace. Use it to prevent any single namespace from consuming all cluster resources.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
    persistentvolumeclaims: "10"

This quota limits the entire production namespace to 10 CPU requests, 20Gi memory requests, 20 CPU limits, 40Gi memory limits, 50 pods, and 10 persistent volume claims.

Viewing quota usage

kubectl describe resourcequota production-quota -n production

The output shows current usage against the hard limits. When a quota is exhausted, Kubernetes rejects new resource creation in that namespace.

Pod Resource Testing and Tuning

Finding the right requests and limits takes measurement. Kubernetes lets you profile pod behavior before setting values in production.

kubectl run with resource specs

kubectl run -it --rm load-generator \
  --image=busybox \
  --restart=Never \
  -- requests.cpu=500m \
  -- requests.memory=128Mi \
  --limits.cpu=1000m \
  --limits.memory=256Mi \
  -- sh

Use this to run temporary pods and observe resource consumption with monitoring tools.

Vertical Pod Autoscaler

The Vertical Pod Autoscaler (VPA) analyzes historical resource usage and recommends or automatically applies better resource requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"

In Auto mode, VPA evicts and reschedules pods with updated resource specs. In Off mode, it only provides recommendations.

VPA helps you find baseline resource requirements without manual profiling.

Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) scales pod replicas based on CPU, memory, or custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

The HPA scales between 3 and 10 replicas to maintain 70% average CPU utilization. For memory-based scaling:

metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Trade-off Analysis

Requests vs Limits Trade-offs

AspectRequests OnlyRequests + Limits
SchedulingPredictable node placementPredictable scheduling + resource capping
QoS classBurstable (best available)Guaranteed if equal, otherwise Burstable
Memory protectionNonePrevents OOMKill from affecting node
CPU behaviorNo throttle, unlimited burstCPU throttled at limit
Production readinessInsufficientRequired for production

QoS Class Trade-offs

QoS ClassScheduling guaranteeEviction priorityResource efficiencyUse when
GuaranteedHighestLast evictedLow (reserved full)Critical infrastructure, licensed software
BurstableMediumMiddleHigh (flexible)Most application workloads
BestEffortNoneFirst evictedHighestBatch jobs only

Trade-off reality: Guaranteed pods reserve their full resource request even when idle. BestEffort pods get whatever is left, making them efficient but risky for production.

LimitRange vs ResourceQuota

AspectLimitRangeResourceQuota
ScopePer namespace, per containerPer namespace, total namespace
What it controlsMin/max/default requests/limitsTotal CPU, memory, pod count
EnforcementApplied when pod is createdChecked when resources are created
Use casePrevent BestEffort pods, set defaultsPrevent namespace from monopolizing cluster

Vertical Pod Autoscaler vs Manual Resource Tuning

AspectVPA (Automated)Manual Tuning
EffortLowHigh (profiling required)
PrecisionBased on historical dataBased on engineered judgment
DisruptionPod evictions in Auto modeNone (no changes unless you apply)
Best forInitial baseline discoveryProduction fine-tuning

CPU Throttling Trade-offs

AspectWith CPU LimitsWithout CPU Limits (Guaranteed)
Latencyp99 spikes from throttlingConsistent latency
Resource costMore efficient node utilizationMay over-provision requests
FairnessPrevents one pod dominating CPUPod can use all allocated CPU
RecommendationAvoid for latency-sensitive svcsUse for critical services

For latency-sensitive services where consistent p99 latency matters, set CPU limits equal to requests for Guaranteed QoS, or remove CPU limits entirely and rely on requests for scheduling.

Memory Limit Trade-offs

AspectLow memory limitHigh memory limit
OOM riskFrequent OOMKilled podsRare OOMKilled pods
Node efficiencyHigher (more pods per node)Lower (reserved memory)
DebuggingEasier to spot (frequent crashes)May hide memory leaks
RecommendationSet ~20-30% above expected peakSet for expected peak + headroom

Production Failure Scenarios

BestEffort Pods Evicted Under Memory Pressure

Pods without resource requests (BestEffort QoS) are the first to be evicted when a node runs low on memory. In a cluster with many BestEffort pods, eviction events can be frequent.

Symptoms: Frequent pod evictions in kubectl get events, pods restart unexpectedly.

Mitigation: Set resource requests for all production pods. Use LimitRange to enforce minimum resource requests per namespace.

OOMKilled Pods from Memory Limit Too Low

When a container exceeds its memory limit, Kubernetes kills it with OOMKilled status. This is one of the most common production issues I see. The fix is almost always raising the memory limit.

Check actual memory usage first:

kubectl describe pod <pod-name>  # Look for OOMKilled in state
kubectl top pod <pod-name>  # Check actual memory usage

Set limits about 20-30% above what you see in staging to account for traffic spikes.

CPU Throttling Impacting Latency

CPU limits throttle containers even when the node has free CPU. For latency-sensitive services, this creates annoying tail latency spikes.

High p99 latency with low average CPU usage usually means CPU throttling. For these workloads, either remove CPU limits or set them equal to requests for a Guaranteed QoS pod.

Anti-Patterns

Setting Identical Requests and Limits for All Pods

Treating every pod the same wastes resources. A web server handling 1000 req/s has different needs than a batch job processing queues.

Profile each application type separately and set appropriate resource specs.

Not Setting Memory Limits

Memory limits prevent runaway processes from consuming all node memory and causing node-level OOM events. Always set memory limits, especially for applications that can experience memory leaks.

Over-Provisioning CPU Limits

Setting CPU limits very high (like 4 cores for a simple web server) defeats the purpose of limits. The scheduler uses requests for node allocation decisions, not limits.

Set CPU limits based on actual expected peak load, not theoretical maximum.

Interview Questions

1. What is the difference between a resource request and a resource limit in Kubernetes?

Expected answer points:

  • Requests define what a container needs — the scheduler uses requests to decide node placement
  • Limits define the maximum a container can consume — exceeding memory kills with OOMKilled, exceeding CPU gets throttled
  • A node must have enough allocatable resources to satisfy a pod's requests
  • Requests are guaranteed; limits are a ceiling
2. What happens when a container exceeds its memory limit versus its CPU limit?

Expected answer points:

  • Memory: Kubernetes kills the container with OOMKilled status and exit code 137
  • CPU: Kubernetes throttles the container, slowing it down but not killing it (CPU is compressible)
  • Memory is not compressible — exceeding the limit has immediate consequences
  • CPU throttling shows as high p99 latency even with low average CPU usage
3. What are the three Kubernetes QoS classes and how are they assigned?

Expected answer points:

  • Guaranteed — requests == limits for ALL containers in the pod
  • Burstable — requests < limits, or some containers have no limits set
  • BestEffort — no requests and no limits set on any container
  • QoS determines eviction order when the node runs out of resources: BestEffort first, Burstable next, Guaranteed last
4. When should you use Guaranteed QoS versus Burstable QoS?

Expected answer points:

  • Guaranteed — for critical infrastructure pods, databases, licensed software with strict resource requirements that must never be evicted
  • Burstable — for most application workloads that benefit from burst capacity when extra resources are available
  • Guaranteed wastes resources by reserving the full request value even when idle
  • BestEffort should never be used for production workloads
5. What is a LimitRange and what does it enforce?

Expected answer points:

  • A LimitRange sets default, minimum, and maximum resource limits for containers in a namespace
  • Without it, containers with no resource specs become BestEffort and are first in line for eviction
  • It applies defaults when containers do not specify requests or limits
  • It enforces mins and maxes to prevent over-provisioning or under-provisioning
6. What is a ResourceQuota and how does it differ from a LimitRange?

Expected answer points:

  • ResourceQuota limits total resource consumption across an entire namespace
  • LimitRange sets per-container defaults, minimums, and maximums
  • ResourceQuota prevents any single namespace from consuming all cluster resources
  • Both are namespace-scoped and enforced at pod creation time
7. How does the Vertical Pod Autoscaler help with resource configuration?

Expected answer points:

  • VPA analyzes historical resource usage and recommends or automatically applies better resource requests
  • Auto mode evicts and reschedules pods with updated specs
  • Off mode only provides recommendations without making changes
  • Useful for finding baseline resource requirements without manual profiling
8. How does OOMKilled happen and how do you diagnose it?

Expected answer points:

  • OOMKilled occurs when a container exceeds its memory limit — Linux kernel kills it with exit code 137
  • Diagnose with `kubectl describe pod ` — look for OOMKilled in container state
  • Check actual memory usage with `kubectl top pod `
  • Mitigation: increase memory limit or optimize application memory usage
9. What is CPU throttling and how does it affect application performance?

Expected answer points:

  • CPU throttling occurs when a container hits its CPU limit — Kubernetes enforces the limit via CFS bandwidth control
  • The container runs but gets less CPU time than it wants, causing latency spikes
  • High p99 latency with low average CPU usage is the telltale sign of CPU throttling
  • For latency-sensitive services, either remove CPU limits or set them equal to requests for Guaranteed QoS
10. What is the relationship between resource requests and the Kubernetes scheduler?

Expected answer points:

  • The scheduler uses requests (not limits) to decide which node a pod goes on
  • A node must have at least as much allocatable resources as the pod's requests
  • Limits do not affect scheduling decisions — only requests do
  • Setting requests correctly is essential for cluster stability and fair scheduling
11. Why is setting memory limits more critical than setting CPU limits?

Expected answer points:

  • Memory is not compressible — exceeding the limit kills the container immediately with OOMKilled
  • CPU is compressible — exceeding the limit only throttles the container
  • Memory leaks in one container can consume all node memory without limits
  • Node-level OOM events can affect all pods on the node if memory limits are not set
12. How does the Horizontal Pod Autoscaler work with resource metrics?

Expected answer points:

  • HPA scales replicas based on observed CPU or memory utilization against a target
  • For CPU: `averageUtilization: 70` means scale up when average CPU exceeds 70% of the limit
  • HPA can also use custom metrics or external metrics for scaling decisions
  • HPA works alongside resource requests — it scales replica count, not the per-pod resource size
13. What happens when a namespace exceeds its ResourceQuota?

Expected answer points:

  • Kubernetes rejects new resource creation in that namespace
  • Pod creation fails with "exceeded quota" error
  • The quota applies to both requests and limits depending on the quota spec
  • Use `kubectl describe resourcequota -n ` to see current usage against hard limits
14. What anti-patterns exist around Kubernetes resource configuration?

Expected answer points:

  • Setting identical requests and limits for all pods — different workloads have different needs
  • Not setting memory limits — memory leaks can consume entire nodes
  • Over-provisioning CPU limits — high limits defeat the purpose; base limits on actual peak load
  • Setting only limits without requests — pods without requests become BestEffort and are first evicted
15. How do you tune resource limits for a latency-sensitive service?

Expected answer points:

  • Monitor actual p99 latency in production to determine if CPU throttling is occurring
  • If throttling is present, either remove CPU limits or set them equal to requests for Guaranteed QoS
  • Set memory limits about 20-30% above observed peak usage to handle traffic spikes
  • Use VPA in Off mode first to understand actual resource consumption before setting production values
16. How does Kubernetes handle resource overcommitment in a cluster?

Expected answer points:

  • Kubernetes allows overcommitment — nodes can run pods whose total requests exceed the node's actual capacity
  • Requests are used for scheduling; limits are enforced at runtime
  • When a node runs out of allocatable resources, Kubernetes evicts pods based on QoS class (BestEffort first)
  • Overcommitment increases utilization but risks OOMKilled events if too many pods burst simultaneously
17. How do LimitRange and ResourceQuota work together to enforce resource constraints?

Expected answer points:

  • LimitRange operates at the container level within a namespace — it sets defaults, minimums, and maximums for individual pods
  • ResourceQuota operates at the namespace level — it caps total requests and limits across all pods in the namespace
  • LimitRange prevents BestEffort pods by enforcing minimum requests; ResourceQuota prevents any single namespace from consuming all cluster resources
  • Both are enforced at pod creation time — pods that exceed either constraint are rejected
18. When multiple pods share the same QoS class, how does Kubernetes determine eviction order?

Expected answer points:

  • Within the same QoS class, Kubernetes uses priority class and then resource usage to determine eviction order
  • Pods with lower priority class are evicted before pods with higher priority
  • Among pods with equal priority, those using the most memory relative to their requests are evicted first
  • You can set pod priority via PriorityClass to influence eviction decisions beyond basic QoS
19. How do Vertical Pod Autoscaler and Horizontal Pod Autoscaler work together?

Expected answer points:

  • VPA adjusts the resource requests (CPU/memory) of individual pod containers — it changes what each pod needs
  • HPA adjusts the replica count of a Deployment or StatefulSet — it changes how many pods run
  • VPA handles vertical scaling (bigger/smaller pods); HPA handles horizontal scaling (more/fewer pods)
  • Using both: VPA sets appropriate resource requests, HPA scales replicas based on utilization
20. What does the Kubernetes notation for CPU and memory values mean in practice?

Expected answer points:

  • CPU: `100m` = 0.1 CPU (100 millicores), `1` = 1 CPU core. 1000m = 1 CPU. The scheduler operates on cores, not threads
  • Memory: `Ki` = kibibytes (1024), `Mi` = mebibytes (1024^2), `Gi` = gibibytes (1024^3), `Gi` = 1.07 GB approximately
  • Binary suffixes (Ki, Mi, Gi) are actual powers of 1024; decimal suffixes (K, M, G) are powers of 1000 — always use binary suffixes in pod specs
  • Precision: `100m` is the smallest reliably-schedulable CPU unit; `4Mi` is the smallest memory request for most workloads

Further Reading

Conclusion

Use this checklist when configuring Kubernetes resource limits:

  • Set resource requests for all production containers (scheduler uses these for placement)
  • Set memory limits to prevent runaway processes from impacting node stability
  • Set CPU limits based on actual application needs, not maximum theoretical values
  • Used LimitRange to enforce default requests/limits per namespace
  • Used ResourceQuota to cap total namespace resource consumption
  • Profile applications with VPA or monitoring tools before setting production values
  • Monitor OOMKilled events and adjust memory limits accordingly
  • Monitor CPU throttling metrics and adjust CPU limits if latency issues appear
  • Set Guaranteed QoS (requests == limits) for critical infrastructure pods
  • Avoid BestEffort for any production workload

Category

Related Posts

Container Security: Image Scanning and Vulnerability Management

Implement comprehensive container security: from scanning images for vulnerabilities to runtime security monitoring and secrets protection.

#container-security #docker #kubernetes

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Compare and implement deployment strategies—rolling updates, blue-green deployments, and canary releases—to reduce risk and enable safe production releases.

#deployment #devops #kubernetes

Developing Helm Charts: Templates, Values, and Testing

Create production-ready Helm charts with Go templates, custom value schemas, and testing using Helm unittest and ct.

#helm #kubernetes #devops