Kubernetes Resource Limits: CPU, Memory, and Quality of Service

Configure CPU and memory requests and limits to ensure fair scheduling, prevent resource starvation, and achieve predictable performance in Kubernetes clusters.

published: March 25, 2026 reading time: 29 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Kubernetes resource management controls how pods consume CPU and memory in a cluster. Requests guarantee minimum resources for scheduling decisions, while limits cap consumption — memory overages trigger OOMKilled restarts, CPU overages cause throttling. Pods receive QoS classes (Guaranteed, Burstable, BestEffort) that determine eviction priority when nodes run short. LimitRange enforces per-namespace defaults and constraints, while ResourceQuota prevents any single namespace from monopolizing cluster resources. Vertical Pod Autoscaler can auto-size requests based on historical data. Getting these values right always affects cluster stability and production reliability.

Kubernetes Resource Limits: CPU, Memory, and Quality of Service

Kubernetes clusters have finite resources. Nodes have CPU and memory that pods consume. Without resource constraints, one pod can starve others, degrade cluster stability, and make scheduling unpredictable. Kubernetes provides mechanisms to declare resource requirements, set hard limits, and classify pods by their importance.

This post explains requests vs limits, Quality of Service classes, and the namespace-level constraints that keep clusters healthy.

If you need Kubernetes fundamentals first, see the Kubernetes fundamentals post. For advanced scheduling patterns, see the Advanced Kubernetes post.

Introduction

Nodes have finite CPU and memory. Pods consume both. Without explicit resource declarations, the scheduler picks nodes based purely on capacity, and running containers can grab as much as they want. The result is noisy neighbors starving other pods, or node-level OOM events taking down services you did not even know were sharing the machine.

Resource requests and limits fix this. Requests tell the scheduler what a container needs to run. Limits cap what it can consume before Kubernetes steps in. Those declarations also assign your pods a Quality of Service class that determines eviction order when resources run short. Getting these values right is one of the most direct things you can do for cluster stability.

This post walks through requests and limits, the three QoS classes, and the namespace constraints (LimitRange, ResourceQuota) that keep one team from eating the entire cluster. You will also learn how to read actual resource usage with kubectl top and Vertical Pod Autoscaler so you are not guessing in the dark.

Requests vs Limits Explained

Every container in a pod can specify resource requests and resource limits:

apiVersion: v1
kind: Pod
metadata:
  name: web-app
  namespace: production
spec:
  containers:
    - name: web-app
      image: nginx:1.25
      resources:
        requests:
          memory: "256Mi"
          cpu: "250m"
        limits:
          memory: "512Mi"
          cpu: "500m"

Requests define what a container needs. The scheduler uses requests to decide which node to place the pod on. A node must have at least as much allocatable resources as the pod’s requests.

Limits define the maximum resources a container can use. When a container hits its memory limit, Kubernetes terminates and restarts it. When it hits its CPU limit, Kubernetes throttles the container.

CPU representation

CPU is measured in cores. You can express it as a whole number (1 CPU = 1 core) or millicores (1000m = 1 CPU). Common values:

100m = 0.1 CPU (one-tenth of a core)
250m = 0.25 CPU
500m = 0.5 CPU
1000m = 1 CPU

CPU is compressible. If a container exceeds its CPU limit, Kubernetes throttles it. The container does not get killed for CPU alone.

Memory representation

Memory is measured in bytes. You can use suffixes: Ki (kibibytes), Mi (mebibytes), Gi (gibibytes).

128Mi = 128 mebibytes (~134 MB)
256Mi = 256 mebibytes (~268 MB)
1Gi = 1 gibibyte (~1.07 GB)

Memory is not compressible. If a container exceeds its memory limit, Kubernetes terminates it with an OOMKilled status.

What happens when limits are exceeded

State: Terminated
Reason: OOMKilled
Exit Code: 137

Exit code 137 indicates the container was killed by the OOM (Out of Memory) killer. Frequent OOMKilled pods indicate you need to increase memory limits or optimize the application’s memory usage.

QoS Classes

Kubernetes assigns pods to Quality of Service classes based on their resource requests and limits:

QoS Class	Criteria	Behavior
Guaranteed	requests == limits for all containers	Last to be evicted
Burstable	requests < limits (or some limits not set)	Evicted after Guaranteed
BestEffort	No requests or limits set	First to be evicted

When to Use Each QoS Class

The table above lays out the criteria, but the decision comes down to three factors: how critical the workload is, whether it needs burst capacity, and whether you can afford to reserve full resources.

Use Guaranteed for anything that cannot go down. Databases with persistent connections, licensed software that checks hardware fingerprints, and control plane components that other pods depend on all fit here. The cost is reservation: a Guaranteed pod with a 2Gi memory request holds that 2Gi even at 2% utilization. If you have 50 Guaranteed pods and most are idle, you are paying for idle capacity across the cluster.

Use Burstable for the majority of application workloads. A web API that sits at 5% CPU most of the day but spikes to 60% during traffic surges belongs in Burstable. It gets enough resources to stay scheduled, can use extra capacity when available, and does not force you to reserve peak-level resources for every hour of the day. Most Kubernetes workloads fall here by default when you set requests and limits based on observed usage.

Use BestEffort only for workloads you can afford to lose at any moment. A batch job that picks up work from a queue, processes it, and reports results can tolerate eviction if the work is idempotent or gets re-queued. CI agents that run ephemeral builds, dev environment pods that developers restart manually, and any workload where the cost of eviction is near zero are candidates. The moment someone cares whether the pod stays running, it should not be BestEffort.

Rule of thumb: Most workloads should be Burstable. Set Guaranteed for workloads that must not be evicted under any circumstances. Never run production workloads as BestEffort.

QoS Decision Flow

flowchart TD
    A[Pod submitted] --> B{Requests == Limits\nfor all containers?}
    B -->|Yes| G[Guaranteed QoS]
    B -->|No| C{Any requests\nor limits set?}
    C -->|Yes| Bu[Burstable QoS]
    C -->|No| BE[BestEffort QoS]
    Bu --> D{Node under\nmemory pressure?}
    BE --> D
    G --> D
    D -->|Guaranteed| L1[Last to evict]
    D -->|Burstable| L2[Middle eviction]
    D -->|BestEffort| L3[First to evict]

Guaranteed pods

containers:
  - name: database
    image: postgres:15
    resources:
      limits:
        memory: "2Gi"
        cpu: "2000m"
      requests:
        memory: "2Gi"
        cpu: "2000m"

Pods with identical requests and limits get the highest QoS. Use this for critical workloads that should not be evicted.

BestEffort pods

containers:
  - name: batch-job
    image: my-batch-job
    resources: {}

No resource specifications means BestEffort. These pods are first in line for eviction when the node runs low on resources.

Burstable pods

containers:
  - name: web-app
    image: nginx:1.25
    resources:
      requests:
        memory: "128Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Most pods fall into Burstable. They have some guaranteed resources but can burst above their requests when available.

LimitRange for Namespace Quotas

A LimitRange sets default, minimum, and maximum resource limits for pods and containers in a namespace. Without it, pods without resource specs become BestEffort.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - max:
        memory: "4Gi"
        cpu: "2000m"
      min:
        memory: "64Mi"
        cpu: "50m"
      default:
        memory: "256Mi"
        cpu: "250m"
      defaultRequest:
        memory: "128Mi"
        cpu: "100m"
      type: Container

This LimitRange:

Sets maximum memory and CPU per container
Sets minimum memory and CPU per container
Applies default limits when containers specify no limits
Applies default requests when containers specify no requests

Without this LimitRange, a container with no resource specs gets no guaranteed resources and can be evicted first.

Applying LimitRange

kubectl apply -f limitrange.yaml
kubectl describe limitrange default-limits -n production

The output shows the actual limits applied:

Type        Resource  Min   Max   Default Request  Default Limit
Container    cpu       50m   2     100m             250m
Container    memory    64Mi  4Gi   128Mi            256Mi

ResourceQuota for Cluster-Wide Limits

A ResourceQuota limits total resource consumption in a namespace. Use it to prevent any single namespace from consuming all cluster resources.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
    persistentvolumeclaims: "10"

This quota limits the entire production namespace to 10 CPU requests, 20Gi memory requests, 20 CPU limits, 40Gi memory limits, 50 pods, and 10 persistent volume claims.

Viewing quota usage

kubectl describe resourcequota production-quota -n production

The output shows current usage against the hard limits. When a quota is exhausted, Kubernetes rejects new resource creation in that namespace.

The kubectl describe output breaks down usage by resource type, so you can see how much of your 20Gi memory request limit is currently consumed across all pods in the namespace. It also shows pods count, persistentvolumeclaims, and any other objects the quota tracks.

If usage is approaching a hard limit, you have a few options: scale down pods, increase the quota (if cluster headroom exists), or clean up unused resources. Unlike LimitRange (which enforces per-container constraints at pod creation), ResourceQuota enforces aggregate limits across the entire namespace. Both constraints work together — a pod that passes the LimitRange checks can still get rejected by ResourceQuota if the namespace has already consumed its total allocation.

For programmatic monitoring, query the quota status directly:

kubectl get resourcequota production-quota -n production -o jsonpath='{.status}' | jq

This returns the current usage and hard limits as JSON, which you can feed into alerting tools. Set up an alert when usage exceeds 80% of any hard limit so you catch quota exhaustion before it blocks new deployments.

Pod Resource Testing and Tuning

Finding the right requests and limits takes measurement. Kubernetes lets you profile pod behavior before setting values in production.

kubectl run with resource specs

kubectl run -it --rm load-generator \
  --image=busybox \
  --restart=Never \
  -- requests.cpu=500m \
  -- requests.memory=128Mi \
  --limits.cpu=1000m \
  --limits.memory=256Mi \
  -- sh

Use this to run temporary pods and observe resource consumption with monitoring tools.

Vertical Pod Autoscaler

The Vertical Pod Autoscaler (VPA) analyzes historical resource usage and recommends or automatically applies better resource requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"

In Auto mode, VPA evicts and reschedules pods with updated resource specs. In Off mode, it only provides recommendations.

VPA helps you find baseline resource requirements without manual profiling.

Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) scales pod replicas based on CPU, memory, or custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

The HPA scales between 3 and 10 replicas to maintain 70% average CPU utilization. For memory-based scaling:

metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Trade-off Analysis

Requests vs Limits Trade-offs

Aspect	Requests Only	Requests + Limits
Scheduling	Predictable node placement	Predictable scheduling + resource capping
QoS class	Burstable (best available)	Guaranteed if equal, otherwise Burstable
Memory protection	None	Prevents OOMKill from affecting node
CPU behavior	No throttle, unlimited burst	CPU throttled at limit
Production readiness	Insufficient	Required for production

QoS Class Trade-offs

QoS Class	Scheduling guarantee	Eviction priority	Resource efficiency	Use when
Guaranteed	Highest	Last evicted	Low (reserved full)	Critical infrastructure, licensed software
Burstable	Medium	Middle	High (flexible)	Most application workloads
BestEffort	None	First evicted	Highest	Batch jobs only

Trade-off reality: Guaranteed pods reserve their full resource request even when idle. BestEffort pods get whatever is left, making them efficient but risky for production.

LimitRange vs ResourceQuota

Aspect	LimitRange	ResourceQuota
Scope	Per namespace, per container	Per namespace, total namespace
What it controls	Min/max/default requests/limits	Total CPU, memory, pod count
Enforcement	Applied when pod is created	Checked when resources are created
Use case	Prevent BestEffort pods, set defaults	Prevent namespace from monopolizing cluster

Vertical Pod Autoscaler vs Manual Resource Tuning

Aspect	VPA (Automated)	Manual Tuning
Effort	Low	High (profiling required)
Precision	Based on historical data	Based on engineered judgment
Disruption	Pod evictions in Auto mode	None (no changes unless you apply)
Best for	Initial baseline discovery	Production fine-tuning

CPU Throttling Trade-offs

Aspect	With CPU Limits	Without CPU Limits (Guaranteed)
Latency	p99 spikes from throttling	Consistent latency
Resource cost	More efficient node utilization	May over-provision requests
Fairness	Prevents one pod dominating CPU	Pod can use all allocated CPU
Recommendation	Avoid for latency-sensitive svcs	Use for critical services

For latency-sensitive services where consistent p99 latency matters, set CPU limits equal to requests for Guaranteed QoS, or remove CPU limits entirely and rely on requests for scheduling.

Memory Limit Trade-offs

Aspect	Low memory limit	High memory limit
OOM risk	Frequent OOMKilled pods	Rare OOMKilled pods
Node efficiency	Higher (more pods per node)	Lower (reserved memory)
Debugging	Easier to spot (frequent crashes)	May hide memory leaks
Recommendation	Set ~20-30% above expected peak	Set for expected peak + headroom

Production Failure Scenarios

BestEffort Pods Evicted Under Memory Pressure

BestEffort pods are first in line for eviction when a node runs low on memory. If your cluster has a lot of them, eviction events pile up fast.

The kubelet’s eviction manager handles this. It periodically scans resource usage on the node and evicts pods when available memory drops below a threshold. It ranks pods by QoS class first, then by actual memory usage relative to requests. BestEffort pods go first — a pod using 10Mi gets evicted before a Burstable pod using 1Gi, purely because of its QoS classification. BestEffort pods are unsuitable for any workload where you actually care about uptime.

Eviction order: BestEffort pods first, then Burstable pods whose usage exceeds their requests, then Guaranteed pods as a last resort.

Symptoms to watch for:

Frequent evictions in kubectl get events --all-namespaces | grep Evicted
Pods restarting with OOMKilled as the last termination reason
One namespace with constant churn while the rest of the cluster looks fine

Mitigation:

Set resource requests for all production pods. Even minimal requests promote a pod to Burstable QoS.
Use LimitRange to enforce minimum resource requests per namespace so new containers cannot become BestEffort.
Monitor kube_pod_info grouped by QoS class to track how many BestEffort pods exist in each namespace.
For batch workloads that genuinely have no resource requirements, isolate them in a dedicated namespace with their own ResourceQuota.

# Check eviction history in a namespace
kubectl get events --namespace production --field-selector reason=Evicted

# Count pods by QoS class
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.qosClass}{"\n"}{end}' | sort | uniq -c

OOMKilled Pods from Memory Limit Too Low

When a container exceeds its memory limit, Kubernetes kills it with OOMKilled status. This is one of the most common production issues I see. The fix is almost always raising the memory limit.

Check actual memory usage first:

kubectl describe pod <pod-name>  # Look for OOMKilled in state
kubectl top pod <pod-name>  # Check actual memory usage

Set limits about 20-30% above what you see in staging to account for traffic spikes.

CPU Throttling Impacting Latency

CPU limits throttle containers even when the node has free CPU. For latency-sensitive services, this is where things go sideways.

Kubernetes enforces CPU limits through the Linux CFS (Completely Fair Scheduler) bandwidth control. When a container hits its CPU limit within a scheduling period, the kernel withholds CPU time until the next period. Kubernetes does not kill the container — it just makes it wait. The container sits there unable to use the CPU sitting idle on the node because its bandwidth limit for this period is already consumed.

You see this in latency metrics. A service that normally responds in 5ms p99 might show 50ms or 200ms p99 under CPU throttling even though the node has idle cores available. Average CPU masks this because the container runs fine most of the time and only slows down during throttled periods.

Diagnosing CPU throttling:

# Check CPU throttling metrics for a container
kubectl exec -it <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.stat

# Look for throttling in container metrics
kubectl top pods --containers --show-labels | grep throttle

The metric to watch is nr_throttled. If it is high relative to the total period count, the container is regularly hitting its CPU limit.

High p99 latency with low average CPU usage is the telltale sign. If your latency charts show spikes but average CPU stays below 50%, CPU throttling is the cause.

Workloads where this matters most:

API servers with strict SLA latency targets (p99 < 50ms)
Real-time data pipelines processing events with deadlines
gRPC services where a single slow response blocks a client thread
Any service where request queuing is not an option

Remediation:

Remove CPU limits entirely. The container still respects its request for scheduling, but once scheduled it can burst above the request when free CPU exists.
Set CPU limits equal to requests. This promotes the pod to Guaranteed QoS and eliminates throttling at the cost of reserved capacity. If the service needs 250m of CPU to handle normal load, set both request and limit to 250m.
Right-size the CPU limit based on actual peak measurement. Use VPA in recommendation mode first to understand realistic burst peaks, then set limits to cover those peaks without overprovisioning.

For batch workloads where latency is irrelevant, CPU limits are useful to prevent one job from monopolizing available CPU. For latency-sensitive services, treat CPU limits as a knob to use carefully or not at all.

Anti-Patterns

Setting Identical Requests and Limits for All Pods

Treating every pod the same wastes resources. A web server handling 1000 req/s has different needs than a batch job processing queues.

Profile each application type separately and set appropriate resource specs.

Workload profiles fall into a few distinct categories:

Workload Type	CPU Profile	Memory Profile	Example
Request-driven web servers	Moderate, stable	Moderate, bounded	API gateways, frontend services
Background workers	Low CPU, variable memory	Low CPU, higher memory	Queue processors, batch transformers
Data-intensive services	High CPU, high memory	High memory, possible growth	ML inference, image processing
Cached services	Low CPU	High memory, no growth	Redis, in-memory aggregators

A web server handling 1000 requests per second might need 250m CPU and 256Mi memory under normal load. Under peak load with 5000 req/s, it might need 1000m CPU and 512Mi memory. A batch job that runs every hour might need 500m CPU but only 128Mi memory because it processes one message at a time and releases memory immediately after.

If you set identical values for both, the web server gets throttled during traffic spikes while the batch job wastes resources it never uses. Run kubectl top pod over a representative period to establish baselines before setting production values.

Not Setting Memory Limits

Memory limits prevent runaway processes from consuming all node memory and causing node-level OOM events. Always set memory limits, especially for applications that can experience memory leaks.

Without memory limits, a single container can consume all available memory on a node. When that happens, the Linux kernel triggers an OOM killer at the node level. Unlike container-level OOMKilled events (which kill only the offending container), node-level OOM kills can take down every pod running on that node.

The distinction matters:

Scenario	What Gets Killed	Who Is Affected
Container hits memory limit	Only that container	Just your pod
Node runs out of memory	All containers on the node	Every workload sharing the node

Memory leaks are a frequent culprit. A Java app with a growing heap, a Python process that keeps objects around, a Go service whose goroutine stack keeps expanding — all of these can quietly fill gigabytes of node memory over hours or days. Without a container-level memory limit, nothing stops them.

Even well-behaved applications that do not leak memory can cause issues. A service that uses 128Mi during quiet hours and 1Gi at peak traffic absorbs whatever is available, but that memory is not returned to the node until the container stops. Other pods on the same node get starved during the quiet hours because the first pod is still holding memory it is not actively using.

Set memory limits to the expected peak plus 20-30% headroom. If staging shows a service using up to 800Mi under load, set the limit to 1024Mi (1Gi). This gives the container room to handle bursts without being killed, while preventing a single runaway process from consuming the entire node.

Over-Provisioning CPU Limits

Setting CPU limits very high (like 4 cores for a simple web server) defeats the purpose of limits. The scheduler uses requests for node allocation decisions, not limits.

Set CPU limits based on actual expected peak load, not theoretical maximum.

The scheduler decides pod placement using requests, not limits. Set requests to 100m and limits to 4000m, and the scheduler only guarantees 100m of CPU when picking a node. The4000m limit does not reserve 4000m on any node — it just caps what the container can use if it tries to burst.

This creates a false sense of capacity. Imagine Node A has 10 pods with 100m requests each (1000m total), placed on a node with 4 cores. Each pod has a 4000m limit, but those limits are invisible to the scheduler. The node is oversubscribed: it has 4000m of CPU but the pods collectively want up to 40000m if they all hit their limits at once. Kubernetes allows this because only requests matter for placement.

When traffic spikes hit all 10 pods simultaneously, each tries to use close to its limit. The node has only 4 cores to share. Kubernetes throttles them all, and p99 latency spikes because none of them can get the CPU they are trying to use.

Base CPU limits on realistic peak load, not theoretical maximums. A web server that processes requests in 10ms of CPU time per request and handles 100 concurrent requests might need 1000m of CPU for brief bursts. Setting the limit to 4000m “just in case” provides no scheduling benefit and hides the actual resource picture.

For latency-sensitive services, set CPU limits equal to requests for Guaranteed QoS, or drop CPU limits entirely and let the service use whatever CPU is available. For batch workloads that benefit from bursting, set limits high enough to capture realistic burst peaks but not so high that they hide an undersized request.

Interview Questions

1. What is the difference between a resource request and a resource limit in Kubernetes?

Expected answer points:

Requests define what a container needs — the scheduler uses requests to decide node placement
Limits define the maximum a container can consume — exceeding memory kills with OOMKilled, exceeding CPU gets throttled
A node must have enough allocatable resources to satisfy a pod's requests
Requests are guaranteed; limits are a ceiling

2. What happens when a container exceeds its memory limit versus its CPU limit?

Expected answer points:

Memory: Kubernetes kills the container with OOMKilled status and exit code 137
CPU: Kubernetes throttles the container, slowing it down but not killing it (CPU is compressible)
Memory is not compressible — exceeding the limit has immediate consequences
CPU throttling shows as high p99 latency even with low average CPU usage

3. What are the three Kubernetes QoS classes and how are they assigned?

Expected answer points:

Guaranteed — requests == limits for ALL containers in the pod
Burstable — requests < limits, or some containers have no limits set
BestEffort — no requests and no limits set on any container
QoS determines eviction order when the node runs out of resources: BestEffort first, Burstable next, Guaranteed last

4. When should you use Guaranteed QoS versus Burstable QoS?

Expected answer points:

Guaranteed — for critical infrastructure pods, databases, licensed software with strict resource requirements that must never be evicted
Burstable — for most application workloads that benefit from burst capacity when extra resources are available
Guaranteed wastes resources by reserving the full request value even when idle
BestEffort should never be used for production workloads

5. What is a LimitRange and what does it enforce?

Expected answer points:

A LimitRange sets default, minimum, and maximum resource limits for containers in a namespace
Without it, containers with no resource specs become BestEffort and are first in line for eviction
It applies defaults when containers do not specify requests or limits
It enforces mins and maxes to prevent over-provisioning or under-provisioning

6. What is a ResourceQuota and how does it differ from a LimitRange?

Expected answer points:

ResourceQuota limits total resource consumption across an entire namespace
LimitRange sets per-container defaults, minimums, and maximums
ResourceQuota prevents any single namespace from consuming all cluster resources
Both are namespace-scoped and enforced at pod creation time

7. How does the Vertical Pod Autoscaler help with resource configuration?

Expected answer points:

VPA analyzes historical resource usage and recommends or automatically applies better resource requests
Auto mode evicts and reschedules pods with updated specs
Off mode only provides recommendations without making changes
Useful for finding baseline resource requirements without manual profiling

8. How does OOMKilled happen and how do you diagnose it?

Expected answer points:

OOMKilled occurs when a container exceeds its memory limit — Linux kernel kills it with exit code 137
Diagnose with `kubectl describe pod ` — look for OOMKilled in container state
Check actual memory usage with `kubectl top pod `
Mitigation: increase memory limit or optimize application memory usage

9. What is CPU throttling and how does it affect application performance?

Expected answer points:

CPU throttling occurs when a container hits its CPU limit — Kubernetes enforces the limit via CFS bandwidth control
The container runs but gets less CPU time than it wants, causing latency spikes
High p99 latency with low average CPU usage is the telltale sign of CPU throttling
For latency-sensitive services, either remove CPU limits or set them equal to requests for Guaranteed QoS

10. What is the relationship between resource requests and the Kubernetes scheduler?

Expected answer points:

The scheduler uses requests (not limits) to decide which node a pod goes on
A node must have at least as much allocatable resources as the pod's requests
Limits do not affect scheduling decisions — only requests do
Setting requests correctly is essential for cluster stability and fair scheduling

11. Why is setting memory limits more critical than setting CPU limits?

Expected answer points:

Memory is not compressible — exceeding the limit kills the container immediately with OOMKilled
CPU is compressible — exceeding the limit only throttles the container
Memory leaks in one container can consume all node memory without limits
Node-level OOM events can affect all pods on the node if memory limits are not set

12. How does the Horizontal Pod Autoscaler work with resource metrics?

Expected answer points:

HPA scales replicas based on observed CPU or memory utilization against a target
For CPU: `averageUtilization: 70` means scale up when average CPU exceeds 70% of the limit
HPA can also use custom metrics or external metrics for scaling decisions
HPA works alongside resource requests — it scales replica count, not the per-pod resource size

13. What happens when a namespace exceeds its ResourceQuota?

Expected answer points:

Kubernetes rejects new resource creation in that namespace
Pod creation fails with "exceeded quota" error
The quota applies to both requests and limits depending on the quota spec
Use `kubectl describe resourcequota -n ` to see current usage against hard limits

14. What anti-patterns exist around Kubernetes resource configuration?

Expected answer points:

Setting identical requests and limits for all pods — different workloads have different needs
Not setting memory limits — memory leaks can consume entire nodes
Over-provisioning CPU limits — high limits defeat the purpose; base limits on actual peak load
Setting only limits without requests — pods without requests become BestEffort and are first evicted

15. How do you tune resource limits for a latency-sensitive service?

Expected answer points:

Monitor actual p99 latency in production to determine if CPU throttling is occurring
If throttling is present, either remove CPU limits or set them equal to requests for Guaranteed QoS
Set memory limits about 20-30% above observed peak usage to handle traffic spikes
Use VPA in Off mode first to understand actual resource consumption before setting production values

16. How does Kubernetes handle resource overcommitment in a cluster?

Expected answer points:

Kubernetes allows overcommitment — nodes can run pods whose total requests exceed the node's actual capacity
Requests are used for scheduling; limits are enforced at runtime
When a node runs out of allocatable resources, Kubernetes evicts pods based on QoS class (BestEffort first)
Overcommitment increases utilization but risks OOMKilled events if too many pods burst simultaneously

17. How do LimitRange and ResourceQuota work together to enforce resource constraints?

Expected answer points:

LimitRange operates at the container level within a namespace — it sets defaults, minimums, and maximums for individual pods
ResourceQuota operates at the namespace level — it caps total requests and limits across all pods in the namespace
LimitRange prevents BestEffort pods by enforcing minimum requests; ResourceQuota prevents any single namespace from consuming all cluster resources
Both are enforced at pod creation time — pods that exceed either constraint are rejected

18. When multiple pods share the same QoS class, how does Kubernetes determine eviction order?

Expected answer points:

Within the same QoS class, Kubernetes uses priority class and then resource usage to determine eviction order
Pods with lower priority class are evicted before pods with higher priority
Among pods with equal priority, those using the most memory relative to their requests are evicted first
You can set pod priority via PriorityClass to influence eviction decisions beyond basic QoS

19. How do Vertical Pod Autoscaler and Horizontal Pod Autoscaler work together?

Expected answer points:

VPA adjusts the resource requests (CPU/memory) of individual pod containers — it changes what each pod needs
HPA adjusts the replica count of a Deployment or StatefulSet — it changes how many pods run
VPA handles vertical scaling (bigger/smaller pods); HPA handles horizontal scaling (more/fewer pods)
Using both: VPA sets appropriate resource requests, HPA scales replicas based on utilization

20. What does the Kubernetes notation for CPU and memory values mean in practice?

Expected answer points:

CPU: `100m` = 0.1 CPU (100 millicores), `1` = 1 CPU core. 1000m = 1 CPU. The scheduler operates on cores, not threads
Memory: `Ki` = kibibytes (1024), `Mi` = mebibytes (1024^2), `Gi` = gibibytes (1024^3), `Gi` = 1.07 GB approximately
Binary suffixes (Ki, Mi, Gi) are actual powers of 1024; decimal suffixes (K, M, G) are powers of 1000 — always use binary suffixes in pod specs
Precision: `100m` is the smallest reliably-schedulable CPU unit; `4Mi` is the smallest memory request for most workloads

Conclusion

Use this checklist when configuring Kubernetes resource limits:

Kubernetes Resource Limits: CPU, Memory, and Quality of Service

Introduction

Requests vs Limits Explained

CPU representation

Memory representation

What happens when limits are exceeded

QoS Classes

When to Use Each QoS Class

QoS Decision Flow

Guaranteed pods

BestEffort pods

Burstable pods

LimitRange for Namespace Quotas

Applying LimitRange

ResourceQuota for Cluster-Wide Limits

Viewing quota usage

Pod Resource Testing and Tuning

kubectl run with resource specs

Vertical Pod Autoscaler

Horizontal Pod Autoscaler

Trade-off Analysis

Requests vs Limits Trade-offs

QoS Class Trade-offs

LimitRange vs ResourceQuota

Vertical Pod Autoscaler vs Manual Resource Tuning

CPU Throttling Trade-offs

Memory Limit Trade-offs

Production Failure Scenarios

BestEffort Pods Evicted Under Memory Pressure

OOMKilled Pods from Memory Limit Too Low

CPU Throttling Impacting Latency

Anti-Patterns

Setting Identical Requests and Limits for All Pods

Not Setting Memory Limits

Over-Provisioning CPU Limits

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Container Security: Image Scanning and Vulnerability Management

Deployment Strategies: Rolling, Blue-Green, and Canary Releases

Developing Helm Charts: Templates, Values, and Testing