Virtualization Basics

Explore hypervisors, virtual machines, containers, and OS-level virtualization — understanding the technologies powering cloud computing.

published: reading time: 28 min read author: GeekWorkBench

Virtualization Basics

The cloud computing revolution runs on virtualization. Every AWS instance, every Docker container, every Kubernetes pod exists because of a fundamental idea: you can make one physical computer look like many computers, or conversely, make many physical computers look like one. Understanding virtualization isn’t optional anymore—it’s foundational infrastructure knowledge for anyone building, deploying, or operating modern software systems.

Whether you’re debugging a container networking issue, architecting a multi-tenant SaaS platform, or trying to understand why your Kubernetes pod behaves differently than expected, virtualization concepts are essential.

Introduction

Virtualization is the simulation of hardware or software resources. In computing, it typically means creating multiple isolated environments on a single physical machine, where each environment believes it has exclusive access to its own set of resources.

The key benefits that drove adoption:

  • Server consolidation — Run multiple “servers” on one physical machine, improving hardware utilization from typical 15% to 70%+
  • Isolation — A bug or security issue in one VM/container doesn’t affect others
  • elasticity — Create and destroy environments on demand
  • portability — VMs and containers package an environment that runs identically anywhere

Modern virtualization exists on a spectrum, from full hardware emulation (virtual machines) to lightweight process isolation (containers), each with different tradeoffs.

When to Use / When Not to Use

When Virtualization Is Essential

  • Cloud computing — All major cloud providers (AWS, GCP, Azure) run workloads in virtualized environments
  • Multi-tenant SaaS — Isolating customer data and compute resources
  • Legacy application hosting — Running old OS versions on modern hardware
  • Development and testing — Reproducing production environments locally
  • Microservices architecture — Containers provide the process isolation and deployment model microservices need

When Virtualization May Be Overkill

  • Simple scripts — Native execution is faster and simpler
  • Single-tenant high-performance workloads — Bare metal may provide better performance
  • Resource-constrained environments — Containers add overhead in memory and CPU
  • Real-time systems with hard latency guarantees — Virtualization introduces unpredictable latency

Virtualization Architecture

Hypervisor Types

graph TB
    subgraph "Type 1 Hypervisor (Bare Metal)"
        A[Hardware] --> B[Xen / VMware ESXi / Hyper-V]
        B --> C[VM 1]
        B --> D[VM 2]
        B --> E[VM N]
    end

    subgraph "Type 2 Hypervisor (Hosted)"
        F[Hardware] --> G[Host OS]
        G --> H[VMware Workstation / VirtualBox]
        H --> I[VM 1]
        H --> J[VM 2]
    end

    style A stroke:#00fff9,stroke-width:2px
    style F stroke:#00fff9,stroke-width:2px

Container Architecture

graph TB
    subgraph "Host Kernel"
        A[Host OS Kernel]
        A --> B[Namespaces]
        A --> C[cgroups]
        A --> D[Overlay Filesystem]
    end

    subgraph "Containers"
        E[Container 1]
        F[Container 2]
        G[Container N]
    end

    subgraph "Container Runtime"
        H[containerd / CRI-O]
        H --> I[runc]
    end

    B --> E
    B --> F
    C --> E
    C --> F
    D --> E
    D --> F

    style A stroke:#ff00ff,stroke-width:2px
    style H stroke:#00fff9,stroke-width:2px

Core Concepts

Type 1 Hypervisors (Bare Metal)

Type 1 hypervisors run directly on hardware without a host operating system. They are the foundation of enterprise virtualization and cloud computing.

VMware ESXi — Commercial hypervisor with vSphere management, known for reliability and enterprise features.

Microsoft Hyper-V — Windows-native hypervisor that also runs Linux VMs; integrated with Windows Server.

Xen — Open-source hypervisor used by AWS. AWS’s Nitro system is a specialized variant that offloads virtualization tasks to dedicated hardware.

KVM — Kernel-based Virtual Machine. Linux kernel module that turns Linux into a Type 1 hypervisor. Combined with QEMU for device emulation, KVM powers many cloud providers and is the foundation of Red Hat Virtualization.

Type 2 Hypervisors (Hosted)

Type 2 hypervisors run as an application within a host operating system. They’re primarily used for desktop virtualization.

VirtualBox — Oracle’s open-source hypervisor, popular for development and testing.

VMware Workstation/Fusion — Commercial products for Windows/Linux (Workstation) and macOS (Fusion).

QEMU — Open-source emulator and hypervisor. Can run as Type 2 or (with KVM) as Type 1.

Virtual Machines vs Containers

Virtual machines emulate entire hardware platforms, including CPU, memory, storage, and network. Each VM runs a complete operating system (the guest OS), making them fully isolated but resource-heavy.

Containers share the host kernel but provide process isolation via Linux namespaces and resource limits via cgroups. They package an application and its dependencies but not a full OS. This makes them:

  • Lighter — No guest OS overhead (typically 10-100MB vs GB for VMs)
  • Faster to start — Seconds vs minutes for VMs
  • More efficient — Higher density per host

The tradeoff is that containers on the same host share the kernel—kernel vulnerabilities can potentially escape container isolation in ways that VM isolation prevents.

Linux Namespace Types

Namespaces partition kernel resources so that processes in different namespaces see different views:

NamespaceFlagIsolates
PIDCLONE_NEWPIDProcess IDs
NetworkCLONE_NEWNETNetwork devices, ports, routes
MountCLONE_NEWNSMount points, filesystem views
UTSCLONE_NEWUTSHostname, domain name
IPCCLONE_NEWIPCSystem V IPC, POSIX queues
UserCLONE_NEWUSERUser and group IDs
CgroupCLONE_NEWCGROUPCgroup root directory

Control Groups (cgroups)

cgroups limit and isolate resource usage (CPU, memory, I/O, network) for process groups. They prevent any single container from consuming all host resources and ensure fair sharing across containers.

Key controllers:

  • cpu — CPU time allocation
  • memory — Memory limits and swap
  • io — Block device I/O throttling
  • pids — Process count limits
  • cpuset — CPU core pinning

Container Runtime Standards

OCI (Open Container Initiative) defines standards for container formats and runtimes:

  • runc — Reference implementation of OCI runtime spec; creates and runs containers
  • containerd — Industry-standard container runtime that manages container lifecycle (start, stop, pause)
  • CRI-O — Kubernetes-specific container runtime implementing CRI

Docker Architecture

Docker popularized containerization with its integrated platform:

  1. Docker client — CLI for user commands
  2. Docker daemon (dockerd) — Background service managing images, containers, networks, volumes
  3. containerd — Container runtime (Docker abstracted away runc)
  4. runc — Low-level container creation

Modern Docker uses containerd as its runtime, with containerd-shim processes that decouple container lifecycle from the daemon.

Production Failure Scenarios

Scenario 1: Container Memory Exhaustion (OOM Kill)

What happens: A container exceeds its memory limit, triggering the kernel’s OOM killer to terminate processes. Application crashes, logs show “Killed” or OOM-related errors.

Detection:

# Check container memory usage
docker stats
# or
crictl stats | grep memory

# Check kernel OOM logs
dmesg | grep -i "killed process"
journalctl -b | grep -i oom

Mitigation:

# docker-compose.yml
services:
  myapp:
    mem_limit: 512m
    mem_reservation: 256m
    deploy:
      resources:
        limits:
          memory: 512M
        reservations:
          memory: 256M
# Kubernetes pod resource limits
resources:
  limits:
    memory: "512Mi"
  requests:
    memory: "256Mi"

Scenario 2: VM Live Migration Failure

What happens: During live migration of a VM between hosts (for maintenance or load balancing), the VM pauses, memory copies to the destination, but the VM fails to resume properly or network connectivity drops.

Mitigation:

  • Pre-copy migration sends memory pages before pausing (faster network, longer downtime)
  • Post-copy migration pauses first, then copies memory (faster migration, higher risk)
  • Use shared storage (SAN/NFS) to avoid disk migration
  • Test migration during maintenance windows
  • Monitor network latency between hosts

Scenario 3: Container Escape Vulnerability

What happens: A vulnerability in container runtime or misconfiguration allows an attacker to escape container isolation and access the host or other containers. Notable examples: containerd CVE-2022-41723, runc CVE-2021-30465.

Mitigation:

# Never run containers with --privileged
# Use read-only root filesystems where possible
docker run --read-only --tmpfs /tmp myapp

# Drop all capabilities, add only what's needed
docker run --cap-drop all --cap-add NET_BIND_SERVICE myapp

# Prevent privilege escalation
docker run --security-opt=no-new-privileges:true myapp

# Use seccomp to restrict syscalls
docker run --security-opt seccomp:default myapp

# Keep container runtimes updated
apt update && apt upgrade containerd

Trade-off Table

AspectVMContainerBare Metal
IsolationFull (separate kernel)Process (shared kernel)Full
Boot timeMinutesSecondsInstant
Resource overheadGB (guest OS)MB (app + deps)None
Max densityLow (10s/hypervisor)High (100s/host)N/A
Security boundaryStrongModerateStrongest
Live migrationYesLimitedN/A
Snapshot/cloneYesImage layersNo
PersistenceDisk imageImage + volumesLocal disk
HypervisorTypePerformanceManagementUse Case
KVM1Near-nativeOpen, complexCloud providers
ESXi1Near-nativevSphereEnterprise
Hyper-V1Near-nativeSCVMMWindows shops
Xen1Near-nativeComplexAWS legacy
QEMU2EmulatedManualEmulation, testing

Implementation Snippets

Creating and Running a Simple Container

# Dockerfile for a minimal container
FROM ubuntu:22.04

# Don't run as root
RUN useradd -m appuser
USER appuser

# Only copy what you need
COPY --chown=appuser:appuser app /home/appuser/app

WORKDIR /home/appuser/app

CMD ["./app"]
#!/bin/bash
# Build and run a container
docker build -t myapp:latest .
docker run --rm -it myapp:latest /bin/sh

# Inspect container internals
docker inspect myapp:latest
docker exec -it $(docker ps -q) ls /

# Resource limits
docker run --memory=512m --cpus=0.5 myapp:latest

Working with Linux Namespaces Directly

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <unistd.h>
#include <sys/wait.h>

int main(void) {
    printf("Parent PID: %d\n", getpid());

    // Create child in new PID namespace
    pid_t pid = clone(
        [](void* arg) -> int {
            printf("Child PID: %d (in new namespace)\n", getpid());
            printf("Parent of child: %d\n", getppid());
            // Sleep and exit
            sleep(60);
            return 0;
        },
        // Stack for child
        malloc(65536),
        // Flags: new PID, network, mount namespaces
        CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCHLD,
        NULL
    );

    if (pid == -1) {
        perror("clone failed");
        return 1;
    }

    printf("Created child with PID: %d\n", pid);
    waitpid(pid, NULL, 0);
    printf("Child exited\n");
    return 0;
}

Inspecting cgroup Hierarchy

#!/bin/bash
# Explore cgroup structure

echo "=== Cgroup Version ==="
if [ -f /sys/fs/cgroup/cgroup.controllers ]; then
    echo "cgroup2 (unified hierarchy)"
else
    echo "cgroup1 (legacy hierarchy)"
fi

echo -e "\n=== Current Process Cgroups ==="
cat /proc/self/cgroup

echo -e "\n=== Memory Cgroup Limits ==="
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/memory.soft_limit_in_bytes
cat /sys/fs/cgroup/memory/memory.swappiness

echo -e "\n=== CPU Cgroup Limits ==="
cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

Kubernetes Pod Spec with Resource Limits

apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: myapp
      image: myapp:1.0.0
      resources:
        limits:
          memory: "512Mi"
          cpu: "500m"
        requests:
          memory: "256Mi"
          cpu: "100m"
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 10

Observability Checklist

VM Monitoring

# VMware esxtop equivalents
vmstat 1
esxtop  # VMware specific

# KVM/QEMU
virsh list
virsh dominfo <vm-name>
virsh dommemstat <vm-name>

# VM performance
top -b -n 1 | grep qemu
cat /proc/interrupts

Container Monitoring

# Docker stats (real-time)
docker stats

# Kubernetes pod metrics
kubectl top pods
kubectl top nodes

# Container runtime metrics (Prometheus format)
curl http://localhost:9323/metrics

# cAdvisor for container metrics
docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  google/cadvisor:latest

Network Namespace Debugging

#!/bin/bash
# Inspect container networking

# List network namespaces
ip netns list

# Create a network namespace (simulates container)
ip netns add testns
ip netns exec testns ip addr

# Connect namespaces with veth pair
ip link add veth0 type veth peer name veth1
ip link set veth1 netns testns

Common Pitfalls / Anti-Patterns

VM Security Considerations

  • Hypervisor vulnerabilities — A flaw in the hypervisor can expose all VMs to attack; keep hypervisors updated
  • Side-channel attacks — Spectre/Meltdown variants affect hypervisors; enable hypervisor-specific mitigations
  • VM escape — Exploits that break out of VM isolation to access host; less common but severe
  • Storage security — VM disks may contain sensitive data; encrypt at rest and in transit

Container Security Best Practices

# Kubernetes security context examples
securityContext:
  runAsNonRoot: true
  runAsUser: 10000
  fsGroup: 10000
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE

Defense in depth for containers:

  1. Run containers with minimal privileges (drop ALL capabilities)
  2. Use read-only root filesystems where possible
  3. Never expose the Docker socket to containers
  4. Scan images for vulnerabilities (Trivy, Grype)
  5. Implement network policies to restrict pod-to-pod communication
  6. Use admission controllers (like OPA/Gatekeeper) to enforce policies

Compliance Considerations

  • PCI-DSS — Requires virtualization technology to provide isolation between cardholder data and other systems
  • HIPAA — Virtualized environments must ensure PHI isolation
  • SOC 2 — Requires evidence of proper isolation and access controls

Common Pitfalls / Anti-patterns

  1. Running containers with —privileged — Gives the container full access to host devices; attackers can escape to host. Use specific capabilities instead

  2. Exposing the Docker socket — Mounting /var/run/docker.sock into a container gives that container root access to the host. Use Docker-in-Docker alternatives instead

  3. Not setting resource limits — Containers without limits can consume all host memory/CPU, affecting other workloads. Always set explicit limits

  4. Using latest tag — Makes it impossible to roll back or audit which version ran. Always use specific tags

  5. Running as root inside containers — If compromised, attacker has root on host (with certain namespace configurations). Use runAsUser in security context

  6. Ignoring VM sprawl — Unused VMs consume resources and become security liabilities. Implement VM lifecycle management

  7. Not testing live migration — Assuming migration works without testing can cause production outages. Test during maintenance windows

Quick Recap Checklist

  • Type 1 hypervisors run directly on hardware; Type 2 run within a host OS
  • Virtual machines provide full hardware emulation with isolated guest OS
  • Containers share the host kernel but provide process isolation via namespaces
  • Linux namespaces partition kernel resources (PID, network, mount, UTS, IPC, user, cgroup)
  • cgroups limit and meter resource usage (CPU, memory, I/O) for process groups
  • Docker/containers package applications; VMs package entire operating systems
  • Container security requires defense in depth: minimal privileges, read-only filesystems, capability dropping
  • Kubernetes uses containerd/CRI-O as the container runtime interface
  • VM live migration enables maintenance without downtime; test migration beforehand
  • OOM kills in containers happen when memory limits are exceeded; always set limits

Interview Questions

1. What is the difference between a VM and a container?

A virtual machine emulates an entire hardware platform—a complete CPU, memory, storage, and network subsystem. Each VM runs a full operating system (guest OS) from its own bootloader. VMs provide strong isolation because each has its own kernel. Starting a VM takes minutes and it consumes gigabytes of RAM for the guest OS alone.

A container shares the host kernel but provides isolated views of process trees, network ports, mount points, and other kernel resources through Linux namespaces. Containers package an application and its dependencies but not a full OS. They start in seconds and consume megabytes because there's no duplicated OS.

The tradeoff is security isolation strength. A kernel vulnerability in a container can potentially affect the host and other containers on the same host—a VM's separate kernel prevents this. For high-security workloads, VMs provide stronger isolation at the cost of more resources.

2. How do Linux namespaces work?

Linux namespaces partition kernel resources so that processes in different namespaces see different system-wide resources. When a process calls clone() with namespace flags, the child gets a new namespace view:

  • PID namespace — Processes in the container see different PIDs; PID 1 inside is not PID 1 on the host
  • Network namespace — Each container gets its own network stack with its own interfaces, routing tables, and port numbers
  • Mount namespace — Each container can mount different filesystems; mounts don't propagate to host
  • UTS namespace — Containers can have different hostnames
  • IPC namespace — System V message queues and shared memory are isolated

Namespaces are the fundamental mechanism that makes containers possible—they're what Docker and Kubernetes build on top of.

3. What are cgroups and why do we need them?

Control groups (cgroups) are a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, I/O, network) of process groups. While namespaces provide isolation (different views of resources), cgroups provide control (limits on resources).

Without cgroups, a single container could consume all available memory, starving other containers and the host. With cgroups, you can say "this container gets maximum 512MB RAM and 0.5 CPUs." The kernel enforces these limits, killing processes or throttling as needed.

Docker uses cgroups to implement its --memory and --cpus flags. Kubernetes pod resource limits map directly to cgroup settings on the node. Without cgroups, containerization wouldn't be safe for multi-tenant workloads.

4. What is a hypervisor and what are the different types?

A hypervisor is the software that creates and runs virtual machines. It sits between the hardware and the VMs, presenting virtual hardware to each VM and managing the actual hardware allocation.

Type 1 (bare metal) hypervisors run directly on hardware without a host OS. Examples: VMware ESXi, Microsoft Hyper-V, KVM, Xen. These are used in data centers and cloud providers because they have minimal overhead and are purpose-built for virtualization.

Type 2 (hosted) hypervisors run as an application within a regular operating system. Examples: VirtualBox, VMware Workstation. These are used primarily for development and testing on desktops where full data center infrastructure isn't needed.

KVM (Kernel-based Virtual Machine) is interesting because it runs as a Linux kernel module but functions as a Type 1 hypervisor—when KVM is loaded, Linux itself becomes the hypervisor. This gives KVM the performance of Type 1 with the flexibility of Linux.

5. What is the Docker architecture and how does containerd relate to Docker?

Docker's architecture has evolved significantly. Originally, Docker used its own runtime (docker-containerd), but the modern architecture separates concerns:

The Docker daemon (dockerd) exposes the Docker API, manages images, networks, and volumes, and orchestrates the runtime. It exposes the familiar docker CLI interface that users interact with.

containerd is the industry-standard container runtime that handles the actual container lifecycle—creating, starting, stopping, pausing, and deleting containers. It was donated to CNCF and is the standard runtime that Kubernetes uses.

runc is the low-level container runtime that creates and runs containers according to OCI specifications. It's the actual process that spawns the container—containerd uses runc to do the heavy lifting.

The advantage of this separation is interoperability: Kubernetes doesn't need Docker; it can use containerd or CRI-O directly via the Container Runtime Interface (CRI). This modularity lets different projects use the same container runtime without depending on Docker.

Further Reading

6. What is VM live migration and how does it work?

Live migration moves a running virtual machine from one physical host to another without disconnecting clients. The process typically uses pre-copy migration: the source VM continues running while memory pages are copied iteratively to the destination, with pages modified during copy being re-sent each round. Once memory synchronization nears completion, the VM pauses briefly, the remaining dirty pages are transferred, and the VM resumes on the destination host.

Post-copy migration takes the opposite approach: the VM pauses immediately, minimal state is transferred, and the VM resumes on the destination while memory pages are faulted in on-demand. Pre-copy offers lower downtime but longer total migration time; post-copy offers faster migration but higher runtime overhead during recovery. Shared storage (SAN/NFS) eliminates disk migration, significantly reducing migration complexity.

7. What is container escape and what are the primary attack vectors?

Container escape occurs when an attacker breaks out of container isolation to access the host or other containers. Primary vectors include: kernel vulnerabilities in container runtimes (like CVE-2022-41723 in containerd, CVE-2021-30465 in runc), misconfigured capabilities granting too many privileges, mounting the Docker socket into a container giving host root access, and vulnerable syscalls not blocked by seccomp profiles.

Namespace isolation means containers share the kernel, so kernel exploits that would be contained by VM isolation can escape containers. Defense requires: never running containers with --privileged, dropping all capabilities and adding only what's necessary, using read-only root filesystems, enabling seccomp profiles, keeping container runtimes updated, and using admission controllers in Kubernetes to enforce security policies.

8. What is the difference between cgroup v1 and cgroup v2?

Cgroup v2 (unified hierarchy) addresses fundamental limitations in cgroup v1. In cgroup v1, each controller (cpu, memory, io, pids) maintained its own separate hierarchy, leading to complexity when controllers depended on each other. Cgroup v2 unifies all controllers into a single hierarchy, simplifying resource management and eliminating cross-controller conflicts.

Key differences: v2 uses a single unified tree rather than multiple parallel trees; the cpu controller no longer has a separate rt runtime interface; memory.low and memory.min provide more intuitive memory protection than the v1 hierarchical limits; pids controller is always hierarchical in v2; and v2 provides better delegation to containers through directory ownership. Most modern container runtimes support cgroup v2, which became the default in systemd and newer kernels.

9. How does Kubernetes use cgroups and namespaces for pod isolation?

Each Kubernetes pod runs as one or more containers sharing the same Linux namespaces (PID, network, mount, IPC, UTS). For a pod with a single container, the container runtime (containerd or CRI-O) creates a new PID namespace so processes inside the container cannot see host processes. The pod shares the host network namespace by default, giving containers direct access to the host's network stack.

Resource limits defined in pod specs (memory, cpu, hugepages) translate directly to cgroup settings on the node. Kubelet configures cgroupfs (or systemd slice) for each pod and container. Pod security policies and security contexts control seccomp profiles, capabilities, SELinux labels, and whether containers run as privileged. The node's kernel enforces these limits regardless of what the container runtime requests.

10. What is nested virtualization and when is it useful?

Nested virtualization runs a hypervisor inside a VM that is itself running on a hypervisor. For example, running VirtualBox inside a KVM VM, or running KVM inside an ESXi VM. This requires hardware support (AMD-V and Intel VT-x have flags that can be passed through) and is disabled by default on most hypervisors.

Nested virtualization is useful for development and testing where you need to run VM environments but lack physical hardware access, for running older hypervisor software that requires bare-metal installation for licensing, for CI/CD pipelines that need to test hypervisor-specific behavior, and for certain security research scenarios. It adds performance overhead since there are multiple translation layers (VM exits nested inside VM exits), making it unsuitable for production workloads.

11. What is the overlay filesystem and how does it work with containers?

Overlay filesystems (overlay2 is the modern version) layer multiple directories into a single merged view. For containers, two or three directories are used: the lower layer (image layers, read-only), the upper layer (container-specific changes, writable), and an optional merged view (what the container sees). When a file exists in both layers, the upper layer version shadows the lower.

Copy-on-write behavior means when a container modifies an image file, the entire file is copied to the upper layer before modification, preserving the original image layer unchanged. This allows many containers to share the same image layers while having independent writable layers. Overlay2 uses inodes efficiently and handles many layers better than the older overlay driver, making it the default for most container runtimes on modern kernels.

12. How do container security scanning tools work?

Container security scanners like Trivy, Grype, and Clair inspect container images for known vulnerabilities. They extract the image's software package manifest (apt, rpm, pip, npm packages, binaries) by parsing the image layers and filesystem contents, then compare against vulnerability databases (NVD, distros' security advisories, GitHub Security Advisories). They report CVEs matching the packages found, with severity ratings and fix versions.

Scanners operate in different modes: static analysis of image contents without running containers, live scanning of running containers for runtime vulnerabilities, and admission control in Kubernetes rejecting deployments with critical vulnerabilities. Some scanners also check for secrets, misconfigurations (Docker CIS benchmarks), and supply chain risks like malicious base images. Scanning should happen both at build time (CI pipeline) and continuously for deployed images as new CVEs are published.

13. What is the difference between user namespaces and host UID/GID mapping?

User namespaces map UIDs inside the container to different UIDs on the host, providing true isolation: container root (UID 0 inside) can map to an unprivileged UID on the host (like 100000). This means a container escape does not automatically give root access to host resources because the container's root is not actually root on the host.

Host UID/GID mapping (the default without user namespaces) maps all container UIDs directly to the same host UIDs, so container UID 0 is host UID 0. This creates security risks if UID collisions occur or if container processes can escape their namespace. User namespaces are the foundation of rootless containers, though they require kernel 3.8+ and have some limitations with certain capabilities and device access.

14. What is the performance difference between VMs and containers for CPU-intensive workloads?

For CPU-intensive workloads, VMs typically incur 1-5% overhead from virtualization (hypervisor scheduling, emulated or paravirtualized devices), while containers introduce near-zero CPU overhead since they are just processes with namespace isolation. The difference is most noticeable in workloads with high rates of context switching, system calls, or I/O operations, where the VM's additional hypervisor layer adds latency.

However, the performance story changes when considering CPU limits. A cgroup-limited container is throttled at the kernel level, which can cause latency spikes when the container exhausts its CPU quota. A VM with dedicated CPUs has no such throttling but shares physical cores according to the hypervisor scheduler. For latency-sensitive real-time workloads, bare metal or VMs with CPU pinning often outperform containers due to more predictable scheduling.

15. How does QEMU work as both a Type 2 hypervisor and an emulator?

QEMU (Quick Emulator) operates in two modes. As a pure emulator, it translates guest instructions to host instructions dynamically using binary translation, emulating CPU, memory, and devices entirely in software. This allows running ARM binaries on x86 hosts, for example, but is slow due to software emulation.

When paired with KVM (Kernel-based Virtual Machine), QEMU becomes a Type 1 hypervisor. KVM runs the guest CPU directly on hardware (VMX/SVM virtualization extensions), treating most guest instructions as native execution. QEMU handles device emulation and I/O for the guest, creating a division of labor: KVM handles CPU/memory virtualization, QEMU handles everything else. This combination delivers near-native CPU performance while still supporting a wide variety of emulated and paravirtualized devices.

16. What is the purpose of the device mapper in Linux storage virtualization?

The device mapper is a kernel framework that underpins LVM, dm-crypt, and the older devicemapper storage driver for Docker. It creates virtual block devices by mapping requests through a series of targets (linear, snapshot, mirror, crypt, raid). Each target transforms I/O in different ways: a linear target simply maps a region to another device, a snapshot target tracks changes against an origin, and a crypt target encrypts/decrypts transparently.

Docker's devicemapper driver (now deprecated in favor of overlay2) used thin provisioning with snapshot targets: each container's writable layer was a snapshot of a thin pool, and images were backing snapshots. This allowed fast container creation but had issues with write amplification and garbage collection. Understanding device mapper is still relevant for understanding how LVM thin pools, dm-verity, and encrypted containers work at a low level.

17. How do memory overcommit and OOM killer interact in containerized environments?

The Linux kernel's OOM killer activates when the system exhausts allocatable memory and cannot swap. It selects and kills a process based on an oom_score calculated from resident memory size, uptime, and oom_score_adj. In containerized environments, the OOM killer operates at the cgroup level: if a container's memory limit (cgroup memory limit) is exceeded, the kernel kills processes within that cgroup, not necessarily the highest-memory process on the system.

This distinction matters because a container hitting its memory limit may kill the wrong process (a small helper process rather than the main workload) or multiple processes within the container. Kubernetes pod resource requests and limits control cgroup settings, but the OOM killer still selects within the cgroup based on its internal scoring. Proper tuning involves setting appropriate memory limits, understanding which process is likely to be killed, and using memory reservation (soft limits) to guide the kernel.

18. What is vETH and how does container networking work?

A vETH (virtual Ethernet) pair is a virtual network cable connecting two network namespaces. Data sent on one end appears on the other, like a physical ethernet cable plugging into two different switches. Containers use vETH pairs: one end is placed inside the container's network namespace, the other end remains in the host's root namespace, typically attached to a bridge (docker0, cni0).

When a container sends a packet, it goes through the container's vETH to the host bridge, which forwards based on MAC addresses or routing tables. For external traffic, NAT (Network Address Translation) translates the container's internal IP to the host's external IP. This model allows containers to have their own network stacks (separate IP addresses, routing tables, firewall rules) while sharing the host's physical network interfaces.

19. What is Kata Containers and how does it differ from traditional containers?

Kata Containers is a container runtime that runs each container inside a lightweight VM, combining the speed and density of containers with the isolation of VMs. It uses hardware virtualization (like KVM) to create a VM that boots a minimal kernel and runs the container workload inside. This provides a strong security boundary because the container's kernel is isolated from the host kernel, preventing kernel exploits from escaping to the host.

Unlike traditional containers that share the host kernel, Kata containers each have their own kernel. This trades some performance (VM boot time, memory overhead per container) and density (typically 10-40% less dense than namespace containers) for dramatically improved isolation. It is particularly valuable for multi-tenant environments where container isolation is insufficient but full VMs are too heavy. Kata integrates with containerd and Kubernetes through the shim-v2 architecture.

20. How does resource quota enforcement work for pods in Kubernetes?

Kubernetes enforces resource limits through cgroups on each node. When a pod is scheduled, kubelet configures cgroup parameters for the pod's QoS class (Guaranteed, Burstable, or BestEffort) based on the requests and limits specified. Memory limits map to memory.limit_in_bytes, CPU limits map to cpu.cfs_quota_us and cpu.cfs_period_us (CFS scheduler) or cpu.max (cgroup v2).

Namespace-level ResourceQuota objects enforce cluster-wide limits on total CPU requests, memory requests, storage, and object counts across a namespace. Kubernetes admission controllers reject pod deployments that would exceed these quotas. LimitRange objects set default values for containers that don't specify resource requirements. Together, cgroups enforce limits at runtime, while admission controllers enforce quotas at deployment time.

Conclusion

Virtualization is the foundation of modern cloud computing, enabling the elastic, multi-tenant infrastructure that powers everything from startup MVPs to enterprise-scale Kubernetes clusters. Understanding hypervisors, containers, namespaces, and cgroups gives you the mental model needed to debug container issues, optimize resource utilization, and design secure multi-tenant systems.

The distinction between VMs (strong isolation, higher overhead) and containers (lightweight, shared kernel) informs architectural decisions about security boundaries and performance requirements. As you continue learning, explore Kubernetes internals, container networking models, and container security scanning to build comprehensive expertise in cloud-native infrastructure.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science