Virtual Memory

How operating systems use disk as an extension of RAM through demand paging, and the page replacement algorithms — LRU, Clock, Working Set — that decide what gets evicted.

published: reading time: 34 min read author: GeekWorkBench

Virtual Memory

Virtual memory is the engineering feat that makes a computer with 8 GB of RAM run applications that collectively demand 64 GB without crashing. It is the OS’s most visible memory management technique — a layer of indirection that allows the logical address space to exceed the physical address space, using disk as a backing store for the pages that do not fit in RAM. The elegance is that application code does not know and does not care whether its pages are resident in physical memory or have been evicted to swap.

The magic happens through demand paging: pages are loaded into physical memory only when a process accesses them. A program that calls malloc(1 GB) pays no cost at allocation time — the OS creates page table entries for the range but marks them as not present. Physical frames are allocated only when a page fault triggers. This lazy allocation extends to disk-backed pages, heap, and stack alike.

Introduction

When to Use / When Not to Use

Virtual memory is always active — you cannot disable it on any modern general-purpose OS. However, you make implicit choices about how aggressively your workload uses it.

When understanding virtual memory matters:

  • Debugging memory-related crashes (segmentation faults, OOM kills, thrashing)
  • Optimizing applications where working set > physical memory (large databases, ML training)
  • Configuring swap space appropriately (or deciding to use no swap at all)
  • Understanding why certain programs behave differently under memory pressure

When virtual memory is a liability:

  • Write-heavy workloads on spinning disks (swap thrashing destroys I/O)
  • Latency-sensitive real-time applications (page faults introduce unbounded latency)
  • Embedded systems with limited storage (swap consumes precious flash endurance)

On latency-sensitive systems (trading platforms, real-time control systems), administrators often disable swap entirely (swapoff -a) to eliminate the possibility of a page fault causing a millisecond-scale pause that violates latency SLAs.

Architecture or Flow Diagram

Virtual memory combines demand paging with page replacement to seamlessly spill pages to disk and retrieve them on demand. The flow below shows the decision tree the OS follows when a page fault occurs.

flowchart TD
    PF["Page Fault<br/>Triggered by CPU"]
    HANDLER["Page Fault<br/>Handler"]
    VALID["Is the address<br/>valid and permitted?"]
    PRESENT["Is the page in<br/>physical memory?"]
    SWAPPED["Is the page<br/>in swap?"]
    LOAD["Load page from<br/>swap device to frame"]
    ZERO["Allocate zero-filled<br/>frame from free pool"]
    UPDATE["Update page table entry<br/>Mark page as present"]
    SEGV["Send SIGSEGV<br/>Terminate process"]
    RESUME["Resume interrupted<br/>instruction"]

    PF --> HANDLER
    HANDLER --> VALID
    VALID -->|"No"| SEGV
    VALID -->|"Yes"| PRESENT
    PRESENT -->|"No, in swap"| LOAD
    PRESENT -->|"No, never allocated"| ZERO
    PRESENT -->|"Yes, already in RAM"| RESUME
    LOAD --> UPDATE
    ZERO --> UPDATE
    UPDATE --> RESUME

    style SEGV stroke:#ff6b6b
    style LOAD stroke:#ff9f43
    style ZERO stroke:#00fff9

The OS maintains a page replacement algorithm to choose which physical frame to evict when a newly faulted page needs a physical frame and none are free. This is where LRU, Clock, and Working Set algorithms diverge.

Core Concepts

Demand Paging

Demand paging is the mechanism by which pages are loaded into physical memory only upon first access. The page table entry for a not-yet-loaded page has the present (P) bit cleared. When a process accesses that page:

  1. CPU triggers page fault
  2. OS determines the fault address (CR2 register on x86)
  3. OS allocates a physical frame
  4. OS reads the page from its backing store (executable, page cache, or swap)
  5. OS updates the page table entry with the frame number and sets P=1
  6. CPU retries the instruction

The first access to a large mmap region is dramatically slower than subsequent access because of this fault storm. Tools like mincore (on Linux) can reveal which pages of a mapped region are resident.

Swap Space

Swap space is a disk partition or file that backs pages that have been evicted from physical memory. When the OS needs a physical frame and the free frame list is empty, it runs the page replacement algorithm to select a victim page — one that has not been recently used — writes it to swap if dirty, and reclaims the frame.

Linux swap space is managed in 4 KB blocks called swap slots. Each slot can hold one evicted page. The swapon command activates a swap partition; swapoff deactivates it. The kernel maintains a swap map — an array tracking reference counts for each swap slot.

Modern kernels support swap files (not just partitions), which can be resized dynamically. The performance of swap files on SSDs is generally acceptable; on HDDs it is poor due to random seek requirements.

Page Replacement Algorithms

When physical memory fills up, the OS must evict some pages to make room for newly needed ones. The choice of which page to evict is the domain of page replacement algorithms.

LRU (Least Recently Used): LRU evicts the page whose last access is furthest in the past. The theoretical ideal for minimizing cache misses, but LRU requires tracking every page access — an expensive hardware or software overhead. Software LRU requires updating a timestamp or queue on every memory access, interfering with normal program execution. Hardware LRU uses a special CPU counter but still requires TLB and cache coherence overhead.

Clock (Second Chance): Clock (also called the clock algorithm or approximated LRU) is a practical LRU approximation. Pages are arranged in a circular list. A pointer (the “clock hand”) scans pages; if a page’s accessed bit is set, it is cleared and the hand moves on. If not set, the page is evicted. This avoids per-access updates — only the accessed bit is checked on eviction, not updated on every access. The “second chance” name comes from the fact that pages with the accessed bit set get one additional time around the clock before eviction.

Linux uses a variant called Clock-Pro (or LRU of the CLOCK-PRO family), which distinguishes between hot and cold pages based on their accessed bit patterns.

Working Set Model: The working set of a process is the set of pages it actively uses over a time window (e.g., the last 10 seconds). Pages outside the working set can be evicted without significantly impacting the process’s hit rate. The working set model aims to keep the working set resident, evicting pages outside it first. The concept was formalized by Peter Denning in the 1960s and is still referenced in modern memory management literature.

A variant called WSClock (Working Set Clock) combines the clock algorithm with working set timestamps, providing both efficiency and accuracy.

Thrashing

Thrashing is the pathological state where the system spends more time swapping pages in and out than executing useful work. The classic scenario: total working sets of all runnable processes exceed physical memory. Every process needs pages, the OS evicts other processes’ pages to make room, those processes fault back in, and the cycle repeats at disk I/O speeds rather than CPU speeds.

The cure is either reducing the number of competing processes, adding physical RAM, or tuning memory limits (cgroups, container memory caps). The vm.swappiness sysctl influences the kernel’s tendency to swap out process pages versus reclaiming page cache.

Production Failure Scenarios

Thrashing Under Memory Pressure

Failure: A node running multiple Java microservices with large heaps experiences constant page faults. The GC in each JVM runs frequently, but heap pages are swapped out between GC cycles. GC pause times spike from 200 ms to 20 seconds. Response times balloon.

Mitigation: Set explicit memory limits for each container (docker run --memory=512m). Reduce vm.swappiness to 10-20 on hosts running latency-sensitive services. Consider disabling swap on database hosts (swapoff) to force the OOM killer to make hard decisions rather than slowly swapping. Move large heaps to hosts with ample headroom. Use memory requests/limits in Kubernetes to prevent any single pod from consuming the node’s memory.

Swap Storm from fork() After exec()

Failure: A web server forks a child process for each request, and the child immediately exec()s a new program. After fork(), the parent and child share all pages (copy-on-write). The child then calls exec(), which replaces the entire address space with the new program — discarding all COW pages. If many requests arrive simultaneously, the OS allocates and discards huge numbers of pages, generating massive swap activity even though the total memory footprint is small.

Mitigation: Use preFork or thread-pool models (Apache MPM worker, Nginx worker processes) instead of fork-per-request. Use vfork() on Linux (which shares the parent’s address space without COW overhead until exec()). Use container memory limits to isolate services and reduce cross-service swap pressure.

Transparent Huge Page Defragmentation Pauses

Failure: Linux’s transparent huge page (THP) feature defragmentates memory in the background to coalesce 4 KB pages into 2 MB huge pages. This defragmentation scan can cause latency spikes of several milliseconds in latency-sensitive workloads.

Mitigation: Disable THP for latency-sensitive applications: echo never > /sys/kernel/mm/transparent_hugepage/enabled. Use explicit huge page allocation (mmap(MAP_HUGETLB)) for services that need it (PostgreSQL, JVM). On kernel 5.17+, use madvise(MADV_COLLAPSE) to request huge page backing for specific regions.

Trade-off Table

AlgorithmImplementation CostEviction AccuracyHardware SupportLinux Implementation
Optimal (OPT)Impossible (requires future knowledge)Perfect (0 extra evictions)NoneNot used
LRU (true)Very high (per-access timestamp update)HighSpecialized hardwareNot used (too expensive)
Clock (Second Chance)Low (circular buffer, accessed bit check)ModerateAccessed bit (R-bit)Used as fallback
Clock-Pro (Linux)Low-moderate (multiple hands, working set tracking)GoodAccessed bitDefault since 2.6
Working Set / WSClockModerate (per-page timestamps)Very goodTimestampsHistorical
RandomZeroPoor (uninformed)NoneUsed when others fail

Implementation Snippets

Simulating LRU Page Replacement (Python)

#!/usr/bin/env python3
"""LRU Page Replacement Simulator.

Given a sequence of page accesses and a number of physical frames,
simulate the LRU algorithm and count page faults.

LRU: On each access, track when each page was last used.
When a replacement is needed, evict the page with the oldest
(last-most-recently-used) timestamp.
"""

from collections import OrderedDict

def lru_page_faults(access_sequence: list[int], num_frames: int) -> int:
    """Return the number of page faults using LRU replacement."""
    # Use an OrderedDict: order = recency of access (tail = most recent)
    resident_pages = OrderedDict()  # page -> last_access_order
    faults = 0
    access_counter = 0

    for page in access_sequence:
        access_counter += 1

        if page in resident_pages:
            # Page hit: move to end (most recently used)
            resident_pages.move_to_end(page)
        else:
            faults += 1
            if len(resident_pages) >= num_frames:
                # Evict LRU page (first item in OrderedDict)
                resident_pages.popitem(last=False)
            # Load new page and mark as most recently used
            resident_pages[page] = access_counter

    return faults

def clock_page_faults(access_sequence: list[int], num_frames: int) -> int:
    """Approximated LRU using the Clock (Second Chance) algorithm."""
    # resident_pages[i] = (page_number, referenced_bit)
    resident_pages = []  # List of (page, r_bit)
    faults = 0
    hand = 0  # Clock hand position

    for page in access_sequence:
        # Check if page is already resident
        found = False
        for i, (p, r) in enumerate(resident_pages):
            if p == page:
                resident_pages[i] = (p, 1)  # Set referenced bit
                found = True
                break

        if found:
            continue

        # Page fault — need to load page
        faults += 1

        # Find a frame to use
        while True:
            if len(resident_pages) < num_frames:
                resident_pages.append((page, 1))
                break

            # Clock algorithm: scan for victim with r=0
            _, r = resident_pages[hand]
            if r == 0:
                # Evict this page
                resident_pages[hand] = (page, 1)
                hand = (hand + 1) % num_frames
                break
            else:
                # Second chance: clear bit, move on
                resident_pages[hand] = (resident_pages[hand][0], 0)
                hand = (hand + 1) % num_frames

    return faults

# Example from textbook: access sequence 7, 0, 1, 2, 0, 3, 0, 4, 1, 2, 3, 4, 7, 0, 1
accesses = [7, 0, 1, 2, 0, 3, 0, 4, 1, 2, 3, 4, 7, 0, 1]

print("Page access sequence:", accesses)
print("\nLRU algorithm:")
for frames in range(1, 7):
    faults = lru_page_faults(accesses, frames)
    print(f"  {frames} frame(s): {faults} faults")

print("\nClock algorithm:")
for frames in range(1, 7):
    faults = clock_page_faults(accesses, frames)
    print(f"  {frames} frame(s): {faults} faults")

Monitoring Virtual Memory (bash)

#!/bin/bash
# Comprehensive virtual memory monitoring

echo "=== VM Statistics ==="
vmstat 1 5

echo ""
echo "=== Swap Usage ==="
swapon --show
echo ""
free -h

echo ""
echo "=== Top processes by major page faults ==="
ps -eo pid,comm,majflt,minflt,rss,vsz --sort=-majflt | head -10

echo ""
echo "=== Current swappiness ==="
cat /proc/sys/vm/swappiness

echo ""
echo "=== Recent OOM Kills ==="
dmesg | grep -i "oom\|out of memory" | tail -5

Observability Checklist

  • Overall swap activity: vmstat 1si (swap in) and so (swap out) columns in KB/s
  • Major page fault rate per process: ps -eo pid,comm,majflt --sort=-majflt | head
  • Minor page fault rate: ps -eo pid,comm,minflt — high minflt = copy-on-write activity
  • Page cache pressure: cat /proc/sys/vm/vfs_cache_pressure (default 100; higher = reclaim more page cache vs process pages)
  • Swappiness: cat /proc/sys/vm/swappiness (0-100; higher = more willing to swap process pages)
  • Available swap: swapon --show and free -h
  • Per-process swap usage: cat /proc/PID/status | grep -i swap
  • OOM killer events: dmesg | grep "Out of memory" | tail or journalctl -k | grep oom
  • Transparent huge page status: cat /sys/kernel/mm/transparent_hugepage/enabled
  • Active vs inactive memory: cat /proc/meminfo | grep -E "Active:|Inactive:" — inactive memory can be reclaimed without causing faults

Common Pitfalls / Anti-Patterns

Swapping sensitive data to disk: When pages containing sensitive data (cryptographic keys, passwords, private keys) are evicted to swap, they reside in plaintext on the swap device. If an attacker gains physical access or reads the swap device after the system crashes, they can recover this data. Mitigations include:

  • Encrypted swap: LUKS-encrypted swap partitions, or cryptswap with a key stored only in RAM
  • Memory locking (mlock/mlockall): Prevent specific pages from being swapped by calling mlock() on the containing region
  • Key management: Hardware security modules (HSMs) and Intel SGX enclaves keep keys in memory that cannot be swapped or inspected by the OS

Cold boot attacks targeting swap: Similar to the memory forensics scenario, cold boot attacks can image the swap partition and recover sensitive data. Encrypted swap renders this ineffective.

Memory disclosure via /proc filesystem: The /proc/PID/maps, /proc/PID/smaps, and /proc/PID/pagemap interfaces expose memory layout information that can aid exploits. Most production environments restrict access to /proc/PID/ to the owner or disable it via hidepid=2 mount option on /proc.

Common Pitfalls / Anti-patterns

Pitfall: Configuring too much swap on systems with SSDs. If you set swap equal to RAM on an SSD-backed system, you may never hit OOM — the system will happily swap instead of invoking the OOM killer. This creates a death by a thousand cuts: applications slow down due to page fault latency, but do not crash. The system appears responsive but degrades gradually. Better practice: set vm.swappiness=1 (or 0 on kernels 3.5+) for swap-averse workloads, and use memory limits to enforce hard caps.

Pitfall: Disabling swap entirely on systems with overcommitted memory. Disabling swap (swapoff -a) works well when you have more than enough physical RAM for all workloads. On a system with memory overcommit (common in containerized environments), disabling swap means the first process to exhaust memory triggers the OOM killer immediately, with no graceful degradation. You want some swap as a pressure valve.

Pitfall: Confusing virtual memory size with physical memory usage. pmap -X <pid> shows the virtual address space size (VSZ) and resident set size (RSS). The gap between VSZ and RSS can be enormous — virtual memory includes mapped but never-touched pages, memory-mapped files that were faulted in then swapped, and stack guard pages. A process with VSZ=100 GB and RSS=200 MB is using 200 MB of physical memory, not 100 GB.

Anti-pattern: Large allocations without considering swap. On 32-bit systems (or 32-bit processes on 64-bit kernels), a single large malloc can exhaust the virtual address space. mmap of a large region that is never accessed reserves virtual address space but no physical memory — but malloc reserves virtual address space for the entire heap and commits pages on first write. The combination of malloc + heavy swap usage can cause the OOM killer to terminate processes unexpectedly.

Quick Recap Checklist

  • Virtual memory allows logical address spaces larger than physical memory by using disk as backing store
  • Demand paging loads pages into RAM only on first access (page fault), not at allocation time
  • Page replacement algorithms (LRU, Clock, Working Set) decide which physical pages to evict when RAM is full
  • Thrashing occurs when the OS spends more time swapping than executing — total working sets exceed physical RAM
  • The OOM killer invoked when physical memory + swap is exhausted, not when virtual address space is exhausted
  • Linux uses Clock-Pro as its default page replacement algorithm (approximates LRU efficiently)
  • vm.swappiness controls kernel preference: higher = more willing to swap process pages, lower = prefer page cache reclaim
  • Huge pages (2 MB, 1 GB) reduce TLB pressure for large working sets — important for databases and JVMs
  • Encrypted swap prevents cold-boot and physical-access attacks from reading sensitive data from swap space
  • Tools: vmstat 1 (swap activity), ps -o majflt,minflt (fault rates), swapon --show (swap size), /proc/meminfo (memory state)

Interview Questions

1. What is thrashing, what causes it, and how do you detect and resolve it?

Thrashing occurs when the total demand for physical memory from all runnable processes exceeds what is available. The OS spends most of its time swapping pages in and out — every time a process gets CPU time, it generates page faults, loads pages from disk, and then the OS evicts other processes' pages to make room. Throughput collapses because the CPU is idle waiting for disk I/O more than it is running code.

Detection: vmstat 1 shows high si (swap in) and so (swap out) columns. iostat -x 1 shows high disk utilization. ps shows processes with high majflt rates. The system appears sluggish despite moderate CPU usage.

Resolution: Add physical RAM, reduce the number of competing processes, set memory limits (cgroups, containers), reduce vm.swappiness to prefer page cache reclaim, or move workloads to dedicated hosts with adequate headroom.

2. Explain the difference between LRU and the Clock page replacement algorithm.

True LRU (Least Recently Used) requires tracking the last access time of every page in memory. On each memory access, software must update a timestamp or reorder a structure — at massive overhead, since memory accesses are the most frequent events in a running program. Hardware assist (cache replacement sensors) exists on some architectures but is imperfect.

The Clock algorithm (Second Chance) approximates LRU by using a referenced (accessed) bit per page. Pages are arranged in a circular list. A pointer sweeps the list; if a page's accessed bit is 0, it is evicted immediately. If it is 1, the bit is cleared and the page gets a "second chance" — the pointer moves on. Pages that have been recently accessed retain their accessed bit and survive multiple sweeps. This approximates LRU without per-access overhead: the accessed bit is updated by hardware on every memory access, but only checked during eviction.

Linux's actual algorithm is Clock-Pro, a refined variant that distinguishes between hot and cold pages more accurately by tracking referenced patterns across multiple clock hands.

3. What happens when a process tries to access a page that has been swapped out to disk?

The CPU generates a page fault because the page table entry shows the page as not present. The OS page fault handler examines the faulting address to verify the access is valid (within a valid VMA and respecting protection bits). It then locates the page in the swap device using the swap offset stored in the PTE. The handler allocates a physical frame, initiates a disk I/O read to load the page from swap, and waits for completion. Meanwhile, other processes can run. Once the page is loaded, the PTE is updated with the frame number, present bits are set, and the instruction is retried — transparently resuming the process as if the page had always been in RAM.

The key insight is that the instruction is restartable — x86 memory access instructions use base+offset addressing that is re-executable after the handler resolves the fault.

4. What is the working set of a process, and why does it matter for memory management?

The working set W(t, delta) is the set of pages referenced by a process during the time interval (t minus delta, t). It is the pages the process actively needs — those accessed within the last delta time units. Pages outside the working set can be evicted without significantly impacting the process's hit rate.

The working set model matters because it quantifies how much physical memory a process truly needs versus how much it has allocated. If the sum of all processes' working sets exceeds physical memory, thrashing is inevitable. OSes that implement working set approximations (like Linux's Clock-Pro) use it to prioritize which pages to keep resident. Monitoring the working set size of a database or JVM over time reveals whether its heap size is appropriate or excessive.

5. Why might you disable swap on a production database server?

Database servers (PostgreSQL, MySQL, Oracle) manage their own buffer pool and typically have a well-tuned working set that fits in physical memory. If swap is enabled, the OS may begin evicting database buffer pool pages under memory pressure — even when the database's own internal eviction policy would have made a smarter choice. This causes database performance to degrade gracefully (slowly) rather than fail fast. By disabling swap, you force the OOM killer to make hard decisions: when physical memory is exhausted, some process gets killed immediately. This is preferable to slow, unpredictable swap thrashing. Swap can also introduce latency unpredictability — a page fault on a spinlock held by a critical thread could cause a millisecond-scale pause that violates real-time SLAs in trading or control systems.

6. What is the difference between anonymous pages and file-backed pages in the page cache?

Anonymous pages are process heap, stack, and data pages that have no filesystem backing — they are backed by swap space (or physical RAM if never swapped). When evicted, they must be written to the swap device. File-backed pages are mapped from files (e.g., code, mmap'd data) — when evicted, they can be dropped immediately if clean (matching the file on disk) or written back to the filesystem if dirty. The difference matters for page replacement: the kernel prefers evicting anonymous pages first because file-backed pages may need disk I/O to reconstitute them, while anonymous pages can be regenerated from swap or zero-filled. The shmem filesystem creates file-backed pages that behave like anonymous pages (stored in swap, no filesystem).

7. What is OOM scoring and how does the OOM killer select its victim?

The Linux OOM killer uses an oom_score_adj value (-1000 to +1000) per process to influence selection. Higher values increase the likelihood of being killed; lower values (including negative) make a process more protected. The kernel calculates a score based on resident set size (RSS), page fault rate, and the oom_score_adj. When physical memory is exhausted, the killer traverses all processes and selects the one with the highest score — typically a large, memory-intensive process that has been running longest.

Containers add complexity: cgroups expose memory.oom.group which kills the entire container's process group when triggered. You can tune per-container OOM tolerance via memory.min and memory.low settings in cgroup v2, which cause the system to apply memory pressure before hitting the hard limit.

8. What is the difference between vmalloc and mmap with MAP_ANONYMOUS?

vmalloc allocates virtually contiguous but physically fragmented memory from the vmalloc area (~1.5 GB on x86-64). It is suitable for large buffers (multi-MB) that don't require physical contiguity. mmap with MAP_ANONYMOUS allocates from the process heap area — which is also virtually contiguous. Both return virtually contiguous memory. The key difference: vmalloc pages are NOT backed by the direct-mapped physical address range — accessing them requires additional page table setup. Anonymous mmap is backed directly by physical pages via the buddy system. For DMA or I/O buffer allocations requiring physical contiguity, neither is suitable — you need alloc_pages() (buddy system) with a high-order allocation. For very large allocations that don't need contiguity, mmap with MAP_ANONYMOUS is typically preferred for its simplicity.

9. What is the working set clock algorithm (WSClock) and how does it work?

The WSClock algorithm combines the clock (second chance) algorithm with per-page timestamps (working set model). Each page has a reference bit and a timestamp of the last reference time. The clock hand scans pages: if a page's reference bit is set, it is cleared and the timestamp updated; if not set, the page's age (time since last reference) is compared to the working set window. Pages older than the window are evicted. This gives the accuracy of true LRU timestamps with the efficiency of the clock algorithm's accessed-bit updates (hardware-managed, no per-access software overhead). Linux's Clock-Pro is a variant that uses referenced bit patterns across multiple clock hands to distinguish hot from cold pages with less precision overhead than full timestamps.

10. What causes memory overcommit and how does the Linux kernel handle it?

Memory overcommit occurs when the sum of all processes' virtual memory allocations exceeds physical RAM + swap. Linux's default behavior (overcommit mode 2) allows allocations as long as there is reclaimable memory — it does not reserve physical pages at allocation time. This is why malloc can return success for a 100 GB allocation on a 16 GB machine without swap: the physical pages are only allocated when the process actually touches the pages (demand paging). Overcommit is necessary for fork()+COW where the child shares all pages with the parent — without it, fork() would fail on large memory footprints. When physical memory is exhausted, the OOM killer selects and kills a process. Setting vm.overcommit_memory=0 enables heuristic overcommit; =1 always allows overcommit; =2 denies allocations that exceed the limit.

11. What is the difference between major page faults and minor page faults with respect to swap?

A major page fault occurs when the page must be read from swap (or a memory-mapped file on disk) — this requires actual disk I/O and takes milliseconds. A minor page fault occurs when the page is already in memory but not yet mapped in the process's page table (e.g., after fork(), when the child shares parent's physical pages until it writes) — no disk I/O, just a page table update. The majflt column in ps shows only genuine disk reads. The minflt column shows copy-on-write faults and other non-disk page faults. A process can have millions of minflt (e.g., fork-heavy workloads) without performance impact. A high majflt rate always indicates memory pressure — either the process's working set exceeds RAM or there is a memory leak causing gradual exhaustion.

12. What is memory compaction and how does it help with large page allocations?

Memory compaction (Linux 2.6.35+) is the kernel's mechanism for de-fragmenting physical memory to create large contiguous blocks for huge page allocations. The kernel scans the inactive page list, identifies movable pages, and migrates them (using the page migration mechanism) to coalesce free pages into larger blocks. This is important because huge page allocations (2 MB or 1 GB) require physically contiguous memory — without compaction, a system with 1 GB free but fragmented across 4 KB chunks cannot satisfy a 2 MB huge page request. Compaction runs as a background kernel thread (khugepaged for transparent huge pages). It is CPU-intensive and can cause latency spikes, which is why some database administrators disable transparent huge pages.

13. How does the page cache interact with the buffer cache in modern Linux?

In modern Linux (2.6+), the page cache is the universal disk cache, unified for both files and block devices. The buffer cache (legacy, pre-2.6) managed individual disk blocks and has been subsumed by the page cache. What you see as "Buffers" in the free command output is a small portion of the buffer cache used for metadata I/O (superblock, inode, bitmap reads) that bypasses the page cache. The page cache caches file content in 4 KB pages; the VFS layer maps files to these pages. When you read a file, the page cache is checked first; if the page is present, the read is served from RAM at DRAM speed. The shmem filesystem creates tmpfs pages that live entirely in the page cache backed by swap. The kernel also maintains an active_list and inactive_list to implement a simplified LRU for the page cache.

14. What is the difference between madvise(MADV_WILLNEED) and madvise(MADV_DONTNEED)?

madvise(MADV_WILLNEED) hints to the kernel that a memory region will be accessed soon, triggering asynchronous readahead — the kernel prefetches pages from disk into the page cache before the application requests them. This is useful for sequential access patterns on large files: calling madvise(MADV_WILLNEED, buf, len) before reading causes the kernel to issue disk I/O proactively, reducing page fault latency during the read loop. madvise(MADV_DONTNEED) hints that the pages in the region are no longer needed — the kernel can reclaim the physical frames immediately (marking them as free) without writing to swap, even if the pages are dirty. This is used by some malloc implementations to return unused memory to the OS. On Linux, madvise(MADV_DONTNEED) actually unmaps the pages; accessing them again causes a page fault and returns zero-filled pages (not the old content). This is different from POSIX specification where the content is preserved.

15. What is the relationship between swap space and encrypted swap for security?

When pages containing sensitive data (passwords, cryptographic keys, private data) are swapped out to disk, they reside in plaintext on the swap device — a security risk if an attacker gains physical access or reads the swap partition after a crash. Encrypted swap (LUKS partition or cryptsetup + swap) ensures swapped pages are ciphertext on disk — physical access alone does not expose sensitive data. The encryption key is stored in RAM and lost on power cycle, making cold boot attacks against swap ineffective. Additionally, mlock() / mlockall() can lock specific pages into RAM, preventing them from being swapped out entirely — useful for security-critical data that must never hit disk. Hardware security modules (HSMs) and Intel SGX enclaves provide the strongest guarantees by keeping sensitive data in memory that cannot be swapped or inspected by the OS.

16. What is the performance impact of transparent huge pages and when might you disable them?

Transparent huge pages (THP) allow the kernel to automatically coalesce 4 KB pages into 2 MB huge pages for anonymous memory (heap, mmap) without application involvement. Benefits: reduced TLB pressure, fewer page table entries, less memory overhead for large working sets. Drawbacks: memory compaction (run periodically in background) can cause latency spikes of several milliseconds on latency-sensitive workloads. Internal fragmentation increases (wasted space within huge pages). THP works best with sequentially-accessed anonymous memory (e.g., malloc'd arrays, JVM heaps). Databases and latency-sensitive services often disable THP (echo never > /sys/kernel/mm/transparent_hugepage/enabled) to prevent compaction pauses and to use explicit huge pages with mmap(MAP_HUGETLB). PostgreSQL reports significant pause time reductions when THP is disabled on some workloads.

17. How does swapin/swapout priority work in Linux and why does it matter for NUMA systems?

Linux 3.0+ assigns swap space priorities based on the swap location's speed ( SSD vs HDD ) and the NUMA node distance. Higher-priority swap is used first before lower-priority swap. In a multi-NUMA-node system with local SSD swap on each node, the kernel prefers swapping to the local SSD (lower latency) over remote storage. However, the kernel also considers the NUMA affinity of the page being swapped: if a page belongs to a process running on node 0 but is backed by swap on node 1, the kernel may prefer to swap it in when the process runs on node 0 (using the node 1 swap) rather than migrate the process. Swap priority is set via the pri= option in /etc/fstab or inferred from the order of swapon commands. The si and so columns in vmstat 1 show swap-in and swap-out rates per second.

18. What is a page frame reclaiming algorithm and what does Linux use?

A page frame reclaiming algorithm (PFRA) determines which physical frames to reclaim when the system needs free frames. Linux uses a multi-generational approach: active pages (frequently accessed) are on the active_list and moved to inactive_list when they appear less active; inactive pages at the tail of the list are the first candidates for eviction. The algorithm is not pure LRU — it uses a referenced bit scanned by the clock algorithm across multiple hands (Clock-Pro variant). The vfs_cache_pressure sysctl controls whether the kernel prefers reclaiming page cache (file-backed pages) or anonymous pages (heap, stack). Low values (~10-50) prefer file-backed pages; high values (~150-200) prefer anonymous pages. The swappiness sysctl is similar for the swap-out decision specifically. The reclaim algorithm also considers the LRU ordering within each list — pages at the tail of inactive_list are oldest (least recently referenced) and evicted first.

19. What is the difference between VMA (Virtual Memory Area) and a page?

A VMA (Virtual Memory Area) is a contiguous range of virtual addresses with a uniform set of attributes (read/write/execute permissions, backing store). Each process has multiple VMAs: one for the code segment (r-x), one for the data segment (rw-), one for the heap (rw-), one for the stack (rw-), and additional VMAs for shared libraries, mmap regions, and thread stacks. The kernel tracks VMAs in a red-black tree (or AVL tree on some architectures) indexed by address. When a process accesses an address, the kernel first looks up which VMA contains that address — if none, it returns SIGSEGV. If found, it checks if the access is permitted by the VMA's permissions. Pages, on the other hand, are the unit of physical memory management — 4 KB of virtual and physical address space that can be individually mapped, swapped, and protected. A VMA contains many pages, but the VMA is a software structure; the pages are what actually get mapped to frames.

20. How does the OS distinguish between a legitimate stack growth and a stack overflow using VMAs?

The OS uses VMA permissions and guard pages to enforce stack boundaries. When a process is created, the kernel creates a stack VMA with a guard page at the top (or bottom, depending on architecture) — a one-page VMA with zero permissions (read/write/execute all cleared). This guard page is not backed by any physical frame; accessing it triggers a page fault. When the stack needs to grow, the kernel's page fault handler checks whether the faulting address is adjacent to an existing stack VMA and within the maximum stack size limit (RLIMIT_STACK). If so, the handler extends the stack by mapping a new physical page, removing the guard page designation, and allowing the access to succeed. If the faulting address is beyond the maximum stack size or not adjacent to the stack VMA, the handler sends SIGSEGV, killing the process. This mechanism allows the stack to grow dynamically on demand (lazy allocation) while catching genuine stack overflows before they corrupt adjacent memory regions like the heap.

Further Reading

Swap Space Configuration and Performance

Swap size guidelines:

  • Traditional rule: RAM × 1.5 to 2.0
  • For swap-averse workloads (databases, in-memory caches): 0 or minimal
  • For memory overcommit scenarios: at least RAM × 0.5 as pressure valve
  • For hibernation: at least RAM (Linux hibernate saves memory to swap)

Swap performance on different storage:

StorageSwap LatencySequential ThroughputRandom IOPSRecommendation
NVMe SSD~100 μs3-7 GB/s100k-1MGood for swap
SATA SSD~100 μs0.5-1 GB/s10k-100kAcceptable
HDD~10 ms100-200 MB/s100-500Avoid for active swap

Swap area priorities (Linux mkswap and swapon): Multiple swap files/partitions can have priorities assigned. Higher priority swap is used first. Use pri= option in /etc/fstab or swapon --priority.

ZSwap: Compressed Swap Cache

ZSwap (Linux 3.11+) is a lightweight compressed cache for pages that would be swapped out. Instead of writing to disk, ZSwap compresses pages and stores them in a memory pool. If the compressed pool fills up, the least-recently-used pages are written to disk.

Benefits:

  • Reduces disk I/O for workloads with good compression ratios
  • Improves performance for intermittent swap usage
  • Especially effective for workloads where swapped pages are soon accessed again

Configuration: echo 1 > /sys/module/zswap/parameters/enabled (varies by distribution)

Page Replacement Algorithm Variants

Linux Clock-Pro: Tracks the referenced bit pattern across multiple clock hands to distinguish between hot and cold pages. Pages that are frequently referenced maintain a pattern of accessed bits; pages that were accessed once and not again get evicted first.

Two-List Strategy: Linux maintains active and inactive page lists. Pages are initially added to the active list; if not referenced again, they move to the inactive list and become candidates for eviction. This simple LRU approximation prevents a single reference from keeping a page permanently resident.

Key Takeaways

  • Virtual memory extends physical RAM by using disk as a backing store for evicted pages
  • Demand paging loads pages only on first access — virtual allocation costs nothing until used
  • Page replacement algorithms (Clock-Pro in Linux) approximate LRU without per-access overhead
  • Thrashing occurs when total working sets exceed physical memory — the system spends time swapping instead of computing
  • OOM killer is invoked when physical memory plus swap is exhausted
  • Encrypted swap prevents cold-boot and physical-access attacks from recovering swapped data

Conclusion

Virtual memory extends physical memory by using disk as a backing store for pages not currently in RAM. Demand paging loads pages only when accessed, so allocating 1 GB of virtual memory costs nothing until the pages are actually used. This enables systems to run workloads larger than installed RAM without crashing.

Page replacement algorithms decide which physical pages to evict when RAM fills up. True LRU is too expensive to implement, so Linux uses Clock-Pro — an efficient approximation that tracks hot and cold pages using accessed bits. Thrashing occurs when working sets collectively exceed physical memory, collapsing throughput as the system spends more time swapping than executing.

The OOM killer intervenes when physical memory plus swap is exhausted, choosing a process to terminate. For latency-sensitive workloads, administrators often disable swap entirely to force fast failure rather than slow degradation. Encrypted swap protects sensitive data from cold-boot attacks if the system is powered down while pages reside in swap space.

For your next step, explore paging and page tables to understand the data structures that map virtual pages to physical frames, or process scheduling to see how the OS decides which runnable process gets CPU time when memory is not the bottleneck.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science