Cache & Buffer Management

How the CPU cache hierarchy works, what cache coherence protocols do, and how the OS manages buffer cache and write policies in the page cache.

published: reading time: 36 min read author: GeekWorkBench

Cache & Buffer Management

Your CPU does not access RAM directly. Between the ALU and main memory sits a ladder of SRAM caches — the fastest storage in your system, measured in single-digit nanoseconds. But these caches introduce problems that do not exist in a simple flat memory model: coherence across multiple cores, replacement policies that determine what gets evicted, and the OS’s role in managing how disk blocks and memory-mapped files flow through the cache hierarchy.

Understanding cache and buffer management is the key to predicting performance bottlenecks, explaining why multi-threaded programs sometimes run slower than single-threaded ones, and making sense of I/O performance anomalies that defy naive analysis.

Introduction

When to Use / When Not to Use

Cache and buffer management is always running — it is built into the hardware and kernel. However, you make high-level choices about write policy and buffering that shape performance.

When understanding this helps:

  • Writing multi-threaded code that shares data (cache coherence overhead becomes the bottleneck)
  • Optimizing sequential I/O (write-back vs write-through dramatically changes throughput)
  • Database tuning (buffer pool size, checkpoint behavior, WAL writing)
  • Understanding O_DIRECT vs buffered I/O trade-offs
  • Analyzing performance profiles showing cache-miss-dominant hotspots

When this is less relevant:

  • Purely sequential, single-threaded workloads with ample working set fit in cache
  • Event-driven I/O-bound services (network polling, async I/O) where CPU caches are rarely the bottleneck

Architecture or Flow Diagram

The CPU cache hierarchy and the OS page cache interact through separate but coupled mechanisms. The diagram below shows the flow of a write through both hierarchies.

flowchart TD
    CPU["CPU Core 0<br/>Issues store to addr X"]
    L1D["L1 Data Cache<br/>64-byte lines<br/>Write-back policy"]
    L2["L2 Cache<br/>Inclusive<br/>256 KB - 1 MB"]
    L3["L3 Cache<br/>Shared<br/>8-64 MB per socket"]
    DRAM["DRAM"]
    PGFAULT["Page Cache<br/>4 KB pages<br/>mmap'd or buffered I/O"]
    DISK["Storage Device<br/>NVMe / HDD"]

    CPU --> L1D
    L1D -->|"Write-back<br/>not flushed"| L2
    L2 --> L3
    L3 --> DRAM
    DRAM --> PGFAULT
    PGFAULT -->|"dirty page<br/>periodic flush"| DISK

    subgraph Coherence["Cache Coherence (MESI)"]
        CPU1["CPU Core 1<br/>L1 Cache"]
        L1D -->|"MESI invalidation<br/>on write"| CPU1
    end

    style PGFAULT stroke:#ff00ff,stroke-width:2px
    style L1D stroke:#ff00ff,stroke-width:2px

On a write, the store goes to L1D (write-back). If the line is shared with another core’s L1 (MESI Shared state), the write issues an invalidation to all other caches before updating the line (MESI Modified state). The line is not written to DRAM or disk immediately — it sits in the cache until evicted or explicitly flushed.

Core Concepts

CPU Cache Architecture

Modern CPU caches are organized in a hierarchy that reflects the physics of wire delay and die area constraints. The L1 cache is split into instruction (L1 I) and data (L1 D) caches, each small (~32-48 KB per core) and fast (~4 cycles). The L2 cache is unified and larger (~256-512 KB per core, ~12 cycles). The L3 cache (also called LLC — Last Level Cache) is shared across all cores on the socket (~8-64 MB, ~20-40 cycles).

Cache sets and associativity: A cache is divided into sets (buckets), and each set has ways (slots for cache lines). A cache line maps to a specific set determined by its physical address. If a set has 8 ways (8-way set associative), it can hold 8 different cache lines for that set index before evicting one.

With a 64 KB L1 data cache and 64-byte cache lines:

  • Number of sets = 64 KB / 64 bytes / 8 ways = 128 sets
  • Set index = (address / 64) % 128

If 129 different addresses map to the same set, the 129th access evicts the oldest line regardless of recency — this is cache set thrashing.

Cache Coherence: MESI and Derivatives

When multiple CPU cores have caches, and a line is cached in more than one core’s L1, writes by one core must be visible to readers in other cores immediately — otherwise the program produces incorrect results. This is the cache coherence problem.

The MESI protocol (Modified, Exclusive, Shared, Invalid) is the canonical write-invalidate coherence protocol:

  • Modified (M): This cache has the only valid copy; memory is stale; the line is “owned” and will be written back on eviction
  • Exclusive (E): This cache has the only valid copy; memory is up-to-date; the line is clean
  • Shared (S): This cache and possibly others have a valid copy; memory is up-to-date; no writing allowed without invalidating others
  • Invalid (I): This cache does not have a valid copy of the line

MESI transitions on a write by core 0 to a line currently Shared:

  1. Core 0 sends invalidate to all other caches
  2. Other caches respond with acknowledge
  3. Core 0 upgrades the line to Modified
  4. All other caches mark their copies Invalid

This invalidation traffic on every write to a shared cache line is why false sharing — two unrelated variables on the same cache line, modified by different threads — collapses multi-core scaling.

Modern derivatives:

  • MOESI (AMD): Adds an Owned state where a core can provide data to another core without writing to memory first
  • MESI with Forward (Intel): Similar optimization; the core with Modified state supplies the line to requesters

Cache Replacement Policies

When a cache set is full and a new line needs to be placed, the cache must choose which line to evict. Common policies:

LRU (Least Recently Used): Evicts the least recently accessed line. Ideal but expensive to track — true LRU requires updating ordering on every access. Hardware caches typically use approximations.

Pseudo-LRU (PLRU): A tree-based approximation using binary flags. Faster to update than true LRU, with acceptable hit rate for most workloads. Used in many Intel and AMD L2/L3 caches.

Random: Evicts a random line. No tracking overhead, surprisingly competitive hit rate, used in ARM L2 caches in some configurations.

RRIP (Re-Reference Interval Prediction): Used in Intel Skylake+ L2/L3 caches. Each line has a ReReference Value (RRV). On insertion, lines get a high RRV (not recently used). On hit, they get low RRV. Eviction targets lines with highest RRV. This provides a form of dead block tracking — lines that have not been used recently take priority for eviction.

Buffer Cache and Page Cache

The page cache is the kernel’s main disk cache. When a file is read, the kernel checks if the requested blocks are already in the page cache. If so, the read is served from RAM — microseconds. If not, the kernel reads from disk and caches the pages for future reads.

The page cache is built on top of the buffer cache (which manages individual disk blocks) and the inode/dentry caches (which manage filesystem metadata). This multi-tier caching means that after reading a file once, subsequent reads are served from the page cache at DRAM speed.

Write-back (delayed write): By default, writes to files are stored in the page cache (marked dirty) and written to disk later. This batches multiple small writes into larger, more efficient I/O operations. The sync() system call forces all dirty pages to disk; the fsync() call forces a specific file’s dirty pages to disk.

Write-through: Some filesystems (or mount options like sync) write through the page cache directly to disk. This is slower but safer — a system crash does not lose the write. Databases typically use O_DIRECT to bypass the page cache entirely and write directly to disk, controlling their own durability guarantees.

Write Policies and Their Trade-offs

Write-back: Write to cache, mark line dirty. Only written to main memory on eviction. Fastest for repeated writes to the same location, but risks losing writes on crash until the dirty line is flushed.

Write-through: Write to cache and main memory simultaneously. Every store hits main memory, consuming bandwidth. Safe — memory always reflects the latest write. Used for write-combining memory regions (I/O devices, graphics RAM).

Write-combining: Buffers multiple writes and issues them as a burst. Used by GPU drivers and memory-mapped I/O regions where strict ordering is handled separately. Combines multiple small stores into one larger bus transaction.

Production Failure Scenarios

False Sharing Destroying Multi-Core Scalability

Failure: A thread-safe counter implemented as two volatile 64-bit counters on the same cache line (adjacent in memory), each updated by different threads, shows near-linear scaling degradation beyond 4 cores. The cache coherence protocol invalidates one core’s cache line on every increment of the other thread’s counter. Thread A writes to its counter, invalidating thread B’s cache line copy; thread B writes to its counter, invalidating thread A’s copy. Every cross-thread update generates a round-trip of invalidation traffic.

Mitigation: Align critical data to cache line boundaries (64 bytes). Use __attribute__((aligned(64))) in C or padding fields in Java. Use per-thread accumulators (thread-local counters that are merged periodically) to reduce cross-thread sharing. Profile with perf stat events: bus_trashes and l2_lines_out.all reveal coherence traffic.

Cache Stampede (Thrashing the Last-Level Cache)

Failure: A service that processes items from a working set larger than LLC by repeatedly scanning a large hash table. Each cache line that fits in the LLC is quickly evicted by the next access. Effective cache hit rate approaches zero; the workload becomes DRAM-bandwidth-bound, not compute-bound. Throughput does not improve with more CPU cores.

Mitigation: Use huge pages (2 MB) to reduce TLB pressure in addition to cache pressure. Redesign the data structure to have better spatial locality (one cache line per item instead of scattered across many pages). Consider partitioning the working set across cores and merging results rather than sharing a single structure.

Dirty Page Cache Pressure Causing Write Latency Spikes

Failure: A batch job that generates many small writes overwrites the page cache with dirty pages. When the dirty-write threshold (/proc/sys/vm/dirty_ratio, default 10% of total memory) is reached, the kernel blocks further writes to background writeback (pdflush/flush/kworker). The batch job stalls while the kernel writes accumulated pages to disk.

Mitigation: Call fsync() or use sync() periodically to acknowledge write durability. Use O_DIRECT for the batch writer if it manages its own durability (databases). Tune vm.dirty_ratio, vm.dirty_background_ratio, and vm.dirty_writeback_centisecs to match workload characteristics.

Trade-off Table

Write PolicyWrite LatencyRead Hit AdvantageCrash SafetyBandwidth Efficiency
Write-back cacheLow (hits cache)Excellent (line in cache)Poor (dirty line lost on crash)High (batched eviction)
Write-through cacheHigh (must write memory)GoodPartial (memory has latest)Low (every write to memory)
Write-combiningVery low (buffered)None (no caching)PoorVery high (burst)
Buffered I/O (page cache)Low (hitting page cache)ExcellentRequires sync/fsyncHigh
O_DIRECTHigh (bypasses cache)NoneUser-controlled (fdatasync)Low (no read caching)

Implementation Snippets

Detecting False Sharing in Practice (C + pthread)

#define _GNU_SOURCE
#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/time.h>

/* False sharing demonstration.
 * Two atomic counters on the same cache line (adjacent in memory)
 * vs. separated by a cache line (padding approach). */

#define ITERATIONS 100000000

/* Cache line size on x86/x64 is 64 bytes */
#define CACHE_LINE_SIZE 64

/* Method 1: Adjacent variables — same cache line, FALSE SHARING */
struct shared_counter_adjacent {
    atomic_long counter_a;  /* Written by thread 0 */
    atomic_long counter_b;  /* Written by thread 1 */
};

/* Method 2: Separated by cache line — NO false sharing */
struct shared_counter_separated {
    atomic_long counter_a;  /* Written by thread 0 */
} __attribute__((aligned(CACHE_LINE_SIZE)));

struct shared_counter_separated_b {
    atomic_long counter_b;  /* Written by thread 1 */
} __attribute__((aligned(CACHE_LINE_SIZE)));

static void *thread_func_adjacent(void *arg) {
    atomic_long *counter = (atomic_long *)arg;
    for (atomic_long i = 0; i < ITERATIONS; i++) {
        atomic_fetch_add(counter, 1);
    }
    return NULL;
}

static void *thread_func_adjacent_split(void *arg) {
    atomic_long *counter = (atomic_long *)arg;
    for (atomic_long i = 0; i < ITERATIONS; i++) {
        atomic_fetch_add(counter, 1);
    }
    return NULL;
}

double measure_throughput(const char *name, void *data_a, void *data_b,
                          void *(*thread_func)(void *)) {
    pthread_t t0, t1;
    struct timeval start, end;

    gettimeofday(&start, NULL);
    pthread_create(&t0, NULL, thread_func, data_a);
    pthread_create(&t1, NULL, thread_func, data_b);
    pthread_join(t0, NULL);
    pthread_join(t1, NULL);
    gettimeofday(&end, NULL);

    double elapsed = (end.tv_sec - start.tv_sec) +
                    (end.tv_usec - start.tv_usec) / 1e6;
    double ops_per_sec = (2.0 * ITERATIONS) / elapsed;
    printf("%s: %.2f million ops/sec (%.3f sec)\n", name, ops_per_sec / 1e6, elapsed);
    return ops_per_sec;
}

int main(void) {
    printf("=== False sharing detection ===\n");
    printf("Iterations per thread: %d\n", ITERATIONS);
    printf("Cache line size assumed: %d bytes\n\n", CACHE_LINE_SIZE);

    /* Method 1: Same cache line (adjacent) */
    struct shared_counter_adjacent adjacent = {0};
    measure_throughput("Adjacent (FALSE SHARING)", &adjacent.counter_a,
                       &adjacent.counter_b, thread_func_adjacent);

    /* Method 2: Cache-line-separated */
    struct shared_counter_separated counter_a_obj = {0};
    struct shared_counter_separated_b counter_b_obj = {0};
    measure_throughput("Separated (no false sharing)", &counter_a_obj.counter_a,
                       &counter_b_obj.counter_b, thread_func_adjacent_split);

    return 0;
}

Examining Page Cache Statistics (bash)

#!/bin/bash
# Page cache and buffer management monitoring

echo "=== Page Cache Statistics ==="
cat /proc/meminfo | grep -E "^(Buffers|Cached|Active|Inactive|Dirty|Writeback)"


echo ""
echo "=== Buffer Cache (block device) Statistics ==="
cat /proc/meminfo | grep -E "^( Buffers|Cached)"

echo ""
echo "=== Memory and Cache Details ==="
cat /proc/meminfo | grep -E "^(MemFree|MemAvailable|MemTotal|Shmem)"

echo ""
echo "=== Writeback Daemon Activity ==="
cat /proc/vmstat | grep -E "^(nr_dirty|nr_writeback|nr_writeback_temp|nr_pages_mapped)"

echo ""
echo "=== Per-CPU Cache Statistics (L1/L2/L3 hits) ==="
# Requires perf or likwid-tools
if command -v likwid &> /dev/null; then
    echo "Using likwid..."
    # likwid-topology -c 0 -M  # Uncomment if likwid is available
elif command -v perf &> /dev/null; then
    echo "Using perf stat (run as root for cache metrics)..."
    echo "perf stat -e cache-misses,cache-references, L1-dcache-load-misses ..."
fi

Observability Checklist

  • CPU cache hit/miss rates: perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads ./program
  • Cache coherence traffic: perf stat -e bus_transactions,bus_trashes,l2_lines_out.all
  • Page cache hit rate: cat /proc/vmstat | grep -E "^(pgpgin|pgpgout|pgactivate|deactivate)" — high pgactivate without corresponding pgdeactivate suggests working set not fitting
  • Dirty page accumulation: cat /proc/meminfo | grep Dirty and cat /proc/vmstat | grep nr_dirty
  • Writeback daemon activity: cat /proc/vmstat | grep -E "^(nr_writeback|writeback)" — persistent writeback activity indicates write-heavy workload
  • LLC (last-level cache) occupancy: perf stat -e LLC-references,LLC-misses ./program or likwid -M
  • TLB miss rate: perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses
  • Per-socket memory bandwidth: likwid -M -C 0 or intel_pcm_bandwidth
  • Cache line alignment check: perf c2c (cache-to-cache transfer analysis) to detect false sharing

Common Pitfalls / Anti-Patterns

Cache timing side-channel attacks: Spectre, Meltdown, and related variants exploit the timing difference between a cache hit and a cache miss. By measuring access time to a cache line, an attacker can infer whether another process accessed a particular line — leaking cryptographic keys, passwords, or any data that influences memory access patterns. Mitigations include:

  • IBRS/STIBP: Microcode patches that serialize branch prediction and address translation, closing the timing oracle
  • L1D flush on context switch: l1d_flush=on kernel parameter forces L1 cache flush on each context switch, reducing cross-process cache residue
  • SpecCtrl/IBRS: Kernel and hypervisor settings that isolate speculative execution windows

Cache coherence traffic as information leakage: Even without malicious timing attacks, cache coherence invalidation patterns can reveal which memory addresses another process is accessing — a form of reverse-engineering. In high-security environments, cache activity should be monitored for anomalies.

Huge pages and secure enclaves (Intel SGX): SGX enclaves have restricted cache access — enclave pages cannot be cached in the LLC (last-level cache) to prevent cross-enclave cache snooping. This means enclave workloads have higher cache miss rates than equivalent non-enclave workloads, which is a performance consideration for SGX-based applications.

Common Pitfalls / Anti-patterns

Pitfall: Assuming cache line size is 64 bytes on all architectures. Cache line sizes vary: ARM Cortex-A72 uses 64 bytes; some IBM POWER chips use 128 bytes; older embedded ARM may use 32 bytes. Code that relies on CACHE_LINE_SIZE must be defined per-architecture. On x86 and ARM64, 64 bytes is correct. If you use __attribute__((aligned(64))) on an ARM with 128-byte lines, you still get false sharing — the compiler can only align to what you specify.

Pitfall: Over-padding in multi-threaded data structures. Adding a full cache line of padding between every field in a shared structure wastes memory bandwidth and cache capacity. Padding should be added only between fields that are written by different threads. Two fields that are always accessed by the same thread can share a cache line without false sharing — the coherence protocol does not invalidate on reads.

Pitfall: Ignoring the write-combining buffer in I/O-heavy code. Memory-mapped I/O regions (device registers) use write-combining, not cache. A store to an MMIO address does not update a cache line — it goes to a write-combining buffer. If you write multiple values to the same MMIO register in quick succession, each write may be dropped if the buffer is not flushed. Use wmb() (write memory barrier) or mmio_write() functions that handle the barrier.

Anti-pattern: Disabling cache entirely with uncached memory accesses. Setting the MTRR (Memory Type Range Register) to uncached (UC) for a memory region forces all accesses to bypass the cache. This is occasionally necessary for hardware registers but devastating for performance on any frequently-accessed data. Use write-combining (WC) or write-back (WB) instead.

Quick Recap Checklist

  • CPU cache hierarchy (L1/L2/L3) provides 10-100x faster access than DRAM; cache lines are 64 bytes on x86/ARM
  • MESI protocol maintains cache coherence across cores: Modified, Exclusive, Shared, Invalid states govern when writes must invalidate other copies
  • False sharing occurs when two unrelated variables on the same cache line are written by different threads, triggering constant invalidation traffic
  • Cache replacement policies (LRU approximations, RRIP) determine which line to evict when a set is full
  • The page cache buffers disk I/O; write-back delays writes to disk for batching, while write-through writes immediately
  • O_DIRECT bypasses the page cache for user-controlled durability; databases use this to avoid double-buffering
  • perf stat -e cache-misses,cache-references reveals how well your workload utilizes the cache hierarchy
  • Huge pages (2 MB) improve cache efficiency by reducing TLB pressure and page table overhead
  • Dirty page thresholds (vm.dirty_ratio) control when the kernel blocks writers to flush dirty pages to disk
  • Cache timing side channels (Spectre/Meltdown) exploit the timing difference between cache hits and misses to leak data
  • Tools: perf stat (cache events), perf c2c (false sharing detection), slabtop (kernel slab cache stats)

Interview Questions

1. What is false sharing and why does it destroy multi-core scalability?

False sharing occurs when two threads modify variables that happen to reside on the same cache line but are otherwise logically unrelated. The cache coherence protocol (MESI) invalidates each core's copy of the cache line when the other core writes. If thread A on core 0 writes to variable X and thread B on core 1 writes to variable Y, and both variables occupy the same 64-byte cache line, every cross-thread write generates an invalidation round-trip between core 0 and core 1. The cores spend their time propagating invalidations rather than doing useful work.

Detection: Profile with perf c2c (cache-to-cache transfer analysis) which shows lines with high cross-socket or cross-core traffic. Look for bus_trashes or l2_lines_out.all events in perf stat. Fixing it: pad critical data structures with __attribute__((aligned(64))) so each variable occupies its own cache line, or use per-thread accumulators that are merged only at synchronization points.

2. What is the MESI cache coherence protocol and how does it handle a write to a shared cache line?

MESI governs how multiple cores keep their caches consistent with each other and with main memory. Each cache line lives in one of four states: Modified (M — this core has the only valid copy, memory is stale), Exclusive (E — this core has the only valid copy and memory is up-to-date), Shared (S — this core and possibly others have valid copies), or Invalid (I — this core has no valid copy).

When core 0 writes to a line in Shared state, it cannot simply update its copy — it must first ensure no other core has a valid copy. Core 0 broadcasts an invalidate message to all other cores. Each other core marks its copy Invalid and sends an acknowledgment. Once core 0 receives all acknowledgments, it upgrades its line to Modified and proceeds with the store. Any other core that later needs this line will find it Invalid in its cache and issue a read request — core 0 will respond with the modified data (or write it back to memory first).

3. What is the difference between write-back and write-through cache policies?

Write-back (also called write-behind) stores data in the cache on write and marks the line as dirty. The line is written back to main memory only when it is evicted or explicitly flushed. This is the default for L1/L2/L3 data caches on modern CPUs. It minimizes memory bandwidth consumption — if the same location is written multiple times before eviction, only one write to memory occurs.

Write-through updates both the cache and main memory simultaneously on every store. Memory always reflects the latest value. This is safer — a crash does not lose writes — but consumes significantly more memory bandwidth. Write-through is used for write-combining regions (memory-mapped I/O, GPU VRAM) where the hardware handles ordering separately.

The OS page cache uses a form of write-back: dirty pages accumulate in the page cache and are flushed to disk by background writeback daemons at intervals. The fsync() system call forces a specific file's dirty pages to disk immediately, giving the application control over durability.

4. What is the role of the page cache vs the CPU cache in the overall memory hierarchy?

CPU caches (L1/L2/L3) are hardware-managed, automatic storage between the CPU and main memory. They operate transparently on every memory access — your code never calls a cache API. They cache physical memory (DRAM) contents and are organized by physical address.

The page cache is a kernel-managed disk cache that sits between the filesystem and the disk device. It caches disk blocks in RAM, serving reads from memory instead of disk. It is managed at the granularity of 4 KB pages and is populated explicitly through read/write system calls or mmap(). It operates on virtual addresses (file offsets mapped to memory pages).

When you read a file, the page cache serves the data. When you read a memory location, the CPU cache serves it. These caches are independent — a page can be in the page cache (disk-backed) and also be in the CPU cache (if a process has mapped that memory). They serve different purposes: CPU caches accelerate CPU-memory bandwidth; the page cache accelerates disk-memory bandwidth.

5. What is cache set thrashing and how do you diagnose it?

Cache set thrashing occurs when a workload's working set maps to the same cache set index repeatedly, causing constant eviction even though total cache usage is far below capacity. Each cache set has a fixed number of ways (slots). If your working set has more than N addresses that map to the same set (where N is the associativity), the (N+1)th line evicts the first — regardless of recency. Diagnosing it: Profile with perf stat -e l2_lines_in.all,l2_lines_out.all,l2_reject.l2_hits. High l2_lines_out.all with low l2_lines_in.all suggests the L2 cache is thrashing. Mitigation: change the data layout (different stride, power-of-two plus one for hash table probing), increase cache associativity (not possible in software), or use huge pages to reduce TLB pressure and improve spatial locality.

6. What is the difference between a write-through and write-back cache at the CPU cache level?

Write-back caches store data in the cache on write and mark the line as dirty. The line is written back to main memory only when it is evicted or explicitly flushed. This is the default for L1/L2/L3 data caches on modern CPUs and is highly efficient for repeated writes to the same location — only one DRAM write occurs per dirty line regardless of how many times it was written before eviction. Write-through caches write to both the cache and main memory simultaneously on every store. Memory always holds the latest value, but every store consumes memory bandwidth. L1 instruction caches are typically write-through; L1 data caches are write-back. Write-through is safer (no writes lost on crash) but wastes bandwidth. Write-back is faster but risks losing writes if the dirty line is never flushed before a crash.

7. How does the page cache differ from the CPU cache in terms of management and purpose?

CPU caches (L1/L2/L3) are fully hardware-managed — the CPU automatically loads data into caches, evicts lines based on hardware replacement policies (PLRU, RRIP), and maintains coherence via MESI. Software never explicitly calls a cache API. The page cache, by contrast, is software-managed by the kernel. When you read a file via read() or mmap(), the kernel checks if the data is already in the page cache; if not, it reads from disk and adds it to the page cache. The kernel's page replacement algorithm (Clock-Pro) decides which pages to evict. CPU caches operate on physical addresses; the page cache operates on virtual addresses (file offsets mapped to pages). CPU caches store any memory content; the page cache specifically stores disk block contents. These are completely independent caches — the same data can exist in both simultaneously.

8. What is write-combining and when is it used?

Write-combining is a memory optimization that buffers multiple small stores into a single larger bus transaction. It is used for memory-mapped I/O (MMIO) regions — device registers that do not behave like normal DRAM. When you write to an MMIO address, the store does not update a cache line; it goes to a write-combining buffer. Multiple writes to the same MMIO register accumulate in the buffer; the hardware eventually flushes them as a burst to the device. If you write multiple values to different registers in quick succession, each may be dropped if the buffer is not flushed between them. Use wmb() (write memory barrier) or mmio_write() functions that handle the barrier. GPUs and high-performance network cards use write-combining to batch writes to VRAM or ring buffers. Regular DRAM accesses should use write-back caching, not write-combining.

9. What is the difference between O_DIRECT and buffered I/O and when would you choose each?

Buffered I/O (the default, e.g., fopen, fread, write) uses the OS page cache — data is copied from the user buffer to a kernel page cache page, then written to disk by the writeback daemon at a later time. This batches small writes, improves performance for repeated writes to the same region, and allows the kernel to manage durability (fsync). However, it double-buffers: the data is in both the user buffer and the page cache simultaneously, consuming extra memory.

O_DIRECT bypasses the page cache entirely — data goes directly from the user buffer to the disk (or RAID controller) without being stored in the page cache. This eliminates double-buffering, giving databases full control over their own caching and durability (via fdatasync). The tradeoff is that every read is a disk read; every write is a disk write — no batching, no readahead, no write-behind. Databases (PostgreSQL InnoDB, Oracle) use O_DIRECT to avoid the page cache interfering with their own buffer pool management and to prevent double-buffering.

10. What is the MESI protocol's Modified state and why does it matter for cache coherence?

In the MESI protocol, the Modified state (M) means this cache has the only valid copy of a cache line AND the memory copy is stale (out of date). This core "owns" the line and is responsible for writing it back to memory before any other core can read it. When another core issues a read request for that address, the owner must supply the data directly (not from memory) before the line can be transferred — or write it back to memory first. The Modified state ensures that at most one core has a writable (dirty) copy at any time, maintaining write atomicity for cache lines. If the owner evicts the line, it must write it back to memory first, making the line available to other cores. The M state is critical for ensuring that coherent writes are properly serialized and no writes are lost during cache line transfers.

11. What is the difference between inclusive and exclusive L3 cache hierarchies?

In an inclusive cache hierarchy (Intel typically), every line present in L1/L2 is also present in L3. This means L3 serves as a backup for data that has been evicted from L1/L2 — if a line was in L1 and got displaced, it might still be in L3 and can be retrieved without going to DRAM. The cost is that L3 capacity includes L1/L2 content, so effective capacity is less than the stated L3 size. In an exclusive cache hierarchy (AMD Zen), L3 does NOT duplicate L1/L2 content — when a line is evicted from L1, it goes to L3 if there's room; if L3 is full, something from L3 is evicted to make room. This means total effective cache capacity = L1 + L2 + L3 (more than an inclusive hierarchy), but losing L1/L2 content requires fetching from L3 or DRAM. Exclusive hierarchies have higher hit rates for workloads with good temporal locality across L1/L2 evictions.

12. What is cache line alignment and why does it matter for data structure design?

Cache line alignment means designing data structures so that frequently accessed fields fall within a single 64-byte cache line. Misaligned data — a field that straddles two cache lines — requires loading two cache lines instead of one, doubling memory bandwidth consumption for that access. In multi-threaded code, alignment also matters for preventing false sharing: if two fields written by different threads happen to fall in the same cache line, every write by one thread invalidates the other's copy. Using __attribute__((aligned(64))) (C) or padding fields ensures each hot field occupies its own cache line. The tradeoff is increased memory usage — if every field is padded to 64 bytes, a struct that could fit in one cache line expands to many.

13. What is a memory fence (memory barrier) and why is it needed in multi-threaded code?

A memory fence (memory barrier) is an instruction that enforces ordering of memory operations around it. Without fences, the compiler and CPU can reorder loads and stores for optimization — the CPU can issue a store before a preceding load if the addresses don't overlap. In multi-threaded code, this can cause a thread to observe stale data or a partially-constructed object. For example, a thread might observe a pointer to a newly allocated object before the constructor's stores are visible. A mfence (x86) or dmb (ARM) ensures all loads/stores before the barrier complete before any loads/stores after the barrier start. In C11/C++11, std::atomic with memory_order_seq_cst provides sequential consistency including implicit barriers. In the Linux kernel, smp_mb(), wmb(), and rmb() provide explicit ordering for SMP code.

14. What is the difference between perf c2c and perf stat for cache analysis?

perf stat provides aggregate counts: total cache misses, total cache references, cache miss rate. It tells you "you have X% cache miss rate" but not "which data structure is causing the misses." perf c2c (cache-to-cache transfer analysis) identifies specific cache lines that are being transferred between cores — i.e., lines where one core wrote and another core read, generating coherence traffic. It shows the offsets within cache lines, the virtual addresses, the sharing nodes, and the number of hits and misses per line. This is the definitive tool for detecting false sharing: if two threads are fighting over a cache line but not actually sharing the data, perf c2c will show a high transfer count for that line with different instruction addresses from each thread.

15. What is the performance impact of cache coherency traffic in NUMA systems?

In a NUMA system, cache coherency traffic between cores on different sockets must cross the inter-socket interconnect (Intel QPI/UPI, AMD Infinity Fabric). A write by a core on socket 0 to a cache line that is cached by a core on socket 1 generates: (1) an invalidate from socket 0 to socket 1, (2) an ack from socket 1 back to socket 0, (3) the data transfer of the modified line from socket 0 to socket 1 (or a snoop filter lookup to locate it). All of this crosses the interconnect at memory bandwidth cost. In a 4-socket system, the coherency traffic for a hot cache line accessed by all sockets can saturate the inter-socket links and become the bottleneck, not compute or memory bandwidth. This is why NUMA-aware applications that share data across threads prefer local memory and avoid cross-socket data sharing. The snoop filter in modern CPUs optimizes this by tracking which sockets have copies of which lines.

16. What is a cache way and how does associativity affect cache hit rate?

A cache way is one slot (set-associative slot) within a cache set. A 4-way set associative cache has 4 ways per set — 4 different cache lines can be cached simultaneously for each set index before eviction occurs. Higher associativity reduces conflict misses (evictions due to set collision) at the cost of more complex hardware for the replacement policy (must track LRU or approximation across ways). A direct-mapped cache (1-way) has the fastest lookup but the highest collision rate — two addresses that map to the same set will always evict each other. An 8-way or 16-way cache behaves nearly as well as fully associative (any line can be in any way) but with a simpler, faster lookup than true fully associative (which requires comparing all tags simultaneously). Most L1/L2 caches are 8-way; L3 caches are often 12- or 16-way. The performance difference between 8-way and 16-way is typically small (<5%) for most workloads.

17. What is the difference between hardware cache coherence and software cache coherence?

Hardware cache coherence (MESI, MOESI, etc.) is implemented by the CPU hardware and the cache coherence protocol — it operates automatically and transparently on every memory access. The hardware snoops on all bus transactions and updates cache line states without software involvement. Software cache coherence (used in some embedded and accelerator architectures) requires the software to explicitly manage which caches hold valid data — the hardware provides no automatic invalidation. Software coherence is sometimes used in GPUs, DSPs, and custom accelerators where the hardware overhead of a full MESI protocol would be too expensive. Programmers or runtime systems must issue explicit invalidate/flush operations when data is shared. Hardware coherence is standard on all general-purpose CPUs (x86, ARM, RISC-V) because the alternative (manual software coherence) is too error-prone for general-purpose computing.

18. What is the dirty bit in a page table entry and why is it important for the page cache?

The dirty bit (D, bit 6 in a PTE on x86) is set by the CPU hardware whenever a write to that page occurs. It indicates that the page's content has been modified relative to its backing store (the file on disk or the swap partition). Before evicting a dirty page, the kernel must write it back to its backing store — otherwise the modifications are lost. Clean pages (dirty bit = 0) can be evicted simply by removing the page table mapping, because the backing store already has the correct content. In the page cache, dirty pages are tracked by nr_dirty in /proc/vmstat and are periodically flushed to disk by the writeback daemon (pdflush/flush). Calling fsync() forces all dirty pages for a specific file to be written immediately and the dirty bit cleared. This is how databases achieve durability: they call fdatasync() to ensure all modified data reaches the disk surface.

19. What is the performance difference between a cache hit and a TLB hit when both hit in the same access?

A TLB hit and a cache hit are independent but sequential in the memory access path: the virtual address goes to the TLB first (1 cycle) for translation, then the physical address goes to the cache. A TLB hit that also hits in L1 cache: ~4-5 cycles total (TLB + L1). A TLB miss that hits in L1 cache: 4 (TLB lookup) + 12 (L2 for page table) + 4 (L1 access) ≈ 20-25 cycles if the page table pages are in L2. A TLB hit that misses in L1 and hits in L2: 1 + 12 ≈ 13 cycles. The key insight is that even with a TLB hit, if the data misses in L1, the access still takes L2 or L3 latency. Modern out-of-order cores can hide some of this latency by running other instructions, but the critical path of dependent loads still stalls. For hot data that is both TLB-resident and in L1, you get maximum throughput.

20. How does O_DIRECT interact with the page cache and what are the trade-offs compared to buffered I/O?

O_DIRECT bypasses the page cache entirely — data is transferred directly from the user buffer to the storage device (or RAID controller) without passing through kernel page cache pages. This eliminates double-buffering: the same data is not held simultaneously in the user buffer and a kernel page, reducing memory consumption. Databases use O_DIRECT to manage their own buffer pools (e.g., InnoDB's buffer pool) without interference from the OS page cache — the database's own cache replacement policy may be superior to Linux's Clock-Pro for the database's access patterns. The trade-offs: every read is a disk read, every write is a disk write — there is no read-ahead, no write-behind batching, and no page cache hits for recently read data. fsync() or fdatasync() must be called explicitly to ensure durability. On systems without a battery-backed write cache, O_DIRECT writes may actually be slower because the device must commit to disk before returning, whereas buffered I/O benefits from the OS's background flush daemon batching dirty pages. O_DIRECT also requires aligned buffers (512-byte or 4 KB alignment depending on the filesystem) — misaligned buffers cause extra copies or failures.

Further Reading

MESI Protocol Variants

MOESI (AMD) — adds the Owned state:

  • Owned: this cache has the only valid copy AND memory is stale; this core can supply data to requesters without writing to memory first
  • Reduces memory bandwidth by eliminating unnecessary writes-back when sharing data

MESI with Forward (Intel) — similar to MOESI but the line is written to memory before being supplied to other cores

Dragon Protocol (research) — used in some ARM implementations:

  • Uses update (write-broadcast) instead of invalidate
  • Better for workloads that read more than write to shared lines
  • Higher bandwidth cost for writes but fewer invalidation storms

Cache Replacement Policy Deep Dive

LRU approximations:

  • True LRU: O(n) per access for n lines in a set — used only in small, software-managed structures
  • PLRU (Pseudo-LRU): Tree-based, ~3 comparisons per access, used in many Intel L2/L3 caches
  • RRIP (Re-Reference Interval Prediction): Intel Skylake+ L2/L3 — each line has an N-bit counter; recently used lines have low values, idle lines have high values; eviction targets highest counter

Thrashing-resistant policies:

  • SIP (Set-dueling Predictive): Some sets use LRU, others use RRIP; the policy that performs better wins for the whole cache
  • MLP-aware replacement: Lines accessed in a memory-level-parallel manner get higher priority for retention

Page Cache and Buffer Cache Relationship

The page cache is the primary disk cache in modern Linux. The buffer cache (legacy, pre-2.6) managed individual disk blocks and was eventually subsumed by the page cache.

LayerUnitPurpose
Page cache4 KB pagesCaches file content (page-mapped files)
Buffer cache512 B - 4 KB blocksCaches metadata and raw block device I/O
dentry cacheVFS structuresCaches directory entry lookups
inode cacheFilesystem metadataCaches inode structures

free command shows -/+ buffers/cache line which combines page cache + buffer cache + slab reclaimable.

Key Takeaways

  • CPU caches (L1/L2/L3) are hardware-managed transparently; page cache is kernel-managed for disk I/O
  • MESI (Modified, Exclusive, Shared, Invalid) maintains coherence across cores — writes trigger invalidations
  • False sharing (unrelated variables on the same cache line, written by different threads) collapses multi-core scaling
  • Cache replacement policies (PLRU, RRIP) approximate LRU without per-access overhead
  • Page cache write-back batches writes to disk; fsync() forces durability for specific files
  • O_DIRECT bypasses the page cache for user-controlled durability and to avoid double-buffering
  • perf c2c detects false sharing; perf stat reveals cache hit/miss rates

Conclusion

Cache and buffer management represents the invisible infrastructure that makes modern computing fast. The CPU cache hierarchy — L1/L2/L3 — operates automatically on every memory access, applying the MESI coherence protocol to maintain consistency across cores. What appears to your code as a simple memory read actually involves cache line lookups, potential invalidation traffic, and sophisticated replacement policies like RRIP.

The page cache extends this thinking to disk I/O, batching writes and serving reads from RAM. Write-back delivers maximum throughput but requires explicit fsync calls for durability. O_DIRECT bypasses the page cache entirely, giving databases control over their own buffering — essential for applications that cannot afford the double-buffering penalty of both the OS page cache and their own in-memory structures.

For deeper exploration, study how memory-mapped I/O interacts with write-combining buffers, why SGX enclaves have higher cache miss rates, and how the L1D flush on context switch mitigates cross-process cache residue. The performance implications of cache behavior become most visible in NUMA systems, where cross-socket memory access latencies dwarf local cache hit times.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science