Memory Allocation: Kernel Allocators, Slab, Buddy System
How the Linux kernel allocates memory internally — from the slab allocator and buddy system to memory zones and the subtle differences between kmalloc and vmalloc.
Memory Allocation: Kernel Allocators, Slab, Buddy System
User-space programs call malloc and free without ever thinking about who fulfills the request. The kernel faces the same problem — it needs to allocate and free memory constantly, for structures as small as a task_struct (a few KB) and as large as a kernel buffer (many MB). But the kernel cannot use the same allocator as user programs. It runs in a privileged context where page faults are catastrophic, memory fragmentation is permanent (no virtual memory backing for the kernel heap), and allocation latency directly affects system performance.
Linux solves this with a layered allocator architecture: the buddy system at the low end handles physical page allocation, and the slab allocator sits above it to satisfy small, frequent kernel allocations efficiently.
Introduction
When to Use / When Not to Use
Kernel memory allocators are used exclusively by the kernel and kernel modules — user-space code does not call kmalloc or alloc_pages(). However, understanding these allocators helps in several scenarios:
When this knowledge is relevant:
- Writing Linux kernel modules or device drivers
- Debugging kernel OOM conditions or memory leaks
- Reading kernel crash dumps (oops/panic output showing slab cache names)
- Performance tuning with
vmstat,/proc/slabinfo,slabtop - Operating system course projects
When to use user-space equivalents:
malloc/freefor C applications in user spacemmap,brkfor larger or memory-mapped allocations- High-level language allocators (Go’s
make/new, Python’s garbage collector) for managed runtime languages
Architecture or Flow Diagram
The kernel’s memory allocation stack flows from high-level requests (kmalloc, vmalloc) down to the buddy system’s physical page allocator, with per-CPU caches and memory zones in between.
flowchart TD
KMALLOC["kmalloc()<br/>~8 bytes to 128 KB"]
VMALLOC["vmalloc()<br/>Arbitrary size, non-contiguous"]
SLAB["Slab Allocator<br/>Object caches (task_struct, etc.)"]
SLUB["SLUB Allocator<br/>Default Linux slab implementation"]
SLOB["SLOB Allocator<br/>Embedded/small system allocator"]
PAGE["alloc_pages()<br/>Buddy System<br/>Physical page allocation"]
ZONES["Memory Zones<br/>DMA, Normal, HighMem"]
PHYMEM["Physical Memory<br/>DRAM"]
KMALLOC --> SLUB
KMALLOC -->|"small allocations"| SLAB
VMALLOC --> PAGE
SLUB --> SLAB
SLAB --> PAGE
PAGE --> ZONES
ZONES --> PHYMEM
style KMALLOC stroke:#ff00ff,stroke-width:2px
style VMALLOC stroke:#000,stroke-width:1px
style PAGE stroke:#ff00ff,stroke-width:2px
The DMA zone (first 16 MB on x86) exists for devices that cannot address full physical memory. The Normal zone (16 MB to 896 MB on x86) is directly mapped. The HighMem zone (above 896 MB on 32-bit x86) requires explicit mapping before use — 64-bit systems generally do not need HighMem.
Core Concepts
The Buddy System Allocator
The buddy system is the foundational physical page allocator in Linux. It maintains free lists for each order (power-of-two page counts): order 0 = 1 page (4 KB), order 1 = 2 pages (8 KB), order 2 = 4 pages (16 KB), up to order 10 or 11 (1-4 GB depending on configuration).
When a request for n pages comes in:
- Round up to the next power of two (the “order”)
- Check the free list for that order
- If a block is available, split it in half and put the halves on the next-lower free list until the correct size is reached
- If no block is available, request a larger block from the next higher order and split it
The “buddy” is the adjacent half of the split block — when a block is freed, the allocator checks if its buddy is also free and, if so, coalesces them back into a larger block. This coalescing is what gives the buddy system its name and its resistance to external fragmentation.
The buddy system operates at the page level. Requests for arbitrary byte counts (like kmalloc) cannot be served directly by the buddy system — they need an intermediate allocator that carves page-sized chunks into smaller objects.
Slab Allocator: Motivations
The kernel allocates and frees many objects of the same type repeatedly: task_struct when a process is created, struct file when a file is opened, struct dentry for each directory entry, struct buffer_head for block I/O buffers. Allocating and freeing these through the buddy system would be prohibitively expensive:
- Fragmentation: A
task_struct(2 KB) allocated from an 8 KB buddy block leaves 6 KB wasted - Cache effects: Buddy-allocated pages are not cache-aligned — critical kernel data structures benefit from cache line alignment
- Latency: The buddy system requires searching free lists, potentially triggering page allocation from the zone allocator
The slab allocator solves this by maintaining per-type caches (called slab caches). Each cache holds objects of one type. When a cache needs objects, it obtains pages from the buddy system and carves them into equal-sized objects. When an object is freed, it is returned to the cache (not the buddy system) — the next allocation of the same type reuses the cached object without touching the buddy system.
SLUB: The Default Slab Allocator
Linux has had three slab implementations:
- Slab (original): Slightly obese for large systems, good debugging
- SLUB (Unqueued Slab Allocator, default since 2.6.23): Simplified, better performance, excellent for large NUMA systems
- SLOB (Simple List of Blocks): Minimal allocator for embedded systems with very limited memory
SLUB is the default on all desktop and server kernels. It removes the per-CPU queues of Slab and uses page structs directly, reducing overhead. It is highly NUMA-aware, distributing slab caches across memory nodes.
Key SLUB concepts:
- Partial slabs: A page with some objects allocated and some free
- Per-CPU freelists: Each CPU has a private list of free objects, eliminating locking for the common case
- Object alignment: Objects are aligned to cache lines by default, eliminating false sharing
kmalloc vs vmalloc
kmalloc() allocates from the kernel’s direct-mapped linear address range (the Normal zone on x86). Addresses returned by kmalloc are contiguous in physical memory (and virtually contiguous). It can only allocate up to a maximum of 128 KB (one page order 7 block on x86). It is fast — the allocator is a slab cache with size-specific objects.
vmalloc() allocates from the virtual address space reserved for vmalloc (VMALLOC_START to VMALLOC_END on x86-64). The allocated regions are contiguous in virtual address space but may be fragmented across multiple non-contiguous physical pages. This is necessary for allocating large buffers (multi-page) that do not need physical contiguity.
| Property | kmalloc | vmalloc |
|---|---|---|
| Physical memory | Contiguous | Non-contiguous (scatter-gather) |
| Virtual address space | Contiguous | Contiguous |
| Maximum allocation | 128 KB (x86) | ~1.5 GB (x86-64, tunable) |
| Latency | Low | Higher (page table updates needed) |
| Use case | Small, frequent kernel objects | Large buffers (module code, large I/O) |
| Fault context | Safe (directly mapped) | Not safe from interrupt context (may sleep) |
Memory Zones
Linux divides physical memory into zones, each serving different purposes:
ZONE_DMA (first 16 MB on x86): Required for ISA DMA — legacy devices that can only address the first 16 MB of RAM. Allocations from this zone are expensive because they must be below the 16 MB boundary.
ZONE_NORMAL (16 MB to 896 MB on 32-bit x86): Directly mapped to the kernel’s linear address space. Allocations here are the fastest — no special mapping needed. Most kernel allocations come from this zone.
ZONE_HIGHMEM (above 896 MB on 32-bit x86): Not directly mapped. Pages must be explicitly mapped (using kmap/kunmap) before the kernel can access them. 64-bit systems typically have no HIGHMEM because they can map the entire physical address space.
On NUMA systems, each node has its own set of zones. The kernel prefers allocating from the node local to the CPU making the request, falling back to remote nodes when local memory is exhausted.
Production Failure Scenarios
Kernel Memory Leak (kmalloc without free)
Failure: A kernel module allocates memory via kmalloc or kzalloc and never frees it before unloading. Over time, leaked memory accumulates, reducing the amount available for legitimate kernel allocations. Eventually the kernel’s memory allocator exhausts available pages, triggering the OOM killer — which may kill user-space processes unpredictably.
Mitigation: Always pair allocations with frees. Use kunmap() for kmap(). Use module_init/module_exit lifecycle hooks to clean up. Use the kernel’s kmemleak tool (echo 1 > /sys/kernel/debug/kmemleak and read the results) to detect leaks during development. Run slabtop to watch for slab caches that grow unboundedly.
Slab Fragmentation Under Heavy Module Loading
Failure: Loading and unloading many kernel modules of different sizes creates slab fragmentation — many partially-filled slab pages with objects of one type, while other object types have full slabs. Physical memory becomes inefficiently used despite reasonable overall free memory counts.
Mitigation: Use the slabinfo tool to analyze slab utilization. On embedded systems, use SLOB instead of SLUB to reduce fragmentation overhead. Consider keeping modules loaded rather than repeatedly loading/unloading. Monitor /proc/slabinfo for increasing objperslab / pagesperslab ratios.
vmalloc Exhaustion
Failure: A driver or module requests a very large vmalloc allocation (e.g., for a frame buffer or scatter-gather buffer). The vmalloc area has a limited size (~1.5 GB on x86-64). Exhausting it causes vmalloc to fail, returning NULL. The calling code may not check for NULL, leading to a NULL pointer dereference.
Mitigation: Check all vmalloc return values. On x86-64, increase the vmalloc area by adjusting vmalloc in the kernel command line (vmalloc=2G). Use alloc_pages() (buddy system) for large physically-contiguous allocations instead of vmalloc.
Trade-off Table
| Allocator Aspect | Buddy System | Slab/SLUB | vmalloc |
|---|---|---|---|
| Allocation unit | Power-of-2 pages | Individual objects (bytes to KB) | Virtual pages |
| Physical contiguity | Always guaranteed | Always (via kmalloc) | Not guaranteed |
| Internal fragmentation | Up to 50% per allocation | Low (per-object size matching) | Low |
| Allocation latency | Moderate | Very low (per-CPU caches) | Higher (page table setup) |
| Can allocate from interrupt context | Yes | Yes (SLUB per-CPU) | No (may sleep) |
| Suitable for | Page-level requests | Frequent small kernel objects | Large buffers, module memory |
| NUMA awareness | Partial | Full (SLUB) | Yes |
Implementation Snippets
Kernel Module with Proper Allocation/Free (C)
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h> /* for kmalloc, kfree, kzalloc */
#include <linux/gfp.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Example");
MODULE_DESCRIPTION("Memory allocation example — proper kmalloc/kfree pairing");
struct my_data {
unsigned long id;
char name[64];
void *buffer;
};
/* Module parameter: buffer size in KB */
static int buffer_size_kb = 64;
module_param(buffer_size_kb, int, 0644);
static int __init my_module_init(void) {
struct my_data *data;
printk(KERN_INFO "Loading my_module: buffer_size=%d KB\n", buffer_size_kb);
/* Allocate a single structure using kmalloc */
data = kmalloc(sizeof(*data), GFP_KERNEL);
if (!data) {
printk(KERN_ERR "my_module: kmalloc failed\n");
return -ENOMEM;
}
/* Zero-initialize with kzalloc */
data->id = 0;
memset(data->name, 0, sizeof(data->name));
/* Allocate the buffer separately with size from parameter.
* kmalloc can only allocate up to 128 KB on x86 (order 7) — for larger
* buffers, use vmalloc or __get_free_pages */
if (buffer_size_kb > 128) {
printk(KERN_WARNING "my_module: size %d KB exceeds kmalloc max, using vmalloc\n",
buffer_size_kb);
data->buffer = vmalloc(buffer_size_kb * 1024);
} else {
data->buffer = kmalloc(buffer_size_kb * 1024, GFP_KERNEL);
}
if (!data->buffer) {
printk(KERN_ERR "my_module: buffer allocation failed\n");
kfree(data);
return -ENOMEM;
}
printk(KERN_INFO "my_module: allocated %d KB buffer at %p\n",
buffer_size_kb, data->buffer);
/* Store data pointer in module's per-cpu or global structure for later use */
return 0; /* success */
}
static void __exit my_module_exit(void) {
/* In production: always free what you allocated, in reverse order.
* This is the critical part that prevents memory leaks on unload. */
printk(KERN_INFO "Unloading my_module\n");
/* Cleanup code would free data->buffer first, then data itself */
}
module_init(my_module_init);
module_exit(my_module_exit);
/* GFP_KERNEL: can sleep, suitable for process context
* GFP_ATOMIC: cannot sleep, for interrupt context
* GFP_USER: for user-space-requested kernel allocations
* __GFP_DMA: force DMA zone
* __GFP_HIGHMEM: prefer HighMem zone
*/
Inspecting Kernel Slab Caches (bash)
#!/bin/bash
# Show top slab caches by memory consumption
echo "=== Top 15 slab caches by memory usage ==="
# Requires root privileges to read /proc/slabinfo
cat /proc/slabinfo 2>/dev/null | \
awk 'NR==1 {print $0} NR>1 {print $0 | "sort -k3 -n -r"}' | \
head -16
echo ""
echo "=== Detailed view of specific caches ==="
echo "--- task_struct cache ---"
cat /proc/slabinfo | grep "^task_struct"
echo "--- buffer_head cache ---"
cat /proc/slabinfo | grep "^buffer_head"
echo ""
echo "=== Using slabtop (live view, needs root) ==="
echo "Run 'slabtop -o' to display once, or 'slabtop' for live updates"
Observability Checklist
- Slab cache statistics:
/proc/slabinfoorslabtop -o— shows active objects, num objects, object size per cache - Kernel memory overall:
cat /proc/meminfo | grep -E "MemFree|MemAvailable|Slab|Cached|Active|Inactive" - kmalloc failures: These appear as
kernel: kmalloc: allocation failedmessages indmesg; enableCONFIG_DEBUG_SLAB_LEAKfor detailed traces - OOM killer for kernel memory:
dmesg | grep -i "out of memory\|oom" | tail - buddy info per zone:
cat /proc/buddyinfo— shows free pages per zone per NUMA node for each order - vmalloc usage:
cat /proc/vmallocinfo— shows all active vmalloc allocations (useful for tracking down a large vmalloc consumer) - Per-NUMA node memory:
numactl --hardwareorcat /proc/nodeinfo - High memory usage (32-bit only):
cat /proc/meminfo | grep High
Common Pitfalls / Anti-Patterns
Use-After-Free (UAF) in Kernel Modules: UAF occurs when kernel code frees an object (returns it to the slab cache) but a dangling pointer in another part of the code still references it. A subsequent allocation of the same type can reuse that memory, and the original code can read/write the new object’s data, corrupting kernel state. UAF in the kernel is more severe than in user space — it can lead to privilege escalation. Mitigations include:
- UAF detectors: KASAN (Kernel Address Sanitizer) detects use-after-free at runtime by poisoning freed memory
- SLAB_FREELIST_HARDENED: Randomizes the freelist to make UAF exploitation harder
- Lockdep: Detects potential UAF scenarios through lock ordering analysis
kmalloc with GFP_USER flag: Allocations from user-triggered paths that succeed with kernel privileges can be exploited to drain kernel memory (denial of service). The kernel uses quota tracking and cgroup-based memory limits (memory.kmem.slab.*) to contain kernel allocations per cgroup.
Speculative execution leaks (Spectre/Meltdown) and kernel memory: The Meltdown exploit (CVE-2017-5754) allowed user-space code to read kernel memory by exploiting speculative execution. The fix involved serializing the address translation path (IBRS microcode), which added overhead to page table walks. KPTI (Kernel Page Table Isolation) separated user and kernel page tables entirely — but added overhead to every system call and interrupt.
Common Pitfalls / Anti-patterns
Pitfall: Using vmalloc in interrupt context.
vmalloc may sleep (calls alloc_pages which can block when memory pressure is high). Using vmalloc from an interrupt handler, softirq, or any context where sleeping is forbidden causes a deadlock or scheduler corruption. Use kmalloc (with GFP_ATOMIC) for memory that must be allocated from atomic context.
Pitfall: Confusing kmalloc’s GFP flags.
GFP_KERNEL can sleep — it is appropriate for process context allocations where waiting is acceptable. GFP_ATOMIC cannot sleep — appropriate for interrupt context, softirq, or atomic section. Using GFP_KERNEL from atomic context panics the kernel immediately. Using GFP_ATOMIC excessively can deadlock when memory is very tight (since it cannot wait for page reclaim to free memory).
Pitfall: Not accounting for slab cache overhead in cgroup memory limits.
Kernel memory (slab allocations) counts against the cgroup’s memory limit, but memory.kmem.usage_in_bytes tracks it separately from user pages. A container with memory.limit_in_bytes=512m may still have its slab grow to 200 MB, leaving only 312 MB for actual process pages. On kernels that support it, use memory.kmem.slab_objects_limit to limit per-slab-cache object counts.
Anti-pattern: Allocating with GFP_HIGHUSER without understanding HighMem.
On 32-bit systems, GFP_HIGHUSER allocates from HighMem (above 896 MB). Accessing HighMem pages requires kmap(), which has a performance cost. On 64-bit systems, HighMem does not exist — GFP_HIGHUSER is identical to GFP_HIGHUSER_MOVABLE. Using GFP_HIGHUSER unnecessarily can add overhead without benefit.
Quick Recap Checklist
- The kernel memory allocator is layered: SLUB → page allocator (buddy system) → memory zones
- The buddy system allocates physical pages in power-of-two sizes (orders); adjacent free blocks (buddies) can be coalesced when freed
- Slab allocators (SLUB is the default) maintain per-type object caches above the buddy system, reducing fragmentation and allocation latency for frequent small allocations
kmallocallocates from the direct-mapped kernel address range (physically and virtually contiguous), maximum ~128 KB on x86vmallocallocates from the vmalloc region (virtually contiguous but physically fragmented), suitable for large buffers- Memory zones (DMA, Normal, HighMem on 32-bit; no HighMem on 64-bit) organize physical memory by capability and mapping requirements
- kmalloc with
GFP_KERNELcan sleep (process context);GFP_ATOMICcannot sleep (interrupt context) - Kernel memory leaks from modules accumulate in slab caches — use
kmemleakto detect them - UAF vulnerabilities in kernel code are severe — use KASAN during development for detection
- Tools:
/proc/slabinfo(slab stats),/proc/buddyinfo(buddy free lists per zone),/proc/vmallocinfo(vmalloc usage)
Interview Questions
The buddy system operates at the page level. It manages physical memory in power-of-two page blocks (orders). When a request for 3 pages comes in, the buddy system rounds up to 4 pages (order 2) and allocates from the order-2 free list. If no order-2 block exists, it splits a larger block (order 3) in half, placing one half on the order-2 list and using the other half. The buddy system guarantees physically contiguous pages and coalesces adjacent free blocks on free.
The slab allocator operates above the buddy system, at the object level. It carves pages obtained from the buddy system into equal-sized objects (e.g., 64-byte task_struct objects). Each slab cache is dedicated to one object type. When the kernel allocates a task_struct, the slab allocator returns a cached object — no buddy system involvement, no fragmentation, no search overhead. Freed objects are returned to the slab cache, not the buddy system, so the next allocation reuses the freed object immediately.
Use vmalloc when you need a large buffer (larger than ~128 KB, the kmalloc limit on x86) that does not require physically contiguous memory. The kernel's module loading mechanism uses vmalloc for code and data segments — module size constraints are governed by the vmalloc area size.
Use kmalloc when you need physically contiguous memory (e.g., for DMA to devices with addressing limitations), when allocation latency matters (vmalloc involves setting up page tables), or when you are allocating from atomic context (vmalloc can sleep; kmalloc with GFP_ATOMIC cannot).
The classic example: a network driver receiving a large packet buffer might use kmalloc for small control structures and vmalloc for the packet data itself, since the packet data need not be physically contiguous for the CPU to read it.
Memory zones are regions of physical memory with different properties, defined in the kernel's architecture-specific code. ZONE_DMA contains the first 16 MB of physical memory on x86 — the addressing range of the legacy ISA bus. Some ancient hardware (ISA devices, some RAID controllers) can only perform DMA within this range. The kernel places such buffers in ZONE_DMA.
ZONE_NORMAL (16 MB to 896 MB on 32-bit x86) is directly mapped to the kernel's linear address space. Allocations here are the fastest. ZONE_HIGHMEM (above 896 MB on 32-bit x86) is not directly mapped — the kernel must use kmap() to temporarily map these pages before accessing them. On 64-bit systems, the entire physical address space is within the direct-mapped range, so HIGHMEM is unnecessary.
SLAB is the original Linux slab allocator, introduced in 2.2, refined through 2.6. It maintains per-CPU and per-node queues of free objects and has extensive debugging features (red zoning, object poisoning, sanity checks). It has significant per-CPU overhead and does not scale as well on large NUMA systems.
SLUB (Unqueued Slab Allocator, default since 2.6.23) removed the complex queue structures and uses page structs directly. It merges per-node lists into simpler structures and uses per-CPU "freelist" arrays instead. It scales dramatically better on large systems, has lower metadata overhead per object, and is the default allocator on all mainstream kernels.
SLOB (Simple List of Blocks) is a minimal allocator designed for embedded systems with very limited memory (sub-16 MB). It uses first-fit allocation rather than slab caches. It trades performance and fragmentation for minimal code size. Used when CONFIG_SLOB is set in the kernel config.
Kernel OOM occurs when the kernel's memory allocator (kmalloc, page allocator, slab allocator) cannot satisfy a memory request even after page reclaim and slab cache shrinking. Unlike user-space OOM (which happens when physical memory is exhausted), kernel OOM is relatively rare — the kernel is careful about allocating memory in ways that can fail. The OOM killer is invoked when physical memory + swap is exhausted. It selects a "badness" score for each process: the process that has consumed the most memory over its lifetime gets the highest score (using /proc/PID/oom_score). The killer terminates that process to reclaim its pages. You can influence OOM behavior via /proc/PID/oom_score_adj ($-1000$ to $+1000$; negative values make a process less likely to be killed, positive values more likely).
Kernel OOM occurs when the kernel's memory allocator (kmalloc, page allocator, slab allocator) cannot satisfy a memory request even after page reclaim and slab cache shrinking. Unlike user-space OOM (which happens when physical memory is exhausted), kernel OOM is relatively rare — the kernel is careful about allocating memory in ways that can fail. The OOM killer is invoked when physical memory + swap is exhausted. It selects a "badness" score for each process: the process that has consumed the most memory over its lifetime gets the highest score (using /proc/PID/oom_score). The killer terminates that process to reclaim its pages. You can influence OOM behavior via /proc/PID/oom_score_adj ($-1000$ to $+1000$; negative values make a process less likely to be killed, positive values more likely).
On 32-bit x86, physical memory is divided into zones based on hardware constraints. ZONE_DMA contains the first 16 MB of physical memory — required for legacy ISA devices that can only perform DMA within this range. ZONE_NORMAL (16 MB to 896 MB) is directly mapped to the kernel's linear address space — allocations here are the fastest because no special mapping is needed. ZONE_HIGHMEM (above 896 MB) is not directly mapped — the kernel must use kmap() to temporarily map these pages before accessing them, adding overhead. On 64-bit systems, the entire physical address space falls within the direct-mapped range, so HIGHMEM does not exist and all memory is in ZONE_NORMAL (or ZONE_DMA on some systems for actual DMA devices).
The buddy system splits larger blocks in half when satisfying a request, placing each half on the appropriate free list. When a block is freed, the allocator checks whether its "buddy" (the adjacent half of the split block) is also free and on the same free list. If so, the two halves are coalesced into a larger block and placed on the next-higher-order free list. This process repeats up the chain until no further coalescing is possible. Coalescing is important because it counteracts fragmentation — as blocks of various sizes are allocated and freed, physical memory can become fragmented (many small holes). The buddy system naturally merges adjacent free blocks, keeping larger contiguous regions available for future page-level allocations. This makes the buddy system highly resistant to external fragmentation at the page level.
kmalloc returns addresses from the kernel's direct-mapped linear address range (the Normal zone on x86). The translation from virtual to physical is a simple fixed offset (the direct-mapped base physical address). No page table entries need to be set up — the kernel's page tables already map this entire range to physical memory. This is why kmalloc is fast and can be called from atomic context (no sleeping).
vmalloc allocates from the vmalloc area (VMALLOC_START to VMALLOC_END), which is NOT in the kernel's direct-mapped range. Each page of a vmalloc allocation requires individual page table entries to be set up (pointing to the underlying physical pages). This page table setup involves the buddy system and page allocator, which may sleep — making vmalloc unsafe from atomic context. The overhead of vmalloc is primarily this page table setup, not the allocation itself.
A slab cache is a per-type object pool maintained by the slab allocator. Each cache holds objects of one type (e.g., task_struct, buffer_head, inode). When the cache needs objects, it obtains whole pages from the buddy system and carves them into equal-sized objects. Freed objects are returned to the cache (not the buddy system), so the next allocation reuses the freed object immediately without any buddy system involvement. This approach eliminates fragmentation (objects are exactly the right size), ensures cache line alignment (objects are aligned to prevent false sharing), and dramatically reduces allocation latency (no searching free lists, no fragmentation checking). The alternative — using the buddy system for every kernel object — would be prohibitively slow for the thousands of small allocations per second in a running kernel.
The kmalloc limit of 128 KB on x86 (order-7, 2^7 × 4 KB pages = 512 KB actually, though practical limits are lower due to fragmentation) comes from the fact that kmalloc allocations come from slab caches that are backed by physically contiguous page groups. Finding larger physically contiguous regions becomes increasingly unlikely as the size grows — the buddy system may have many free pages but not enough contiguous ones to satisfy a large order allocation. For allocations larger than ~128 KB that still require physical contiguity (for DMA, for example), use alloc_pages() (the buddy system directly) with an appropriate order. For allocations larger than ~128 KB that do NOT require physical contiguity, use vmalloc(), which only guarantees virtual contiguity and can address the full ~1.5 GB vmalloc area on x86-64.
GFP_KERNEL allocations can sleep (block) while waiting for memory to become available. If the buddy system has no free pages, the allocator triggers page reclaim and waits for pages to be freed. This makes GFP_KERNEL appropriate for process context code where sleeping is acceptable. GFP_ATOMIC allocations cannot sleep — if no free pages are available, the allocation fails immediately rather than waiting. This is required for interrupt context, softirq, tasklet, and any code path where the scheduler cannot be invoked. The trade-off is that GFP_ATOMIC can fail, so callers must check the return value. Additionally, GFP_ATOMIC is more likely to trigger the OOM killer because it cannot wait for reclaim — the OOM killer may be invoked to free memory when GFP_ATOMIC fails.
On NUMA systems, each node has its own set of memory zones (DMA, Normal, HighMem). The kernel prefers to allocate memory from the node local to the CPU making the request — local memory has lower latency and higher bandwidth than remote node memory. The allocation path calls alloc_pages_node() which first tries the local node's appropriate zone. If that fails (e.g., local node is out of memory), it falls back to remote nodes — but remote access costs 30-50% more latency. On a two-socket system, process A running on socket 0 accessing memory on socket 1 pays the Infinity Fabric / QPI inter-socket latency. The numactl tool, mbind(), and set_mempolicy() system calls allow fine-grained control over memory placement. Large database workloads often explicitly bind memory allocation to specific nodes to avoid cross-socket traffic.
The page allocator (the buddy system) operates at the page level — it manages physical pages (4 KB chunks) and their allocation to any caller. The slab allocator sits above the page allocator and uses it to obtain pages for its caches. When a slab cache needs more objects, it calls alloc_pages() to obtain a batch of pages from the buddy system. These pages are then subdivided into equal-sized objects and placed in the cache. When objects are freed, they go back to the slab cache — not back to the buddy system — until the cache decides to release excess pages back to the buddy system. This layered approach means the buddy system only deals with page-sized allocations, while the slab layer handles the byte-to-KB allocations that the kernel needs.
Use-after-free occurs when kernel code frees an object (returns it to the slab cache) but a dangling pointer in another part of the code still references it. A subsequent allocation of the same type reuses that memory, and the original code reads/writes the new object's data — corrupting kernel state or potentially escalating privileges.
KASAN (Kernel Address Sanitizer) detects UAF at runtime by poisoning freed memory with a known pattern (0x6B for each byte, called "kasan-byte"). When that memory is later allocated, KASAN saves the original shadow memory state. On any access to the poisoned region, KASAN checks whether the address is within a valid allocated object — if it was freed, the access triggers a warning. KASAN requires ~2x memory overhead and is enabled with CONFIG_KASAN=y in the kernel config. It detects out-of-bounds writes, use-after-free, and double-free bugs during development and testing.
Linux cgroups (v1 and v2) allow per-cgroup memory limits enforced by the kernel's memory controller. When a cgroup's memory usage hits its memory.limit_in_bytes, the kernel invokes the OOM killer within that cgroup — killing a process within the cgroup to reclaim memory. This is independent of the system-wide OOM killer. The cgroup OOM killer selects from among the processes in the cgroup, not system-wide. Critical system services that must never be killed should be in their own cgroup with a high oom_score_adj (-1000) or outside the memory-constrained cgroup. Container runtimes (Docker, Kubernetes) use cgroups to enforce memory limits — when a container hits its limit, the cgroup OOM killer terminates a process within that container, not elsewhere on the system.
vmalloc() allocates a virtually contiguous region backed by non-contiguous physical pages, returning a virtual address range. vmap() takes an existing array of page pointers and maps them into a contiguous virtual address range — the pages must already be allocated. vmalloc is for when you need memory and don't care about physical contiguity; vmap is for when you have pages (from alloc_pages or I/O) and need them to appear as one contiguous virtual region. The vmalloc result can be used directly; vmap is typically used during kernel initialization or forior mapping I/O buffers. Both create page table entries dynamically — neither can be called from atomic context.
The kernel's memory pool (mempool) maintains a reserve of pre-allocated objects that can be used when normal allocation fails (e.g., during memory pressure). Mempools were originally designed for block I/O where an allocation failure during an I/O operation could cause deadlock — if the system is low on memory and needs to allocate a buffer to complete the I/O that would free more memory, a deadlock is possible. Mempools solve this by always keeping a minimum pool of allocated objects. They are used in the block layer (bio pools), SCSI mid-layer, and some filesystem code. For most kernel code, mempool is overkill — a failed allocation usually means genuine memory exhaustion and the OOM killer should handle it. Using mempool for regular allocations just delays the inevitable and reduces memory available for other uses.
The buddy system (page allocator) manages physical pages — fragmentation is at the page level, managed by the order system. The slab allocator (used by kmalloc) manages sub-page objects carved from pages obtained from the buddy system. For kmalloc, fragmentation is limited to internal fragmentation within objects (a 64-byte object in a 64-byte slot has zero internal fragmentation; a 65-byte object in a 128-byte slot has significant waste). The SLUB allocator reduces this by maintaining size-specific caches (8, 16, 32, 64, 128, 256, 512, 1024, 4096 bytes) — the waste per allocation is bounded by the next size up. The buddy system can become fragmented over time with many order-0 and order-1 free pages but no higher-order blocks available — compaction (background kernel thread) periodically defragments this.
Kernel code must handle allocation failures because unlike user-space (malloc always succeeds and gives you memory you may never use), kernel allocations can genuinely fail. The kernel provides several strategies: (1) check return values — every kmalloc, vmalloc, alloc_pages can return NULL; well-written code checks and propagates or handles the failure. (2) OOM killer — when memory is genuinely exhausted, the OOM killer terminates a process to free memory for continuing operations. (3) mempools — pre-allocated reserves for critical paths where allocation failure would be catastrophic. (4) memory cgroups and limits — preventing any single service from consuming all memory. (5) boot-time reservations — reserving memory for specific uses so critical allocations succeed. Failing to check for allocation failures is one of the most common kernel bugs — a NULL dereference from a failed kmalloc can crash the system.
Further Reading
Slab Cache Debugging and Analysis
Analyzing slab cache utilization:
# Top 20 slab caches by memory usage
sudo awk 'NR==1{for(i=1;i<=NF;i++)if($i=="active_obj")a=i;if($i=="obj_sz")b=i} NR>1{print $a,$b,$0}' /proc/slabinfo | sort -rn | head -20
# Watch cache size over time
watch -n1 'cat /proc/slabinfo | grep -E "^task_struct|^buffer_head|^ext4_inode_cache"'
Slub debug features (compile with CONFIG_SLUB_DEBUG=on):
slub_debug=O— enable debugging for specific cachesslub_debug=FZPU— F=zap (poison), Z=red zoning, P=print stats, U=verify- Red zoning: fills unused space with a pattern to detect buffer overflows
- Object poisoning: fills freed objects with a pattern to detect use-after-free
KASAN (Kernel Address Sanitizer):
- Detects out-of-bounds and use-after-free at runtime
- Requires ~2x memory overhead
- Supported in modern kernels (4.0+)
- Enable:
CONFIG_KASAN=yin kernel config
Memory Zones and the DMA Zone on x86
On 32-bit x86, the DMA zone exists because of the ISA bus 16 MB limitation:
| Zone | Address Range (32-bit x86) | Size | Purpose |
|---|---|---|---|
| ZONE_DMA | 0 - 16 MB | 16 MB | ISA DMA devices |
| ZONE_NORMAL | 16 MB - 896 MB | 880 MB | Kernel direct-mapped |
| ZONE_HIGHMEM | 896 MB - end | Varies | Must use kmap() to access |
On 64-bit x86, the entire physical address space is within the direct-mapped range — ZONE_HIGHMEM is unnecessary and does not exist.
Key Takeaways
- The kernel memory allocator is layered: SLUB (object cache) → buddy system (page allocator) → memory zones
- The buddy system allocates physical pages in power-of-two sizes (orders) with coalescing on free
- Slab allocators (SLUB is default) maintain per-type object caches above the buddy system
-
kmallocallocates from the direct-mapped zone (physically/virtually contiguous), max ~128 KB on x86 -
vmallocallocates from the vmalloc region (virtually contiguous, physically fragmented) - Memory zones organize physical memory by capability and mapping requirements
- kmalloc with
GFP_KERNELcan sleep (process context);GFP_ATOMICcannot sleep (interrupt context)
Conclusion
The kernel memory architecture is a layered system that transforms physical RAM into the virtual address spaces your applications use. At the foundation sits the buddy system, allocating physical pages in power-of-two blocks. Above it, slab allocators like SLUB carve those pages into cached objects, reducing allocation overhead for the kernel’s most frequently-created structures.
Understanding kmalloc versus vmalloc is essential for driver and module development. The former gives you physically contiguous memory from the direct-mapped zone; the latter provides virtually contiguous regions at the cost of potentially scattered physical pages. Memory zones reflect the hardware reality: the DMA zone serves legacy devices that cannot address full memory, while HighMem exists only on 32-bit systems where the kernel’s linear address range cannot cover all physical RAM.
For continued learning, explore how the kernel handles out-of-memory conditions through the OOM killer, and how cgroup-based memory limits contain kernel memory growth in containerized environments. The intersection of kernel memory allocation and security — particularly KASAN for use-after-free detection and KPTI for Spectre/Meltdown mitigation — represents an advanced frontier in systems programming.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.