Shared Memory

Learn about mmap, shmget/shmat, memory-mapped I/O, and the cache coherence challenges that come with zero-copy inter-process communication.

published: May 19, 2026 reading time: 41 min read author: GeekWorkBench

Quick Summary

Learn about mmap, shmget/shmat, memory-mapped I/O, and the cache coherence challenges that come with zero-copy inter-process communication.

Shared Memory

If you need to move large amounts of data between processes, every mechanism we’ve looked at so far — pipes, message queues, sockets — requires the kernel to copy data at least once. Each read() from a pipe copies data from the kernel’s pipe buffer into user space. Each msgsnd() copies a message into the kernel’s queue. This is fine for small messages but becomes a bottleneck when you are sharing megabytes or gigabytes of data between processes at high frequency.

Shared memory fixes this by letting multiple processes map the same physical memory pages into their address spaces. No kernel copy is involved — one process writes to an address and another process reads from a different address that maps to the same physical memory. This is the fastest form of IPC available, and understanding it is essential for building high-performance systems like databases, video processing pipelines, and real-time data distribution engines.

Introduction

Shared memory is a region of memory that multiple processes can access. Unlike pipes where data flows in one direction and is consumed, shared memory is a bidirectional, persistent data space that all participants share. The operating system coordinates access through page table entries and optional synchronization primitives (typically semaphores or mutexes).

There are two primary mechanisms for shared memory on Unix/Linux systems:

Memory-mapped files (mmap()) — Maps a file (or a file-backed region) into memory. Changes are eventually written back to the file. The file provides persistence and a way to back the shared region.

System V shared memory (shmget()/shmat()/shmdt()/shmctl()) — A pure shared memory mechanism backed by kernel memory (not a file). Data does not persist across system reboots. Offers more control over memory attributes.

POSIX shared memory (shm_open()/mmap()) — A hybrid approach using filesystem-like paths but backed by kernel memory rather than a file. More modern API than System V.

All three approaches work by mapping the same physical memory pages into multiple processes’ page tables. The CPU’s memory management unit (MMU) handles the translation from virtual addresses (different per process) to physical addresses (shared).

When to Use / When Not to Use

Use shared memory when:

You need maximum throughput between processes (zero-copy data sharing)
You are sharing large data structures (buffers, matrices, databases)
You are implementing a producer-consumer pattern where the data volume is high
Multiple processes need simultaneous read/write access to the same data
You want to avoid the kernel overhead of repeated copy operations

Do not use shared memory when:

You need data persistence across system reboots (use memory-mapped files)
Your data sizes are small and simple (pipes or message queues are simpler)
You cannot tolerate the complexity of synchronization
You are on a platform where shared memory is not well-supported
You need security isolation between processes (use separate address spaces)

Architecture or Flow Diagram

Virtual Address Translation (Per Process)

Each process running on a Unix-like system operates within its own virtual address space. The CPU’s Memory Management Unit (MMU) translates virtual addresses used by a process into physical addresses that refer to actual RAM. This translation happens on every memory access, and the translation rules are stored in a per-process data structure called the page table.

The page table maps virtual page numbers to physical page frames. A virtual address is split into a virtual page number (the high bits) and an offset within that page (the low bits). The MMU looks up the virtual page number in the page table, retrieves the corresponding physical page frame number, and concatenates it with the offset to form the physical address. The page table entry also stores access permissions (read/write/execute) and status bits (present, dirty, accessed).

What makes shared memory work is that multiple processes can have page table entries that map different virtual pages to the same physical page frame. When Process A maps a shared memory segment, the kernel creates page table entries for A that point to the physical pages backing the segment. When Process B maps the same segment, the kernel creates entries for B that point to the same physical pages. The virtual addresses may differ — A might map at 0x7f0000000000 and B at 0x556000000000 — but both translate to the same physical memory.

The diagram above traces this translation for a single process. The virtual address space contains the shared memory region at some virtual address. The page table entry for that region points to the physical page frame. The MMU uses this entry to translate every subsequent access. The key insight is that the page table is per-process, but the physical pages it references can be shared.

graph LR
    A_VA[Virtual Address Space<br/>Process A] --> A_PT[A's Page Table]
    A_PT --> A_MMU[MMU Translation]
    A_MMU --> PHYS[Physical Memory Page<br/>Shared Physical Page]

Two Processes Mapping Same Physical Page

graph TD
    subgraph Process_A["Process A"]
        A_VA[Virtual Address<br/>0x7f...] --> A_PT[A's Page Table]
        A_PT --> A_MMU[MMU]
    end

    subgraph Process_B["Process B"]
        B_VA[Virtual Address<br/>0x556...] --> B_PT[B's Page Table]
        B_PT --> B_MMU[MMU]
    end

    A_MMU --> PHYS[Shared Physical Page]
    B_MMU --> PHYS

Memory Mapping Flow (mmap / shmget)

The sequence from creating a shared memory segment to having it accessible in two processes follows a consistent pattern regardless of whether you use mmap(), shmget()/shmat(), or shm_open()/mmap(). The differences are in the API and the backing store, but the kernel mechanism is identical: physical pages are allocated and page table entries are created pointing to those pages in each attaching process.

The flow starts with a system call that tells the kernel to create or open a shared memory region. For shmget(), this allocates a System V shared memory segment in kernel memory. For shm_open(), this creates a POSIX shared memory object backed by a tmpfs filesystem at /dev/shm/. For mmap() with MAP_ANONYMOUS, the kernel allocates swap space directly with no filesystem backing. In all three cases, the result is a region of kernel memory accessible by multiple processes.

After creation, each process attaches the region to its address space. shmat() returns a virtual address where the segment is mapped. mmap() with a file or POSIX shm object maps the object into the process’s virtual address space. The kernel allocates physical page frames for the region on first touch — not at creation time — and creates page table entries in the process’s page table that map the virtual pages to those physical frames.

The diagram above shows the flow. When Process A first writes to its virtual address in the shared region, the MMU triggers a page fault because the page table entry is not yet valid. The kernel handles this by allocating a physical page, filling the page table entry with the physical frame number, and resuming the process. When Process B attaches and writes to its virtual address (which is different from A’s), the same sequence happens — but the kernel allocates the same physical pages, and the page table entries in both processes point to the shared frames.

After both processes have attached, writes from A appear in physical memory and are immediately visible when B reads from its virtual address, because both virtual addresses map to the same physical pages. No data is copied through kernel space during the access — it flows directly between the CPU cache and physical memory, with the MMU handling the address translation on each access.

graph LR
    C[shmget / mmap] --> D[Kernel allocates<br/>physical pages]
    D --> E[Kernel maps same physical pages<br/>into both processes' page tables]
    E --> F[Process A writes to address 0x7f...<br/>Process B reads from 0x556...]
    F --> G[Both virtual addresses<br/>→ same physical address]

Cache Coherence Protocol

Shared memory works at the page level, but modern CPUs operate at the cache line level (typically 64 bytes). When two processes running on different CPU cores access the same memory location, their respective CPU caches may hold the same cache line. This creates cache coherence challenges.

Modern CPUs implement MESI (Modified, Exclusive, Shared, Invalid) or MOESI cache coherence protocols. When one CPU writes to a cache line that another CPU holds in a shared state, the write causes an invalidation on the other CPU’s cache line, forcing the other CPU to re-fetch the data from memory on its next access.

In a high-contention shared memory scenario with many writers on different cores, cache line bouncing can severely degrade performance. This is why shared memory with high write contention often benefits from careful data layout — separating frequently written fields to different cache lines to avoid false sharing.

When two CPU cores share data through physical memory, they do not share a direct connection — each core has its own cache that sits between it and main memory. When Core 1 writes to a cache line that Core 2 also has cached, Core 1 cannot simply overwrite the data in its cache. The cache coherence protocol coordinates this. The MESI protocol gives each cache line one of four states: Modified (the data is dirty and only this core has it), Exclusive (the data is clean and only this core has it), Shared (the data is clean and other cores may have it), or Invalid (the cache line holds no valid data).

When Core 1 writes to a cache line in the Shared state, it must first broadcast an invalidation to all other cores that hold a copy of that line. Those cores mark their copies as Invalid. Core 1 then upgrades its copy to the Modified state. On the next access from Core 2, Core 1 must write the data back to main memory so Core 2 can fetch a fresh copy. This invalidation and write-back traffic is the coherence overhead.

MOESI adds an Owned state that reduces write-back traffic. When a core responds to a read request from another core with data it has modified, it marks its own copy as Owned rather than writing it back to memory first. The requesting core gets a copy and both mark the line as Shared. This avoids the memory write that MESI requires, at the cost of slightly more complex state management. AMD processors use MOESI; Intel historically used MESI.

The diagram above shows the invalidation sequence. CPU 1 writes to a shared cache line, triggering an invalidation on CPU 2’s copy. The coherence protocol (MESI or MOESI) handles the coordination. The programmer does not control this directly, but understanding it explains why false sharing — two unrelated variables on the same cache line — causes dramatic slowdowns.

graph TD
    H[CPU 1 writes to shared cache line] --> I[Cache line invalidated<br/>in CPU 2's cache]
    I --> J[MES protocol or MESI protocol<br/>handles coherence]

Core Concepts

Memory-Mapped Files (mmap)

The mmap() function creates a mapping between a file (or anonymous region) and a process’s virtual address space:

#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

// Open or create a file
int fd = open("/tmp/shared_data.bin", O_RDWR | O_CREAT, 0666);
if (fd == -1) {
    perror("open");
    exit(1);
}

// Extend file to needed size
ftruncate(fd, 1024 * 1024);  // 1MB

// Map the file into memory
void *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE,
                  MAP_SHARED, fd, 0);
if (addr == MAP_FAILED) {
    perror("mmap");
    exit(1);
}

// Now read/write like regular memory
memcpy(addr, "Hello shared world!", 18);

// Cleanup
munmap(addr, 1024 * 1024);
close(fd);

Key points:

PROT_READ | PROT_WRITE defines the access mode
MAP_SHARED means changes are visible to other processes and written to the underlying file
MAP_PRIVATE creates a copy-on-write mapping (changes go to a private copy)
MAP_ANONYMOUS creates a mapping backed by swap space, not a file (no fd needed)

System V Shared Memory

#include <sys/ipc.h>
#include <sys/shm.h>

// Create a shared memory segment (1MB)
int shmid = shmget(IPC_PRIVATE, 1024 * 1024, IPC_CREAT | 0666);
if (shmid == -1) {
    perror("shmget");
    exit(1);
}

// Attach the segment to our address space
void *addr = shmat(shmid, NULL, 0);
if (addr == (void *)-1) {
    perror("shmat");
    exit(1);
}

// Use the shared memory
strcpy((char *)addr, "Hello from System V shared memory!");

// Detach when done
shmdt(addr);

// Clean up (only when no one needs it anymore)
shmctl(shmid, IPC_RMID, NULL);

POSIX Shared Memory

#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

// Create a POSIX shared memory object
int fd = shm_open("/my_shm", O_CREAT | O_RDWR, 0666);
if (fd == -1) {
    perror("shm_open");
    exit(1);
}

// Set the size
ftruncate(fd, 1024 * 1024);

// Map it into address space
void *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE,
                  MAP_SHARED, fd, 0);

// Use it...

munmap(addr, 1024 * 1024);
close(fd);
shm_unlink("/my_shm");  // Remove from filesystem

Production Failure Scenarios

Race Conditions Without Synchronization

The most common shared memory failure is two processes writing to the same data simultaneously. Without synchronization primitives, one write overwrites the other, or partial writes create garbage data. A process reads a half-updated data structure and crashes.

Consider a shared counter as a plain integer. Process A reads 42, increments it to 43 in its register, and gets preempted before writing back. Meanwhile, Process B reads 42, increments to 43, and stores. When A resumes it writes its stale 43 back, silently dropping one increment. The counter ends up at 43 instead of 44. This is the classic lost-update problem, and it shows up even for the simplest operations.

Worse is the torn-write case with multi-field structs. A writer updates sequence_number to 5 and gets interrupted before touching data_buffer. A reader sees sequence_number == 5 but data_buffer still holding stale data from sequence 4. The reader treats the mismatched fields as a valid reading and passes corrupted data downstream.

Under heavy load the failures stop being predictable. Four processes hitting a shared queue at once can produce lost messages (overwritten slots), duplicates (two processes writing the same slot with the same sequence number), and crashes when readers hit broken invariants. None of these reproduce reliably, which is what makes them so frustrating to debug.

Mitigation: Always use synchronization primitives with shared memory. Options include System V semaphores (semget/semop), POSIX mutexes (with pthread_mutexattr_setpshared), or memory-mapped files with msync() and atomic operations. Design your data layout to minimize contention and false sharing.

Data Corruption from Partial Writes

If a writer is interrupted mid-write (preempted by the scheduler) and another process reads the data structure, it sees a partially updated state. This is especially dangerous with multi-field data structures that need atomic updates.

The scheduler does not wait for convenient moments. On a preemptive Unix kernel, your process can be interrupted between any two instructions, not just at explicit yield points. A write() call touching multiple fields of a shared struct is not atomic just because it is one C statement. The compiler may emit multiple store instructions, and the scheduler can interrupt between any of them.

Picture a sensor_data struct with timestamp, pressure, and temperature fields, written by a producer at 100Hz. If the producer is interrupted after writing timestamp and pressure but before temperature, the consumer reads a pressure from the new sample paired with a temperature from the previous one. It then computes derived values like density or flow rate using mismatched inputs. The error is silent and persists indefinitely if there is no checksum or version field catching it.

The larger the struct, the worse the window. A 20-field struct updated field by field has 20 potential interruption points. Every extra field you write in sequence without atomicity widens the window for torn reads.

This is not only a writer problem. A reader polling a data_ready flag without a lock can see the flag set to true while the data fields are still being written. On relaxed-memory-ordering CPUs (ARM, POWER), the CPU can reorder stores, so a writer might set data_ready = 1 before the actual data values land in memory. A reader on another core sees the flag as true and reads uninitialized or stale data.

Mitigation: Use atomic operations for simple values, use reader-writer locks for complex data, or structure updates so they can be done in a single write (e.g., write to a versioning field last, so readers can check version before trusting data).

Lost Updates from Non-Atomic Operations

Even simple operations like counter++ are not atomic on most architectures — they involve a read, an increment, and a write. If two processes do this simultaneously on a shared counter, one update is lost.

The read-modify-write cycle is where things break down. On x86_64, counter++ compiles to three instructions: mov reg, [addr], inc reg, mov [addr], reg. Between the load and the store, the register holds a local copy. If Process A loads 5, gets preempted, and Process B loads 5 before A stores, both CPUs hold 5 independently. When A stores 6 and then B stores 6, the final value is 6 instead of 7. One increment vanishes without any error or warning.

This is not some exotic race that only happens under pathological load. It fires every time two processes touch the same location within the scheduler window. Ten processes incrementing a shared counter once per second lose roughly one update per second per competing pair. After an hour, your counter is measurably wrong.

The same window exists in any read-modify-write operation: buf[i++] = value (read i, increment, write i, then write the value), flag &= MASK, semaphore-- in userspace before a syscall. All of these have the same torn-read/torn-write exposure.

ARM and POWER make it worse. These architectures allow the CPU to reorder memory operations for performance, and store buffers can hold writes that are not yet visible to other cores. On ARM, counter++ uses a load-exclusive instruction that fails if another core writes to the cache line between the load and the store. When the exclusive monitor is lost, the instruction faults and must be retried — and if that retry logic is not designed carefully you get live-locks.

One more thing worth knowing: __sync_fetch_and_add() on x86 is atomic at the cache line level, but if your data spans two cache lines, the hardware atomicity of a single cache line does not protect your logical multi-line operation. That is false sharing again, and it bites in subtle ways.

Mitigation: Use atomic operations (__sync_fetch_and_add() GCC builtin, or C11 <stdatomic.h>), or use mutex-protected critical sections.

Resource Leaks from Improper Cleanup

If a process exits without detaching from shared memory (shmdt() or munmap()), the kernel marks the mapping as detached when the process’s page table entries are destroyed. However, the shared memory segment itself may persist if IPC_RMID or shm_unlink() has not been called.

A System V shared memory segment moves through three stages: creation (shmget with IPC_CREAT), attachment (shmat), and detachment (shmdt). The segment lives in kernel memory and is reference-counted. Each shmat bumps the count; each shmdt drops it. When the count hits zero and the segment has been marked for destruction with IPC_RMID, the kernel frees it.

But IPC_RMID does not destroy the segment right away if processes are still attached. The kernel marks it for destruction and waits until the last process detaches. This trips people up. If one process calls IPC_RMID while five others are still attached, the segment keeps running for those five. No new process can attach, but the existing ones keep going normally.

Orphaned segments pile up when processes crash or get killed with SIGKILL. SIGKILL cannot be caught, so signal handlers never run. A process killed this way never calls shmdt() or shmctl(IPC_RMID). The kernel cleans up the page table entries, which drops the attachment count, but if IPC_RMID was never called the segment just sits there. Running ipcs -m shows segments with nattach = 0 that were never cleaned up.

POSIX shared memory behaves differently but not necessarily better: shm_unlink() removes the name immediately, and the underlying memory is freed when all file descriptors and mappings are closed. If a process holding a mapping is killed, the mapping stays in that process’s address space until the process exits — at which point the kernel unmaps it — but the name is already gone from /dev/shm/.

Long-running processes that repeatedly create and fail to clean up shared memory segments are the most likely to hit this. A connection pool that creates a new segment per connection and skips cleanup on error will slowly consume kernel memory until the system hits SHMALL limits and new allocations start failing.

Mitigation: Implement robust cleanup in signal handlers (SIGTERM), use wrapper frameworks that track shared memory lifecycle, and monitor for orphaned shared memory segments with ipcs -m.

Page Fault Overhead on First Access

When a shared memory segment is first attached, the pages may not be in physical memory. A page fault occurs for each page, which adds latency on first access.

When a process first touches a virtual address whose page table entry is not valid, the page fault handler kicks in. For a newly attached shared memory segment, the page table entries exist — shmat or mmap set those up — but the physical pages have not been allocated yet. The first access to each page triggers a minor page fault. The kernel allocates a physical page, initializes it (zero-fills for anonymous memory, reads from the backing file for file-backed mappings), updates the page table entry, and resumes the process.

For a 1MB segment (256 pages at 4KB each), the first access triggers 256 minor page faults in sequence. Each one costs about 1-5 microseconds on a modern system, depending on memory pressure. The total latency can reach 1-2 milliseconds — meaningful for real-time code that expects shared memory access to be sub-microsecond.

How you access the pages matters. Sequential access lets the CPU’s hardware prefetcher and the kernel’s prefaulting heuristics hide some of the cost. Random access patterns — common in database buffer pools — defeat prefetching entirely and force each page fault to be handled independently.

Huge pages make the first-fault cost worse, not better. A 1GB huge page requires the kernel to allocate and zero-fill 262,144 sub-pages before the mapping is valid. That takes noticeably longer than 256 individual 4KB faults, even though it is one logical operation.

mlock() does not solve the fault problem. It prevents pages from being swapped out after they are faulted in, but it does not stop the initial faults. Pages still get faulted on first access, then locked. And if memory is under pressure, mlock() can fail with EAGAIN if the call would exceed the per-process locked-memory limit (RLIMIT_MEMLOCK). Default limits are small on many systems — a 100MB shared memory segment can already be too large for the default setting on shared hosting.

Prefaulting sidesteps the problem by paying the cost upfront. After calling shmat or mmap, touch every page before any worker threads start:

for (i = 0; i < size; i += 4096)
    *(volatile char *)(addr + i) = 0;

This triggers all page faults during initialization rather than in the hot path. For latency-sensitive applications, this is almost always the right trade.

Mitigation: Prefault the pages by touching each page immediately after attachment to force them into memory. Use mlock() to lock pages in RAM and prevent paging to disk.

Trade-off Table

Feature	mmap (file-backed)	System V shm	POSIX shm_open	Pipe/Message Queue
Data persistence	Yes (file)	No (kernel RAM)	No (kernel RAM)	No
Zero-copy semantics	Yes (after initial map)	Yes (direct access)	Yes (direct access)	No (kernel copy)
Access model	Random access, file-like	Direct memory access	Direct memory access	Sequential/random
Typical use case	Memory-mapped I/O, file sharing	High-speed IPC	High-speed IPC	Task distribution
Synchronization	Optional (file locks or external)	Semaphores (sysv)	POSIX mutex/sem	Built-in (send/recv)
Maximum size	Limited by disk space	Limited by kernel limits	Limited by kernel limits	Limited by buffer size
Cleanup model	munmap + file remains	shmctl(IPC_RMID)	shm_unlink()	Auto when all close
Portable	Very portable	Portable	POSIX (widely available)	Very portable

Implementation Snippet(s)

C: Shared Memory with POSIX Mutex Synchronization

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>

#define SHM_NAME "/prod_cons_shm"
#define BUF_SIZE 1024

typedef struct {
    pthread_mutex_t mutex;
    pthread_cond_t cond;
    int data_ready;
    char data[BUF_SIZE];
} shared_data_t;

int main() {
    int fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(fd, sizeof(shared_data_t));

    shared_data_t *shm = mmap(NULL, sizeof(shared_data_t),
                              PROT_READ | PROT_WRITE,
                              MAP_SHARED, fd, 0);
    close(fd);

    // Initialize synchronization primitives (once)
    pthread_mutexattr_t mattr;
    pthread_mutexattr_init(&mattr);
    pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED);
    pthread_mutex_init(&shm->mutex, &mattr);
    pthread_condattr_init(&cattr);
    pthread_condattr_setpshared(&cattr, PTHREAD_PROCESS_SHARED);
    pthread_cond_init(&shm->cond, &cattr);
    shm->data_ready = 0;

    pid_t pid = fork();
    if (pid == 0) {
        // Child: producer
        while (1) {
            pthread_mutex_lock(&shm->mutex);
            snprintf(shm->data, BUF_SIZE, "Message at %ld", (long)time(NULL));
            shm->data_ready = 1;
            pthread_cond_signal(&shm->cond);
            pthread_mutex_unlock(&shm->mutex);
            sleep(1);
        }
    } else {
        // Parent: consumer
        while (1) {
            pthread_mutex_lock(&shm->mutex);
            while (!shm->data_ready) {
                pthread_cond_wait(&shm->cond, &shm->mutex);
            }
            printf("Received: %s\n", shm->data);
            shm->data_ready = 0;
            pthread_mutex_unlock(&shm->mutex);
        }
    }

    return 0;
}

Python: Using mmap for Shared Memory

import mmap
import os

# Memory-mapped file as shared memory
FILE_PATH = "/tmp/shared_mmap.bin"
SIZE = 1024 * 1024  # 1MB

# Create the file and map it
fd = os.open(FILE_PATH, os.O_RDWR | os.O_CREAT)
os.ftruncate(fd, SIZE)

mmap_obj = mmap.mmap(fd, SIZE, mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE)

# Write data
mmap_obj.seek(0)
mmap_obj.write(b"Shared data via mmap!")

# Read data
mmap_obj.seek(0)
print(mmap_obj.readline())

mmap_obj.close()
os.close(fd)
os.unlink(FILE_PATH)

Bash: Using shared memory monitoring tools

# Check existing System V shared memory segments
ipcs -m

# Show detailed info about a specific segment
ipcs -m -i <shmid>

# Remove a stuck segment
ipcrm -m <shmid>

# Check POSIX shared memory
ls /dev/shm/

# Remove POSIX shared memory object
# shm_unlink("/name") from C code

# Watch shared memory consumption
watch -n 1 'ipcs -m'

Observability Checklist

Shared memory segments: ipcs -m shows all System V shared memory segments with IDs, sizes, attached processes
POSIX shared memory: ls -la /dev/shm/ lists POSIX shared memory objects
Attachment count: ipcs -m shows number of attached processes per segment — 0 means orphaned if not marked for destruction
Page fault analysis: Use perf stat -e page-faults -e minor-faults to measure page fault overhead when first accessing shared segments
Cache coherence metrics: Use perf stat with hardware counters (if available) to detect cache line bouncing between cores
Memory usage: Monitor RSS and shared memory size via /proc/<pid>/status and /proc/<pid>/smaps
strace: Trace mmap, munmap, shmget, shmat, shmdt, shmctl system calls to debug mapping issues

Common Pitfalls / Anti-Patterns

No Access Control Beyond Permissions: Shared memory provides no access control beyond Unix permissions — any process with read/write permission to the underlying file or segment can read and modify all data. Use appropriate file permissions (chmod 0600 for exclusive access), set restrictive permissions (0660 or 0664) for System V/POSIX segments with dedicated groups, encrypt sensitive data before placing it in shared memory, use mlock() to prevent paging to disk, and implement application-level authentication if untrusted processes may access segments.

Compliance: Shared memory segments may contain sensitive data in memory dumps (core files). For PCI-DSS, HIPAA, or other compliance regimes, ensure sensitive data is properly protected and core dumps are handled appropriately.

Forgetting synchronization — the most dangerous anti-pattern. Without mutexes, semaphores, or atomic operations protecting shared data, you get race conditions, data corruption, and crashes. Always pair shared memory with synchronization.
False sharing — when two frequently-modified variables happen to share the same cache line, each modification invalidates the other’s cached copy, causing massive performance degradation. Pad your data structures to align frequently-written fields to separate cache lines (64-byte boundaries).
Assuming memory visibility is immediate — when one process writes to shared memory, the other process might not see that write immediately due to CPU caching, compiler optimizations, or store buffers. Use pthread_mutex_lock() which provides memory barriers, or use __sync_synchronize() for explicit barriers.
Not handling EINVAL on shmat — shmat() returns (void *)-1 on error, not NULL. Check for this correctly: if (addr == (void *)-1) not if (!addr).
Leaving synchronization primitives in inconsistent state — if a process crashes while holding a mutex, other processes deadlock forever. Use robust mutex initialization and consider using robust mutexes (PTHREAD_MUTEX_ROBUST) that detect deadlocks.
Not accounting for different page sizes — on systems with huge pages (2MB or 1GB pages), shared memory alignment requirements may differ. Using huge pages can significantly improve performance for large shared memory regions.
Mismatched MAP_SHARED vs MAP_PRIVATE — two processes mapping the same file with MAP_PRIVATE each get a private copy of the data. Changes made by one process are invisible to the other. Use MAP_SHARED for true sharing.

Quick Recap Checklist

Shared memory provides zero-copy IPC by mapping the same physical memory pages into multiple processes’ address spaces
Three mechanisms: mmap() with file backing, System V shmget()/shmat(), and POSIX shm_open()/mmap()
Synchronization is mandatory — shared memory alone does not prevent race conditions. Use mutexes, semaphores, or atomic operations.
Cache coherence between CPU cores is managed by MESI/MOESI protocols; false sharing causes severe performance degradation
mmap() with MAP_ANONYMOUS creates anonymous shared memory backed by swap, not a file
System V shared memory persists until explicitly removed with IPC_RMID; POSIX objects persist until shm_unlink()
Memory-mapped files provide persistence and are useful for memory-mapped I/O where writes are eventually flushed to disk
For high-performance inter-process data sharing, shared memory is the fastest option but requires careful synchronization design

Interview Questions

1. How does shared memory work at the hardware level?

When a process calls shmat() or mmap() with a shared mapping, the kernel creates or updates page table entries that map the process's virtual address to the same physical page frames. Multiple processes thus have different virtual addresses that map to the same physical memory. The CPU's MMU (Memory Management Unit) performs the translation from virtual address to physical address on every memory access. When processes on different CPU cores access the same physical memory location simultaneously, their respective CPU caches may both hold the cache line. Cache coherence protocols (MESI or MOESI) ensure that writes from one core invalidate the stale copy in another core's cache. The kernel is involved only in setting up the page table entries — after that, all data movement happens directly between the CPU cache and main memory, without kernel intervention.

2. What is false sharing and how can you avoid it?

False sharing occurs when two processes (or threads) modify different variables that happen to live on the same CPU cache line (typically 64 bytes). When one process writes to its variable, the entire cache line is invalidated in the other processor's cache. When the other process tries to read or write its variable, it must re-fetch the cache line from memory, incurring significant latency. This can make a shared memory program run slower than if the data were not shared at all.

To avoid false sharing:

Pad data structures to ensure frequently-written fields are on different cache lines
Use compiler attributes or manual alignment (__attribute__((aligned(64)))) to control placement
Separate hot fields into their own structures that can be placed on independent cache lines
Use profiling tools (Intel VTune, perf) to identify cache line bouncing in high-contention scenarios

3. What is the difference between MAP_SHARED and MAP_PRIVATE in mmap?

MAP_SHARED creates a mapping where modifications are visible to other processes that have mapped the same region and are written back to the underlying file or backing store. For file-backed mappings, changes eventually reach the disk. This is the mode you want for true shared memory IPC.

MAP_PRIVATE creates a copy-on-write mapping — modifications are only visible to the current process and affect a private copy of the data. The underlying file or backing store is not modified. Other processes mapping the same file see the original content. Private mappings are useful for loading program code or data without affecting the original file, but they are useless for IPC.

Using MAP_PRIVATE when you intended MAP_SHARED is a common bug — you will see processes modifying the same file but never seeing each other's changes.

4. Why is synchronization still necessary with shared memory?

Shared memory provides a communication channel but does not inherently prevent concurrent access. Without synchronization:

Read-modify-write races: Two processes reading the same counter value, incrementing it locally, and writing back — losing one increment
Partial write visibility: A process interrupted mid-write leaves the data structure in a partially updated state that another process reads
CPU cache coherency issues: Writes may sit in a store buffer or cache and not be immediately visible to another core reading from its own cache

Synchronization primitives (mutexes, semaphores, atomic operations) provide both mutual exclusion (only one process in the critical section at a time) and memory barriers (ensuring writes are visible before the lock is released). The combination prevents all the failure modes above. Use pthread_mutex with PTHREAD_PROCESS_SHARED attribute, or System V semaphores, or GCC atomics for lock-free algorithms.

5. How do you clean up orphaned shared memory segments?

Orphaned shared memory segments accumulate when processes terminate without properly detaching and removing the segment. To clean them up:

System V shared memory:

List all segments: ipcs -m
Find segments with nattach=0 (no attached processes) but with ipcrmperm set to destroy when detached
Remove manually: ipcrm -m <shmid>

POSIX shared memory:

List objects: ls /dev/shm/
Remove with shm_unlink("/name") from a C program, or if the object is truly orphaned it may need root intervention

Prevention: Use signal handlers (SIGTERM, SIGINT) that call cleanup functions, use wrapper frameworks that track shared memory lifecycle, and always test for orphaned segments at application startup. For long-running services, implement a startup check that cleans stale segments.

6. What is the difference between System V shared memory and POSIX shared memory?

System V shared memory uses the shmget(), shmat(), shmdt(), and shmctl() API and is identified by an integer ID (shmid). It persists in the kernel until explicitly removed with IPC_RMID. POSIX shared memory uses shm_open() and shm_unlink() with filesystem-like path names (e.g., /my_shm), making it feel more familiar to developers used to file-based APIs.

Key differences: POSIX is generally considered more modern and easier to use, with better integration with mmap(). System V is older, slightly more portable to legacy Unix systems, and offers more granular control via shmctl(). POSIX shared memory objects appear in /dev/shm/ on Linux. Both are kernel-backed and do not persist across reboots.

7. How does mmap with MAP_ANONYMOUS differ from file-backed mmap for shared memory?

MAP_ANONYMOUS creates a mapping backed by swap space (or RAM) with no underlying file — the fd parameter to mmap() is ignored. This is useful for pure inter-process communication where persistence is not needed. The data never touches disk except when the system swaps.

File-backed mmap() associates the mapping with a file, so changes are eventually flushed to the filesystem. This provides persistence but means data is subject to filesystem overhead (alignment to file boundaries, journal updates if the filesystem has one). Anonymous mappings are typically faster because they bypass the filesystem entirely. Both can be shared across processes with MAP_SHARED.

8. What are the security risks of shared memory and how do you mitigate them?

Shared memory has no built-in access control beyond basic Unix file permissions — any process that can open the backing file or reach the shared memory ID can read and modify all data. This creates several risks: data leakage between processes running different privilege levels, tampering by untrusted processes, and denial of service if a malicious process corrupts shared state.

Mitigations: use restrictive file permissions (chmod 0600) on backing files; use 0660 or 0664 with dedicated groups for POSIX and System V shm; encrypt sensitive data at the application layer before placing it in shared memory; use mlock() to prevent paging to swap where sensitive data could be recovered; implement process authentication within the shared memory protocol itself.

9. What is the difference between MESI and MOESI cache coherence protocols?

MESI (Modified, Exclusive, Shared, Invalid) is a four-state protocol. A cache line can be: Modified (dirty, exclusive to this core, needs write-back), Exclusive (clean, exclusive to this core), Shared (clean, potentially in other cores' caches), or Invalid (not present or stale). When a core writes to a Shared line, it must first send invalidations to all other cores holding that line.

MOESI adds an Owned state. A line in the Owned state is dirty (needs write-back) but other cores may retain a stale copy. This avoids the need to write back before responding to a read from another core — the core can respond with data while marking itself as the owner. AMD processors use MOESI; Intel historically used MESI (with modifications). Both protocols solve the same problem: keeping multiple CPU caches coherent for the same physical memory location.

10. How does a page fault occur when accessing shared memory for the first time?

When a process first attaches a shared memory segment, the virtual addresses are mapped but the physical pages may not yet be in RAM. On first access to a page, the MMU triggers a page fault because the page table entry is not present or not valid. The kernel handles this via the page fault handler: it allocates a physical page, fills it with data (from the backing file or zero-initializes for anonymous memory), updates the page table entry, and resumes the process. This happens transparently.

For large shared memory segments, the first-access page fault overhead can be significant (one interrupt and kernel allocation per page). Prefaulting — touching every page immediately after attachment to force them into memory — eliminates this latency spike at the cost of upfront time. mlock() can lock pages to prevent them from being swapped out after faulting in.

11. What is the impact of huge pages (hugetlb) on shared memory performance?

Huge pages (2MB or 1GB on x86_64) reduce Translation Lookaside Buffer (TLB) pressure for large shared memory regions. Each TLB entry covers one page, so a 1GB region with 4KB pages needs 262,144 TLB entries — far more than most TLBs hold — causing TLB misses that require expensive page table walks. With 1GB huge pages, the same region needs only 1 TLB entry.

However, huge pages are harder to allocate (may require contiguous physical memory), fragment over time, and require explicit configuration (sysctl vm.nr_hugepages). For database buffer pools and other large shared memory use cases, huge pages can improve performance by 10-20% by reducing TLB miss overhead. Use MAP_HUGETLB with mmap() to request huge pages for shared memory.

12. How do you handle partial updates to multi-field data structures in shared memory?

Partial writes are dangerous because a reader may see an inconsistent state if a writer is interrupted mid-update. Solutions: Atomic fields — use atomic types (C11 _Atomic, GCC __sync intrinsics) for individual fields, but this does not protect multi-field updates. Version numbering — write a version number last; readers check the version before and after reading to detect mid-update reads. Copy-on-write — writers copy the entire structure, modify the copy, then atomically swap pointers. Double buffering — maintain two buffers; writers always write to the inactive buffer, then atomically switch the active pointer.

The version-number approach is widely used: add an initial version check, copy the data, check the version again — if it changed mid-copy, retry. This handles both torn reads and mid-write interrupts.

13. What is the difference between robust mutexes and regular mutexes in shared memory contexts?

A regular pthread_mutex in a shared memory segment becomes undefined if the process that holds it terminates without releasing it — other processes waiting on that mutex will wait forever. A robust mutex (PTHREAD_MUTEX_ROBUST) handles this: if the owning process dies, the next call to pthread_mutex_lock() returns EOWNERDEAD instead of deadlocking, allowing the caller to recover the mutex state.

To use robust mutexes across processes, initialize the mutex with PTHREAD_MUTEX_ROBUST attribute and PTHREAD_PROCESS_SHARED. Always check the return value of pthread_mutex_lock() — EOWNERDEAD means the previous owner died and you should call pthread_mutex_consistent() to make the mutex consistent again before continuing.

14. How does NUMA awareness affect shared memory performance on multi-socket systems?

On NUMA (Non-Uniform Memory Access) systems, memory attached to socket A is faster to access from cores on socket A than from cores on socket B. When multiple processes on different sockets share memory, accesses from the "wrong" socket incur cross-socket memory latency (100+ nanoseconds vs ~50 nanoseconds for local access).

To optimize: use mbind() and set_mempolicy() to bind shared memory pages to a specific node, or use libnuma for easier control. Some shared memory implementations (including System V on Linux) automatically NUMA-balance, but explicit placement gives more control. For latency-critical shared memory (trading systems, real-time databases), NUMA-aware placement is essential for predictable performance.

15. What is the purpose of shmget IPC_PRIVATE and when would you use it?

IPC_PRIVATE as the key argument to shmget() creates a shared memory segment that is not accessible by any other process through the System V IPC key mechanism. The returned shmid is the only way to access the segment. This is useful when you want to create a shared memory segment that is only inherited or passed explicitly — for example, after fork(), the child inherits access to segments the parent created.

The typical pattern: parent calls shmget(IPC_PRIVATE, size, IPC_CREAT | 0666), then passes the resulting shmid to child processes via fork() inheritance or a pipe. No other process can accidentally or intentionally access the segment because there is no well-known key. This is more secure than a named segment for parent-child-only sharing.

16. Can shared memory be used for communication between a parent process and its forked child without IPC_PRIVATE?

Yes. When a process forks, the child inherits the parent's file descriptor table, which includes file descriptors for System V shared memory segments and mappings from mmap(). So if the parent creates a shared memory segment with shmget() and shmat(), the child automatically has access to the same segment after fork() — no explicit sharing mechanism is needed beyond the fork itself.

For mmap()-based shared memory, the mapping is inherited across fork() because page table entries are duplicated. However, note that fork() does not copy the underlying physical pages — the child shares them with the parent (copy-on-write). Only after either process writes does a private copy get created. This inheritance applies to both MAP_SHARED and MAP_PRIVATE mappings.

17. How does msync differ from regular writes to shared memory, and when should you use it?

For mmap() file-backed shared memory, writes to the mapped address go through the CPU cache and are eventually propagated to the page cache in the kernel — but not necessarily to the disk immediately. msync() forces the kernel to flush changes to the underlying file, either synchronously (MS_SYNC, which blocks until all data is written) or asynchronously (MS_ASYNC, which returns immediately while the kernel writes in the background).

Use msync(MS_SYNC) when you need durability guarantees — for example, before another process reads the file and relies on seeing your writes. MS_ASYNC is useful for periodic checkpoints where you want to push data to disk eventually without blocking. Without msync(), data may live only in the page cache and can be lost on a crash — but for pure IPC where the file is just the backing store, not the primary storage, msync() is rarely needed.

18. What is the maximum size of shared memory on Linux and how is it configured?

System V shared memory limits: SHMMAX (max size of a single segment, default ~8GB on 64-bit) and SHMALL (total system-wide limit in pages, default ~8GB). POSIX shared memory has a per-object limit based on the underlying tmpfs/shmfs filesystem size. These are tunable via /proc/sys/kernel/shmmax and /proc/sys/kernel/shmall.

To check current limits: ipcs -l shows System V limits. To increase: sysctl -w kernel.shmmax=<bytes> and sysctl -w kernel.shmall=<pages>. The physical RAM and swap size constrain the total — you cannot allocate more shared memory than available RAM + swap. On 32-bit systems, the address space itself (typically 3-4GB) may be the limiting factor before the kernel settings.

19. How do you implement a lock-free producer-consumer pattern using shared memory?

Lock-free designs use atomic operations instead of locks. For a bounded buffer in shared memory, use a ring buffer with atomic head and tail indices: producers atomically increment the tail index to claim a slot, write data, then mark the slot as valid. Consumers atomically increment the head index to claim a slot, read data, then mark the slot as empty. Compare-and-swap (CAS) operations handle the claim phase safely across processes.

In C, GCC provides __sync_bool_compare_and_swap() for CAS. C11 provides atomic_compare_exchange_strong() in <stdatomic.h>. The key is ensuring the valid/empty marking is also atomic — either use separate atomic flags per slot, or make the "empty" value a special sentinel that both head and tail understand. Lock-free algorithms are complex and must be carefully verified; they are generally faster than mutex-based approaches under high contention but much harder to implement correctly.

20. What happens when two processes try to attach the same shared memory segment at different virtual addresses?

This is completely normal. Each process can call shmat() with a different NULL-derived address (or explicit hint address) and the kernel will map the same physical pages to different virtual addresses in each process. The virtual addresses are independent — what matters is that they both map to the same physical page frames.

For example, Process A might attach at 0x7f0000000000 and Process B at 0x556000000000. Both point to the same physical memory. The MMU handles translation independently for each process. This is why shared memory requires synchronization — different virtual addresses that point to the same physical location are equally subject to race conditions. You can verify the physical addresses are shared by checking /proc/<pid>/smaps for the same physical page frames across processes.

Conclusion

Shared memory delivers the highest throughput of any IPC mechanism by eliminating kernel-mediated data copies. Multiple processes access the same physical memory pages through their own virtual address spaces, with the MMU handling address translation and CPU cache coherence protocols maintaining consistency. This raw speed comes at the cost of programmer-managed synchronization — there are no locks or queues to serialize access, so you must provide your own.

The evolution from shared memory toward even faster mechanisms leads to memory-mapped I/O, NUMA-aware data placement, and persistent memory (PMEM) architectures. Database systems like PostgreSQL and Oracle use shared memory extensively for buffer pools, while real-time trading systems use it for low-latency data distribution between processes.

For continued learning, explore how the Linux kernel implements shared memory under the hood (the shmem filesystem), study NUMA-aware memory allocation strategies for multi-socket systems, and investigate persistent memory programming models where shared regions survive system reboots.

Shared Memory

Introduction

When to Use / When Not to Use

Architecture or Flow Diagram

Virtual Address Translation (Per Process)

Two Processes Mapping Same Physical Page

Memory Mapping Flow (mmap / shmget)

Cache Coherence Protocol

Core Concepts

Memory-Mapped Files (mmap)

System V Shared Memory

POSIX Shared Memory

Production Failure Scenarios

Race Conditions Without Synchronization

Data Corruption from Partial Writes

Lost Updates from Non-Atomic Operations

Resource Leaks from Improper Cleanup

Page Fault Overhead on First Access

Trade-off Table

Implementation Snippet(s)

C: Shared Memory with POSIX Mutex Synchronization

Python: Using mmap for Shared Memory

Bash: Using shared memory monitoring tools

Observability Checklist

Common Pitfalls / Anti-Patterns

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

ASLR & Stack Protection

Assembly Language Basics: Writing Code the CPU Understands

Boolean Logic & Gates