Fork & Exec System Calls
fork() duplicates a running process, then exec() replaces it with a new program. Together they power every shell, web server, and daemon on Unix-like systems.
Fork & Exec System Calls
Every time you type a command in a terminal, run a background service, or spawn a worker process in a web server, two Unix system calls are doing the heavy lifting: fork() and exec(). Together, they form the foundation of process creation in every Unix-like operating system — Linux, macOS, BSD, and even the Android kernel.
If you have been running programs without understanding these calls, you are missing a big piece of the picture. Once you see how fork() duplicates a running program and how exec() swaps it for something new, the entire process model makes sense. Engineers who understand these calls debug systems properly. Everyone else just restarts services and hopes for the best.
The Anatomy of fork()
fork() is a strange beast. It takes zero arguments and returns twice — once in the original process (the parent) and once in the brand-new process (the child). That is not a typo.
The way this works is that fork() creates a duplicate of the calling process. The child gets an exact copy of the parent’s address space — its memory, its open files, its variables. Everything. The child starts executing at the exact instruction where fork() returned in the parent. The only difference is the return value.
Return Values Tell You Who You Are
This is the key to understanding every fork()-based program:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
int main() {
pid_t pid = fork();
if (pid < 0) {
// fork() failed — not enough processes allowed, out of memory
perror("fork failed");
return 1;
} else if (pid == 0) {
// We are the child process
printf("I am the child. My PID is %d\n", getpid());
} else {
// We are the parent process
printf("I am the parent. My child's PID is %d\n", pid);
}
return 0;
}
The return value in the child is 0. The return value in the parent is the child’s PID — a positive integer. The parent needs the child’s PID to manage it (wait for it, send signals to it). The child does not know its own PID from the return value — it calls getpid() if it needs to know.
If fork() returns -1, the call failed entirely and no process was created. Common failure reasons include hitting the process limit (EAGAIN) or running out of virtual memory.
Copy-on-Write: Why fork() Does Not Destroy Performance
A naive reading of fork() suggests it must copy the entire address space — all memory pages — from parent to child. For a process with hundreds of megabytes of memory, that would be devastatingly slow and would waste enormous amounts of RAM.
Unix kernels solve this with copy-on-write (COW). At the moment of fork(), the kernel does not copy the memory pages at all. Instead, both processes share the same physical pages. The kernel marks those pages as read-only. As long as both processes only read from their memory, nothing needs to be copied.
The moment either process tries to write to a page, the write is trapped by the CPU. The kernel then allocates a new physical page, copies the original content there, and remaps the writing process’s page table to point at the new page. From that point on, the two processes have independent copies of that page.
This means fork() is fast — it does not need to copy all the data immediately. It only needs to copy the page tables and mark pages as read-only. The actual copying is deferred until a write is necessary, and in many programs, large portions of memory are never written at all.
The Address Space Duplication
When fork() returns in the child, the child has an identical but independent copy of:
- The code segment (the program’s executable instructions)
- The data segment (global and static variables)
- The heap (dynamically allocated memory)
- The stack (local variables and return addresses)
- Open file descriptors (pointing to the same file table entries)
The child does not share memory with the parent — it has its own separate address space. But the contents at the moment of fork() are identical. If you want to understand what goes into a Process Control Block and how the OS tracks all this state, see the Process Concept post.
The exec() Family: Replacing the Program Image
fork() alone is not enough to run a different program. It only duplicates the calling process. To run an actual different program — say, invoking ls from a shell — the child process needs to replace its address space with the new program’s code and data.
There are six functions in the family, all calling the same kernel service:
| Function | Arguments | Example |
|---|---|---|
execl() | list | execl("/bin/ls", "ls", "-la", NULL); |
execv() | array | execv("/bin/ls", argv); |
execle() | list + env | execle("/bin/ls", "ls", NULL, envp); |
execve() | array + env | execve("/bin/ls", argv, envp); |
execlp() | list + PATH | execlp("ls", "ls", "-la", NULL); |
execvp() | array + PATH | execvp("ls", argv); |
The p variants search PATH so you can run ls without typing /bin/ls. The e variants pass a custom environment instead of inheriting the parent’s.
What exec() Actually Does
Calling exec() does not create a new process. It replaces the current process’s address space with the code and data of the executable file. The PID does not change. Open file descriptors that are not marked O_CLOEXEC remain open. The process simply stops running its old program and starts running the new one from the entry point.
If exec() succeeds, the function never returns — the old program is gone. If exec() returns at all, it means it failed, and the code continues in the old program just as if an error occurred.
#include <stdio.h>
#include <unistd.h>
int main() {
printf("About to exec ls...\n");
execlp("ls", "ls", "-la", NULL);
// If we reach here, exec failed
perror("exec failed");
return 1;
}
The fork()+exec() Pattern: How Shells and Servers Work
Every time you run a command in a shell, the shell performs the classic fork()+exec() sequence:
- The shell calls
fork()to create a child process. - The shell calls
exec()to replace the child’s address space with the program you requested. - The shell calls
wait()to suspend itself until the child finishes.
The shell does not run your command directly — it first duplicates itself, then swaps the duplicate for your program. This separation is deliberate. It means the shell’s own address space stays intact, ready to parse the next command.
A Complete fork()+exec()+wait() Example
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main() {
pid_t pid = fork();
if (pid < 0) {
perror("fork failed");
return 1;
}
if (pid == 0) {
// Child process: replace with ls
execlp("ls", "ls", "-la", NULL);
// exec failed if we reach here
perror("exec failed");
exit(1);
}
// Parent process: wait for child to finish
int status;
waitpid(pid, &status, 0);
if (WIFEXITED(status)) {
printf("Child exited with code %d\n", WEXITSTATUS(status));
}
return 0;
}
How Web Servers Spawn Workers
Web servers like Apache and Nginx use this same pattern, just at much larger scale. At startup, the master process binds to port 80. When a request arrives, the master calls fork() to spawn a worker child. The child inherits the listening socket, so it does not need to rebind. Then the child either calls exec() to run a different program or — in many modern servers — just keeps running the same server code in worker mode.
wait() and Zombie Processes
When a child process terminates, it does not disappear immediately. The kernel keeps certain information about it — its exit status, resource usage statistics — until the parent retrieves it. A process in this state is called a zombie.
The parent retrieves this information using wait() or waitpid(). Until the parent calls one of these, the child’s entry in the process table remains, consuming a slot. If the parent exits before the child, the child is adopted by the init process (PID 1), which always calls wait() on its children. This is why orphan processes do not become zombies — init cleans them up.
A Practical wait() Example
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main() {
pid_t pid = fork();
if (pid == 0) {
// Child does some work
sleep(2);
printf("Child finishing\n");
exit(42); // Exit with status 42
}
// Parent waits specifically for this child
int status;
waitpid(pid, &status, 0);
if (WIFEXITED(status)) {
printf("Child exited with status %d\n", WEXITSTATUS(status));
}
return 0;
}
What Happens When a Parent Exits
If a parent exits before its child, the child’s PCB (Process Control Block) is reparented to init. init periodically calls wait() on all its children, collecting their exit statuses and removing them from the process table. This means orphan processes are short-lived in a properly functioning system — they only persist when the parent fails to call wait().
A common mistake is for a parent to fork and then continue doing other work without ever calling wait(). Over time, this can exhaust the process table (there is a system limit on the number of zombie entries). The solution is to either call wait() in the parent, or if the parent does not care about the child’s exit status, call waitpid(pid, NULL, WNOHANG) in a loop or register a signal handler for SIGCHLD.
Fork Bombs and Process Limits
A fork bomb is a denial-of-service attack where a process keeps calling fork() as fast as possible, creating exponentially growing numbers of child processes until the system runs out of process table entries, memory, or both. The classic form:
// Do not run this
while(1) fork();
Modern systems mitigate this with per-user and per-process resource limits. The ulimit command (in the shell) and the setrlimit() system call control the maximum number of processes a user can create. When a process hits its limit, fork() returns -1 with errno set to EAGAIN.
System administrators can also use control groups (cgroups) on Linux to limit the number of processes a container or user can spawn. For more on how threads relate to processes in this context, see Threads & Lightweight Processes.
Interview Questions
All open file descriptors are duplicated in the child process. The child shares the same file table entries as the parent, meaning they reference the same open file description. This is why the fork()+exec() pattern works for I/O redirection — the child can close or duplicate file descriptors before calling exec(), affecting what the new program reads from or writes to.
vfork() was introduced to address the performance overhead of copying the parent's page tables during fork(). Like fork(), vfork() creates a child process, but it does not copy the parent's address space — the child runs in the parent's address space until exec() or exit(). The parent is blocked while the child runs. vfork() is essentially obsolete now that copy-on-write makes fork() efficient enough for most use cases. On modern Linux, vfork() is implemented as a wrapper around fork() with copy-on-write disabled for performance comparisons.
When exec() replaces the process image, the signal disposition is reset to the default for all signals. Custom signal handlers defined with signal() or sigaction() are not carried over to the new program. However, the signal mask (which signals are blocked) is preserved across exec(). Additionally, if a signal's disposition is set to SIG_IGN or SIG_DFL before exec(), those dispositions are also preserved.
The kernel implements fork() by duplicating the current process and scheduling both the parent and child to continue running. From the kernel's perspective, both processes exist and both need to resume execution. The return value differs because the kernel detects which process is running — it sets the return value to 0 in the child (the newly created process) and to the child's PID in the parent. This design lets both processes determine their role and branch accordingly. It is the fundamental mechanism that makes the parent-child relationship explicit in the code.
When a child terminates, its exit status and resource usage are preserved in the kernel until the parent retrieves them via wait(). Until the parent calls wait() or waitpid(), the child remains in the process table as a zombie. If the parent never calls wait() (a programming error), the zombie entry persists and can eventually exhaust process table slots. If the parent exits before the child, init inherits the child and automatically calls wait(), so zombies from orphaned children are cleaned up promptly. SIGCHLD handling with SA_NOCLDWAIT can also prevent zombies by instructing the kernel to discard child exit information.
When a process calls exec(), custom signal handlers are reset to default and custom signal dispositions are cleared. The new program's signal handling is whatever was installed in the executable.
However, the signal mask (which signals are currently blocked) is preserved across exec(). Also, if a signal's disposition is set to SIG_IGN or SIG_DFL before exec(), those are preserved.
This is why daemon processes use fork()+exec() — they can set up the child's signal handling after fork() and before exec(), and the exec() will apply the new program's handlers rather than inheriting the parent's custom ones.
O_CLOEXEC is an flag passed to open() or socket() when creating a file descriptor. It sets the close-on-exec flag at creation time — the descriptor will automatically close when any exec() call replaces the process image.
FD_CLOEXEC is used with fcntl(F_GETFD, FD_CLOEXEC) to set the close-on-exec flag on an existing descriptor. It's how you achieve the same effect for descriptors inherited from a parent process.
Both achieve the same result — preventing file descriptor leakage across exec(). O_CLOEXEC is slightly more efficient because it avoids the race between setting FD_CLOEXEC and another thread's fork()+exec().
posix_spawn() combines fork() and exec() into a single call. It creates a child process and replaces it with a new program, handling file descriptor inheritance and signal management through attribute parameters.
Use posix_spawn() when:
- Forking in a multi-threaded program — fork() is unsafe when other threads hold locks (they won't exist in child)
- You need to control the child's signal handling, file descriptor table, or scheduling parameters atomically
- Implementing a shell or command executor where you need predictable fork+exec behavior
posix_spawn() is effectively a standardized interface that handles the tricky parts of fork()+exec() in a way that works correctly in multi-threaded programs.
At fork(), the kernel marks all pages in the parent's address space as read-only in both parent's and child's page tables. Both processes share the same physical pages. No actual memory copying happens at fork() time.
When either process tries to write to a shared page:
- CPU raises a page fault (write to read-only page)
- Kernel intercepts the fault, allocates a new physical page
- Kernel copies the original page content to the new page
- Kernel updates the writing process's page table to point to the new page
- Kernel marks the new page as writable
- Other process's page table entry still points to the original (read-only) page
The kernel also marks both page table entries with copy-on-write flags so future writes trigger the same mechanism for the other process.
fork() can fail and return -1 with errno set to:
- EAGAIN: The process's RLIMIT_NPROC limit has been reached, or the system's process table is full. This is the most common failure in containerized or resource-constrained environments.
- ENOMEM: Not enough kernel memory to allocate the child's PCB or page tables. Very rare on modern systems with swap.
For EAGAIN, the application should throttle fork attempts or increase ulimit -u. For ENOMEM, the system itself is in trouble and needs intervention.
Note: fork() can succeed even if there isn't enough memory for the child to run (COW means actual memory is only allocated on write). fork() succeeds but the child may be killed by the OOM killer later if it writes heavily and there is no memory available.
fork() succeeds immediately because COW defers memory copying — at fork() time, no actual memory needs to be allocated. The child starts with a copy-on-write mapping of the parent's memory.
If both parent and child write heavily (triggering COW for many pages), they may collectively allocate significantly more memory than either would have alone. If the system runs out of memory, the OOM killer selects a process to terminate.
The OOM killer tends to target processes that allocate the most memory or have been running the longest. A fork bomb or heavy COW activity can trigger it. cgroups v2 allows controlling OOM behavior per container.
When fork() is called, memory mappings (created by mmap) are inherited by the child. For file-backed mappings, both processes share the same physical pages (COW applies if MAP_PRIVATE). For anonymous mappings (MAP_ANONYMOUS), the COW mechanism applies — both initially share pages until either writes.
MAP_SHARED mappings behave differently: writes go directly to the underlying file. These are NOT copied on write — modifications are immediately visible to any process sharing the mapping.
After fork(), mmap() calls in either process are independent — new mappings in one process don't appear in the other.
After fork(), the child starts with utime and stime both set to 0. The child has not yet used any CPU time.
Accounting begins when the child is first scheduled. On Linux, the scheduler records the time when a process is scheduled in and calculates the delta when the process is scheduled out.
This means that if you fork a child and immediately wait() for it, you may see non-zero CPU times if the child ran briefly between fork and wait (even for a moment).
Nice value is inherited across fork(). The child starts with the same nice value as the parent. However, the child can then call setpriority() or nice() to adjust its own priority independently.
This inheritance applies to all scheduler properties — the child gets the same scheduler policy (SCHED_OTHER by default), the same CPU affinity mask, and the same nice value as the parent.
This is why a background job started with `nice -n 10 ./job &` from a shell with default nice has nice=10, even though the shell forked and exec'd the job process.
exit() is a standard C library function that performs cleanup before terminating: flushing stdio buffers, calling atexit() handlers, then calling _exit().
_exit() is a raw system call that terminates immediately without cleanup. It does not flush buffers, does not call atexit handlers, and does not invoke C++ destructors.
In a fork()+exec() context, if the child needs to terminate without running the new program (e.g., error handling after fork() before exec()), it should call _exit() rather than exit(). Calling exit() in the child after fork() can cause duplicate flushing of parent's buffers, since exit() was never supposed to be called after fork() in a multi-threaded program.
wait() blocks until any child terminates, then returns its PID and status. If there are no children to wait for, it returns -1 with errno ECHILD.
waitpid(pid, status, 0) waits for a specific child (or any child if pid=-1). By default it blocks if the child hasn't exited.
waitpid(pid, status, WNOHANG) is non-blocking — if the specified child hasn't exited yet, it returns 0 immediately. This is essential in event-driven programs or signal handlers where blocking would be unacceptable.
A common pattern: in a SIGCHLD handler, use waitpid(-1, NULL, WNOHANG) in a loop to reap all terminated children without blocking the handler.
When a process's parent exits before it, the kernel re-parents the child to the nearest ancestor that is still running — ultimately init (PID 1) or the nearest service manager (systemd on modern systems).
The reparenting happens immediately upon the parent's termination. The child continues running exactly as before, just with a different parent PID.
init/systemd periodically calls wait() on all adopted children to prevent zombie accumulation. This is why orphan processes are short-lived — they get reaped automatically.
Zombie and defunct are the same thing. A zombie (or defunct) process is one that has terminated but whose PCB entry remains in the process table because the parent has not yet read its exit status via wait().
Once the parent calls wait() and reads the exit status, the zombie is cleaned up. If the parent never calls wait() (a programming bug), the zombie persists indefinitely — or until the parent is killed, at which point the zombie is adopted and reaped by init.
fork() is implemented as clone() with CLONE_VFORK | CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM (and a few others). CLONE_VM is the key flag — it causes the child to share the parent's memory map.
When CLONE_VM is set (shared address space), writes by either process are visible to the other. When CLONE_VM is NOT set (fork semantics), the child's page tables point to COW copies of the parent's pages.
vfork() sets CLONE_VM but also CLONE_VFORK and blocks the parent until the child calls exec() or exit(). The combination of shared VM and parent blocking is what allows vfork() to work without COW overhead.
ptrace() allows a tracer process to observe and control another process's execution, including intercepting system calls. When ptrace attaches to a child after fork(), the tracer can intercept every system call the child makes.
seccomp (secure computing mode) filters system calls. When a process enters seccomp mode, it can only make a whitelist of allowed syscalls. Any other syscall results in SIGKILL.
Together, ptrace+seccomp form the basis of sandboxes: a process can fork a child, have the child enter seccomp mode (narrowing its available syscalls), then execute untrusted code. The parent uses ptrace to monitor the child. This is how strace and sandboxing tools work.
Further Reading
- Process Concept — PCB architecture, process states, and lifecycle
- Process Scheduling — How schedulers manage process execution
- Threads & Lightweight Processes — Thread models and shared address space
This post is part of the Operating Systems roadmap — Section 3.3: Process & Thread Management.## Quick Recap Checklist
-
fork()creates a child process; returns0in child, child’s PID in parent,-1on error -
exec()family replaces the current process image with a new program; never returns on success -
fork()+exec()is the universal pattern for creating and running new programs -
wait()/waitpid()retrieves a child’s exit status and removes its zombie entry - Copy-on-write defers memory copying at fork() until a write actually occurs
- File descriptors are duplicated (shared) across fork(); use
O_CLOEXECto auto-close before exec() -
vfork()is an obsolete optimization that shares the address space until exec() - Fork bombs exploit the fact that
fork()succeeds until process limits are hit; mitigate withulimitand cgroups
Conclusion
- Process Concept — How the OS represents and manages running programs
- System Calls Interface — How programs talk to the kernel
- Memory Allocation — How the kernel manages heap memory for processes
Category
Related Posts
CPU Affinity & Real-Time Operating Systems
CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.
System Calls Interface
System calls are the boundary between user programs and the kernel. They are the mechanism by which user-space applications request services from the operating system — opening files, creating processes, allocating memory, and more. Understanding syscalls reveals how the OS enforces isolation and provides safe access to hardware.
What Is an Operating System?
An operating system sits between hardware and applications, managing resources so programs don't have to. This guide explains what an OS does, its architecture, and why it matters.