System Calls Interface
System calls are the boundary between user programs and the kernel. They are the mechanism by which user-space applications request services from the operating system — opening files, creating processes, allocating memory, and more. Understanding syscalls reveals how the OS enforces isolation and provides safe access to hardware.
System Calls Interface
The sandbox your code runs in is real. Not metaphorical — enforced by the CPU, in hardware. Your program can calculate, loop, allocate structs on the stack, and thrash around in cache all day without ever asking permission. But open a file, send a packet, or allocate a chunk of heap memory, and you hit a wall. The kernel sits on the other side of that wall, and the only way your code can cross is through system calls.
Syscalls are the handshake between user space and kernel space. They are the only entry points through which unprivileged code reaches privileged operations. Without them, any bug in any program could overwrite your disk or hijack the CPU. With them, the kernel can inspect every request, enforce security policies, and share hardware across thousands of competing processes.
This post covers what syscalls are, how the CPU flips privilege levels to make them work, the calls you’ll hit most often in practice, and the security machinery built to contain them.
What Is a System Call
A system call is a controlled ring transition — from user space into kernel space — triggered by user-level code that needs something the kernel provides. The kernel runs in ring 0 (on x86), the highest privilege. Your application runs in ring 3. When your code fires a syscall instruction, the CPU bumps up to ring 0, jumps into kernel code, and the kernel takes over from there.
What kinds of operations cross that boundary? A rough taxonomy:
- File I/O — open, read, write, close
- Process management — fork, execve, exit, wait
- Memory allocation — brk, mmap
- Networking — send, recv on sockets
- Inter-process communication — pipes, message queues, shared memory
- Time and scheduling — clock_gettime, nanosleep, sched_yield
The C library (glibc, musl, uClibc) wraps these into functions you already know: open(), read(), fork(), malloc(). Worth knowing: malloc() itself is not a syscall. Under the hood it calls brk() or mmap() to get memory from the kernel. The syscall is the primitive; the library is ergonomics on top.
The User/Kernel Boundary
The ring model is simple. x86 has four rings (0–3). Linux and Windows use ring 0 for kernel code, ring 3 for everything else. Nothing in between.
When a syscall fires, the sequence is:
- CPU switches to ring 0.
- Program counter jumps to a fixed kernel address — the syscall entry point.
- Kernel reads the syscall number from a register (
eaxon x86,raxon x86-64). - Kernel dispatches to the handler for that syscall number.
- On return, privilege drops back to ring 3 and execution resumes in user code.
graph LR
A["User Space (Ring 3)"] -->|"syscall instruction"| B["Kernel Space (Ring 0)"]
B -->|"return value in rax"| A
No user program can jump into kernel code arbitrarily — the syscall instruction itself is privileged, and the kernel vets the target address. This is the bedrock of process isolation.
How the Syscall Instruction Works
Same idea across architectures, different instructions.
x86 (32-bit): int 0x80
On 32-bit Linux the entry path is the int 0x80 interrupt. Syscall number goes in eax, arguments in ebx, ecx, edx, esi, edi, ebp.
; Read 10 bytes from fd 3 into buffer
mov eax, 3 ; read = 3
mov ebx, 3 ; fd = 3
mov ecx, buffer ; buffer address
mov edx, 10 ; byte count
int 0x80 ; kernel entry
x86-64: syscall
On 64-bit Linux, int 0x80 is gone. syscall is faster and cleaner. Syscall number in rax, arguments in rdi, rsi, rdx, r10, r8, r9.
; Write 14 bytes to stdout
mov rax, 1 ; write = 1
mov rdi, 1 ; fd = 1 (stdout)
mov rsi, msg ; buffer
mov rdx, 14 ; byte count
syscall
Both conventions index into a syscall table — the kernel’s dispatch array. It’s architecture-specific and lives in the kernel binary itself.
Common Syscalls
These come up constantly in real code:
| Syscall | Number (x86-64) | What it does |
|---|---|---|
read | 0 | Read from a file descriptor |
write | 1 | Write to a file descriptor |
open | 2 | Open a file, get a descriptor back |
close | 3 | Close a file descriptor |
fork | 57 | Clone the current process |
execve | 59 | Replace the current process image with a new program |
exit | 60 | Terminate the calling process |
mmap | 9 | Map files or anonymous memory into the address space |
brk | 12 | Move the program break (adjust heap size) |
File Descriptors
Most file-related syscalls operate on file descriptors — small integers that index into the per-process fd table. Call open("/etc/passwd", O_RDONLY), the kernel allocates an fd (say, 3), stores a reference to the open file, and returns 3. Every subsequent read(fd, ...) looks up that 3 in the table.
File descriptors are process-local. When a process forks, children inherit the parent’s fd table — which is why a forked child can keep reading from the same files the parent had open.
Return Values and Error Handling
Syscalls use a consistent convention: non-negative means success, -1 means error. On failure, the kernel sets errno (a global int) to a positive value encoding what went wrong.
#include <stdio.h>
#include <errno.h>
#include <string.h>
int main(void) {
int fd = open("/nonexistent/file", O_RDONLY);
if (fd < 0) {
fprintf(stderr, "open failed: %s\n", strerror(errno));
return 1;
}
close(fd);
return 0;
}
This is why -1 instead of exceptions — it maps directly onto C’s error-checking idiom. Python wraps it as OSError, Go as error, but the contract underneath is identical.
Library Wrappers: glibc, musl
You almost never call syscalls directly. You call library functions, which may:
- Wrap a syscall one-to-one — musl’s
open()is essentiallysyscall(SYS_open, ...). - Add buffering and semantics —
fopen()wrapsopen()and gives you aFILE*with an internal buffer. - Build something larger —
printf()wrapswrite()and adds formatting.
The difference matters when you’re tracing behavior. fopen() and open() end up in the same syscall, but fopen() does extra work first: allocating a FILE struct, filling an internal buffer, calling open() once things are set up.
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main(void) {
// Raw syscall wrapper: file descriptor
int fd = open("data.txt", O_RDONLY);
if (fd < 0) return 1;
char buf[128];
ssize_t n = read(fd, buf, sizeof(buf));
close(fd);
// Buffered standard I/O wrapper: FILE*
FILE* fp = fopen("data.txt", "r");
if (!fp) return 1;
fgets(buf, sizeof(buf), fp);
fclose(fp);
return 0;
}
fopen() is better for sequential text reads — buffering cuts syscall traffic. But if you need non-blocking I/O, specific file offsets, or precise control, work with the raw fd.
VDSO and vsyscall
Every syscall has overhead: the privilege switch, the kernel entry, the return. Some syscalls are so cheap the overhead dominates. Linux has two optimizations for these.
vsyscall (legacy)
A fixed memory region mapped at a known address that exposes certain syscall implementations without actually entering the kernel. Early Linux used it for gettimeofday(). The problem: fixed address, no ASLR, easy target for exploits. Modern Linux deprecates it.
VDSO (Virtual Dynamic Shared Object)
A shared object the kernel maps into every process at a randomized address. Unlike vsyscall, VDSO is subject to ASLR. The kernel pre-computes values for syscalls like clock_gettime() and stores them in the VDSO page. Libc checks for VDSO availability and calls it directly — no kernel entry needed.
clock_gettime() is the canonical example. The kernel fills the VDSO page with pre-computed time values. Your libc detects the VDSO and calls it without ever invoking int 0x80 or syscall.
Check it out: cat /proc/self/maps | grep vdso. The [vdso] region is the shared code page available without a syscall.
A Practical Example: open/read/write
Here is a complete program that opens a file, reads it, and writes the content to stdout.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#define BUFFER_SIZE 256
int main(int argc, char* argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
return 1;
}
const char* filepath = argv[1];
// syscall number 2 on x86-64
int fd = open(filepath, O_RDONLY);
if (fd < 0) {
fprintf(stderr, "open('%s') failed: %s (errno=%d)\n",
filepath, strerror(errno), errno);
return 1;
}
char buffer[BUFFER_SIZE];
ssize_t bytes_read = read(fd, buffer, BUFFER_SIZE - 1);
if (bytes_read < 0) {
fprintf(stderr, "read() failed: %s\n", strerror(errno));
close(fd);
return 1;
}
buffer[bytes_read] = '\0';
// syscall number 1 on x86-64
ssize_t bytes_written = write(STDOUT_FILENO, buffer, bytes_read);
if (bytes_written < 0) {
fprintf(stderr, "write() failed: %s\n", strerror(errno));
close(fd);
return 1;
}
close(fd);
fprintf(stderr, "Processed %zd bytes from '%s'\n", bytes_read, filepath);
return 0;
}
Compile with gcc -o fileutil fileutil.c and run with ./fileutil somefile.txt. Trace the syscalls with strace -c ./fileutil somefile.txt — you’ll see open, read, write, close in the syscall summary.
Security Implications: Syscall Filtering and seccomp
The syscall interface is the kernel’s attack surface. Every syscall is a potential vector — privilege escalation, data exfiltration, DoS. Linux gives processes two ways to shrink that surface.
seccomp (Secure Computing Mode)
seccomp restricts which syscalls a process may call. Once enabled in strict mode, any disallowed syscall kills the process with SIGKILL. It was designed for long-running servers that only need a handful of operations after initialization.
#include <stdio.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>
int main(void) {
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) < 0) {
perror("prctl seccomp");
return 1;
}
// Only these syscalls are allowed after this point
write(STDOUT_FILENO, "Hello via seccomp\n", 18);
_exit(0);
// Any other syscall — read, open, anything — kills the process
return 0;
}
seccomp-BPF
The modern seccomp interface uses Berkeley Packet Filter programs to filter syscalls by number, argument values, or both. Docker and container runtimes use seccomp-BPF to disable dangerous calls inside containers — mount(), syslog(), and similar — while leaving harmless ones available.
Why This Matters
If a web server gets breached, the attacker should not be able to format the disk or spawn a root shell. seccomp makes that a property of the process rather than a property of the network boundary. Containers and minimal container images depend on this: restrict the syscall surface, restrict the blast radius of any future exploit.
Interview Questions
On Linux, open() from glibc places the syscall number for open (5 on x86-64) in rax, the flags and mode arguments in rdi, rsi, rdx, and fires the syscall instruction. The CPU transitions to ring 0 and jumps to the syscall entry point. The kernel's handler validates the path pointer (ensuring it doesn't point into kernel memory), checks file permissions, allocates a file descriptor, creates an open file description, and returns the fd number. On error, it returns -1 and sets errno. The calling program checks the return value and handles the error if negative.
int 0x80 is the legacy 32-bit Linux syscall mechanism — a software interrupt that transfers control to kernel mode via interrupt descriptor table entry 0x80. It was the only mechanism on early Linux. syscall is the modern 64-bit instruction introduced with x86-64 that performs a direct CPU transition to kernel mode without interrupt overhead — it's faster and has dedicated registers for syscall arguments (rdi, rsi, rdx, r10, r8, r9). The kernel maintains separate syscall tables for each mode, and int 0x80 cannot enter the 64-bit kernel.
malloc() is a library function in glibc/musl, not a syscall. It manages a heap within a process's address space by calling brk() or mmap() to request memory pages from the kernel — those are the actual syscalls. This design separates concerns: the kernel manages page-level allocation, while the C library manages byte-level heap semantics (coalescing free blocks, binning by size, alignment). The library also caches freed memory rather than returning it immediately to the kernel, which amortizes syscall overhead. Most malloc() calls never touch the kernel at all.
VDSO (Virtual Dynamic Shared Object) is a shared library the kernel maps into every process's address space at a randomized location. It contains kernel code that has been pre-compiled and exposes pre-computed values for certain syscalls — notably clock_gettime(). Instead of crossing the user/kernel boundary with a syscall instruction, the libc simply reads the VDSO page, which contains the time value already computed by the kernel. This eliminates the privilege switch overhead entirely for these "get" operations. You can verify VDSO is present with cat /proc/self/maps | grep vdso.
fork() is syscall number 57 on x86-64. The kernel's fork handler creates a new process control block (PCB/struct task_struct), duplicates the parent's memory mappings (copy-on-write — pages are shared until either parent or child writes to them), copies the file descriptor table (each fd points to the same open file description, reference count incremented), and sets up the child process's registers so it returns 0 from fork() while the parent gets the child's PID. The key insight is that the kernel doesn't copy the entire memory — just the page table entries, which makes fork fast even for large processes.
seccomp (Secure Computing Mode) is a Linux kernel feature that restricts which syscalls a process may invoke. In strict mode, the only allowed syscalls are read, write, _exit, and sigreturn — anything else kills the process with SIGKILL. Docker uses seccomp-BPF (an extension) to define fine-grained filters based on syscall number and arguments. By default, Docker blocks around 44 syscalls that are not needed for normal container operation — including mount(), syslog(), and module loading. This dramatically reduces the attack surface: even if an attacker compromises the container process, they cannot escalate to host-level access through syscalls the container isn't allowed to make.
C doesn't have exceptions as a language feature (it was added later as a library convention with setjmp/longjmp). Unix was designed in the 1970s with C, and the error model reflects this: syscalls return a signed integer — non-negative values are success (including 0 for read returning no bytes, which is valid), -1 indicates error. The kernel sets the global errno variable to a positive value encoding which error occurred (EFAULT, ENOENT, EPERM, etc.). This maps naturally onto C's error-checking idiom: if (fd < 0) handle_error(). Python, Go, and other languages wrap this underlying contract in their own exception or error-return patterns, but the kernel interface stays the same.
A file descriptor is a small non-negative integer (typically 0, 1, 2 for stdin/stdout/stderr, then 3 upward) that indexes into the per-process file descriptor table. An open file description (also called an open file table entry) is a kernel data structure storing the current file offset, access mode, and a reference to the inode. When you fork(), the child has its own fd table but the entries point to the same open file descriptions — this is why both parent and child see updates to the file offset. When you dup() an fd, both point to the same open file description. File descriptors are process-local; open file descriptions are reference-counted kernel objects that outlive individual fds pointing to them.
dup() returns a new file descriptor that points to the same open file description as the original fd — the kernel allocates the lowest available fd number. dup2(oldfd, newfd) explicitly assigns the duplicated fd to a specific descriptor number you choose, closing newfd first if it's already open. The main use case for dup2() is redirecting standard input, output, or error before executing a child process — for example, redirecting STDIN_FILENO to a file without worrying about what fd number you get back. It also atomically handles the close-of-old-fd case, which matters in multithreaded programs where a fd being closed in another thread could be reused between the close() and dup() in a non-atomic sequence.
pipe(int fds[2]) creates a unidirectional communication channel and returns two file descriptors — fds[0] for reading, fds[1] for writing. Data written to fds[1] can be read from fds[0]. The canonical pattern is: call pipe(), then fork(). After fork, the child typically closes the read end and writes; the parent closes the write end and reads. The kernel's pipe implementation uses a circular buffer in kernel memory. If a process calls pipe() without forking, you have a one-directional channel within a single process — useful for subprocess communication in the same process tree.
fork() creates a new process — it duplicates the calling process, giving the child a copy of the parent's memory, file descriptors, and register state (except the child gets 0 return from fork while the parent gets the child's PID). execve() does not create a new process — it replaces the current process's address space with a new program image, loading the executable from disk and resetting registers, stack, and heap. The typical sequence is: fork() to create a child process, then the child calls execve() to run a different program while inheriting the parent's open file descriptors (which is why pipes redirecting stdin/stdout work with subprocesses). Without execve, fork alone just gives you a clone of the same program running in two processes.
brk() moves the program break (the end of the heap) up or down by a requested amount — it's a simple interface for adjusting how much heap the kernel has allocated to the process. mmap() is more general: it creates a new memory mapping in the process's virtual address space. Mappings can be file-backed (mapping a file's contents into memory) or anonymous (just allocated pages with no backing store). malloc() uses both — small allocations come from the heap managed by brk-style logic inside glibc, while large allocations (typically > 128KB) call mmap() directly to get whole pages directly from the kernel without fragmenting the heap. The key difference is that brk reuses a single heap region while mmap creates entirely new non-contiguous virtual memory regions.
On fork(), the child receives a copy of the parent's file descriptor table — each entry points to the same open file description in the kernel, with the reference count incremented. File descriptors are not duplicated by value; they share the underlying kernel object. The file descriptor flags (FD_CLOEXEC) determine behavior on execve(): if FD_CLOEXEC is set, the fd is automatically closed when the process calls exec. This is how daemons close stdin/stdout/stderr before exec — they set FD_CLOEXEC on the relevant fds before calling execve, so the new program doesn't inherit the parent's file descriptors.
access(path, mode) checks whether the calling process has permission to access a file at a given path — it tests read, write, or execute permissions without actually opening the file. The mode flags are R_OK, W_OK, X_OK, and F_OK (existence check). You'd use access() when you want to test permissions before taking an action that might fail — for example, checking if a file is readable before loading it, or checking execute permission before running a subprocess. Unlike open(), it does not create a file descriptor, does not track the file in the process's fd table, and doesn't update the file's access time. It's purely a permission check against the process's effective UID/GID.
ioctl(fd, request, ...) is a catch-all syscall for device-specific operations that don't fit the standard read/write/open/close model. It takes a file descriptor (often from opening a device node like /dev/tty or a socket) and a request code that uniquely identifies the operation — the third argument is polymorphic, depending on the request. Examples: TIOCGWINSZ to get terminal window size, FIONREAD to get bytes available to read on a file descriptor, SIOCGIFCONF to get network interface addresses. The request codes are typically defined in header files like <sys/ioctl.h> or device-specific headers. It's "miscellaneous" because it's the syscall that doesn't fit neatly into any other category — each device driver implements its own ioctl operations.
prctl(option, arg2, arg3, arg4, arg5) is a multi-purpose syscall for configuring process-level kernel features. Its behavior depends entirely on the option argument. Common uses include: PR_SET_SECCOMP to enable seccomp (the first argument after SECCOMP_MODE_STRICT), PR_SET_PDEATHSIG to set a signal delivered to the process when its parent dies, PR_SET_NAME to set the process name visible in /proc/self/status, and PR_GET_DUMPABLE to query whether core dumps can be made after setuid operations. It's the primary interface for process self-configuration that doesn't fit elsewhere. seccomp-BPF extends this by using prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_program) to load BPF filter programs.
File descriptors are preserved across execve() — they remain open and point to the same open file descriptions they referenced before the exec. The file descriptor table itself is replaced entirely (the child process gets a fresh fd table on fork, then that table is reused on exec). The key exception is if FD_CLOEXEC flag is set on a fd, the kernel automatically closes it during exec as a safety measure. This design supports the common Unix pattern where a parent sets up pipes before forking and execing a child — the child inherits the pipe fds and they remain open during exec so the new program can use them. Without this, redirecting stdin/stdout in subprocesses would require the subprocess to explicitly reopen file descriptors after starting.
read() and write() are library functions (in glibc, musl) that wrap the raw syscall(SYS_read, ...) instruction. syscall() itself is a generic function taking a syscall number as its first argument, followed by up to six arguments passed to the kernel. You could theoretically call syscall(SYS_read, fd, buf, count) directly, bypassing the libc wrapper — though normally you wouldn't because the wrapper does argument validation, error handling (setting errno), and in the case of fopen() vs open(), adds buffering. Using the raw syscall() function also bypasses any libc-level interception (some build environments intercept syscalls for debugging or sandboxing). In normal application code, use the wrappers — they're more ergonomic and handle errno correctly.
Each syscall has a unique number (e.g., read = 0, write = 1, open = 2, fork = 57 on x86-64) that serves as an index into the kernel's syscall table. When a user process executes the syscall instruction, the CPU writes the syscall number into rax and arguments into specific registers. The kernel's entry point (in the kernel's assembly code) uses that number to index into an array of function pointers — sys_call_table[rax] — and calls the corresponding handler function. This indirection is why syscall numbers are architecture-specific: the table for x86-64 has different numbers and different handler addresses than for arm64 or x86-32. Userspace libc knows these numbers and passes them correctly when invoking syscall.
The kernel must treat syscall arguments as potentially malicious because they point to memory in user space — a malicious program could pass a pointer to kernel memory to try to read it, or a NULL pointer to crash the kernel. Every pointer argument is validated before use: the kernel checks whether the address falls within the user portion of the address space (access_ok() in Linux), and dereferencing must go through functions like copy_from_user() or get_user() that copy data safely from user to kernel memory, returning an error code if the copy fails. A common attack is passing a kernel-space address — the kernel must reject it. This is also why seccomp filters inspect syscall arguments; even the kernel's own syscall handler can't assume callers are well-behaved. Pointer validation is why a NULL fd to write() returns EBADF rather than crashing the kernel.
Interview Questions
On Linux, open() from glibc places the syscall number for open (5 on x86-64) in rax, the flags and mode arguments in rdi, rsi, rdx, and fires the syscall instruction. The CPU transitions to ring 0 and jumps to the syscall entry point. The kernel's handler validates the path pointer (ensuring it doesn't point into kernel memory), checks file permissions, allocates a file descriptor, creates an open file description, and returns the fd number. On error, it returns -1 and sets errno. The calling program checks the return value and handles the error if negative.
int 0x80 is the legacy 32-bit Linux syscall mechanism — a software interrupt that transfers control to kernel mode via interrupt descriptor table entry 0x80. It was the only mechanism on early Linux. syscall is the modern 64-bit instruction introduced with x86-64 that performs a direct CPU transition to kernel mode without interrupt overhead — it's faster and has dedicated registers for syscall arguments (rdi, rsi, rdx, r10, r8, r9). The kernel maintains separate syscall tables for each mode, and int 0x80 cannot enter the 64-bit kernel.
malloc() is a library function in glibc/musl, not a syscall. It manages a heap within a process's address space by calling brk() or mmap() to request memory pages from the kernel — those are the actual syscalls. This design separates concerns: the kernel manages page-level allocation, while the C library manages byte-level heap semantics (coalescing free blocks, binning by size, alignment). The library also caches freed memory rather than returning it immediately to the kernel, which amortizes syscall overhead. Most malloc() calls never touch the kernel at all.
VDSO (Virtual Dynamic Shared Object) is a shared library the kernel maps into every process's address space at a randomized location. It contains kernel code that has been pre-compiled and exposes pre-computed values for certain syscalls — notably clock_gettime(). Instead of crossing the user/kernel boundary with a syscall instruction, the libc simply reads the VDSO page, which contains the time value already computed by the kernel. This eliminates the privilege switch overhead entirely for these "get" operations. You can verify VDSO is present with cat /proc/self/maps | grep vdso.
fork() is syscall number 57 on x86-64. The kernel's fork handler creates a new process control block (PCB/struct task_struct), duplicates the parent's memory mappings (copy-on-write — pages are shared until either parent or child writes to them), copies the file descriptor table (each fd points to the same open file description, reference count incremented), and sets up the child process's registers so it returns 0 from fork() while the parent gets the child's PID. The key insight is that the kernel doesn't copy the entire memory — just the page table entries, which makes fork fast even for large processes.
seccomp (Secure Computing Mode) is a Linux kernel feature that restricts which syscalls a process may invoke. In strict mode, the only allowed syscalls are read, write, _exit, and sigreturn — anything else kills the process with SIGKILL. Docker uses seccomp-BPF (an extension) to define fine-grained filters based on syscall number and arguments. By default, Docker blocks around 44 syscalls that are not needed for normal container operation — including mount(), syslog(), and module loading. This dramatically reduces the attack surface: even if an attacker compromises the container process, they cannot escalate to host-level access through syscalls the container isn't allowed to make.
C doesn't have exceptions as a language feature (it was added later as a library convention with setjmp/longjmp). Unix was designed in the 1970s with C, and the error model reflects this: syscalls return a signed integer — non-negative values are success (including 0 for read returning no bytes, which is valid), -1 indicates error. The kernel sets the global errno variable to a positive value encoding which error occurred (EFAULT, ENOENT, EPERM, etc.). This maps naturally onto C's error-checking idiom: if (fd < 0) handle_error(). Python, Go, and other languages wrap this underlying contract in their own exception or error-return patterns, but the kernel interface stays the same.
A file descriptor is a small non-negative integer (typically 0, 1, 2 for stdin/stdout/stderr, then 3 upward) that indexes into the per-process file descriptor table. An open file description (also called an open file table entry) is a kernel data structure storing the current file offset, access mode, and a reference to the inode. When you fork(), the child has its own fd table but the entries point to the same open file descriptions — this is why both parent and child see updates to the file offset. When you dup() an fd, both point to the same open file description. File descriptors are process-local; open file descriptions are reference-counted kernel objects that outlive individual fds pointing to them.
dup() returns a new file descriptor that points to the same open file description as the original fd — the kernel allocates the lowest available fd number. dup2(oldfd, newfd) explicitly assigns the duplicated fd to a specific descriptor number you choose, closing newfd first if it's already open. The main use case for dup2() is redirecting standard input, output, or error before executing a child process — for example, redirecting STDIN_FILENO to a file without worrying about what fd number you get back. It also atomically handles the close-of-old-fd case, which matters in multithreaded programs where a fd being closed in another thread could be reused between the close() and dup() in a non-atomic sequence.
pipe(int fds[2]) creates a unidirectional communication channel and returns two file descriptors — fds[0] for reading, fds[1] for writing. Data written to fds[1] can be read from fds[0]. The canonical pattern is: call pipe(), then fork(). After fork, the child typically closes the read end and writes; the parent closes the write end and reads. The kernel's pipe implementation uses a circular buffer in kernel memory. If a process calls pipe() without forking, you have a one-directional channel within a single process — useful for subprocess communication in the same process tree.
fork() creates a new process — it duplicates the calling process, giving the child a copy of the parent's memory, file descriptors, and register state (except the child gets 0 return from fork while the parent gets the child's PID). execve() does not create a new process — it replaces the current process's address space with a new program image, loading the executable from disk and resetting registers, stack, and heap. The typical sequence is: fork() to create a child process, then the child calls execve() to run a different program while inheriting the parent's open file descriptors (which is why pipes redirecting stdin/stdout work with subprocesses). Without execve, fork alone just gives you a clone of the same program running in two processes.
brk() moves the program break (the end of the heap) up or down by a requested amount — it's a simple interface for adjusting how much heap the kernel has allocated to the process. mmap() is more general: it creates a new memory mapping in the process's virtual address space. Mappings can be file-backed (mapping a file's contents into memory) or anonymous (just allocated pages with no backing store). malloc() uses both — small allocations come from the heap managed by brk-style logic inside glibc, while large allocations (typically > 128KB) call mmap() directly to get whole pages directly from the kernel without fragmenting the heap. The key difference is that brk reuses a single heap region while mmap creates entirely new non-contiguous virtual memory regions.
On fork(), the child receives a copy of the parent's file descriptor table — each entry points to the same open file description in the kernel, with the reference count incremented. File descriptors are not duplicated by value; they share the underlying kernel object. The file descriptor flags (FD_CLOEXEC) determine behavior on execve(): if FD_CLOEXEC is set, the fd is automatically closed when the process calls exec. This is how daemons close stdin/stdout/stderr before exec — they set FD_CLOEXEC on the relevant fds before calling execve, so the new program doesn't inherit the parent's file descriptors.
access(path, mode) checks whether the calling process has permission to access a file at a given path — it tests read, write, or execute permissions without actually opening the file. The mode flags are R_OK, W_OK, X_OK, and F_OK (existence check). You'd use access() when you want to test permissions before taking an action that might fail — for example, checking if a file is readable before loading it, or checking execute permission before running a subprocess. Unlike open(), it does not create a file descriptor, does not track the file in the process's fd table, and doesn't update the file's access time. It's purely a permission check against the process's effective UID/GID.
ioctl(fd, request, ...) is a catch-all syscall for device-specific operations that don't fit the standard read/write/open/close model. It takes a file descriptor (often from opening a device node like /dev/tty or a socket) and a request code that uniquely identifies the operation — the third argument is polymorphic, depending on the request. Examples: TIOCGWINSZ to get terminal window size, FIONREAD to get bytes available to read on a file descriptor, SIOCGIFCONF to get network interface addresses. The request codes are typically defined in header files like <sys/ioctl.h> or device-specific headers. It's "miscellaneous" because it's the syscall that doesn't fit neatly into any other category — each device driver implements its own ioctl operations.
prctl(option, arg2, arg3, arg4, arg5) is a multi-purpose syscall for configuring process-level kernel features. Its behavior depends entirely on the option argument. Common uses include: PR_SET_SECCOMP to enable seccomp (the first argument after SECCOMP_MODE_STRICT), PR_SET_PDEATHSIG to set a signal delivered to the process when its parent dies, PR_SET_NAME to set the process name visible in /proc/self/status, and PR_GET_DUMPABLE to query whether core dumps can be made after setuid operations. It's the primary interface for process self-configuration that doesn't fit elsewhere. seccomp-BPF extends this by using prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_program) to load BPF filter programs.
File descriptors are preserved across execve() — they remain open and point to the same open file descriptions they referenced before the exec. The file descriptor table itself is replaced entirely (the child process gets a fresh fd table on fork, then that table is reused on exec). The key exception is if FD_CLOEXEC flag is set on a fd, the kernel automatically closes it during exec as a safety measure. This design supports the common Unix pattern where a parent sets up pipes before forking and execing a child — the child inherits the pipe fds and they remain open during exec so the new program can use them. Without this, redirecting stdin/stdout in subprocesses would require the subprocess to explicitly reopen file descriptors after starting.
read() and write() are library functions (in glibc, musl) that wrap the raw syscall(SYS_read, ...) instruction. syscall() itself is a generic function taking a syscall number as its first argument, followed by up to six arguments passed to the kernel. You could theoretically call syscall(SYS_read, fd, buf, count) directly, bypassing the libc wrapper — though normally you wouldn't because the wrapper does argument validation, error handling (setting errno), and in the case of fopen() vs open(), adds buffering. Using the raw syscall() function also bypasses any libc-level interception (some build environments intercept syscalls for debugging or sandboxing). In normal application code, use the wrappers — they're more ergonomic and handle errno correctly.
Each syscall has a unique number (e.g., read = 0, write = 1, open = 2, fork = 57 on x86-64) that serves as an index into the kernel's syscall table. When a user process executes the syscall instruction, the CPU writes the syscall number into rax and arguments into specific registers. The kernel's entry point (in the kernel's assembly code) uses that number to index into an array of function pointers — sys_call_table[rax] — and calls the corresponding handler function. This indirection is why syscall numbers are architecture-specific: the table for x86-64 has different numbers and different handler addresses than for arm64 or x86-32. Userspace libc knows these numbers and passes them correctly when invoking syscall.
The kernel must treat syscall arguments as potentially malicious because they point to memory in user space — a malicious program could pass a pointer to kernel memory to try to read it, or a NULL pointer to crash the kernel. Every pointer argument is validated before use: the kernel checks whether the address falls within the user portion of the address space (access_ok() in Linux), and dereferencing must go through functions like copy_from_user() or get_user() that copy data safely from user to kernel memory, returning an error code if the copy fails. A common attack is passing a kernel-space address — the kernel must reject it. This is also why seccomp filters inspect syscall arguments; even the kernel's own syscall handler can't assume callers are well-behaved. Pointer validation is why a NULL fd to write() returns EBADF rather than crashing the kernel.
Further Reading
- Syscalls are the only controlled entry from user space to kernel space.
- The CPU enforces rings;
syscall(orint 0x80) is the privileged instruction that crosses the boundary. read,write,open,close,fork,execve,mmap,brkare the ones you hit most.- Success: non-negative return. Failure: -1 +
errno. Python, Go, and other languages wrap this but don’t change it. - The C library wraps raw syscalls into convenient functions;
fopen()vsopen()is the classic example. - VDSO lets the kernel pre-compute results for lightweight calls, so
clock_gettime()often never enters the kernel at all. - seccomp and seccomp-BPF let you filter the syscall surface per process.
Conclusion
- Kernel Architecture — How the kernel is structured internally
- Process Concept — How the OS represents and manages running programs
- Memory Allocation — How the kernel manages heap memory for processes
Category
Related Posts
What Is an Operating System?
An operating system sits between hardware and applications, managing resources so programs don't have to. This guide explains what an OS does, its architecture, and why it matters.
Build Your Own OS
A hands-on project guide to building a minimal operating system from scratch: boot loader, kernel, scheduler, and file system.
CPU Affinity & Real-Time Operating Systems
CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.