System Calls Interface

System calls are the boundary between user programs and the kernel. They are the mechanism by which user-space applications request services from the operating system — opening files, creating processes, allocating memory, and more. Understanding syscalls reveals how the OS enforces isolation and provides safe access to hardware.

published: May 20, 2026 reading time: 27 min read author: GeekWorkBench

Quick Summary

System Calls Interface

The sandbox your code runs in is real. Not metaphorical — enforced by the CPU, in hardware. Your program can calculate, loop, allocate structs on the stack, and thrash around in cache all day without ever asking permission. But open a file, send a packet, or allocate a chunk of heap memory, and you hit a wall. The kernel sits on the other side of that wall, and the only way your code can cross is through system calls.

Syscalls are the handshake between user space and kernel space. They are the only entry points through which unprivileged code reaches privileged operations. Without them, any bug in any program could overwrite your disk or hijack the CPU. With them, the kernel can inspect every request, enforce security policies, and share hardware across thousands of competing processes.

This post covers what syscalls are, how the CPU flips privilege levels to make them work, the calls you’ll hit most often in practice, and the security machinery built to contain them.

What Is a System Call

A system call is a controlled ring transition — from user space into kernel space — triggered by user-level code that needs something the kernel provides. The kernel runs in ring 0 (on x86), the highest privilege. Your application runs in ring 3. When your code fires a syscall instruction, the CPU bumps up to ring 0, jumps into kernel code, and the kernel takes over from there.

What kinds of operations cross that boundary? A rough taxonomy:

File I/O — open, read, write, close
Process management — fork, execve, exit, wait
Memory allocation — brk, mmap
Networking — send, recv on sockets
Inter-process communication — pipes, message queues, shared memory
Time and scheduling — clock_gettime, nanosleep, sched_yield

The C library (glibc, musl, uClibc) wraps these into functions you already know: open(), read(), fork(), malloc(). Worth knowing: malloc() itself is not a syscall. Under the hood it calls brk() or mmap() to get memory from the kernel. The syscall is the primitive; the library is ergonomics on top.

The User/Kernel Boundary

The ring model is simple. x86 has four rings (0–3). Linux and Windows use ring 0 for kernel code, ring 3 for everything else. Nothing in between.

When a syscall fires, the sequence is:

CPU switches to ring 0.
Program counter jumps to a fixed kernel address — the syscall entry point.
Kernel reads the syscall number from a register (eax on x86, rax on x86-64).
Kernel dispatches to the handler for that syscall number.
On return, privilege drops back to ring 3 and execution resumes in user code.

graph LR
    A["User Space (Ring 3)"] -->|"syscall instruction"| B["Kernel Space (Ring 0)"]
    B -->|"return value in rax"| A

No user program can jump into kernel code arbitrarily — the syscall instruction itself is privileged, and the kernel vets the target address. This is the bedrock of process isolation.

How the Syscall Instruction Works

Same idea across architectures, different instructions.

x86 (32-bit): `int 0x80`

On 32-bit Linux the entry path is the int 0x80 interrupt. Syscall number goes in eax, arguments in ebx, ecx, edx, esi, edi, ebp.

; Read 10 bytes from fd 3 into buffer
mov eax, 3        ; read = 3
mov ebx, 3        ; fd = 3
mov ecx, buffer   ; buffer address
mov edx, 10       ; byte count
int 0x80          ; kernel entry

x86-64: `syscall`

On 64-bit Linux, int 0x80 is gone. syscall is faster and cleaner. Syscall number in rax, arguments in rdi, rsi, rdx, r10, r8, r9.

; Write 14 bytes to stdout
mov rax, 1        ; write = 1
mov rdi, 1        ; fd = 1 (stdout)
mov rsi, msg      ; buffer
mov rdx, 14       ; byte count
syscall

Both conventions index into a syscall table — the kernel’s dispatch array. It’s architecture-specific and lives in the kernel binary itself.

Common Syscalls

These come up constantly in real code:

Syscall	Number (x86-64)	What it does
`read`	0	Read from a file descriptor
`write`	1	Write to a file descriptor
`open`	2	Open a file, get a descriptor back
`close`	3	Close a file descriptor
`fork`	57	Clone the current process
`execve`	59	Replace the current process image with a new program
`exit`	60	Terminate the calling process
`mmap`	9	Map files or anonymous memory into the address space
`brk`	12	Move the program break (adjust heap size)

File Descriptors

Most file-related syscalls operate on file descriptors — small integers that index into the per-process fd table. Call open("/etc/passwd", O_RDONLY), the kernel allocates an fd (say, 3), stores a reference to the open file, and returns 3. Every subsequent read(fd, ...) looks up that 3 in the table.

File descriptors are process-local. When a process forks, children inherit the parent’s fd table — which is why a forked child can keep reading from the same files the parent had open.

Return Values and Error Handling

Syscalls use a consistent convention: non-negative means success, -1 means error. On failure, the kernel sets errno (a global int) to a positive value encoding what went wrong.

#include <stdio.h>
#include <errno.h>
#include <string.h>

int main(void) {
    int fd = open("/nonexistent/file", O_RDONLY);
    if (fd < 0) {
        fprintf(stderr, "open failed: %s\n", strerror(errno));
        return 1;
    }
    close(fd);
    return 0;
}

This is why -1 instead of exceptions — it maps directly onto C’s error-checking idiom. Python wraps it as OSError, Go as error, but the contract underneath is identical.

Library Wrappers: glibc, musl

You almost never call syscalls directly. You call library functions, which may:

Wrap a syscall one-to-one — musl’s open() is essentially syscall(SYS_open, ...).
Add buffering and semantics — fopen() wraps open() and gives you a FILE* with an internal buffer.
Build something larger — printf() wraps write() and adds formatting.

The difference matters when you’re tracing behavior. fopen() and open() end up in the same syscall, but fopen() does extra work first: allocating a FILE struct, filling an internal buffer, calling open() once things are set up.

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int main(void) {
    // Raw syscall wrapper: file descriptor
    int fd = open("data.txt", O_RDONLY);
    if (fd < 0) return 1;

    char buf[128];
    ssize_t n = read(fd, buf, sizeof(buf));
    close(fd);

    // Buffered standard I/O wrapper: FILE*
    FILE* fp = fopen("data.txt", "r");
    if (!fp) return 1;

    fgets(buf, sizeof(buf), fp);
    fclose(fp);

    return 0;
}

fopen() is better for sequential text reads — buffering cuts syscall traffic. But if you need non-blocking I/O, specific file offsets, or precise control, work with the raw fd.

VDSO and vsyscall

Every syscall has overhead: the privilege switch, the kernel entry, the return. Some syscalls are so cheap the overhead dominates. Linux has two optimizations for these.

vsyscall (legacy)

Before VDSO, Linux needed a way to avoid syscall overhead for cheap read-only operations — gettimeofday() was the main one. The kernel mapped a fixed memory region at a predictable address (0xffffe000 on x86), placed implementations of these getter functions there, and let libc call them as ordinary function pointers. No interrupt, no privilege switch.

The problem is that fixed address. No ASLR means exploit code can jump directly to the vsyscall page and call whatever it wants from there — gettimeofday() was just the beginning. The fixed address is also a ROP gadget. Modern Linux deprecates vsyscall; newer kernels emulate it with a SIGSEGV fallback that still has the same address problem. VDSO replaces it by mapping at a randomized address, subject to ASLR.

The vsyscall page shows up in /proc/self/maps on older systems as a single execute-mapped page at that static address. If you are debugging a security issue on an old container image, that mapping is one of the first things to check — it is a predictable code execution primitive sitting in every process’s address space.

You can inspect this yourself:

# On a system with vsyscall enabled
cat /proc/self/maps | grep vsyscall
# Typical output: ffffe000-ffffefff versusyscall [vsyscall]

VDSO replaced vsyscall because it solves the address predictability problem at its root. The next section covers how VDSO works and why clock_gettime() is effectively free on modern Linux.

VDSO (Virtual Dynamic Shared Object)

A shared object the kernel maps into every process at a randomized address. Unlike vsyscall, VDSO is subject to ASLR. The kernel pre-computes values for syscalls like clock_gettime() and stores them in the VDSO page. Libc checks for VDSO availability and calls it directly — no kernel entry needed.

clock_gettime() is the canonical example. The kernel fills the VDSO page with pre-computed time values. Your libc detects the VDSO and calls it without ever invoking int 0x80 or syscall.

Check it out: cat /proc/self/maps | grep vdso. The [vdso] region is the shared code page available without a syscall.

A Practical Example: open/read/write

Here is a complete program that opens a file, reads it, and writes the content to stdout.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

#define BUFFER_SIZE 256

int main(int argc, char* argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <file>\n", argv[0]);
        return 1;
    }

    const char* filepath = argv[1];

    // syscall number 2 on x86-64
    int fd = open(filepath, O_RDONLY);
    if (fd < 0) {
        fprintf(stderr, "open('%s') failed: %s (errno=%d)\n",
                filepath, strerror(errno), errno);
        return 1;
    }

    char buffer[BUFFER_SIZE];
    ssize_t bytes_read = read(fd, buffer, BUFFER_SIZE - 1);
    if (bytes_read < 0) {
        fprintf(stderr, "read() failed: %s\n", strerror(errno));
        close(fd);
        return 1;
    }
    buffer[bytes_read] = '\0';

    // syscall number 1 on x86-64
    ssize_t bytes_written = write(STDOUT_FILENO, buffer, bytes_read);
    if (bytes_written < 0) {
        fprintf(stderr, "write() failed: %s\n", strerror(errno));
        close(fd);
        return 1;
    }

    close(fd);

    fprintf(stderr, "Processed %zd bytes from '%s'\n", bytes_read, filepath);
    return 0;
}

Compile with gcc -o fileutil fileutil.c and run with ./fileutil somefile.txt. Trace the syscalls with strace -c ./fileutil somefile.txt — you’ll see open, read, write, close in the syscall summary.

Security Implications: Syscall Filtering and seccomp

The syscall interface is the kernel’s attack surface. Every syscall is a potential vector — privilege escalation, data exfiltration, DoS. Linux gives processes two ways to shrink that surface.

seccomp (Secure Computing Mode)

seccomp restricts which syscalls a process may call. Once enabled in strict mode, any disallowed syscall kills the process with SIGKILL. It was designed for long-running servers that only need a handful of operations after initialization.

#include <stdio.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>

int main(void) {
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) < 0) {
        perror("prctl seccomp");
        return 1;
    }

    // Only these syscalls are allowed after this point
    write(STDOUT_FILENO, "Hello via seccomp\n", 18);
    _exit(0);

    // Any other syscall — read, open, anything — kills the process
    return 0;
}

seccomp-BPF

The modern seccomp interface uses Berkeley Packet Filter programs to filter syscalls by number, argument values, or both. Docker and container runtimes use seccomp-BPF to disable dangerous calls inside containers — mount(), syslog(), and similar — while leaving harmless ones available.

BPF filter programs attached to a process inspect each syscall before it reaches the kernel handler. A filter can examine the syscall number and all six arguments, then return ALLOW, KILL, or TRAP — kill the process or deliver a signal. More granular filters look at argument values: block open() when the flags contain O_CREAT, block write() to a specific fd range.

Docker’s default seccomp profile is a JSON allowlist of ~44 blocked syscalls. Docker compiles this into a BPF program that runs on every syscall. The kernel enforces the filter before the syscall executes, so a container cannot call mount() regardless of its capability bits — the BPF verdict blocks it at the hardware transition level.

The BPF programs here are not the packet-filtering BPF you find in tcpdump. Those run in the network stack. Seccomp-BPF programs run at every syscall and have access to syscall arguments only — they cannot loop, recurse, or access memory arbitrarily. The kernel’s BPF verifier statically analyzes the program before loading it, rejecting any code that could hang or access out-of-bounds memory. This is why seccomp-BPF is safe to run in a production path: the verifier guarantees termination and bounded execution.

The filter return values work like a firewall:

SECCOMP_RET_ALLOW — let the syscall through
SECCOMP_RET_KILL — kill the process immediately (no signal, no cleanup)
SECCOMP_RET_TRAP — send SIGSYS to the process, which you can catch and handle
SECCOMP_RET_ERRNO — return a custom error code to userspace without executing the syscall
SECCOMP_RET_TRACE — pause and let a tracer (like strace or a debugger) inspect before resuming

Containers typically use KILL for dangerous syscalls and ALLOW for everything else. System call tracing tools use TRACE to intercept syscalls without blocking them. You can attach a seccomp-BPF filter with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) after writing a BPF program, though most people use libseccomp or Docker’s JSON profile which compiles down to the binary format automatically.

Why This Matters

If a web server gets breached, the attacker should not be able to format the disk or spawn a root shell. seccomp makes that a property of the process rather than a property of the network boundary. Containers and minimal container images depend on this: restrict the syscall surface, restrict the blast radius of any future exploit.

The real-world impact shows up in container escapes. CVE-2022-0492 is a good example — it involved a misconfiguration in the cgroups procfs mount inside a container that let an attacker escape to the host filesystem. A correctly configured seccomp profile blocking mount() and chroot() would have stopped the exploit entirely, even if the container runtime itself was misconfigured. seccomp in other words is a defense-in-depth layer that protects against unknown vulnerabilities in the code running inside the process.

Minimal container images like scratch and distroless ship without shells or package managers specifically so that the syscall surface is as small as possible. A process running in scratch can only make the syscalls you explicitly allow. This is the extreme end of the principle: if you do not need a syscall to exist, it should not be available to your process.

This is also why Chrome runs its renderer processes under seccomp-BPF, and why systems like OpenSSH, vsftpd, and Dan York’s securespawn library use it for privilege separation. In each case the goal is the same: whatever the exploited process can do, the syscall filter limits it to a narrow set of operations. An attacker who compromises the renderer process in Chrome cannot pivot to making arbitrary syscalls on the host — the filter kills the process before it gets anywhere close.

The questions that follow dig into the mechanics of how syscalls and seccomp work at the level you need to explain in an interview. They cover the syscall dispatch mechanism, the error model, fork/execve behavior, and the specifics of memory allocation — all of which connect back to the security properties covered here.

Interview Questions

1. What happens when a program calls `open()` in C on Linux?

On Linux, open() from glibc places the syscall number for open (5 on x86-64) in rax, the flags and mode arguments in rdi, rsi, rdx, and fires the syscall instruction. The CPU transitions to ring 0 and jumps to the syscall entry point. The kernel's handler validates the path pointer (ensuring it doesn't point into kernel memory), checks file permissions, allocates a file descriptor, creates an open file description, and returns the fd number. On error, it returns -1 and sets errno. The calling program checks the return value and handles the error if negative.

2. What is the difference between `int 0x80` and `syscall`?

int 0x80 is the legacy 32-bit Linux syscall mechanism — a software interrupt that transfers control to kernel mode via interrupt descriptor table entry 0x80. It was the only mechanism on early Linux. syscall is the modern 64-bit instruction introduced with x86-64 that performs a direct CPU transition to kernel mode without interrupt overhead — it's faster and has dedicated registers for syscall arguments (rdi, rsi, rdx, r10, r8, r9). The kernel maintains separate syscall tables for each mode, and int 0x80 cannot enter the 64-bit kernel.

3. Why does `malloc()` call `brk()` or `mmap()` instead of being a syscall itself?

malloc() is a library function in glibc/musl, not a syscall. It manages a heap within a process's address space by calling brk() or mmap() to request memory pages from the kernel — those are the actual syscalls. This design separates concerns: the kernel manages page-level allocation, while the C library manages byte-level heap semantics (coalescing free blocks, binning by size, alignment). The library also caches freed memory rather than returning it immediately to the kernel, which amortizes syscall overhead. Most malloc() calls never touch the kernel at all.

4. What is VDSO and why does `clock_gettime()` often avoid entering the kernel?

VDSO (Virtual Dynamic Shared Object) is a shared library the kernel maps into every process's address space at a randomized location. It contains kernel code that has been pre-compiled and exposes pre-computed values for certain syscalls — notably clock_gettime(). Instead of crossing the user/kernel boundary with a syscall instruction, the libc simply reads the VDSO page, which contains the time value already computed by the kernel. This eliminates the privilege switch overhead entirely for these "get" operations. You can verify VDSO is present with cat /proc/self/maps | grep vdso.

5. How does `fork()` work at the syscall level?

fork() is syscall number 57 on x86-64. The kernel's fork handler creates a new process control block (PCB/struct task_struct), duplicates the parent's memory mappings (copy-on-write — pages are shared until either parent or child writes to them), copies the file descriptor table (each fd points to the same open file description, reference count incremented), and sets up the child process's registers so it returns 0 from fork() while the parent gets the child's PID. The key insight is that the kernel doesn't copy the entire memory — just the page table entries, which makes fork fast even for large processes.

6. What does seccomp do, and why does Docker use it?

seccomp (Secure Computing Mode) is a Linux kernel feature that restricts which syscalls a process may invoke. In strict mode, the only allowed syscalls are read, write, _exit, and sigreturn — anything else kills the process with SIGKILL. Docker uses seccomp-BPF (an extension) to define fine-grained filters based on syscall number and arguments. By default, Docker blocks around 44 syscalls that are not needed for normal container operation — including mount(), syslog(), and module loading. This dramatically reduces the attack surface: even if an attacker compromises the container process, they cannot escalate to host-level access through syscalls the container isn't allowed to make.

7. Why do syscalls return -1 on error rather than throwing an exception?

C doesn't have exceptions as a language feature (it was added later as a library convention with setjmp/longjmp). Unix was designed in the 1970s with C, and the error model reflects this: syscalls return a signed integer — non-negative values are success (including 0 for read returning no bytes, which is valid), -1 indicates error. The kernel sets the global errno variable to a positive value encoding which error occurred (EFAULT, ENOENT, EPERM, etc.). This maps naturally onto C's error-checking idiom: if (fd < 0) handle_error(). Python, Go, and other languages wrap this underlying contract in their own exception or error-return patterns, but the kernel interface stays the same.

8. What is the relationship between a file descriptor and an open file description?

A file descriptor is a small non-negative integer (typically 0, 1, 2 for stdin/stdout/stderr, then 3 upward) that indexes into the per-process file descriptor table. An open file description (also called an open file table entry) is a kernel data structure storing the current file offset, access mode, and a reference to the inode. When you fork(), the child has its own fd table but the entries point to the same open file descriptions — this is why both parent and child see updates to the file offset. When you dup() an fd, both point to the same open file description. File descriptors are process-local; open file descriptions are reference-counted kernel objects that outlive individual fds pointing to them.

9. How does `dup2()` differ from `dup()`, and why would you use it?

dup() returns a new file descriptor that points to the same open file description as the original fd — the kernel allocates the lowest available fd number. dup2(oldfd, newfd) explicitly assigns the duplicated fd to a specific descriptor number you choose, closing newfd first if it's already open. The main use case for dup2() is redirecting standard input, output, or error before executing a child process — for example, redirecting STDIN_FILENO to a file without worrying about what fd number you get back. It also atomically handles the close-of-old-fd case, which matters in multithreaded programs where a fd being closed in another thread could be reused between the close() and dup() in a non-atomic sequence.

10. What is the purpose of the `pipe()` syscall and how does it relate to fork?

pipe(int fds[2]) creates a unidirectional communication channel and returns two file descriptors — fds[0] for reading, fds[1] for writing. Data written to fds[1] can be read from fds[0]. The canonical pattern is: call pipe(), then fork(). After fork, the child typically closes the read end and writes; the parent closes the write end and reads. The kernel's pipe implementation uses a circular buffer in kernel memory. If a process calls pipe() without forking, you have a one-directional channel within a single process — useful for subprocess communication in the same process tree.

11. How does `execve()` differ from `fork()`, and why are they often used together?

fork() creates a new process — it duplicates the calling process, giving the child a copy of the parent's memory, file descriptors, and register state (except the child gets 0 return from fork while the parent gets the child's PID). execve() does not create a new process — it replaces the current process's address space with a new program image, loading the executable from disk and resetting registers, stack, and heap. The typical sequence is: fork() to create a child process, then the child calls execve() to run a different program while inheriting the parent's open file descriptors (which is why pipes redirecting stdin/stdout work with subprocesses). Without execve, fork alone just gives you a clone of the same program running in two processes.

12. What is the difference between `mmap()` and `brk()` for memory allocation?

brk() moves the program break (the end of the heap) up or down by a requested amount — it's a simple interface for adjusting how much heap the kernel has allocated to the process. mmap() is more general: it creates a new memory mapping in the process's virtual address space. Mappings can be file-backed (mapping a file's contents into memory) or anonymous (just allocated pages with no backing store). malloc() uses both — small allocations come from the heap managed by brk-style logic inside glibc, while large allocations (typically > 128KB) call mmap() directly to get whole pages directly from the kernel without fragmenting the heap. The key difference is that brk reuses a single heap region while mmap creates entirely new non-contiguous virtual memory regions.

13. How do child processes inherit file descriptors across fork, and what happens to them on exec?

On fork(), the child receives a copy of the parent's file descriptor table — each entry points to the same open file description in the kernel, with the reference count incremented. File descriptors are not duplicated by value; they share the underlying kernel object. The file descriptor flags (FD_CLOEXEC) determine behavior on execve(): if FD_CLOEXEC is set, the fd is automatically closed when the process calls exec. This is how daemons close stdin/stdout/stderr before exec — they set FD_CLOEXEC on the relevant fds before calling execve, so the new program doesn't inherit the parent's file descriptors.

14. What does the `access()` syscall do and when would you use it over `open()`?

access(path, mode) checks whether the calling process has permission to access a file at a given path — it tests read, write, or execute permissions without actually opening the file. The mode flags are R_OK, W_OK, X_OK, and F_OK (existence check). You'd use access() when you want to test permissions before taking an action that might fail — for example, checking if a file is readable before loading it, or checking execute permission before running a subprocess. Unlike open(), it does not create a file descriptor, does not track the file in the process's fd table, and doesn't update the file's access time. It's purely a permission check against the process's effective UID/GID.

15. What is `ioctl()` and why is it considered a "miscellaneous" syscall?

ioctl(fd, request, ...) is a catch-all syscall for device-specific operations that don't fit the standard read/write/open/close model. It takes a file descriptor (often from opening a device node like /dev/tty or a socket) and a request code that uniquely identifies the operation — the third argument is polymorphic, depending on the request. Examples: TIOCGWINSZ to get terminal window size, FIONREAD to get bytes available to read on a file descriptor, SIOCGIFCONF to get network interface addresses. The request codes are typically defined in header files like <sys/ioctl.h> or device-specific headers. It's "miscellaneous" because it's the syscall that doesn't fit neatly into any other category — each device driver implements its own ioctl operations.

16. How does `prctl()` enable process control features like seccomp mode?

prctl(option, arg2, arg3, arg4, arg5) is a multi-purpose syscall for configuring process-level kernel features. Its behavior depends entirely on the option argument. Common uses include: PR_SET_SECCOMP to enable seccomp (the first argument after SECCOMP_MODE_STRICT), PR_SET_PDEATHSIG to set a signal delivered to the process when its parent dies, PR_SET_NAME to set the process name visible in /proc/self/status, and PR_GET_DUMPABLE to query whether core dumps can be made after setuid operations. It's the primary interface for process self-configuration that doesn't fit elsewhere. seccomp-BPF extends this by using prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_program) to load BPF filter programs.

17. What happens to a process's file descriptors when it calls `execve()`?

File descriptors are preserved across execve() — they remain open and point to the same open file descriptions they referenced before the exec. The file descriptor table itself is replaced entirely (the child process gets a fresh fd table on fork, then that table is reused on exec). The key exception is if FD_CLOEXEC flag is set on a fd, the kernel automatically closes it during exec as a safety measure. This design supports the common Unix pattern where a parent sets up pipes before forking and execing a child — the child inherits the pipe fds and they remain open during exec so the new program can use them. Without this, redirecting stdin/stdout in subprocesses would require the subprocess to explicitly reopen file descriptors after starting.

18. How does `syscall()` differ from calling library wrappers like `read()` and `write()`?

read() and write() are library functions (in glibc, musl) that wrap the raw syscall(SYS_read, ...) instruction. syscall() itself is a generic function taking a syscall number as its first argument, followed by up to six arguments passed to the kernel. You could theoretically call syscall(SYS_read, fd, buf, count) directly, bypassing the libc wrapper — though normally you wouldn't because the wrapper does argument validation, error handling (setting errno), and in the case of fopen() vs open(), adds buffering. Using the raw syscall() function also bypasses any libc-level interception (some build environments intercept syscalls for debugging or sandboxing). In normal application code, use the wrappers — they're more ergonomic and handle errno correctly.

19. What is the role of `syscall` numbers and how does the kernel dispatch them?

Each syscall has a unique number (e.g., read = 0, write = 1, open = 2, fork = 57 on x86-64) that serves as an index into the kernel's syscall table. When a user process executes the syscall instruction, the CPU writes the syscall number into rax and arguments into specific registers. The kernel's entry point (in the kernel's assembly code) uses that number to index into an array of function pointers — sys_call_table[rax] — and calls the corresponding handler function. This indirection is why syscall numbers are architecture-specific: the table for x86-64 has different numbers and different handler addresses than for arm64 or x86-32. Userspace libc knows these numbers and passes them correctly when invoking syscall.

20. How does the kernel handle syscall arguments that are pointers to user memory?

The kernel must treat syscall arguments as potentially malicious because they point to memory in user space — a malicious program could pass a pointer to kernel memory to try to read it, or a NULL pointer to crash the kernel. Every pointer argument is validated before use: the kernel checks whether the address falls within the user portion of the address space (access_ok() in Linux), and dereferencing must go through functions like copy_from_user() or get_user() that copy data safely from user to kernel memory, returning an error code if the copy fails. A common attack is passing a kernel-space address — the kernel must reject it. This is also why seccomp filters inspect syscall arguments; even the kernel's own syscall handler can't assume callers are well-behaved. Pointer validation is why a NULL fd to write() returns EBADF rather than crashing the kernel.

Conclusion

Kernel Architecture — How the kernel is structured internally
Process Concept — How the OS represents and manages running programs
Memory Allocation — How the kernel manages heap memory for processes

System Calls Interface

What Is a System Call

The User/Kernel Boundary

How the Syscall Instruction Works

x86 (32-bit): int 0x80

x86-64: syscall

Common Syscalls

File Descriptors

Return Values and Error Handling

Library Wrappers: glibc, musl

VDSO and vsyscall

vsyscall (legacy)

VDSO (Virtual Dynamic Shared Object)

A Practical Example: open/read/write

Security Implications: Syscall Filtering and seccomp

seccomp (Secure Computing Mode)

seccomp-BPF

Why This Matters

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

What Is an Operating System?

Build Your Own OS

CPU Affinity & Real-Time Operating Systems

x86 (32-bit): `int 0x80`

x86-64: `syscall`