Kernel Module Development
A comprehensive guide to writing Linux kernel modules, character drivers, and ioctl interfaces for extending kernel functionality.
Introduction
Kernel module development sits at the deepest level of Linux system programming. Unlike user-space applications that execute in a protected environment, kernel modules run with full system privileges, making them capable of extending the kernel’s functionality at runtime. This is how device drivers, file systems, and kernel extensions are implemented in Linux.
The kernel module is a loadable object file (.ko) that can be dynamically linked into the running kernel without rebooting the system. This capability is what makes Linux so adaptable—you can add support for new hardware or new features without restarting your machine.
When to Use / When Not to Use
Kernel modules are appropriate when you need to:
- Interact with hardware directly — Writing drivers for custom or proprietary hardware that has no existing Linux support
- Intercept system calls — Building security modules, auditing systems, or sandboxing solutions
- Implement file systems — Creating custom file system drivers for specialized storage backends
- Add kernel-level performance optimizations — Offloading computation that would otherwise suffer from user/kernel context switches
Kernel modules are NOT appropriate when:
- User-space alternatives exist — FUSE allows file systems in user space with less risk
- Your driver only needs to communicate with existing hardware — The kernel already has thousands of drivers; write a kernel module only if no suitable existing driver exists
- You need debugging flexibility — Kernel bugs crash the entire system; user-space debugging is far safer
Architecture or Flow Diagram
flowchart TB
subgraph "User Space"
APP[Application Program]
IOCTL[ioctl System Call]
end
subgraph "Kernel Space"
VFS[Virtual File System Layer]
DRIVER[Char Driver Layer]
MODULE[Kernel Module]
KSU[Kernel Subsystems]
end
subgraph "Hardware"
HW[Hardware Device]
end
APP --> IOCTL
IOCTL --> VFS
VFS --> DRIVER
DRIVER --> MODULE
MODULE --> KSU
KSU --> HW
style MODULE stroke:#ff6b6b,stroke-width:3px
style DRIVER stroke:#ffa94d,stroke-width:3px
Core Concepts
Module Lifecycle
Every kernel module follows a standard lifecycle pattern:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Your Name");
MODULE_DESCRIPTION("A simple example kernel module");
MODULE_VERSION("1.0");
/* Called when module is loaded */
static int __init my_module_init(void)
{
printk(KERN_INFO "my_module: initializer called\n");
/* Initialize your device, register character device, etc. */
return 0;
}
/* Called when module is unloaded */
static void __exit my_module_exit(void)
{
printk(KERN_INFO "my_module: cleanup called\n");
/* Release resources, unregister device, etc. */
}
module_init(my_module_init);
module_exit(my_module_exit);
Character Driver Registration
Character drivers provide a file-based interface to kernel functionality. The registration process involves:
#include <linux/cdev.h>
#include <linux/fs.h>
#define DEVICE_NAME "my_device"
#define MAX_BUFFER_SIZE 1024
static dev_t dev_number;
static struct cdev my_cdev;
static char kernel_buffer[MAX_BUFFER_SIZE];
static int device_open(struct inode *inode, struct file *file)
{
printk(KERN_INFO "my_device: opened\n");
return 0;
}
static int device_release(struct inode *inode, struct file *file)
{
printk(KERN_INFO "my_device: closed\n");
return 0;
}
static ssize_t device_read(struct file *file, char __user *user_buffer,
size_t len, loff_t *offset)
{
size_t bytes_to_copy = min(len, (size_t)(MAX_BUFFER_SIZE - *offset));
if (copy_to_user(user_buffer, kernel_buffer + *offset, bytes_to_copy))
return -EFAULT;
*offset += bytes_to_copy;
return bytes_to_copy;
}
static ssize_t device_write(struct file *file, const char __user *user_buffer,
size_t len, loff_t *offset)
{
size_t bytes_to_copy = min(len, (size_t)(MAX_BUFFER_SIZE - *offset));
if (copy_from_user(kernel_buffer + *offset, user_buffer, bytes_to_copy))
return -EFAULT;
*offset += bytes_to_copy;
return bytes_to_copy;
}
static long device_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
int ret = 0;
switch (cmd) {
case MY_IOCTL_RESET:
memset(kernel_buffer, 0, MAX_BUFFER_SIZE);
break;
case MY_IOCTL_GET_STATUS:
ret = put_user(0, (int __user *)arg); /* Dummy status */
break;
default:
ret = -EINVAL;
}
return ret;
}
static struct file_operations fops = {
.owner = THIS_MODULE,
.open = device_open,
.release = device_release,
.read = device_read,
.write = device_write,
.unlocked_ioctl = device_ioctl,
};
static int __init my_driver_init(void)
{
/* Dynamic device number allocation */
if (alloc_chrdev_region(&dev_number, 0, 1, DEVICE_NAME) < 0) {
printk(KERN_ERR "my_device: failed to allocate device number\n");
return -1;
}
cdev_init(&my_cdev, &fops);
my_cdev.owner = THIS_MODULE;
if (cdev_add(&my_cdev, dev_number, 1) < 0) {
unregister_chrdev_region(dev_number, 1);
return -1;
}
printk(KERN_INFO "my_device: registered with major %d\n", MAJOR(dev_number));
return 0;
}
static void __exit my_driver_exit(void)
{
cdev_del(&my_cdev);
unregister_chrdev_region(dev_number, 1);
printk(KERN_INFO "my_device: unregistered\n");
}
module_init(my_driver_init);
module_exit(my_driver_exit);
User-Space Interaction via ioctl
The ioctl interface allows user applications to send custom commands to kernel modules:
/* User-space side (example.c) */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#define MY_IOCTL_RESET _IO('M', 0)
#define MY_IOCTL_GET_STATUS _IOR('M', 1, int)
int main(void)
{
int fd = open("/dev/my_device", O_RDWR);
if (fd < 0) {
perror("open");
exit(EXIT_FAILURE);
}
/* Send custom commands */
ioctl(fd, MY_IOCTL_RESET);
int status;
ioctl(fd, MY_IOCTL_GET_STATUS, &status);
close(fd);
return 0;
}
Production Failure Scenarios
Scenario 1: Module Crashes During Load
Problem: A bug in __init causes a kernel oops, freezing the system.
Mitigation:
- Always validate all allocations and operations in init functions
- Use
__initannotations to free memory after initialization - Implement proper error rollback in init failure paths
- Test modules thoroughly in a VM before production deployment
Scenario 2: Race Conditions in Multi-threaded Access
Problem: Concurrent read/write operations corrupt shared kernel buffer.
Mitigation:
- Use kernel synchronization primitives (spinlocks, mutexes, RCU)
- Always grab locks in consistent order to prevent deadlocks
- Use
nonseekable_open()for character devices that don’t support seeking - Consider using the kernel’s lockdep tool to detect lock ordering issues
Scenario 3: Module Version Mismatch After Kernel Update
Problem: Module compiled for one kernel version fails to load after kernel update.
Mitigation:
- Use
modinfoto check module compatibility - Rebuild modules after kernel updates
- Consider using the “staging” tree for drivers that need rapid iteration
- Sign modules with the same key used by the kernel (for secure boot systems)
Trade-off Table
| Aspect | Kernel Module | User-Space Driver (FUSE) | Out-of-Tree Module |
|---|---|---|---|
| Performance | Highest—no context switches for I/O | Lower—FUSE adds overhead | Same as in-tree |
| Safety | Risky—bugs crash kernel | Safe—crashes only affect process | Risky, no upstream review |
| Maintenance | Mainlined = long-term support | Easier—standard debugging | High—manual tracking |
| Development Speed | Slower—kernel build cycles | Faster—familiar debugging | Varies |
Implementation Snippet: Device Class Registration
A modern approach to device registration uses the device model:
#include <linux/device.h>
#include <linux/platform_device.h>
static struct platform_device *my_platform_device;
static int my_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
printk(KERN_INFO "my_device: probing\n");
/* Use devm_* for automatic resource cleanup */
return 0;
}
static int my_remove(struct platform_device *pdev)
{
printk(KERN_INFO "my_device: removing\n");
return 0;
}
static const struct of_device_id my_of_match[] = {
{ .compatible = "vendor,my-device", },
{ }
};
MODULE_DEVICE_TABLE(of, my_of_match);
static struct platform_driver my_driver = {
.probe = my_probe,
.remove = my_remove,
.driver = {
.name = "my_device",
.of_match_table = my_of_match,
},
};
module_platform_driver(my_driver);
Observability Checklist
For production kernel modules, monitor these aspects:
- dmesg output — Kernel log messages via
printkwith appropriate log levels - /proc/modules — Check module loaded state and memory usage
- /sys/module/
— Parameter values and statistics via module parameters - ftrace — Trace function calls within the module during debugging
- perf — Profile CPU usage attributed to module functions
Common Pitfalls / Anti-Patterns
- Never trust user input — Always validate and copy data between user and kernel space using
copy_from_user()andcopy_to_user() - Avoid arbitrary ioctl commands — Design ioctl interfaces carefully to prevent privilege escalation
- Module signing — On systems with secure boot, unsigned modules will not load
- GPL module license — Only GPL-licensed modules can use many internal kernel symbols
- Minimize attack surface — Export only necessary symbols; hide internal interfaces
Common Pitfalls / Anti-patterns
- Forgetting to unregister — Always pair registration with proper cleanup in exit function
- Blocking in atomic context — Don’t use sleeping locks or allocations with
GFP_KERNELin atomic context - Memory leaks — Every
kmallocmust have a correspondingkfree; usedevm_*functions to automate this - Not checking return values — Ignore
copy_to_userreturn values and your module will corrupt user memory - Deadlocks — Acquiring locks in inconsistent order or holding locks while calling user callbacks
Quick Recap Checklist
- Kernel modules extend kernel functionality at runtime without rebooting
- Character drivers provide file-based interfaces via
file_operations - ioctl commands allow custom user-kernel communication
- Always validate and copy data between user and kernel space
- Use proper synchronization primitives for concurrent access
- Test extensively in VMs before production deployment
- Follow the GPL licensing requirement for module headers
- Use
devm_*functions for automatic resource management
Real-World Case Study: USB Driver Development
Consider developing a USB device driver for a custom hardware device. The driver must:
- Identify the device using USB vendor/product IDs via
USB_DEVICE() - Declare endpoints in the interrupt/bulk descriptors
- Implement probe/remove for device lifecycle management
- Handle URBs (USB Request Blocks) for async I/O
- Manage device power states with
usb_autopm_get_interface()
The complexity of USB driver development demonstrates why understanding the kernel’s driver model abstraction is essential—driver authors work at a high level while the kernel handles the messy details of descriptor parsing, bandwidth management, and hotplug coordination.
Advanced Topic: Kernel Symbols and EXPORT_SYMBOL
When one kernel module needs to call functions from another, the callee must explicitly export its symbols:
/* Symbol export example */
void internal_helper_function(int arg)
{
/* Implementation */
}
EXPORT_SYMBOL(internal_helper_function);
Only exported symbols appear in /proc/kallsyms and are available to other modules. This mechanism allows the kernel to hide internal implementation details while exposing stable APIs. Always document which symbols are part of your module’s public API versus internal implementation.
Interview Questions
copy_from_user and get_user? When would you use each?get_user retrieves a single value (up to 8 bytes) and is atomic on most architectures. copy_from_user copies arbitrary amounts of data. Use get_user for simple values like integers in ioctl handlers; use copy_from_user for buffers and larger data structures. Both return non-zero on failure, indicating the number of bytes that could not be copied.
When modprobe fails, check dmesg | tail for kernel oops messages. Common failures include missing symbols (unresolved dependencies), version magic mismatch, or init function errors. Use modinfo to inspect module metadata, depmod to rebuild module dependencies, and insmod with verbose flags for direct loading attempts.
__init and __exit annotations. Why are they important?__init marks functions called only during module initialization; the kernel can free their code memory after they complete. __exit marks functions only called during module unload; the kernel can omit them entirely if the module cannot be unloaded. These annotations optimize memory usage, especially important in embedded systems with limited RAM.
Always acquire locks in a consistent global order across all code paths. Use lockdep to detect potential deadlocks during development. Avoid calling user callbacks while holding locks—user code might try to acquire the same lock. Consider using mutex_lock_nested() with subclasses when the same lock type is used in nested contexts.
MODULE_LICENSE macro and why does it matter?MODULE_LICENSE declares the license terms of the module. The kernel only exports certain symbols to modules with GPL-compatible licenses ("GPL", "GPL v2", "Dual BSD/GPL", etc.). Proprietary modules with "Proprietary" or "GPL" missing cannot use many internal kernel APIs. This mechanism ensures community contributions remain open while allowing vendors to ship closed-source modules with limited functionality.
alloc_chrdev_region and register_chrdev? When would you use each?register_chrdev registers a legacy character driver with a fixed major number (or requests dynamic allocation). alloc_chrdev_region always dynamically allocates a major number and provides a dev_t for use with the modern cdev interface. Use alloc_chrdev_region with cdev_init/cdev_add for new code—it integrates properly with the device model, sysfs, and udev. register_chrdev is legacy and doesn't create device nodes automatically.
Kernel objects use struct kref for reference counting. When a consumer needs to keep an object alive, it calls kref_get. When done, it calls kref_put, which calls a release function when the count reaches zero. For example,struct file objects are reference-counted—each open file increments the count, and fput decrements it. If the count drops to zero during a memory-mapped operation, the file won't disappear mid-I/O.
A spinlock busy-waits (spins) while waiting for the lock—appropriate for short critical sections in atomic context. A mutex can sleep (block) while waiting—appropriate for longer sections or when sleeping is allowed. Never use a mutex in interrupt context; never hold a spinlock while sleeping. Use spinlocks for short, quick operations (like updating a linked list). Use mutexes for operations that might take longer or require memory allocation.
The Linux Device Model (DM) provides a hierarchical tree of devices, buses, and drivers that enables: (1) power management coordination through the driver core, (2) dynamic device enumeration through sysfs, (3) driver binding and unbinding at runtime, (4) proper cleanup on device removal through reference counting. Before the DM, device handling was ad-hoc and led to duplicate code. It introducedstruct device, struct bus_type, and struct class as the canonical abstractions for representing hardware in the kernel.
Data from user space accessed from interrupt context requires special handling: (1) Use get_user/put_user for atomic single values—these work in interrupt context. (2) For larger data, copy to a kernel buffer using copy_from_user in a non-interrupt context (tasklet, workqueue) before processing. (3) Never access user pointers directly from hardirq context—if the page is swapped out or belongs to another process, it causes a crash. Use access_ok() to validate the address range first.
Tasklets (built on top of softirqs) run in interrupt context—atomic, cannot sleep, fast execution. Workqueues run in process context—can sleep, schedule on any CPU, for longer-running deferred work. Key differences: (1) tasklets are scheduled on the CPU where the interrupt arrived; workqueues run on configured CPU; (2) tasklets cannot block (no memory allocation with sleeping); workqueues can use full kernel APIs; (3) tasklets are per-CPU by nature; workqueues can be per-CPU or shared. Use tasklets for quick, atomic work like updating statistics. Use workqueues for anything that might sleep (memory allocation, I/O, waiting for locks). The kernel's schedule_work() queues to system workqueue; INIT_WORK() + schedule_delayed_work() for delayed execution.
Kernel objects (including modules) use struct kref for reference counting. When a module is loaded, its use count tracks consumers. try_module_get() increments; module_put() decrements. A module cannot be unloaded (rmmod fails) if use count is non-zero. This prevents unloading while in use. Example: a kernel module provides a device interface; module_get() is called when the device is opened and module_put() when closed. Additionally, modules can use __module_ref_get() / __module_ref_put() for fine-grained control. The kernel's lockdep can detect incorrect refcount use, and lsmod shows current refcounts.
Character devices (register_chrdev/alloc_chrdev_region + cdev_add) require: major/minor number allocation, explicit device node creation (mknod), manual permission management, and integration with udev rules. Misc devices (MISC_DYNAMIC_MINOR + misc_register) are a simplified wrapper: they use a single misc major number (10), dynamically assign minor numbers, auto-create device nodes in /dev, and provide a simpler API. Use misc devices for simple drivers that don't need complex device hierarchies (like sound cards, GPUs, or input devices that are one per system). For multiple instances of the same device type, character devices with dynamic allocation are more appropriate.
GFP_ATOMIC vs GFP_KERNEL for memory allocation in modules?GFP_KERNEL: Can sleep, waiting for memory if none available. Safe for process context, cannot be used in atomic context (interrupt handlers, spinlocks held). Uses the kernel's regular memory allocator which can compact and reclaim. GFP_ATOMIC: Cannot sleep—uses a pre-allocated pool (called the atomic reserve). If pool is exhausted, allocation fails. Required in: interrupt handlers, softirqs, spinlock-held contexts, tasklet handlers. Trade-offs: GFP_ATOMIC can fail if memory is fragmented; overusing it depletes the atomic pool. For modules that may run in any context, use GFP_ATOMIC in top-half handlers and defer allocation to bottom-half (workqueue) where GFP_KERNEL is safe.
Module parameters are declared with module_param(name, type, perm) and optionally MODULE_PARM_DESC(). At load time (insmod or modprobe), values can be set via: command line (insmod module.ko param=value) or /sys/module/ at runtime (if perm bits include write). Perm bits: 0 (hidden), 0644 (world-readable, root-only write), 0600 (root-only). The kernel stores parameters in a NULL-terminated array accessible via /sys/module/. For boolean parameters, use module_param_named() with bool type. Arrays use module_param_array() with separate count variable.
try_then_request_module() in the kernel API?try_then_request_module() is used to handle optional module dependencies. It first tries to resolve a symbol or function; if that fails (ENOSYS), it triggers a module load request and waits for the module to be loaded, then retries. Pattern: func = try_then_request_module(symbol_get, "module_name"). If symbol_get() returns NULL (symbol not present), userspace is notified to load the module, and the call blocks until request_module() completes. After the module loads, symbol_get() is retried. Used for: (1) optional kernel features that can be compiled as modules; (2) backing device drivers that may be loadable. The function combines symbol lookup with asynchronous module loading.
fasync ( asynchronous notification) notifies userspace when a device has data available without polling. It uses fasync_helper() in the driver's fasync file operation and kill_fasync() when data arrives. select/poll (poll_wait() + f_op->poll) allows userspace to wait for file descriptor readiness (readable, writable, error). Key difference: fasync pushes data to userspace via signal; poll is pulled (userspace calls select/poll and blocks). Use fasync for low-latency notification of asynchronous events (like incoming network packets); use poll for request-response style I/O. The poll method returns a bitmask of currently ready events; fasync is invoked only when the state actually changes.
owner field in struct file_operations?The .owner field (set to THIS_MODULE) tells the kernel that this driver owns a reference to the module. When the file operation is called, the kernel increments the module's use count, preventing module unloading while any file handle is open. This prevents rmmod from succeeding if a process has the device open. Without setting owner, unloading a module while its device is in use causes use-after-free bugs—processes with open file descriptors would be accessing freed memory. Always set .owner = THIS_MODULE in your file_operations structure. The only exception is for built-in drivers (no module to unload), where this is moot.
Coherent DMA (dma_alloc_coherent(), dma_free_coherent()): Allocates memory the device and CPU can access simultaneously. The kernel maintains the mapping for the lifetime of the allocation—simple but uses non-cacheable memory which is slower. Streaming DMA (dma_map_single(), dma_unmap_single()): Maps a buffer only for the duration of a DMA transfer. CPU may not access buffer during this time (or must call dma_sync). More efficient because the buffer can stay cached normally. Use coherent DMA for frequent, small accesses (descriptor rings); use streaming DMA for bulk transfers. For scatter-gather, use dma_map_sg() + dma_unmap_sg(). The dma_mask field in struct device controls addressing capabilities (32-bit vs 64-bit).
container_of macro in kernel development and when is it used?container_of(ptr, type, member) obtains the containing structure given a pointer to an internal field. Syntax: container_of(pointer_to_field, struct_type, field_name). Example: given struct device *dev from a probe function, and you need the struct my_driver_data *driver_data that contains it: container_of(dev, struct my_driver_data, device_member). This works because the macro uses offsetof() to find the member's position within the struct, then subtracts that offset from the pointer. Uses: (1) driver data retrieval—dev_get_drvdata(dev) uses container_of internally; (2) list traversal—list_entry(pos, struct type, member) in linked lists is container_of; (3) callbacks—when given a pointer to an embedded struct, recover the containing structure.
Further Reading
- Linux Device Driver Development - Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman
- Linux Kernel Documentation - Official kernel docs for subsystems
- LWN.net Kernel Articles - Weekly kernel development news
- Kernel.org Bugzilla - Track kernel issues and patches
- Linux Driver Verification - Tools for driver testing and validation
Conclusion
Kernel module development sits at the deepest level of Linux system programming, offering the ability to extend kernel functionality at runtime without rebooting. The core pattern involves implementing init/exit functions with proper lifecycle management, character driver registration with file operations, and careful user/kernel data transfer.
Security must be paramount: always validate and copy data between address spaces, design ioctl interfaces carefully to prevent privilege escalation, and follow GPL licensing requirements to access internal kernel symbols. Use devm_* functions for automatic resource management and proper synchronization primitives (spinlocks, mutexes, RCU) to handle concurrent access safely.
For continued learning, explore Linux Device Driver Model (platform drivers, PCI, USB), kernel debugging techniques (KGDB, ftrace, lockdep), and advanced topics like writing file system drivers or implementing network protocol handlers in kernel space.
Category
Related Posts
Real-Time Operating Systems
Understand RTOS concepts, scheduling guarantees, latency bounds, and the PREEMPT_RT patch for achieving real-time Linux.
CPU Affinity & Real-Time Operating Systems
CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.
Fork & Exec System Calls
fork() duplicates a running process, then exec() replaces it with a new program. Together they power every shell, web server, and daemon on Unix-like systems.