CPU Affinity & Real-Time Operating Systems
CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.
CPU Affinity & Real-Time Operating Systems
Modern systems run dozens of processes simultaneously across multiple CPU cores. By default, the kernel scheduler makes placement decisions on your behalf, and it does a reasonable job. But there are times when you know better. When latency matters down to the microsecond, when cache warmth is measured in nanoseconds, when a missed deadline means a heart monitor fails, the default scheduler is not enough.
CPU affinity gives you direct control over which cores your processes inhabit. Real-time operating systems go further: they guarantee that work completes within a defined time bound. Together, these are the foundation of deterministic computing — systems that behave predictably, not just probabilistically.
This post is part of the Operating Systems Roadmap — specifically Section 3.3 on Process and Thread Management. If you are new to scheduling concepts, start with Process Scheduling and Process Concept for the fundamentals.
Introduction
CPU Affinity: Taking Control of Placement
CPU affinity is a property that lets you bind a process or thread to a specific subset of CPU cores. When you set affinity, the scheduler respects your preference — it will only place your work on the cores you specify.
The Linux API
Linux exposes affinity through a straightforward API built around bitmasks. Each bit in the mask corresponds to a logical CPU. If bit 3 is set, the process is eligible to run on core 3.
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
int main(void) {
cpu_set_t cpuset;
pid_t pid = getpid();
/* Clear the mask — start fresh */
CPU_ZERO(&cpuset);
/* Set bits for cores 0 and 2 */
CPU_SET(0, &cpuset);
CPU_SET(2, &cpuset);
/* Apply the mask to the current process */
if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* Read it back to confirm */
if (sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
perror("sched_getaffinity");
exit(EXIT_FAILURE);
}
printf("Running on cores: ");
for (int i = 0; i < CPU_SETSIZE; i++) {
if (CPU_ISSET(i, &cpuset))
printf("%d ", i);
}
printf("\n");
return 0;
}
CPU_SETSIZE is typically 1024, meaning you can address up to 1024 logical CPUs. CPU_ZERO, CPU_SET, CPU_CLR, and CPU_ISSET are the basic operations on these masks.
The shell command taskset wraps this API for convenient command-line use:
# Run a process exclusively on cores 0 and 2
taskset -c 0,2 ./my_real_time_app
# Launch on core 0 with exclusive affinity
taskset -c 0 ./latency_sensitive_worker
# Check the affinity of a running process
taskset -p $(pgrep -f my_real_time_app)
The -c flag accepts a comma-separated list or range. taskset -c 0-3 binds to cores 0 through 3.
Why Bind to Specific Cores?
The default scheduler spreads work across all available cores. This is generally efficient, but it introduces variability that real-time and latency-sensitive workloads cannot tolerate.
Cache warmth is the primary motivation. Each CPU core has its own L1 and L2 caches. When a process runs on core 0 then migrates to core 5, it must reload caches on the new core — a penalty measured in dozens of cycles. Keeping a latency-sensitive thread on a single core preserves its cache state, which eliminates that source of jitter.
NUMA systems amplify this concern. On a multi-socket server, memory access latency depends on which socket the CPU belongs to. Binding a process to cores on the same socket as its working memory avoids cross-socket memory traffic, which can be 2-3x slower than local access.
Isolation matters enormously in production. If you dedicate cores 4-7 to a real-time signal processing thread, you can keep other processes from fragmenting cache state on those cores. You can even set those cores aside entirely with the isolcpus kernel parameter at boot, leaving them exclusively for your workload.
Determinism follows from isolation. Fewer variables means fewer sources of unexpected delay. A thread pinned to one core does not experience scheduler-induced migration — and that alone removes a significant variable from your latency budget.
Soft Affinity vs Hard Affinity
Linux implements two tiers of affinity.
Soft affinity (also called natural affinity) is what the scheduler applies by default — a preference for keeping a process on the same core where it ran previously. This is a hint, not a requirement. The scheduler will migrate the process if needed, for example when a core becomes overcommitted.
Hard affinity is what you get when you explicitly call sched_setaffinity. The kernel respects your mask strictly. If you mark core 2 as eligible, the process will never run on core 5. Hard affinity is enforced at the scheduler level, not merely preferred.
You can combine these. A thread can have hard affinity for a subset of cores while the scheduler within that subset applies soft affinity to keep the thread on one particular core as much as possible.
Real-Time Scheduling Policies
Standard process scheduling in Linux is designed for throughput and fairness. The Completely Fair Scheduler (CFS) allocates CPU time to maximize aggregate work done, not to meet deadlines. For workloads where meeting a deadline is more important than throughput, Linux provides real-time scheduling policies.
SCHED_FIFO
SCHED_FIFO is the simplest real-time policy. Processes under this policy have no time slice — they run until they voluntarily yield or block. When a SCHED_FIFO process becomes runnable, it immediately preempts any lower-priority process.
Priority levels range from 1 to 99, with 99 being the highest. Only root can set priorities above 0.
struct sched_param sp = { .sched_priority = 90 };
if (sched_setscheduler(0, SCHED_FIFO, &sp) == -1) {
perror("sched_setscheduler");
}
With SCHED_FIFO, if two processes at the same priority are both runnable, the one that has been waiting longer runs first — a simple FIFO queue. There is no time slicing whatsoever. A misbehaving SCHED_FIFO process that never yields will lock out all other processes at that priority level or lower.
SCHED_RR
SCHED_RR is identical to SCHED_FIFO except that processes at the same priority level share CPU time through round-robin rotation. Each process gets a fixed time quantum before the scheduler rotates to the next process at that priority.
This prevents a single SCHED_FIFO process from monopolizing a priority level, but it introduces a small amount of scheduling latency — the time between when a process exhausts its quantum and when it runs again is bounded by the number of same-priority runnable processes times the quantum duration.
SCHED_DEADLINE
SCHED_DEADLINE is the most sophisticated real-time policy, introduced in Linux 3.14. It implements the Earliest Deadline First (EDF) algorithm with bandwidth reservation.
Each task specifies:
- Runtime — the maximum CPU time it needs per period
- Deadline — the time by which it must complete
- Period — the interval between successive invocations
struct sched_attr {
__u32 size;
__u32 sched_policy;
__u64 sched_flags;
__s64 sched_runtime; /* nanoseconds */
__u64 sched_deadline; /* nanoseconds */
__u64 sched_period; /* nanoseconds */
};
struct sched_attr attr = {
.size = sizeof(attr),
.sched_policy = SCHED_DEADLINE,
.sched_runtime = 10 * 1000 * 1000, /* 10 ms */
.sched_deadline = 20 * 1000 * 1000, /* 20 ms */
.sched_period = 50 * 1000 * 1000, /* 50 ms */
};
if (sched_setattr(0, &attr, 0) == -1) {
perror("sched_setattr");
}
A task that consistently uses more runtime than reserved will be throttled. The kernel guarantees that as long as total reserved runtime does not exceed CPU capacity, all deadline tasks meet their deadlines.
Comparing Scheduling Policies
| Policy | Time Slicing | Priority Range | Deadline Support | Use Case |
|---|---|---|---|---|
SCHED_OTHER (CFS) | Yes, fair share | 0 only | No | General computing |
SCHED_FIFO | No | 1-99 | No | Highest priority, non-preemptible |
SCHED_RR | Yes, fixed quantum | 1-99 | No | Shared priority real-time tasks |
SCHED_DEADLINE | Bandwidth reserved | 1-99 + EDF | Yes | Hard real-time workloads |
The Priority Inversion Problem
Real-time scheduling introduces a subtle failure mode that haunted the Mars Pathfinder mission in 1997. A high-priority task was blocked waiting for a low-priority task to release a resource — but medium-priority tasks kept preempting the low-priority one, delaying the release indefinitely. The high-priority task missed its deadline and triggered a system reset.
This scenario has a name: priority inversion. A high-priority task is indirectly blocked by a medium-priority task, because the medium-priority task is preempting a low-priority task that holds a lock needed by the high-priority task.
The classic sequence:
- Low-priority task
Lacquires mutexM Lis preempted by medium-priority taskM- High-priority task
Hstarts, tries to acquireM, and blocks Hwaits whileMrunsHstarves — deadline may be missed
The window of vulnerability is proportional to the duration that L holds M while M can preempt L. In the worst case, H never runs.
Priority Inheritance: The Solution
The kernel solves priority inversion through priority inheritance for mutexes. When a high-priority task blocks on a mutex held by a lower-priority task, the kernel temporarily raises the holding task’s priority to match that of the blocked task.
pthread_mutex_t mutex;
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
/* Request priority inheritance for the mutex */
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);
With PTHREAD_PRIO_INHERIT, the mutex records the priority of the highest-priority waiter. When the owning thread releases the mutex, its priority is restored to its original level. This propagation is transitive — if the owning thread itself holds another mutex with a waiter of even higher priority, the inheritance chain extends.
Priority inheritance is local to the mutex. It does not permanently change thread priorities. The kernel applies it only for the duration of the lock acquisition.
Linux mutexes (futex-based) support priority inheritance when configured with PTHREAD_PRIO_INHERIT. Not all kernel mutex implementations do. When designing real-time systems, verify that all synchronization primitives use priority inheritance.
PREEMPT_RT: Making Stock Linux Real-Time Capable
The standard Linux kernel is not a real-time operating system. It is preemptible at many levels, but it has regions of code — particularly in interrupt handlers and critical sections — where preemption is disabled. The longest non-preemptible section is called the interrupt latency ceiling. On a stock kernel, this ceiling can range from hundreds of microseconds to several milliseconds.
The PREEMPT_RT patch set, maintained by a small team of kernel developers and merged incrementally into mainline over the past two decades, addresses this by converting the remaining non-preemptible regions into preemptible ones.
Key transformations in PREEMPT_RT:
- All interrupt handlers become threaded — they run as kernel threads rather than in hardirq context. This makes them preemptible.
- Spinlocks become sleeping locks, allowing the holder to be preempted.
- Critical sections that cannot tolerate preemption are isolated and minimized.
- The kernel’s internal locking uses priority inheritance everywhere.
The result is a kernel with fully deterministic preemption. The maximum latency becomes bounded and measurable — typically in the range of tens of microseconds on modern hardware, rather than milliseconds.
To use PREEMPT_RT, you either run a distribution that ships a pre-built -rt kernel (Ubuntu, RHEL, Fedora all offer these), or you apply the patch against a matching kernel version and compile it yourself.
Once running, you can verify latency with cyclictest from the rt-tests suite:
# Measure scheduling latency on all cores
cyclictest -l 1000000 -m -S -p 90
# Measure latency with per-thread histogram
cyclictest -l 1000000 -m -h 40 -q
The -p 90 sets the test thread to priority 90. -h 40 produces a latency histogram with 40-microsecond buckets. A healthy PREEMPT_RT system shows the vast majority of samples within a few hundred microseconds even under heavy load.
Real-Time Operating Systems: Purpose-Built for Determinism
While Linux with PREEMPT_RT can meet soft real-time requirements, many embedded and safety-critical domains need formal certification, formally verified worst-case latency bounds, or minimal footprint. For these, dedicated RTOS solutions exist that are smaller and more deterministic than a general-purpose kernel.
FreeRTOS
FreeRTOS is the most widely deployed RTOS in the world, running on billions of microcontrollers in IoT devices, wearables, and industrial sensors. It is MIT licensed and runs on over 40 microcontroller architectures.
A FreeRTOS application consists of a small kernel and application tasks:
/* Task function */
void vTaskFunction(void *pvParameters) {
const char *task_name = (char *)pvParameters;
TickType_t wake_time = xTaskGetTickCount();
while (1) {
/* Do real-time work */
process_sensor_data();
/* Delay until next period */
vTaskDelayUntil(&wake_time, pdMS_TO_TICKS(100));
}
}
int main(void) {
xTaskCreate(vTaskFunction, "Sensor", 1024, "Sensor", 2, NULL);
xTaskCreate(vTaskFunction, "Logger", 1024, "Logger", 1, NULL);
vTaskStartScheduler();
/* Should never reach here */
for (;;);
}
FreeRTOS supports priority-based preemptive scheduling with optional time slicing. Tasks can communicate through queues, semaphores, and mutexes with priority inheritance.
VxWorks
VxWorks is a mature, certifiable RTOS used in aerospace, defense, medical devices, and automotive systems. It powers the Curiosity and Perseverance rovers on Mars, and it is used in many safety-critical automotive applications.
VxWorks provides a POSIX-compliant interface, deterministic scheduling, and an architecture with memory protection. Its certification heritage means it comes with extensive documentation for DO-178C (aerospace) and ISO 26262 (automotive) compliance.
Zephyr
The Zephyr Project is a Linux Foundation-hosted RTOS designed for resource-constrained devices. It scales from a single-kilobyte RAM footprint up to more capable systems. Zephyr uses a microkernel architecture where services like scheduling, interrupt handling, and synchronization are optional components you include as needed.
Zephyr supports multiple architectures including x86, ARM, RISC-V, and Tensilica Xtensa. Its configuration system lets you build minimal images without unnecessary components — critical for MCUs with tight memory budgets.
QNX
QNX Neutrino is a POSIX-certified microkernel RTOS used in automotive infotainment systems, medical devices, and industrial control. The kernel itself is just a few tens of kilobytes, and most OS services run as user-space processes. This design means a fault in a driver or filesystem does not crash the kernel.
QNX has a long track record in safety-critical applications and carries certifications for medical (IEC 62304), automotive (ISO 26262), and industrial (IEC 61508) standards.
Scheduling Hierarchy: A Visual Map
Understanding how these concepts relate helps frame the design decisions. The following diagram maps the landscape from bare hardware through kernel scheduling to real-time policies.
graph TD
Hardware["Hardware<br/>CPU Cores"]
Kernel["Linux Kernel"]
Scheduler["Scheduler"]
CFS["CFS<br/>SCHED_OTHER"]
RT["Real-Time Policies<br/>SCHED_FIFO · SCHED_RR · SCHED_DEADLINE"]
Affinity["CPU Affinity API<br/>sched_setaffinity · taskset"]
Preempt["PREEMPT_RT Patch"]
RTOS["Dedicated RTOS<br/>FreeRTOS · VxWorks · Zephyr · QNX"]
Hardware --> Kernel
Kernel --> Scheduler
Scheduler --> CFS
Scheduler --> RT
Scheduler --> Affinity
Affinity --> Preempt
Scheduler --> Preempt
Kernel --> RTOS
CFS -.->|"Best effort<br/>No latency guarantee"| CFS_label["Throughput-optimized"]
RT -.->|"Priority-based<br/>Bounded latency"| RT_label["Real-time workloads"]
Affinity -.->|"Core pinning<br/>Cache warmth"| Affinity_label["Latency-sensitive"]
Where Determinism Matters: Use Cases
Industrial Robots
Modern manufacturing robots operate on tight control loops — a pick-and-place arm may need to complete a cycle within 500 microseconds to maintain throughput on an assembly line. The robot controller runs a real-time task that must guarantee this deadline regardless of what other tasks on the system are doing. FreeRTOS or a PREEMPT_RT Linux system with a dedicated core handles this.
Medical Devices
An infusion pump that delivers medication must complete its bolus delivery within a specified time window. A pacemaker must deliver stimulation pulses within a few milliseconds of detecting an arrhythmia. These are Class III medical devices where the Food and Drug Administration requires rigorous evidence that deadlines are never missed. Certifiable RTOS platforms like VxWorks or QNX provide the paper trail that regulatory submissions require.
Audio and Digital Signal Processing
Professional audio workstations demand sub-millisecond latency to avoid audible artifacts. A buffer underrun — when the audio thread fails to fill the buffer in time — produces a pop or drop that breaks the performance. Audio threads are scheduled with real-time policies, often pinned to isolated cores to prevent any interference from the operating system.
High-Frequency Trading
In financial markets, a few microseconds of latency can represent millions of dollars in missed arbitrage opportunities. Trading systems use kernel bypass networking (via technologies like DPDK) combined with real-time scheduling to minimize jitter in order execution. Hard affinity ensures the trading engine never leaves its designated cores.
Quick Recap Checklist
- CPU affinity binds a process to specific cores via
sched_setaffinityortaskset - Soft affinity is a scheduler preference; hard affinity is a strict mask
- Cache warmth, NUMA locality, and isolation are primary reasons for affinity
-
SCHED_FIFOhas no time slicing — runs until yield or block -
SCHED_RRadds round-robin rotation at the same priority level -
SCHED_DEADLINEuses EDF with bandwidth reservation (runtime/deadline/period) - Priority inversion: H blocked by L, M preempts L → H starves
- Priority inheritance: L’s priority temporarily raised to unblock H
- PREEMPT_RT makes stock Linux real-time capable (threaded IRQs, sleeping spinlocks)
- FreeRTOS, VxWorks, Zephyr, QNX are purpose-built RTOS options
-
cyclictestmeasures scheduling latency on PREEMPT_RT systems
Interview Questions
Soft affinity is a preference that the scheduler applies naturally — the kernel tries to keep a process on the same core where it last ran, but will migrate it if necessary for load balancing or core availability. Hard affinity is an explicit mask set via sched_setaffinity or taskset that the scheduler must respect. With hard affinity, the process will never run on a core outside the specified mask.
Priority inversion occurs when a high-priority task is blocked waiting for a resource held by a low-priority task, while a medium-priority task preempts the low-priority one and runs instead — effectively starving the high-priority task. Priority inheritance temporarily raises the holding task's priority to match the blocked task's priority, ensuring it runs and releases the lock quickly. The kernel restores the original priority once the lock is released.
SCHED_FIFO has no time slicing — a runnable process runs until it voluntarily yields or blocks. SCHED_RR is identical but rotates among same-priority processes using a fixed time quantum. SCHED_DEADLINE uses Earliest Deadline First with bandwidth reservation — each task declares a runtime, deadline, and period, and the kernel guarantees the deadline provided total reserved bandwidth does not exceed CPU capacity.
PREEMPT_RT converts non-preemptible regions of the kernel — primarily interrupt handlers and certain critical sections — into preemptible ones. Interrupt handlers become threaded kernel threads, spinlocks become sleeping locks, and internal locking uses priority inheritance. This reduces the maximum preemption latency from milliseconds down to tens of microseconds, making stock Linux capable of soft real-time workloads.
Choose a dedicated RTOS when you need formal certification for safety standards (DO-178C, ISO 26262, IEC 62304), when memory footprint must be extremely small (FreeRTOS can run in kilobytes of RAM), when you need a formally verified worst-case latency bound, or when the hardware platform is not well-supported by Linux. Choose PREEMPT_RT Linux when you need the Linux ecosystem, drivers, and toolchain while achieving soft real-time latency in the low hundreds of microseconds.
Soft affinity (also called natural affinity) is a preference — the scheduler tries to keep a process on the same CPU but can migrate it when needed (e.g., load balancing). Hard affinity is a mandatory constraint — you pin a process to a specific set of cores using sched_setaffinity(), and the kernel will never migrate it outside that set unless you explicitly remove the constraint. Hard affinity is used in real-time and performance-critical workloads where cache eviction cost matters more than load balancing.
On NUMA systems, each CPU socket has its own memory bank. If a process migrates to a remote socket, every memory access pays a ~100ns penalty for remote access versus ~60ns for local. Effective real-time performance requires binding to both the correct CPU core AND the correct memory node using set_mempolicy(MPOL_BIND) or numactl --cpunodebind --membind. Linux's libnuma API lets applications query and set NUMA affinity programmatically. The PREEMPT_RT kernel does not automatically handle NUMA — the application or deployment tooling must.
Modern multi-core CPUs use MESI (Modified, Exclusive, Shared, Invalid) cache coherence. When a process runs on Core 0, its data lives in Core 0's L1/L2 caches. If the process migrates to Core 1, the cache line must be invalidated on Core 0 and transferred to Core 1 — costing 50-200 cycles depending on the cache state. On a properly pinned process, cache hits can approach 100% for working sets that fit in L1/L2. With frequent migration, effective CPI degrades significantly even if the CPU appears "idle."
sched_setaffinity() and its flags.sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask) sets the CPU affinity mask for a process. Key macros: CPU_SETSIZE limits the mask to 1024 CPUs. CPU_ZERO clears a mask. CPU_SET/CPU_CLR add/remove individual cores. CPU_COUNT returns the number of set bits. The call returns 0 on success. Reading affinity back uses sched_getaffinity(). It is safe to call from signal handlers.
CPU hotplug allows removing a CPU core from the scheduler at runtime (via echo 0 > /sys/devices/system/cpu/cpuN/online). When a core is offlined, the kernel migrates all processes away and sends SIGBUS to any process with hard affinity to that core. For real-time workloads, either disable hotplug at boot (maxcpus=), or subscribe to the CPU_DOWN_PREPARE notifier to gracefully migrate before the core goes offline.
The cgroup v2 CPU controller limits CPU usage per control group using CFS bandwidth control (shares + throttle). Unlike affinity (which controls where a process runs), the CPU controller controls how much CPU time a process gets. They compose: a process in a cgroup with 50% CPU limit can still have its affinity restricted to cores 0-3. For containers, cgroup CPU limits are typically the primary mechanism while taskset is used for finer cache control.
In containerized environments, CPU affinity and cpusets interact in layered ways: 1) The container runtime (Docker, containerd) may set default cpuset cgroups limiting visible CPUs. 2) Inside the container, sched_setaffinity() is constrained by the cgroup cpuset subsystem. 3) Kubernetes' topologyManager and static CPU manager can coordinate NUMA and socket-level affinity for pods requesting guaranteed resources. 4) On cgroup v2 systems, cpuset.cpus.effective determines the allowed mask visible to processes.
isolcpus) and CPU affinity?isolcpus is a kernel boot parameter that removes a CPU from the scheduler's default load balancing entirely — the kernel will never automatically place a process on that CPU. You must explicitly assign work using affinity. This is the gold standard for real-time: reserve core 3 for your real-time thread, and nothing else will ever land on it. Unlike setting affinity (which requires userspace calls), isolcpus provides deterministic isolation without userspace coordination.
When a process loses cache affinity and migrates to another CPU, it suffers: L1/L2 cache miss (50-100 cycles), possible L3 miss (100-200 cycles if shared cache is now on remote socket), TLB flush (thousands of cycles on some architectures), and potential store buffer invalidation. With hard affinity, the kernel still handles ISR and kernel preemption on any CPU — soft lockup detector and perf profiling can cause jitter even on pinned processes. Use perf stat -e cache-misses to quantify migration cost.
SCHED_FIFO and SCHED_RR at the assembly level?At the kernel level, SCHED_FIFO processes run until they yield, block, or are preempted by a higher-priority process. The only difference from SCHED_RR is that SCHED_RR applies a time slice counter: when a FIFO task exhausts its quantum, it is moved to the end of its priority queue. In pseudocode: if (policy == SCHED_RR && counter == 0) enqueue(current, rq); pick_next(rq);. For single-threaded real-time work, FIFO is typically preferred to avoid quantum expiration overhead.
Linux's PI-futex (priority inheritance fast userspace mutex) extends the basic futex with a priority queue of waiters. When a high-priority process blocks on a futex owned by a low-priority process, the kernel temporarily boosts the owner's priority to match the waiter's. The owner runs at elevated priority until it releases the lock, then reverts. Implementation: rt_mutex_setprio() is called to update the owner's effective priority. This prevents priority inversion without deadlock, but adds kernel overhead — basic futexes are ~3x faster than PI-futexes for uncontended cases.
Latency is the time from an event (interrupt, system call, network packet) to the response (scheduled execution of the handler). Throughput is the volume of work completed per unit time. Real-time systems care about bounded latency (must meet deadline), not throughput (might actually decrease under strict real-time constraints). For example, an automotive brake-by-wire ECU must respond within 1ms of a sensor event — latency is life safety. A web server maximizing requests/second optimizes throughput. PREEMPT_RT improves latency at some throughput cost due to increased preemption points.
CONFIG_PREEMPT_NONE vs CONFIG_PREEMPT_VOLUNTARY vs CONFIG_PREEMPT?These are kernel preemption configuration options: CONFIG_PREEMPT_NONE (server) — the kernel can only be preempted at explicit preemption points. CONFIG_PREEMPT_VOLUNTARY (desktop) — adds explicit preemption checks in additional kernel code paths. CONFIG_PREEMPT (preemptible kernel) — the kernel can be preempted anywhere except in interrupt context. CONFIG_PREEMPT_RT (real-time) — builds on CONFIG_PREEMPT with threaded IRQs, sleeping spinlocks, and mandatory preemption everywhere.
Traditional interrupt handlers run to completion in atomic interrupt context — no sleeping, no rescheduling. Threaded IRQs (request_threaded_irq) run as kernel threads, allowing blocking operations and preemption. Trade-offs: threaded IRQs add scheduling latency (the thread must be scheduled), but they reduce interrupt handler duration, improve parallelism, and make latency more predictable. PREEMPT_RT requires threaded IRQs to achieve deterministic preemption. The downside is ~10-50us extra latency for IRQ handling due to context switch.
On a well-tuned PREEMPT_RT system, worst-case latency can be held below 100 microseconds for most workloads, and below 50 microseconds for latency-sensitive applications on modern hardware with proper CPU isolation. The limiting factors are: 1) hardware interrupt routing and APIC timer latency, 2) memory allocation latency (GFP_ATOMIC can still block if the page allocator needs to evict), 3) interrupt controller latency on virtualized systems, 4) device driver atomic sections. Use cyclictest from the rt-tests suite to measure your actual worst-case latency.
Further Reading
- Process Scheduling — General scheduling theory and context switching
- Process Scheduling Algorithms — Deep dive into CFS, MLFQ, and real-time scheduling algorithms
Conclusion
CPU affinity and real-time scheduling are two levels of the same idea: taking control away from the default scheduler and making explicit placement and timing decisions. Affinity tells the kernel where to run work. Real-time policies tell the kernel how long the work can run and what happens when deadlines loom.
For most applications, the default scheduler is perfectly adequate. But when cache warmth matters, when NUMA topology dictates placement, when deadlines are non-negotiable — these mechanisms give you the control needed to build systems that behave predictably under pressure.
The next step from here is to explore Kernel Architecture to understand how the scheduler fits into the broader kernel design, or to dig deeper into Process Scheduling Algorithms if you want to understand the algorithmic foundations of CFS.
Category
Related Posts
Real-Time Operating Systems
Understand RTOS concepts, scheduling guarantees, latency bounds, and the PREEMPT_RT patch for achieving real-time Linux.
Fork & Exec System Calls
fork() duplicates a running process, then exec() replaces it with a new program. Together they power every shell, web server, and daemon on Unix-like systems.
System Calls Interface
System calls are the boundary between user programs and the kernel. They are the mechanism by which user-space applications request services from the operating system — opening files, creating processes, allocating memory, and more. Understanding syscalls reveals how the OS enforces isolation and provides safe access to hardware.