CPU Affinity & Real-Time Operating Systems

CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.

published: May 20, 2026 reading time: 34 min read author: GeekWorkBench

Quick Summary

CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.

CPU Affinity & Real-Time Operating Systems

Modern systems run dozens of processes simultaneously across multiple CPU cores. By default, the kernel scheduler makes placement decisions on your behalf, and it does a reasonable job. But there are times when you know better. When latency matters down to the microsecond, when cache warmth is measured in nanoseconds, when a missed deadline means a heart monitor fails, the default scheduler is not enough.

CPU affinity gives you direct control over which cores your processes inhabit. Real-time operating systems go further: they guarantee that work completes within a defined time bound. Together, these are the foundation of deterministic computing — systems that behave predictably, not just probabilistically.

This post is part of the Operating Systems Roadmap — specifically Section 3.3 on Process and Thread Management. If you are new to scheduling concepts, start with Process Scheduling and Process Concept for the fundamentals.

Introduction

CPU affinity and real-time operating systems sit at the far end of the scheduling spectrum — where squeezing performance or guaranteeing deadlines matters more than raw throughput. Most software never needs either. But when you are debugging a trading system that misses arbitrage windows by microseconds, or writing firmware for a pacemaker that must deliver a pulse within milliseconds of detecting an arrhythmia, the defaults stop being good enough.

This post covers two related capabilities. CPU affinity binds a process or thread to specific CPU cores, controlling where the scheduler places work. The gains are real: L1/L2/L3 cache warmth, NUMA-local memory access, isolation from noisy neighbors on co-located systems. Real-time scheduling policies change how the scheduler decides — priority-based preemption and bandwidth reservation replace fair-share time slicing, with formal guarantees about deadline meeting.

Real-time workloads almost always need affinity too. Migrate a real-time thread to a different core mid-execution and caches go cold, which makes latency unpredictable. Binding the thread to a dedicated core and applying a real-time scheduling policy is the standard combination for latency-critical work.

After this post, you will know how to pin processes to cores with sched_setaffinity and taskset, how SCHED_FIFO, SCHED_RR, and SCHED_DEADLINE differ and when to use each, what priority inversion is and how priority inheritance solves it, how PREEMPT_RT makes stock Linux work for soft real-time, and how FreeRTOS, VxWorks, Zephyr, and QNX compare for different embedded contexts.

This is Section 3.3 of the Operating Systems Roadmap. If you are coming from the scheduling algorithms post, you already know how the scheduler picks which runnable process gets CPU time. Here we cover how to override those decisions and control where and how long your work runs.

CPU Affinity: Taking Control of Placement

CPU affinity is a property that lets you bind a process or thread to a specific subset of CPU cores. When you set affinity, the scheduler respects your preference — it will only place your work on the cores you specify.

The Linux API

Linux exposes affinity through a straightforward API built around bitmasks. Each bit in the mask corresponds to a logical CPU. If bit 3 is set, the process is eligible to run on core 3.

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>

int main(void) {
    cpu_set_t cpuset;
    pid_t pid = getpid();

    /* Clear the mask — start fresh */
    CPU_ZERO(&cpuset);

    /* Set bits for cores 0 and 2 */
    CPU_SET(0, &cpuset);
    CPU_SET(2, &cpuset);

    /* Apply the mask to the current process */
    if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
        perror("sched_setaffinity");
        exit(EXIT_FAILURE);
    }

    /* Read it back to confirm */
    if (sched_getaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
        perror("sched_getaffinity");
        exit(EXIT_FAILURE);
    }

    printf("Running on cores: ");
    for (int i = 0; i < CPU_SETSIZE; i++) {
        if (CPU_ISSET(i, &cpuset))
            printf("%d ", i);
    }
    printf("\n");

    return 0;
}

CPU_SETSIZE is typically 1024, meaning you can address up to 1024 logical CPUs. CPU_ZERO, CPU_SET, CPU_CLR, and CPU_ISSET are the basic operations on these masks.

The shell command taskset wraps this API for convenient command-line use:

# Run a process exclusively on cores 0 and 2
taskset -c 0,2 ./my_real_time_app

# Launch on core 0 with exclusive affinity
taskset -c 0 ./latency_sensitive_worker

# Check the affinity of a running process
taskset -p $(pgrep -f my_real_time_app)

The -c flag accepts a comma-separated list or range. taskset -c 0-3 binds to cores 0 through 3.

Why Bind to Specific Cores?

The default scheduler spreads work across all available cores. This is generally efficient, but it introduces variability that real-time and latency-sensitive workloads cannot tolerate.

Cache warmth is the primary motivation. Each CPU core has its own L1 and L2 caches. When a process runs on core 0 then migrates to core 5, it must reload caches on the new core — a penalty measured in dozens of cycles. Keeping a latency-sensitive thread on a single core preserves its cache state, which eliminates that source of jitter.

NUMA systems amplify this concern. On a multi-socket server, memory access latency depends on which socket the CPU belongs to. Binding a process to cores on the same socket as its working memory avoids cross-socket memory traffic, which can be 2-3x slower than local access.

Isolation matters enormously in production. If you dedicate cores 4-7 to a real-time signal processing thread, you can keep other processes from fragmenting cache state on those cores. You can even set those cores aside entirely with the isolcpus kernel parameter at boot, leaving them exclusively for your workload.

Determinism follows from isolation. Fewer variables means fewer sources of unexpected delay. A thread pinned to one core does not experience scheduler-induced migration — and that alone removes a significant variable from your latency budget.

Soft Affinity vs Hard Affinity

Linux implements two tiers of affinity.

Soft affinity (also called natural affinity) is what the scheduler applies by default — a preference for keeping a process on the same core where it ran previously. This is a hint, not a requirement. The scheduler will migrate the process if needed, for example when a core becomes overcommitted.

Hard affinity is what you get when you explicitly call sched_setaffinity. The kernel respects your mask strictly. If you mark core 2 as eligible, the process will never run on core 5. Hard affinity is enforced at the scheduler level, not merely preferred.

You can combine these. A thread can have hard affinity for a subset of cores while the scheduler within that subset applies soft affinity to keep the thread on one particular core as much as possible.

Real-Time Scheduling Policies

Standard process scheduling in Linux is designed for throughput and fairness. The Completely Fair Scheduler (CFS) allocates CPU time to maximize aggregate work done, not to meet deadlines. For workloads where meeting a deadline is more important than throughput, Linux provides real-time scheduling policies.

SCHED_FIFO

SCHED_FIFO is the simplest real-time policy. Processes under this policy have no time slice — they run until they voluntarily yield or block. When a SCHED_FIFO process becomes runnable, it immediately preempts any lower-priority process.

Priority levels range from 1 to 99, with 99 being the highest. Only root can set priorities above 0.

struct sched_param sp = { .sched_priority = 90 };

if (sched_setscheduler(0, SCHED_FIFO, &sp) == -1) {
    perror("sched_setscheduler");
}

With SCHED_FIFO, if two processes at the same priority are both runnable, the one that has been waiting longer runs first — a simple FIFO queue. There is no time slicing whatsoever. A misbehaving SCHED_FIFO process that never yields will lock out all other processes at that priority level or lower.

SCHED_RR

SCHED_RR is identical to SCHED_FIFO except that processes at the same priority level share CPU time through round-robin rotation. Each process gets a fixed time quantum before the scheduler rotates to the next process at that priority.

This prevents a single SCHED_FIFO process from monopolizing a priority level, but it introduces a small amount of scheduling latency — the time between when a process exhausts its quantum and when it runs again is bounded by the number of same-priority runnable processes times the quantum duration.

SCHED_DEADLINE

SCHED_DEADLINE is the most sophisticated real-time policy, introduced in Linux 3.14. It implements the Earliest Deadline First (EDF) algorithm with bandwidth reservation.

Each task specifies:

Runtime — the maximum CPU time it needs per period
Deadline — the time by which it must complete
Period — the interval between successive invocations

struct sched_attr {
    __u32 size;
    __u32 sched_policy;
    __u64 sched_flags;
    __s64 sched_runtime;   /* nanoseconds */
    __u64 sched_deadline;  /* nanoseconds */
    __u64 sched_period;    /* nanoseconds */
};

struct sched_attr attr = {
    .size = sizeof(attr),
    .sched_policy = SCHED_DEADLINE,
    .sched_runtime = 10 * 1000 * 1000,  /* 10 ms */
    .sched_deadline = 20 * 1000 * 1000, /* 20 ms */
    .sched_period = 50 * 1000 * 1000,    /* 50 ms */
};

if (sched_setattr(0, &attr, 0) == -1) {
    perror("sched_setattr");
}

A task that consistently uses more runtime than reserved will be throttled. The kernel guarantees that as long as total reserved runtime does not exceed CPU capacity, all deadline tasks meet their deadlines.

Comparing Scheduling Policies

Policy	Time Slicing	Priority Range	Deadline Support	Use Case
`SCHED_OTHER` (CFS)	Yes, fair share	0 only	No	General computing
`SCHED_FIFO`	No	1-99	No	Highest priority, non-preemptible
`SCHED_RR`	Yes, fixed quantum	1-99	No	Shared priority real-time tasks
`SCHED_DEADLINE`	Bandwidth reserved	1-99 + EDF	Yes	Hard real-time workloads

The Priority Inversion Problem

Real-time scheduling introduces a subtle failure mode that haunted the Mars Pathfinder mission in 1997. A high-priority task was blocked waiting for a low-priority task to release a resource — but medium-priority tasks kept preempting the low-priority one, delaying the release indefinitely. The high-priority task missed its deadline and triggered a system reset.

This scenario has a name: priority inversion. A high-priority task is indirectly blocked by a medium-priority task, because the medium-priority task is preempting a low-priority task that holds a lock needed by the high-priority task.

The classic sequence:

Low-priority task L acquires mutex M
L is preempted by medium-priority task M
High-priority task H starts, tries to acquire M, and blocks
H waits while M runs
H starves — deadline may be missed

The window of vulnerability is proportional to the duration that L holds M while M can preempt L. In the worst case, H never runs.

Priority Inheritance: The Solution

The kernel solves priority inversion through priority inheritance for mutexes. When a high-priority task blocks on a mutex held by a lower-priority task, the kernel temporarily raises the holding task’s priority to match that of the blocked task.

pthread_mutex_t mutex;
pthread_mutexattr_t attr;

pthread_mutexattr_init(&attr);
/* Request priority inheritance for the mutex */
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);

pthread_mutex_init(&mutex, &attr);

With PTHREAD_PRIO_INHERIT, the mutex records the priority of the highest-priority waiter. When the owning thread releases the mutex, its priority is restored to its original level. This propagation is transitive — if the owning thread itself holds another mutex with a waiter of even higher priority, the inheritance chain extends.

Priority inheritance is local to the mutex. It does not permanently change thread priorities. The kernel applies it only for the duration of the lock acquisition.

Linux mutexes (futex-based) support priority inheritance when configured with PTHREAD_PRIO_INHERIT. Not all kernel mutex implementations do. When designing real-time systems, verify that all synchronization primitives use priority inheritance.

PREEMPT_RT: Making Stock Linux Real-Time Capable

The standard Linux kernel is not a real-time operating system. It is preemptible at many levels, but it has regions of code — particularly in interrupt handlers and critical sections — where preemption is disabled. The longest non-preemptible section is called the interrupt latency ceiling. On a stock kernel, this ceiling can range from hundreds of microseconds to several milliseconds.

The PREEMPT_RT patch set, maintained by a small team of kernel developers and merged incrementally into mainline over the past two decades, addresses this by converting the remaining non-preemptible regions into preemptible ones.

Key transformations in PREEMPT_RT:

All interrupt handlers become threaded — they run as kernel threads rather than in hardirq context. This makes them preemptible.
Spinlocks become sleeping locks, allowing the holder to be preempted.
Critical sections that cannot tolerate preemption are isolated and minimized.
The kernel’s internal locking uses priority inheritance everywhere.

The result is a kernel with fully deterministic preemption. The maximum latency becomes bounded and measurable — typically in the range of tens of microseconds on modern hardware, rather than milliseconds.

To use PREEMPT_RT, you either run a distribution that ships a pre-built -rt kernel (Ubuntu, RHEL, Fedora all offer these), or you apply the patch against a matching kernel version and compile it yourself.

Once running, you can verify latency with cyclictest from the rt-tests suite:

# Measure scheduling latency on all cores
cyclictest -l 1000000 -m -S -p 90

# Measure latency with per-thread histogram
cyclictest -l 1000000 -m -h 40 -q

The -p 90 sets the test thread to priority 90. -h 40 produces a latency histogram with 40-microsecond buckets. A healthy PREEMPT_RT system shows the vast majority of samples within a few hundred microseconds even under heavy load.

Real-Time Operating Systems: Purpose-Built for Determinism

While Linux with PREEMPT_RT can meet soft real-time requirements, many embedded and safety-critical domains need formal certification, formally verified worst-case latency bounds, or minimal footprint. For these, dedicated RTOS solutions exist that are smaller and more deterministic than a general-purpose kernel.

FreeRTOS

FreeRTOS is the most widely deployed RTOS in the world, running on billions of microcontrollers in IoT devices, wearables, and industrial sensors. It is MIT licensed and runs on over 40 microcontroller architectures.

A FreeRTOS application consists of a small kernel and application tasks:

/* Task function */
void vTaskFunction(void *pvParameters) {
    const char *task_name = (char *)pvParameters;
    TickType_t wake_time = xTaskGetTickCount();

    while (1) {
        /* Do real-time work */
        process_sensor_data();

        /* Delay until next period */
        vTaskDelayUntil(&wake_time, pdMS_TO_TICKS(100));
    }
}

int main(void) {
    xTaskCreate(vTaskFunction, "Sensor", 1024, "Sensor", 2, NULL);
    xTaskCreate(vTaskFunction, "Logger",  1024, "Logger",  1, NULL);

    vTaskStartScheduler();
    /* Should never reach here */
    for (;;);
}

FreeRTOS supports priority-based preemptive scheduling with optional time slicing. Tasks can communicate through queues, semaphores, and mutexes with priority inheritance.

VxWorks

VxWorks is a mature, certifiable RTOS used in aerospace, defense, medical devices, and automotive systems. It powers the Curiosity and Perseverance rovers on Mars, and it is used in many safety-critical automotive applications.

VxWorks provides a POSIX-compliant interface, deterministic scheduling, and an architecture with memory protection. Its certification heritage means it comes with extensive documentation for DO-178C (aerospace) and ISO 26262 (automotive) compliance.

Under the hood, VxWorks uses a priority-based preemptive scheduler similar in concept to SCHED_FIFO — tasks at higher priority levels immediately preempt those at lower levels. The key difference from Linux is that VxWorks is a small microkernel (under 100 KB) where device drivers, file systems, and networking stack run as user-space components. This minimal kernel surface means the scheduler has far fewer paths to traverse and fewer sources of non-determinism. The wind kernel provides fast interrupt handling with guaranteed maximum latency figures that are published and auditable — a requirement for DO-178C Level A certification where the worst-case execution time must be formally proven.

Memory protection in VxWorks is enforced through the MMU. Each task gets its own address space, so a null pointer dereference in one task cannot corrupt another task’s memory. By default, FreeRTOS runs all tasks in a single flat address space — VxWorks does not. For automotive ECUs running ISO 26262 ASIL-D, VxWorks supports dual-lockstep cores where two processors execute identical code in lockstep and the system flags any divergence within the same clock cycle. This is a hardware-level fault detection mechanism that software alone cannot provide.

When you deploy VxWorks, you build a project file (usually via the Wind River Workbench IDE) that defines which components are included: which network stack, which file system, which interrupt controllers. The result is a single firmware image tuned to the hardware. This build-time configuration is both a strength (no bloat, every byte accounted for) and a limitation (no dynamic module loading at runtime like Linux). For a brake-by-wire ECU that must pass ISO 26262 ASIL-D, this predictability is worth the development overhead.

Zephyr

The Zephyr Project is a Linux Foundation-hosted RTOS designed for resource-constrained devices. It scales from a single-kilobyte RAM footprint up to more capable systems. Zephyr uses a microkernel architecture where services like scheduling, interrupt handling, and synchronization are optional components you include as needed.

Zephyr supports multiple architectures including x86, ARM, RISC-V, and Tensilica Xtensa. Its configuration system lets you build minimal images without unnecessary components — critical for MCUs with tight memory budgets.

What makes Zephyr technically interesting is its use of a configurable microkernel called “Nanokernel” for the smallest devices and “Microkernel” for more capable ones. At the Nanokernel level, there is no concept of a traditional process — all code runs in a single privileged address space. This sounds dangerous, but it eliminates context switch overhead entirely, which is why Zephyr can achieve sub-microsecond interrupt latency on hardware where even FreeRTOS would struggle. The trade-off is that a buggy driver can corrupt kernel state, so you only use Nanokernel on truly constrained hardware where you control every line of code.

The Microkernel adds proper process isolation and a more traditional kernel API while still being orders of magnitude smaller than Linux. Tasks in Zephyr have a priority (0-62, lower is higher priority by default) and are scheduled preemptively. Zephyr also provides the Zephyr Kernel Scheduling Guide as part of its documentation, which includes formal latency measurements for each supported board — something few RTOS projects publish openly.

A practical difference from FreeRTOS is Zephyr’s Device Tree support. Rather than hardcoding peripheral base addresses in driver code, Zephyr uses a Device Tree that describes the hardware. This means the same driver binary can run on different hardware revisions without recompilation — a feature borrowed from Linux that cuts down the maintenance burden significantly. For IoT devices that ship in millions of units across several hardware revisions, this is a real engineering advantage.

Building Zephyr is done via its West meta-tool. You clone the Zephyr repository, set up a Python virtual environment, and run west build with a board file and your application. The resulting image is a single binary with no runtime dependency on an external file system or loader. This simplicity makes Zephyr popular for BLE SoCs, smart sensors, and industrial IO link masters where firmware updates are OTA but bandwidth is limited.

QNX

QNX Neutrino is a POSIX-certified microkernel RTOS used in automotive infotainment systems, medical devices, and industrial control. The kernel itself is just a few tens of kilobytes, and most OS services run as user-space processes. This design means a fault in a driver or filesystem does not crash the kernel.

QNX has a long track record in safety-critical applications and carries certifications for medical (IEC 62304), automotive (ISO 26262), and industrial (IEC 61508) standards.

The microkernel architecture is the core design choice. The QNX kernel handles only four things: scheduling, inter-process communication (IPC), interrupt handling, and low-level synchronization. Everything else — file systems, networking stacks, device drivers, graphical subsystems — runs as user-space processes. When a network driver crashes, it restarts without affecting the kernel or other processes. This fault isolation matters enormously in automotive head units where a Bluetooth stack crash cannot be allowed to disable the instrument cluster.

IPC in QNX uses a message-passing model. A process sends a message to a server process (like a file system) and blocks until it receives a reply. The kernel delivers the message, optionally validates it, and the server processes it at its own priority. This model is fundamentally different from Linux where any thread can call into any subsystem synchronously. QNX message passing is the reason the OS can make strong deterministic guarantees — there are no priority inversions caused by synchronous kernel calls because kernel operations never block asynchronously.

For automotive applications, QNX provides the QNX OS Services for ISO 26262, which includes a safety manual, QM/ASIL certification evidence, and a hardware abstract layer that simplifies porting to different MCU architectures. The QNX Platform for Automotive Safety is pre-certified to ASIL D, which means you do not have to certify the OS layer yourself — you only certify your application code. In practice this cuts months off the certification timeline for a brake-by-wire ECU or an ADAS control unit.

Scheduling Hierarchy: A Visual Map

Understanding how these concepts relate helps frame the design decisions. The following diagram maps the landscape from bare hardware through kernel scheduling to real-time policies.

graph TD
    Hardware["Hardware<br/>CPU Cores"]
    Kernel["Linux Kernel"]
    Scheduler["Scheduler"]
    CFS["CFS<br/>SCHED_OTHER"]
    RT["Real-Time Policies<br/>SCHED_FIFO · SCHED_RR · SCHED_DEADLINE"]
    Affinity["CPU Affinity API<br/>sched_setaffinity · taskset"]
    Preempt["PREEMPT_RT Patch"]
    RTOS["Dedicated RTOS<br/>FreeRTOS · VxWorks · Zephyr · QNX"]

    Hardware --> Kernel
    Kernel --> Scheduler
    Scheduler --> CFS
    Scheduler --> RT
    Scheduler --> Affinity
    Affinity --> Preempt
    Scheduler --> Preempt
    Kernel --> RTOS

    CFS -.->|"Best effort<br/>No latency guarantee"| CFS_label["Throughput-optimized"]
    RT -.->|"Priority-based<br/>Bounded latency"| RT_label["Real-time workloads"]
    Affinity -.->|"Core pinning<br/>Cache warmth"| Affinity_label["Latency-sensitive"]

Where Determinism Matters: Use Cases

Industrial Robots

Modern manufacturing robots operate on tight control loops — a pick-and-place arm may need to complete a cycle within 500 microseconds to maintain throughput on an assembly line. The robot controller runs a real-time task that must guarantee this deadline regardless of what other tasks on the system are doing. FreeRTOS or a PREEMPT_RT Linux system with a dedicated core handles this.

The typical architecture is a dual-CPU design: a real-time core runs the motion control loop at 1-10 kHz while a Linux side handles HMI, networking, and configuration. The motion control core runs FreeRTOS or bare metal with no operating system at all. Communication between the two sides uses a shared memory region with a dual-port RAM or a high-speed SPI link, and the interface is designed so that the Linux side never blocks the real-time side.

Jitter tolerance on a pick-and-place line is typically 100-500 microseconds. The arm trajectory is computed ahead of time, so the real-time task is not doing complex math. It mostly reads encoder positions, runs a PID loop, and writes PWM duty cycles to motor drivers. A late sample means the arm overshoots position and either misses the target or applies force that damages the fixture.

Medical Devices

An infusion pump that delivers medication must complete its bolus delivery within a specified time window. A pacemaker must deliver stimulation pulses within a few milliseconds of detecting an arrhythmia. These are Class III medical devices where the Food and Drug Administration requires rigorous evidence that deadlines are never missed. Certifiable RTOS platforms like VxWorks or QNX provide the paper trail that regulatory submissions require.

Class III is the FDA’s highest risk classification, and the regulatory bar is DO-178C Level A for software that could cause catastrophic failure. This is not merely a documentation exercise. It requires formal methods, code coverage analysis, and worst-case execution time analysis that the FDA reviews before the device is approved. The RTOS vendor provides a certification kit with RTCA DO-178C artifacts: requirements traceability, structural coverage analysis, and hardening evidence.

For an infusion pump, the real-time requirement is the peristaltic pump stepper motor. It must deliver exactly X milliliters per minute. Missing a deadline means delivering too much (overdose risk) or too little (underdose risk). Both are Class I recalls. The software stack is typically a small RTOS kernel with no file system, no networking stack, and no dynamic memory allocation. Static allocation only means no heap fragmentation or allocation latency surprises at runtime.

Audio and Digital Signal Processing

Professional audio workstations demand sub-millisecond latency to avoid audible artifacts. A buffer underrun — when the audio thread fails to fill the buffer in time — produces a pop or drop that breaks the performance. Audio threads are scheduled with real-time policies, often pinned to isolated cores to prevent any interference from the operating system.

The chain is: audio interface hardware fires an interrupt every 128 or 256 samples (at 48 kHz, that is 2.6 ms or 5.3 ms per period). The audio callback must fill the buffer before the next interrupt fires. On a standard Linux system with PulseAudio or JACK, the callback runs in userspace and is triggered by a timer. With SCHED_FIFO and a core isolated with isolcpus, the callback thread is the only thing running on that core. No timer interrupts, no scheduler preemption, no context switch mid-buffer.

Buffer size is a direct latency versus stability trade-off. A 128-sample buffer at 48 kHz gives 2.67 ms round-trip latency, which is imperceptible to humans. Under heavy load, if the callback is delayed by even one sample period, you get a buffer underrun. Professional systems typically use 256 or 512 sample buffers as a safety margin, accepting 5-10 ms latency in exchange for not dropping frames. The audio engineer tunes buffer size to the workload.

High-Frequency Trading

In financial markets, a few microseconds of latency can represent millions of dollars in missed arbitrage opportunities. Trading systems use kernel bypass networking (via technologies like DPDK) combined with real-time scheduling to minimize jitter in order execution. Hard affinity ensures the trading engine never leaves its designated cores.

The architecture of a modern HFT system is a layered approach to latency minimization. At the bottom is a network interface card with an FPGA that parses incoming market data packets (Ethernet frames containing NASDAQ ITCH or NYSE Ouch protocols) and writes them directly into user-space memory via DMA, completely bypassing the kernel network stack. The kernel never sees these packets — no interrupts, no socket buffers, no TCP/IP stack.

DPDK (Data Plane Development Kit) provides the userspace driver framework for this. A DPDK application polls a network interface continuously rather than waiting for interrupts. The trading strategy runs as a SCHED_FIFO process at priority 99, pinned to a dedicated core that never handles any other work. When an FPGA NIC receives a packet, it DMA-writes it into a locked-down memory region and signals the polling thread via a write to a specific memory address. The thread processes the packet, generates an order, and submits it — all within a single digit number of microseconds.

The latency budget is tight. A typical equity trade decision (receive market data, apply strategy, submit order) must complete in under 10 microseconds to be competitive in options market making. The column-based network approach eliminates cache misses from context switching between network handling and strategy processing. The trading engine thread never migrates — hard affinity keeps it on a specific physical core with its L1/L2 cache hot. The FPGA handles the network; the CPU handles the math.

Co-location is the physical counterpart to software optimization. HFT firms pay exchanges for servers housed in the same data centers as the exchange matching engines, cutting network transit time to microseconds. The combination of kernel bypass, real-time scheduling, CPU affinity, and co-location is what makes the microsecond-level latency achievable.

Quick Recap Checklist

Interview Questions

1. What is the difference between soft affinity and hard affinity in Linux?

Soft affinity is a preference that the scheduler applies naturally — the kernel tries to keep a process on the same core where it last ran, but will migrate it if necessary for load balancing or core availability. Hard affinity is an explicit mask set via sched_setaffinity or taskset that the scheduler must respect. With hard affinity, the process will never run on a core outside the specified mask.

2. What is priority inversion and how does priority inheritance solve it?

Priority inversion occurs when a high-priority task is blocked waiting for a resource held by a low-priority task, while a medium-priority task preempts the low-priority one and runs instead — effectively starving the high-priority task. Priority inheritance temporarily raises the holding task's priority to match the blocked task's priority, ensuring it runs and releases the lock quickly. The kernel restores the original priority once the lock is released.

3. Explain the difference between SCHED_FIFO, SCHED_RR, and SCHED_DEADLINE.

SCHED_FIFO has no time slicing — a runnable process runs until it voluntarily yields or blocks. SCHED_RR is identical but rotates among same-priority processes using a fixed time quantum. SCHED_DEADLINE uses Earliest Deadline First with bandwidth reservation — each task declares a runtime, deadline, and period, and the kernel guarantees the deadline provided total reserved bandwidth does not exceed CPU capacity.

4. What does the PREEMPT_RT patch set do to the Linux kernel?

PREEMPT_RT converts non-preemptible regions of the kernel — primarily interrupt handlers and certain critical sections — into preemptible ones. Interrupt handlers become threaded kernel threads, spinlocks become sleeping locks, and internal locking uses priority inheritance. This reduces the maximum preemption latency from milliseconds down to tens of microseconds, making stock Linux capable of soft real-time workloads.

5. When would you choose a dedicated RTOS like FreeRTOS over a PREEMPT_RT Linux system?

Choose a dedicated RTOS when you need formal certification for safety standards (DO-178C, ISO 26262, IEC 62304), when memory footprint must be extremely small (FreeRTOS can run in kilobytes of RAM), when you need a formally verified worst-case latency bound, or when the hardware platform is not well-supported by Linux. Choose PREEMPT_RT Linux when you need the Linux ecosystem, drivers, and toolchain while achieving soft real-time latency in the low hundreds of microseconds.

6. What is the difference between soft affinity and hard affinity?

Soft affinity (also called natural affinity) is a preference — the scheduler tries to keep a process on the same CPU but can migrate it when needed (e.g., load balancing). Hard affinity is a mandatory constraint — you pin a process to a specific set of cores using sched_setaffinity(), and the kernel will never migrate it outside that set unless you explicitly remove the constraint. Hard affinity is used in real-time and performance-critical workloads where cache eviction cost matters more than load balancing.

7. How does NUMA awareness interact with CPU affinity?

On NUMA systems, each CPU socket has its own memory bank. If a process migrates to a remote socket, every memory access pays a ~100ns penalty for remote access versus ~60ns for local. Effective real-time performance requires binding to both the correct CPU core AND the correct memory node using set_mempolicy(MPOL_BIND) or numactl --cpunodebind --membind. Linux's libnuma API lets applications query and set NUMA affinity programmatically. The PREEMPT_RT kernel does not automatically handle NUMA — the application or deployment tooling must.

8. What is the relationship between CPU affinity and cache coherence?

Modern multi-core CPUs use MESI (Modified, Exclusive, Shared, Invalid) cache coherence. When a process runs on Core 0, its data lives in Core 0's L1/L2 caches. If the process migrates to Core 1, the cache line must be invalidated on Core 0 and transferred to Core 1 — costing 50-200 cycles depending on the cache state. On a properly pinned process, cache hits can approach 100% for working sets that fit in L1/L2. With frequent migration, effective CPI degrades significantly even if the CPU appears "idle."

9. Describe the behavior of sched_setaffinity() and its flags.

sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask) sets the CPU affinity mask for a process. Key macros: CPU_SETSIZE limits the mask to 1024 CPUs. CPU_ZERO clears a mask. CPU_SET/CPU_CLR add/remove individual cores. CPU_COUNT returns the number of set bits. The call returns 0 on success. Reading affinity back uses sched_getaffinity(). It is safe to call from signal handlers.

10. What is CPU hotplug and how does it affect real-time processes?

CPU hotplug allows removing a CPU core from the scheduler at runtime (via echo 0 > /sys/devices/system/cpu/cpuN/online). When a core is offlined, the kernel migrates all processes away and sends SIGBUS to any process with hard affinity to that core. For real-time workloads, either disable hotplug at boot (maxcpus=), or subscribe to the CPU_DOWN_PREPARE notifier to gracefully migrate before the core goes offline.

11. What is cgroup CPU controller and how does it relate to CPU affinity?

The cgroup v2 CPU controller limits CPU usage per control group using CFS bandwidth control (shares + throttle). Unlike affinity (which controls where a process runs), the CPU controller controls how much CPU time a process gets. They compose: a process in a cgroup with 50% CPU limit can still have its affinity restricted to cores 0-3. For containers, cgroup CPU limits are typically the primary mechanism while taskset is used for finer cache control.

12. How does the scheduler handle affinity in a containerized environment?

In containerized environments, CPU affinity and cpusets interact in layered ways: 1) The container runtime (Docker, containerd) may set default cpuset cgroups limiting visible CPUs. 2) Inside the container, sched_setaffinity() is constrained by the cgroup cpuset subsystem. 3) Kubernetes' topologyManager and static CPU manager can coordinate NUMA and socket-level affinity for pods requesting guaranteed resources. 4) On cgroup v2 systems, cpuset.cpus.effective determines the allowed mask visible to processes.

13. What is the difference between CPU isolation (isolcpus) and CPU affinity?

isolcpus is a kernel boot parameter that removes a CPU from the scheduler's default load balancing entirely — the kernel will never automatically place a process on that CPU. You must explicitly assign work using affinity. This is the gold standard for real-time: reserve core 3 for your real-time thread, and nothing else will ever land on it. Unlike setting affinity (which requires userspace calls), isolcpus provides deterministic isolation without userspace coordination.

14. What are the performance implications of context switching between CPUs with affinity?

When a process loses cache affinity and migrates to another CPU, it suffers: L1/L2 cache miss (50-100 cycles), possible L3 miss (100-200 cycles if shared cache is now on remote socket), TLB flush (thousands of cycles on some architectures), and potential store buffer invalidation. With hard affinity, the kernel still handles ISR and kernel preemption on any CPU — soft lockup detector and perf profiling can cause jitter even on pinned processes. Use perf stat -e cache-misses to quantify migration cost.

15. What is the difference between SCHED_FIFO and SCHED_RR at the assembly level?

At the kernel level, SCHED_FIFO processes run until they yield, block, or are preempted by a higher-priority process. The only difference from SCHED_RR is that SCHED_RR applies a time slice counter: when a FIFO task exhausts its quantum, it is moved to the end of its priority queue. In pseudocode: if (policy == SCHED_RR && counter == 0) enqueue(current, rq); pick_next(rq);. For single-threaded real-time work, FIFO is typically preferred to avoid quantum expiration overhead.

16. How does the kernel implement priority inheritance for futexes?

Linux's PI-futex (priority inheritance fast userspace mutex) extends the basic futex with a priority queue of waiters. When a high-priority process blocks on a futex owned by a low-priority process, the kernel temporarily boosts the owner's priority to match the waiter's. The owner runs at elevated priority until it releases the lock, then reverts. Implementation: rt_mutex_setprio() is called to update the owner's effective priority. This prevents priority inversion without deadlock, but adds kernel overhead — basic futexes are ~3x faster than PI-futexes for uncontended cases.

17. What is the difference between real-time latency and real-time throughput?

Latency is the time from an event (interrupt, system call, network packet) to the response (scheduled execution of the handler). Throughput is the volume of work completed per unit time. Real-time systems care about bounded latency (must meet deadline), not throughput (might actually decrease under strict real-time constraints). For example, an automotive brake-by-wire ECU must respond within 1ms of a sensor event — latency is life safety. A web server maximizing requests/second optimizes throughput. PREEMPT_RT improves latency at some throughput cost due to increased preemption points.

18. What is CONFIG_PREEMPT_NONE vs CONFIG_PREEMPT_VOLUNTARY vs CONFIG_PREEMPT?

These are kernel preemption configuration options: CONFIG_PREEMPT_NONE (server) — the kernel can only be preempted at explicit preemption points. CONFIG_PREEMPT_VOLUNTARY (desktop) — adds explicit preemption checks in additional kernel code paths. CONFIG_PREEMPT (preemptible kernel) — the kernel can be preempted anywhere except in interrupt context. CONFIG_PREEMPT_RT (real-time) — builds on CONFIG_PREEMPT with threaded IRQs, sleeping spinlocks, and mandatory preemption everywhere.

19. What are the trade-offs between threaded IRQs and interrupt handlers running in interrupt context?

Traditional interrupt handlers run to completion in atomic interrupt context — no sleeping, no rescheduling. Threaded IRQs (request_threaded_irq) run as kernel threads, allowing blocking operations and preemption. Trade-offs: threaded IRQs add scheduling latency (the thread must be scheduled), but they reduce interrupt handler duration, improve parallelism, and make latency more predictable. PREEMPT_RT requires threaded IRQs to achieve deterministic preemption. The downside is ~10-50us extra latency for IRQ handling due to context switch.

20. What is the worst-case latency achievable on a PREEMPT_RT kernel and what limits it?

On a well-tuned PREEMPT_RT system, worst-case latency can be held below 100 microseconds for most workloads, and below 50 microseconds for latency-sensitive applications on modern hardware with proper CPU isolation. The limiting factors are: 1) hardware interrupt routing and APIC timer latency, 2) memory allocation latency (GFP_ATOMIC can still block if the page allocator needs to evict), 3) interrupt controller latency on virtualized systems, 4) device driver atomic sections. Use cyclictest from the rt-tests suite to measure your actual worst-case latency.

Conclusion

CPU affinity and real-time scheduling are two levels of the same idea: taking control away from the default scheduler and making explicit placement and timing decisions. Affinity tells the kernel where to run work. Real-time policies tell the kernel how long the work can run and what happens when deadlines loom.

For most applications, the default scheduler is perfectly adequate. But when cache warmth matters, when NUMA topology dictates placement, when deadlines are non-negotiable — these mechanisms give you the control needed to build systems that behave predictably under pressure.

The next step from here is to explore Kernel Architecture to understand how the scheduler fits into the broader kernel design, or to dig deeper into Process Scheduling Algorithms if you want to understand the algorithmic foundations of CFS.

CPU Affinity & Real-Time Operating Systems

Introduction

CPU Affinity: Taking Control of Placement

The Linux API

Why Bind to Specific Cores?

Soft Affinity vs Hard Affinity

Real-Time Scheduling Policies

SCHED_FIFO

SCHED_RR

SCHED_DEADLINE

Comparing Scheduling Policies

The Priority Inversion Problem

Priority Inheritance: The Solution

PREEMPT_RT: Making Stock Linux Real-Time Capable

Real-Time Operating Systems: Purpose-Built for Determinism

FreeRTOS

VxWorks

Zephyr

QNX

Scheduling Hierarchy: A Visual Map

Where Determinism Matters: Use Cases

Industrial Robots

Medical Devices

Audio and Digital Signal Processing

High-Frequency Trading

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Real-Time Operating Systems

Fork & Exec System Calls

System Calls Interface