I/O Scheduling Algorithms
Master FCFS, SSTF, SCAN, C-SCAN, and elevator algorithms for disk arm scheduling in modern operating systems.
Introduction
Every time your application writes a file, reads from a database, or loads a web page, the operating system must orchestrate movement of the disk’s read/write head—or manage the complex flash translation layer in an SSD. I/O schedulers are the OS components responsible for ordering and timing these disk operations to maximize throughput while minimizing latency and starvation.
The problem seems simple: given a queue of disk requests, in what order should they be serviced? In practice, this becomes a nuanced optimization problem involving physical constraints (mechanical latency in HDDs), fairness (preventing request starvation), and throughput (keeping the disk busy).
Understanding these algorithms matters beyond just disk I/O. The fundamental scheduling problem—optimizing request ordering under constraints—appears identically in CPU task scheduling, network packet scheduling, and cloud resource allocation.
When to Use / When Not to Use
When I/O Scheduling Algorithms Matter
- Rotating storage (HDD): Mechanical constraints make seek time dominant; scheduling provides 10-100x improvements over naive FCFS
- Mixed workload servers: Web servers handling both small random reads (user requests) and large sequential reads (log analysis) need scheduling to prevent starvation
- Database workloads: PostgreSQL and MySQL benefit significantly from deadline-aware scheduling to prevent write starvation
- Virtual machines: Hypervisors use I/O scheduling to provide fair disk access to multiple VMs
When Scheduling Matters Less
- NVMe SSDs with deep queues: The device’s internal scheduling and parallelism make OS-level scheduling redundant
- Direct attached storage with single application: If one application owns the disk, scheduling complexity provides diminishing returns
- Very low-latency requirements: Applications requiring sub-millisecond latency should bypass spinning storage entirely
- Write-heavy workloads with battery backup: Battery-backed RAID controllers can reorder writes aggressively because data is protected
Scheduler Selection Guidelines
| Workload | Recommended Scheduler | Reason |
|---|---|---|
| Desktop/interactive | mq-deadline or BFQ | Low latency for interactive tasks |
| Database (MySQL, PostgreSQL) | deadline | Write deadline guarantees |
| File server | mq-deadline | Balanced throughput and latency |
| VM host | mq-deadline or none | Fairness between guests |
| NVMe SSD | none (or mq-deadline minimal) | Device handles its own scheduling |
| Heavy sequential (video streaming) | BFQ | Guarantees bandwidth for real-time flows |
Architecture or Flow Diagram
The following diagram shows how a modern block I/O scheduler processes requests through its pipeline:
flowchart TB
subgraph "Application Layer"
A1["App 1\nwrite"]
A2["App 2\nread"]
A3["App 3\nwrite"]
end
subgraph "Kernel Block Layer"
Q["Request Queue\n(Linux: elevator queue)"]
S["Scheduler Algorithm\n(deadline / cfq / mq-deadline / bfq)"]
D["Dispatch Queue\n(per-hardware queue)"]
end
subgraph "Hardware"
H1["NVMe Queue 0"]
H2["NVMe Queue 1"]
HN["Disk\nController"]
end
A1 --> Q
A2 --> Q
A3 --> Q
Q --> S
S --> D
D --> H1
D --> H2
H1 --> HN
H2 --> HN
style Q stroke:#00fff9
style S stroke:#ff00ff
style D stroke:#00fff9
The flow works as follows: applications submit bio structures representing I/O requests. These bios are queued in the elevator (request queue). The scheduler algorithm reorders these requests based on its optimization criteria (seek distance, deadline, or fairness). When the hardware is ready, the dispatch layer pulls requests from the scheduler and submits them to the hardware queue.
Core Concepts
First-Come, First-Served (FCFS)
The simplest possible scheduler—requests are processed in arrival order. No optimization whatsoever.
// Conceptual FCFS implementation
void fcfs_add_request(struct request_queue *q, struct request *req)
{
list_add_tail(&req->queuelist, &q->queue_head);
}
struct request *fcfs_next_request(struct request_queue *q)
{
if (list_empty(&q->queue_head))
return NULL;
return list_first_entry(&q->queue_head, struct request, queuelist);
}
FCFS is trivially fair and has zero scheduling overhead, but performs terribly on rotating disks because random seeks dominate service time.
Shortest Seek Time First (SSTF)
SSTF services the request closest to the current head position, minimizing total seek distance.
struct request *sstf_next_request(struct request_queue *q)
{
struct request *req, *best = NULL;
sector_t head = disk_head_position(q);
sector_t best_dist = ULONG_MAX;
list_for_each_entry(req, &q->queue_head, queuelist) {
sector_t dist = abs(req->sector - head);
if (dist < best_dist) {
best_dist = dist;
best = req;
}
}
if (best) {
list_del(&best->queuelist);
}
return best;
}
SSTF dramatically reduces average seek time but risks starvation—requests far from the current head may never be serviced if new requests keep arriving closer to the head.
SCAN Algorithm (Elevator)
The SCAN algorithm moves the head in one direction (say, from innermost to outermost cylinder) servicing all requests along the way, then reverses direction. Like an elevator, it serves floors in order.
struct request *scan_next_request(struct request_queue *q)
{
struct request *req;
sector_t head = disk_head_position(q);
enum direction dir = current_direction(q); // IN or OUT
// Find closest request in current direction
list_for_each_entry(req, &q->queue_head, queuelist) {
if (req->sector >= head && dir == OUT)
goto found;
if (req->sector <= head && dir == IN)
goto found;
}
// No request in current direction, reverse
reverse_direction(q);
return scan_next_request(q); // Recurse once
found:
list_del(&req->queuelist);
return req;
}
SCAN provides low variance in service time and avoids starvation, but requests immediately behind the head must wait for the full sweep.
C-SCAN (Circular SCAN)
C-SCAN only services requests in one direction, then quickly returns to the beginning without servicing requests on the return trip. This provides more uniform service time than SCAN.
SCAN pattern: ←←←←←←←←←←← →→→→→→→→→→→ ←←←←←←←←←←
service return (no service) service
C-SCAN pattern: ←←←←←←←←←←← ================== ←←←←←←←←
service fast return (empty) service
Deadline Scheduler
The deadline scheduler adds time constraints to prevent starvation. Each request gets a deadline (typically 500ms for reads, 5s for writes). The scheduler normally services requests by seek distance, but enforces deadlines to prevent indefinite postponement.
struct request *deadline_next_request(struct request_queue *q)
{
struct request *req;
// Check read FIFO for expired deadlines
req = read_fifo_peek(q);
if (req && deadline_expired(req)) {
return read_fifo_pop(q); // Force service
}
// Check write FIFO
req = write_fifo_peek(q);
if (req && deadline_expired(req)) {
return write_fifo_pop(q);
}
// Normal dispatch: closest request (elevator)
return elevator_dispatch(q);
}
Linux’s mq-deadline is the modern incarnation, optimized for multi-queue block devices like NVMe.
Completely Fair Queueing (CFQ)
CFQ allocates disk time slices fairly across processes, giving each process its own queue. This prevents a single aggressive process from monopolizing the disk.
// Per-process queue structure
struct cfq_queue {
pid_t pid;
unsigned int sector_weight; // Based on ioprio
unsigned int time_slice;
struct list_head cfq_list; // In service tree
};
// Dispatch: pick queue with oldest slice, service its requests
struct cfq_queue *cfq_pick_queue(struct cfq_data *cfqd)
{
// Find non-empty queue with lowest virtual runtime
return rb_first(&cfqd->service_tree);
}
CFQ was the default Linux scheduler for years but was deprecated in favor of mq-deadline and BFQ because it introduced latency for certain workloads and didn’t scale well to high-IOPS devices.
Production Failure Scenarios
Scenario 1: Starvation in SSTF Causing Application Hang
What happened: A backup application continuously wrote large files in sequential order from cylinder 0 upward. Meanwhile, an interactive database application issued small reads scattered across the outer cylinders. The database’s requests were always farther from the backup stream’s current position, so they were perpetually deferred. Database queries began timing out.
Detection: Application monitoring showed I/O latency spikes exceeding 10 seconds. iostat showed disk utilization at 100% but with very low transaction rates.
Mitigation: Switch from SSTF-based scheduler to deadline scheduler:
echo deadline > /sys/block/sda/queue/scheduler
Deadline ensures database reads hit their 500ms deadline even if the disk is busy with sequential writes.
Scenario 2: Write Starvation Under CFQ
What happened: A web server handling mostly read requests gave writes such low priority that write buffers filled and application blocked on write(), causing apparent hangs despite low disk utilization.
Detection: iotop showed minimal write I/O but free showed swap usage increasing—writeback couldn’t keep up.
Mitigation: Use mq-deadline instead of CFQ, which separates read and write queues with proper batching:
echo mq-deadline > /sys/block/nvme0n1/queue/scheduler
Scenario 3: NVMe Queue Depth Mismanagement
What happened: Application configured for spinning disk submitted requests one at a time, waiting for each to complete. On NVMe, this left 32,000+ entry queues empty, achieving only 400 IOPS instead of 200,000+ possible.
Detection: nvme list showed queue temperatures at 0.1% utilization. Latency was dominated by application-level synchronization.
Mitigation: Submit multiple requests asynchronously:
# BAD: Synchronous submission
for chunk in file_chunks:
fd.write(chunk) # Waits for each write before next
# GOOD: Async batch submission
async with aioopen('/dev/nvme0n1', 'O_DIRECT') as fd:
await fd.write_all(file_chunks) # Batch of 128+ requests
Trade-off Table
| Algorithm | Avg Seek Distance | Starvation Risk | Throughput | Complexity |
|---|---|---|---|---|
| FCFS | Very High (random access) | None (trivially fair) | Low | Trivial |
| SSTF | Minimal | High (far requests ignored) | Highest (HDD) | Low |
| SCAN | Moderate | Low (guaranteed service) | High | Medium |
| C-SCAN | Moderate-High | Very Low (uniform guarantee) | High | Medium |
| Deadline | Moderate | Very Low (deadline guarantee) | High | Medium |
| CFQ | Variable | Very Low (fair queuing) | Moderate | High |
Implementation Snippets
Simulating SCAN Scheduling
#!/usr/bin/env python3
"""
Simulates the SCAN (elevator) disk scheduling algorithm.
"""
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class Request:
sector: int
pid: int = 0
arrival_time: float = 0.0
class SCANScheduler:
def __init__(self, disk_size: int = 10000, head_start: int = 0):
self.disk_size = disk_size
self.head = head_start
self.direction = 1 # 1 = toward end, -1 = toward start
self.queue: List[Request] = []
self.total_seek = 0
def add_request(self, req: Request):
self.queue.append(req)
def _sort_key(self, req: Request) -> tuple:
"""Sort by direction-aware distance from head."""
if self.direction == 1:
# Moving toward end: prioritize requests ahead of head
if req.sector >= self.head:
return (0, req.sector) # In direction, sort by sector
else:
return (1, self.disk_size - req.sector) # Behind, sort by reverse distance
else:
# Moving toward start
if req.sector <= self.head:
return (0, -req.sector) # In direction, sort by reverse sector
else:
return (1, req.sector) # Ahead, sort by distance
def run(self) -> List[tuple]:
"""Run SCAN until queue empty, return list of (sector, seek_distance)."""
serviced = []
while self.queue:
# Sort by direction-aware distance
self.queue.sort(key=self._sort_key)
req = self.queue.pop(0)
seek = abs(req.sector - self.head)
self.total_seek += seek
serviced.append((req.sector, seek))
self.head = req.sector
# At disk edge, reverse direction
if self.head >= self.disk_size - 1:
self.direction = -1
elif self.head <= 0:
self.direction = 1
return serviced
if __name__ == "__main__":
import random
random.seed(42)
scheduler = SCANScheduler(disk_size=10000, head_start=5000)
# Add random requests
for i in range(20):
scheduler.add_request(Request(sector=random.randint(0, 9999)))
print(f"Initial head position: 5000")
print(f"Running SCAN scheduler...\n")
serviced = scheduler.run()
for i, (sector, seek) in enumerate(serviced):
print(f" Step {i+1:2d}: Service sector {sector:5d} (seek={seek:5d})")
print(f"\nTotal seek distance: {scheduler.total_seek}")
print(f"Average seek: {scheduler.total_seek / len(serviced):.1f}")
Deadline Scheduler Verification
#!/bin/bash
# Benchmark deadline scheduler performance
echo "=== Setting deadline scheduler ==="
echo deadline | sudo tee /sys/block/sda/queue/scheduler
echo ""
echo "=== Current scheduler ==="
cat /sys/block/sda/queue/scheduler
echo ""
echo "=== Tunable parameters ==="
cat /sys/block/sda/queue/iosched/read_expire_ms
cat /sys/block/sda/queue/iosched/write_expire_ms
cat /sys/block/sda/queue/iosched/fifo_batch
echo ""
echo "=== Benchmark with fio ==="
sudo fio --name=randread --filename=/dev/sda --ioengine=libaio \
--rw=randread --bs=4k --numjobs=4 --runtime=30 --time_based=1 \
--readonly --iodepth=32 2>/dev/null | grep -E "IOPS|lat"
Observability Checklist
Linux Block Layer Metrics
# View queue depth and scheduler behavior
cat /sys/block/sda/queue/nr_requests # Queue depth
cat /sys/block/sda/queue/read_expire_ms # Deadline read timeout
cat /sys/block/sda/queue/write_expire_ms # Deadline write timeout
# iostat for throughput and queue length
iostat -xz 1 # Per-device utilization and queue
iostat -xdy 1 # Per-partition detailed stats
# blktrace for request-level tracing
sudo blktrace -d /dev/sda -o - | blkparse -i -
Key Metrics to Monitor
| Metric | Healthy Range | Alert Threshold | Indicates |
|---|---|---|---|
avgqu-sz | < 4 | > 16 | Queue saturation |
util | < 70% | > 90% | Disk bottleneck |
rkB/s + wkB/s | Workload dependent | Sudden drops | Controller or disk issue |
await | < 10ms HDD, < 1ms SSD | > 100ms | Latency issues |
svctm | < 10ms HDD | > 20ms | Device service degradation |
Common Pitfalls / Anti-Patterns
I/O Scheduling Security Implications
-
Timing Attacks: Predictable scheduling can leak information through access patterns. Adversaries measuring I/O timing might infer what data other processes access.
-
Denial of Service: A malicious process could submit enormous numbers of tiny requests, monopolizing scheduler time and starving legitimate processes. Use cgroups I/O controllers to enforce fairness:
# Limit cgroup to 50MB/s write bandwidth echo "8:0 52428800" > /sys/fs/cgroup/blkio/user.slice/io.max -
Data Lifetime: Scheduling decisions affect when data reaches stable storage. Synchronous writes must bypass caches—verify
sync()calls actually flush before returning.
Common Pitfalls / Anti-patterns
| Anti-pattern | Why It’s Bad | Correct Approach |
|---|---|---|
| Using SSTF on multi-process systems | Starvation causes unpredictable latency | Use deadline or mq-deadline |
| Ignoring scheduler on NVMe | Scheduler overhead adds latency without benefit | Set none or mq-deadline |
| Submitting requests synchronously | Underutilizes hardware queues | Batch async submissions |
| Not aligning to physical block size | Causes read-modify-write cycles | Use fdisk alignment, strace misalignments |
| Mixing SSD and HDD scheduling logic | Assumes seek time matters on flash | Different schedulers per device type |
Quick Recap Checklist
- I/O schedulers order disk requests to minimize seek time and prevent starvation
- FCFS is fair but performs poorly; SSTF minimizes seeks but risks starvation
- SCAN (elevator) sweeps across the disk; C-SCAN provides uniform service times
- Deadline scheduling adds time guarantees to prevent starvation
- CFQ provided fairness but was deprecated; mq-deadline is the modern replacement
- NVMe SSDs benefit less from OS scheduling—use
noneor minimal scheduling - Monitor
iostat -xfor queue depth, utilization, and latency metrics - Use cgroups I/O controllers for multi-tenant fairness guarantees
- Scheduler choice should match workload: interactive = mq-deadline, database = deadline, real-time = BFQ
Interview Questions
The elevator algorithm (formally SCAN) solves the problem of arbitrary request ordering in FCFS. It services requests in cylinder order as the head sweeps across the disk—moving in one direction, serving all requests it encounters, then reversing and sweeping back. This dramatically reduces total seek distance compared to random access, while guaranteeing every request will be serviced (no starvation). The name comes from the similarity to how an elevator services floor requests.
CFQ was deprecated for several reasons: First, it didn't scale well to high-IOPS devices like NVMe—its per-process queue management overhead became significant. Second, it introduced latency for certain workloads because "fair" doesn't mean "appropriate"—a 1KB metadata read and a 1MB sequential write shouldn't get equal time slices. Third, it interacted poorly with the block layer multi-queue redesign for modern storage. Finally, it was complex (thousands of lines of code) for marginal benefit over simpler deadline-based approaches. mq-deadline and BFQ replaced it for most use cases.
Starvation occurs when a request is perpetually deferred because others keep getting priority—typically with SSTF, where requests far from the current head position never get serviced if new closer requests keep arriving. FCFS has no starvation (trivially fair). SCAN prevents starvation because every request in the sweep direction gets serviced before the head reverses. Deadline prevents starvation by enforcing maximum wait times—expired requests jump the queue. CFQ/BFQ prevent starvation by giving each process bounded time slices.
The deadline scheduler maintains separate read and write FIFOs, each with associated deadlines (default: 500ms for reads, 5 seconds for writes). Normally it dispatches requests sorted by location (like SSTF). However, before each dispatch, it checks if any request's deadline has expired—if so, it immediately services that request from the FIFO regardless of position. This ensures writes don't starve even when reads keep arriving, because eventually the write deadline expires and forces the write to be serviced. The write deadline is intentionally longer because applications typically block longer on reads.
NVMe SSDs have fundamentally different characteristics: No mechanical seek time (NAND flash has no moving parts). Internal massive parallelism (thousands of concurrent operations). Deep command queues (up to 64,000 commands per queue). The device knows its own internal topology and state far better than the OS scheduler can predict. The OS scheduler typically adds overhead without meaningful optimization. Modern practice is to either use none scheduler for NVMe, or use mq-deadline only to batch and merge requests before handing them to the device's own internal scheduler. The benefit of OS-level scheduling diminishes dramatically when the device handles its own scheduling with perfect information.
Anticipatory scheduling was based on the observation that after a read completes, the process that issued it typically issues another read nearby (sequential pattern). The scheduler would "anticipate" this and wait a few milliseconds after a read completes to see if the same process issues another read nearby, maximizing sequential reads. This improved throughput for sequential-heavy workloads but introduced latency for random I/O because requests waited while the scheduler anticipated. When the workload changed (more random reads, more mixed workloads), anticipatory scheduling caused performance degradation. The Linux kernel deprecated anticipatory scheduling and removed it from mainline in favor of CFQ (which provided fairness) and deadline (which provided latency guarantees). Anticipatory scheduling illustrated that optimizing purely for throughput can harm fairness and latency.
Modern NVMe and SATA SSDs have multiple hardware queues for commands. The older single-queue block layer (with a global elevator) became a bottleneck—all I/O passed through one queue, even when the device could handle many concurrent operations. The multi-queue block layer creates per-CPU software queues (mapped to hardware queues) so that each CPU can submit I/O without contention. The device's own internal scheduler handles ordering across hardware queues. This design reduces lock contention at the OS level (each CPU touches its own queue) and better utilizes the device's internal parallelism. The `mq-deadline` and `BFQ` schedulers are designed for this multi-queue environment. The block layer also handles request batching and merging before submission to reduce per-command overhead.
The Noop scheduler is the simplest possible scheduler—it implements a FIFO queue with no reordering whatsoever. Requests are processed in the order they arrive. This seems terrible for HDDs (random access is slow), but Noop has specific use cases. For SSDs with internal wear-leveling and command queuing, the OS-level scheduling adds overhead without benefit—the device does its own optimal scheduling. For virtualized environments where the hypervisor's storage layer is the real scheduler, an OS-level scheduler might conflict with or duplicate the hypervisor's scheduling. For storage that is essentially a ramdisk or battery-backed, sequential ordering provides no benefit. Noop has the lowest CPU overhead of any scheduler because it does almost nothing, making it attractive for embedded or high-IOPS scenarios where CPU overhead of scheduling matters more than disk efficiency.
Elevator merging combines adjacent I/O requests into a single larger request before submission to the hardware. If two read requests are for consecutive sectors (e.g., sector 100 and sector 101), the scheduler merges them into one request spanning sectors 100-101. This reduces the number of I/O operations the device must process, which is especially important for HDDs where each operation has seek overhead. Merging is particularly effective for sequential write patterns where multiple small writes are combined into larger, more efficient writes. The deadline scheduler includes merge logic before applying deadline constraints. Without merging, a process writing one byte at a time would generate one I/O operation per byte—a disaster on rotating media. Even on SSDs, merging reduces command processing overhead.
I/O scheduler decisions directly affect filesystem fragmentation. When the scheduler groups and reorders requests, it changes the order in which file blocks are written to disk. An SCAN-style scheduler that sweeps across the disk tends to write file blocks in roughly sequential order, producing less fragmentation. An SSTF-based scheduler (if not properly deadline-constrained) might interleave blocks from different files, increasing fragmentation over time. Filesystems like ext4 use extent-based allocation to reduce fragmentation sensitivity, but heavy random-write workloads will eventually fragment even extent-based filesystems. The relationship is: more intelligent I/O scheduling leads to more sequential writes, which leads to lower fragmentation, which in turn improves performance because reads are more sequential. This is a feedback loop—better scheduling reduces fragmentation, which reduces the need for seeking, which improves effective throughput.
Network storage doesn't go through the block I/O scheduler because network protocols (NFS, CIFS) use the network stack, not the block layer. The OS-level I/O scheduler only applies to block devices (disks, SSDs, RAID arrays). For NFS, the network packet scheduler (qdisc) in the network stack handles ordering, and NFS client readahead/writethrough caching determines when data is sent over the network. For NAS, the "scheduling" effectively happens at the NFS server level—the client sends requests and the server orders them on its attached storage. This means that choosing `deadline` versus `cfq` on a Linux NFS client has no effect on the actual storage performance—only on local caching behavior and how requests are batched before being sent. For local storage, the scheduler matters. For remote storage, you tune the network qdisc and NFS mount options.
blk-mq (Block Multi-Queue) is the redesigned block layer introduced in Linux 3.13 for modern storage. It creates per-CPU submission queues that map to hardware submission/completion queues, leveraging multi-core parallelism for high-IOPS devices. The traditional single-queue approach serialized all I/O through one lock, which became a bottleneck at very high IOPS. With blk-mq, each CPU has its own queue and can submit I/O without lock contention—the hardware (NVMe) handles parallel processing internally. blk-mq also enables hardware queue affinity (steering I/O to specific queues based on NUMA locality or device topology). The `mq-deadline` and `BFQ` schedulers sit on top of blk-mq. Most modern SSDs (NVMe and SATA AHCI) are supported via blk-mq. The blk-mq redesign was one of the most significant kernel changes for storage performance in years.
I/O scheduler latency directly impacts QoS because different workloads require different latency guarantees. A database synchronous write (fsync) needs to complete as fast as possible—the scheduler's deadline mechanism ensures it has a bounded maximum wait. A background backup write is latency-tolerant and can wait. The scheduler's job is to differentiate these: deadline ensures reads are bounded to ~500ms by default; BFQ provides bandwidth guarantees ensuring each process gets its share. Without QoS differentiation, a heavy background backup can starve a small interactive read, causing user-facing latency spikes. Linux's `blk-throttle` mechanism extends this with cgroup-based I/O bandwidth limits, allowing per-cgroup QoS for multi-tenant workloads. The scheduler is the mechanism that enforces the QoS policy the operator defines.
Request merging combines adjacent requests, reducing the total number of I/O operations issued to the disk. For an HDD, fewer operations means fewer seeks—each seek takes 5-15ms, so merging adjacent requests can eliminate seeks entirely, dramatically increasing effective bandwidth. If the scheduler merges 8 requests for consecutive sectors into 1 request, the disk handles 1 seek instead of 8, potentially cutting latency by an order of magnitude. For SSDs, merging reduces command processing overhead and allows the drive's internal NAND to handle larger, more efficient write operations. The effective bandwidth improvement from merging depends on the workload: sequential workloads benefit most (nearly all requests are mergeable), while random I/O workloads see little merging benefit. The block layer's merge logic is fundamental to all I/O schedulers.
blkcg provides I/O control via cgroups, allowing disk I/O bandwidth and IOPS limits per cgroup. The `io.max` interface in cgroups v2 limits a cgroup to specific bandwidth (bytes per second) or IOPS. Unlike the `blkio` cgroup v1 throttling which operated at the scheduler level, `io.max` works at the device level and is scheduler-agnostic—it works with deadline, mq-deadline, BFQ, or even no scheduler. The throttling happens before the scheduler, making it effective even for NVMe where the scheduler matters less. This enables true per-cgroup guarantees in multi-tenant environments. For example, in Kubernetes, you can set `storage.io.throttle.read_bps` to guarantee that a specific pod cannot exceed a certain read rate. Without blkcg, a noisy neighbor pod could monopolize disk bandwidth.
SCAN reverses direction at each end of the disk—moving from innermost to outermost, then back. This creates a pattern where requests at the edges of the disk have higher latency because they wait for the full sweep before being serviced. After SCAN processes all inner requests and reaches the innermost cylinder, it reverses and starts processing outer requests. If a request arrives at the outermost cylinder after the sweep has passed, it must wait for the full sweep back. This creates asymmetric latency: requests near the center of the disk tend to be serviced more frequently than those at the extremes. C-SCAN addresses this by servicing requests only in one direction and then jumping back to the start (not servicing requests on the return trip), giving every cylinder equal service time. The tradeoff is slightly higher total seek distance (no benefit from the return sweep).
Throughput is the total amount of data transferred per second (MB/s); latency is the time from request to completion for a single operation (ms). A large sequential write has high throughput but potentially moderate latency. A small random read has high latency but contributes little to throughput. A batch analytics workload cares about throughput—getting as much data processed as possible. A database OLTP workload cares about latency—each individual transaction needs to complete quickly. The scheduler must balance these: pure SSTF maximizes throughput on HDDs but creates latency variance for random requests. Deadline scheduling sacrifices some throughput for bounded latency. BFQ tries to guarantee bandwidth for each process, balancing both. The scheduler configuration for a given system should reflect the workload priority—if latency is critical (interactive), use deadline; if throughput is critical (batch), use mq-deadline with appropriate tuning.
Full-disk encryption (LUKS, BitLocker) operates below the filesystem layer in the block device layer, encrypting each disk sector on write and decrypting on read. This happens in the bio path after the I/O scheduler has merged and ordered requests but before the data reaches the storage medium. The encryption overhead is CPU-bound and pipelineable—modern CPUs with AES-NI can encrypt at several GB/s with negligible overhead. The key interaction with the I/O scheduler is that encryption does not change request ordering or merging behavior—the scheduler operates on logical block addresses, which are encrypted to physical addresses transparently. However, encrypted storage has higher CPU overhead per I/O operation, which can become a bottleneck at very high IOPS when the CPU's crypto unit becomes saturated. Hardware encryption accelerators (Intel QAT, AMD PSP) offload this to dedicated silicon, eliminating the bottleneck.
Many modern HDDs and some SSDs report 4KB physical sectors internally while exposing 512-byte logical sectors to the OS (512e). A write that is not aligned to the physical block size triggers read-modify-write cycles: the device must read the full 4KB block, modify the 512-byte portion, and write the block back. The I/O scheduler can reduce the occurrence of these misaligned operations by merging adjacent requests and ensuring that requests are aligned to physical boundaries. An 512-byte write at sector 7 will cause a 4KB read-modify-write cycle at the physical block level; merging it with adjacent writes to reach 4KB alignment can eliminate this overhead. Modern storage devices with native 4KB sectors (4Kn) avoid this complexity. The scheduler's merge logic is particularly important for 512e drives because it can reduce the penalty of misaligned writes.
The kernel can record I/O latency at the block layer using the blktrace infrastructure, capturing timestamps for each request from submission through dispatch to completion. The btt (blktrace analysis tool) converts this into histograms showing latency distribution. If the histogram shows a long tail (99th percentile latency much higher than median), the scheduler might be prioritizing throughput over latency, or the disk might be heavily seek-bound. A bimodal distribution (requests clustered at low and high latency) suggests request-dependent behavior—some requests are being serviced while others wait excessively. Comparing latency histograms across different schedulers (deadline vs mq-deadline vs BFQ) reveals which scheduler best controls tail latency for your workload. You can also identify "I/O storms"—periods where latency spikes dramatically, often correlating with specific workloads or cron jobs.
Further Reading
- Linux Kernel Documentation: Block Layer and I/O Schedulers — Official documentation covering the block layer, request queues, and scheduler interfaces
- Linux Kernel Documentation: mq-deadline scheduler — Detailed description of the modern multi-queue deadline scheduler
- Linux Kernel Documentation: BFQ Scheduler — Guide to BFQ (Budget Fair Queueing) for interactive and real-time workloads
- Storage Performance Per Pin (SPPP) paper — Academic research on NVMe scheduling optimization
- The SCAN Algorithm ( elevator ) — Wikipedia explanation of elevator scheduling with historical context
- fio - Flexible I/O Tester — The standard tool for benchmarking I/O schedulers and storage performance
Conclusion
I/O schedulers transform arbitrary disk request streams into optimized service orders, minimizing mechanical seek time and preventing request starvation. The evolution from FCFS through SSTF, SCAN, C-SCAN, and deadline-based algorithms reflects a maturing understanding of the tradeoff between throughput, fairness, and implementation complexity.
The shift from CFQ to mq-deadline and BFQ in modern Linux reflects the hardware reality of NVMe SSDs with internal scheduling and multi-queue architectures. For rotating storage, deadline scheduling remains the pragmatic choice for balancing throughput with write fairness. For interactive workloads, BFQ’s bandwidth-guarantee approach shines.
Looking forward, as NVMe dominates storage and storage-class memory reduces latency gaps further, OS-level scheduling will continue diminishing in importance. The future belongs to device-managed scheduling with application-level hints, not kernel-managed command reordering.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.