I/O Scheduling Algorithms

Master FCFS, SSTF, SCAN, C-SCAN, and elevator algorithms for disk arm scheduling in modern operating systems.

published: reading time: 27 min read author: GeekWorkBench

Introduction

Every time your application writes a file, reads from a database, or loads a web page, the operating system must orchestrate movement of the disk’s read/write head—or manage the complex flash translation layer in an SSD. I/O schedulers are the OS components responsible for ordering and timing these disk operations to maximize throughput while minimizing latency and starvation.

The problem seems simple: given a queue of disk requests, in what order should they be serviced? In practice, this becomes a nuanced optimization problem involving physical constraints (mechanical latency in HDDs), fairness (preventing request starvation), and throughput (keeping the disk busy).

Understanding these algorithms matters beyond just disk I/O. The fundamental scheduling problem—optimizing request ordering under constraints—appears identically in CPU task scheduling, network packet scheduling, and cloud resource allocation.

When to Use / When Not to Use

When I/O Scheduling Algorithms Matter

  • Rotating storage (HDD): Mechanical constraints make seek time dominant; scheduling provides 10-100x improvements over naive FCFS
  • Mixed workload servers: Web servers handling both small random reads (user requests) and large sequential reads (log analysis) need scheduling to prevent starvation
  • Database workloads: PostgreSQL and MySQL benefit significantly from deadline-aware scheduling to prevent write starvation
  • Virtual machines: Hypervisors use I/O scheduling to provide fair disk access to multiple VMs

When Scheduling Matters Less

  • NVMe SSDs with deep queues: The device’s internal scheduling and parallelism make OS-level scheduling redundant
  • Direct attached storage with single application: If one application owns the disk, scheduling complexity provides diminishing returns
  • Very low-latency requirements: Applications requiring sub-millisecond latency should bypass spinning storage entirely
  • Write-heavy workloads with battery backup: Battery-backed RAID controllers can reorder writes aggressively because data is protected

Scheduler Selection Guidelines

WorkloadRecommended SchedulerReason
Desktop/interactivemq-deadline or BFQLow latency for interactive tasks
Database (MySQL, PostgreSQL)deadlineWrite deadline guarantees
File servermq-deadlineBalanced throughput and latency
VM hostmq-deadline or noneFairness between guests
NVMe SSDnone (or mq-deadline minimal)Device handles its own scheduling
Heavy sequential (video streaming)BFQGuarantees bandwidth for real-time flows

Architecture or Flow Diagram

The following diagram shows how a modern block I/O scheduler processes requests through its pipeline:

flowchart TB
    subgraph "Application Layer"
        A1["App 1\nwrite"]
        A2["App 2\nread"]
        A3["App 3\nwrite"]
    end

    subgraph "Kernel Block Layer"
        Q["Request Queue\n(Linux: elevator queue)"]
        S["Scheduler Algorithm\n(deadline / cfq / mq-deadline / bfq)"]
        D["Dispatch Queue\n(per-hardware queue)"]
    end

    subgraph "Hardware"
        H1["NVMe Queue 0"]
        H2["NVMe Queue 1"]
        HN["Disk\nController"]
    end

    A1 --> Q
    A2 --> Q
    A3 --> Q
    Q --> S
    S --> D
    D --> H1
    D --> H2
    H1 --> HN
    H2 --> HN

    style Q stroke:#00fff9
    style S stroke:#ff00ff
    style D stroke:#00fff9

The flow works as follows: applications submit bio structures representing I/O requests. These bios are queued in the elevator (request queue). The scheduler algorithm reorders these requests based on its optimization criteria (seek distance, deadline, or fairness). When the hardware is ready, the dispatch layer pulls requests from the scheduler and submits them to the hardware queue.

Core Concepts

First-Come, First-Served (FCFS)

The simplest possible scheduler—requests are processed in arrival order. No optimization whatsoever.

// Conceptual FCFS implementation
void fcfs_add_request(struct request_queue *q, struct request *req)
{
    list_add_tail(&req->queuelist, &q->queue_head);
}

struct request *fcfs_next_request(struct request_queue *q)
{
    if (list_empty(&q->queue_head))
        return NULL;
    return list_first_entry(&q->queue_head, struct request, queuelist);
}

FCFS is trivially fair and has zero scheduling overhead, but performs terribly on rotating disks because random seeks dominate service time.

Shortest Seek Time First (SSTF)

SSTF services the request closest to the current head position, minimizing total seek distance.

struct request *sstf_next_request(struct request_queue *q)
{
    struct request *req, *best = NULL;
    sector_t head = disk_head_position(q);
    sector_t best_dist = ULONG_MAX;

    list_for_each_entry(req, &q->queue_head, queuelist) {
        sector_t dist = abs(req->sector - head);
        if (dist < best_dist) {
            best_dist = dist;
            best = req;
        }
    }

    if (best) {
        list_del(&best->queuelist);
    }
    return best;
}

SSTF dramatically reduces average seek time but risks starvation—requests far from the current head may never be serviced if new requests keep arriving closer to the head.

SCAN Algorithm (Elevator)

The SCAN algorithm moves the head in one direction (say, from innermost to outermost cylinder) servicing all requests along the way, then reverses direction. Like an elevator, it serves floors in order.

struct request *scan_next_request(struct request_queue *q)
{
    struct request *req;
    sector_t head = disk_head_position(q);
    enum direction dir = current_direction(q);  // IN or OUT

    // Find closest request in current direction
    list_for_each_entry(req, &q->queue_head, queuelist) {
        if (req->sector >= head && dir == OUT)
            goto found;
        if (req->sector <= head && dir == IN)
            goto found;
    }

    // No request in current direction, reverse
    reverse_direction(q);
    return scan_next_request(q);  // Recurse once

found:
    list_del(&req->queuelist);
    return req;
}

SCAN provides low variance in service time and avoids starvation, but requests immediately behind the head must wait for the full sweep.

C-SCAN (Circular SCAN)

C-SCAN only services requests in one direction, then quickly returns to the beginning without servicing requests on the return trip. This provides more uniform service time than SCAN.

SCAN pattern:  ←←←←←←←←←←←  →→→→→→→→→→→  ←←←←←←←←←←
                 service      return (no service)    service

C-SCAN pattern: ←←←←←←←←←←←  ==================  ←←←←←←←←
                 service      fast return (empty)   service

Deadline Scheduler

The deadline scheduler adds time constraints to prevent starvation. Each request gets a deadline (typically 500ms for reads, 5s for writes). The scheduler normally services requests by seek distance, but enforces deadlines to prevent indefinite postponement.

struct request *deadline_next_request(struct request_queue *q)
{
    struct request *req;

    // Check read FIFO for expired deadlines
    req = read_fifo_peek(q);
    if (req && deadline_expired(req)) {
        return read_fifo_pop(q);  // Force service
    }

    // Check write FIFO
    req = write_fifo_peek(q);
    if (req && deadline_expired(req)) {
        return write_fifo_pop(q);
    }

    // Normal dispatch: closest request (elevator)
    return elevator_dispatch(q);
}

Linux’s mq-deadline is the modern incarnation, optimized for multi-queue block devices like NVMe.

Completely Fair Queueing (CFQ)

CFQ allocates disk time slices fairly across processes, giving each process its own queue. This prevents a single aggressive process from monopolizing the disk.

// Per-process queue structure
struct cfq_queue {
    pid_t pid;
    unsigned int sector_weight;  // Based on ioprio
    unsigned int time_slice;
    struct list_head cfq_list;  // In service tree
};

// Dispatch: pick queue with oldest slice, service its requests
struct cfq_queue *cfq_pick_queue(struct cfq_data *cfqd)
{
    // Find non-empty queue with lowest virtual runtime
    return rb_first(&cfqd->service_tree);
}

CFQ was the default Linux scheduler for years but was deprecated in favor of mq-deadline and BFQ because it introduced latency for certain workloads and didn’t scale well to high-IOPS devices.

Production Failure Scenarios

Scenario 1: Starvation in SSTF Causing Application Hang

What happened: A backup application continuously wrote large files in sequential order from cylinder 0 upward. Meanwhile, an interactive database application issued small reads scattered across the outer cylinders. The database’s requests were always farther from the backup stream’s current position, so they were perpetually deferred. Database queries began timing out.

Detection: Application monitoring showed I/O latency spikes exceeding 10 seconds. iostat showed disk utilization at 100% but with very low transaction rates.

Mitigation: Switch from SSTF-based scheduler to deadline scheduler:

echo deadline > /sys/block/sda/queue/scheduler

Deadline ensures database reads hit their 500ms deadline even if the disk is busy with sequential writes.

Scenario 2: Write Starvation Under CFQ

What happened: A web server handling mostly read requests gave writes such low priority that write buffers filled and application blocked on write(), causing apparent hangs despite low disk utilization.

Detection: iotop showed minimal write I/O but free showed swap usage increasing—writeback couldn’t keep up.

Mitigation: Use mq-deadline instead of CFQ, which separates read and write queues with proper batching:

echo mq-deadline > /sys/block/nvme0n1/queue/scheduler

Scenario 3: NVMe Queue Depth Mismanagement

What happened: Application configured for spinning disk submitted requests one at a time, waiting for each to complete. On NVMe, this left 32,000+ entry queues empty, achieving only 400 IOPS instead of 200,000+ possible.

Detection: nvme list showed queue temperatures at 0.1% utilization. Latency was dominated by application-level synchronization.

Mitigation: Submit multiple requests asynchronously:

# BAD: Synchronous submission
for chunk in file_chunks:
    fd.write(chunk)  # Waits for each write before next

# GOOD: Async batch submission
async with aioopen('/dev/nvme0n1', 'O_DIRECT') as fd:
    await fd.write_all(file_chunks)  # Batch of 128+ requests

Trade-off Table

AlgorithmAvg Seek DistanceStarvation RiskThroughputComplexity
FCFSVery High (random access)None (trivially fair)LowTrivial
SSTFMinimalHigh (far requests ignored)Highest (HDD)Low
SCANModerateLow (guaranteed service)HighMedium
C-SCANModerate-HighVery Low (uniform guarantee)HighMedium
DeadlineModerateVery Low (deadline guarantee)HighMedium
CFQVariableVery Low (fair queuing)ModerateHigh

Implementation Snippets

Simulating SCAN Scheduling

#!/usr/bin/env python3
"""
Simulates the SCAN (elevator) disk scheduling algorithm.
"""
from dataclasses import dataclass
from typing import List, Optional


@dataclass
class Request:
    sector: int
    pid: int = 0
    arrival_time: float = 0.0


class SCANScheduler:
    def __init__(self, disk_size: int = 10000, head_start: int = 0):
        self.disk_size = disk_size
        self.head = head_start
        self.direction = 1  # 1 = toward end, -1 = toward start
        self.queue: List[Request] = []
        self.total_seek = 0

    def add_request(self, req: Request):
        self.queue.append(req)

    def _sort_key(self, req: Request) -> tuple:
        """Sort by direction-aware distance from head."""
        if self.direction == 1:
            # Moving toward end: prioritize requests ahead of head
            if req.sector >= self.head:
                return (0, req.sector)  # In direction, sort by sector
            else:
                return (1, self.disk_size - req.sector)  # Behind, sort by reverse distance
        else:
            # Moving toward start
            if req.sector <= self.head:
                return (0, -req.sector)  # In direction, sort by reverse sector
            else:
                return (1, req.sector)  # Ahead, sort by distance

    def run(self) -> List[tuple]:
        """Run SCAN until queue empty, return list of (sector, seek_distance)."""
        serviced = []

        while self.queue:
            # Sort by direction-aware distance
            self.queue.sort(key=self._sort_key)
            req = self.queue.pop(0)

            seek = abs(req.sector - self.head)
            self.total_seek += seek
            serviced.append((req.sector, seek))
            self.head = req.sector

            # At disk edge, reverse direction
            if self.head >= self.disk_size - 1:
                self.direction = -1
            elif self.head <= 0:
                self.direction = 1

        return serviced


if __name__ == "__main__":
    import random
    random.seed(42)

    scheduler = SCANScheduler(disk_size=10000, head_start=5000)

    # Add random requests
    for i in range(20):
        scheduler.add_request(Request(sector=random.randint(0, 9999)))

    print(f"Initial head position: 5000")
    print(f"Running SCAN scheduler...\n")

    serviced = scheduler.run()
    for i, (sector, seek) in enumerate(serviced):
        print(f"  Step {i+1:2d}: Service sector {sector:5d} (seek={seek:5d})")

    print(f"\nTotal seek distance: {scheduler.total_seek}")
    print(f"Average seek: {scheduler.total_seek / len(serviced):.1f}")

Deadline Scheduler Verification

#!/bin/bash
# Benchmark deadline scheduler performance

echo "=== Setting deadline scheduler ==="
echo deadline | sudo tee /sys/block/sda/queue/scheduler

echo ""
echo "=== Current scheduler ==="
cat /sys/block/sda/queue/scheduler

echo ""
echo "=== Tunable parameters ==="
cat /sys/block/sda/queue/iosched/read_expire_ms
cat /sys/block/sda/queue/iosched/write_expire_ms
cat /sys/block/sda/queue/iosched/fifo_batch

echo ""
echo "=== Benchmark with fio ==="
sudo fio --name=randread --filename=/dev/sda --ioengine=libaio \
    --rw=randread --bs=4k --numjobs=4 --runtime=30 --time_based=1 \
    --readonly --iodepth=32 2>/dev/null | grep -E "IOPS|lat"

Observability Checklist

Linux Block Layer Metrics

# View queue depth and scheduler behavior
cat /sys/block/sda/queue/nr_requests     # Queue depth
cat /sys/block/sda/queue/read_expire_ms   # Deadline read timeout
cat /sys/block/sda/queue/write_expire_ms  # Deadline write timeout

# iostat for throughput and queue length
iostat -xz 1                              # Per-device utilization and queue
iostat -xdy 1                             # Per-partition detailed stats

# blktrace for request-level tracing
sudo blktrace -d /dev/sda -o - | blkparse -i -

Key Metrics to Monitor

MetricHealthy RangeAlert ThresholdIndicates
avgqu-sz< 4> 16Queue saturation
util< 70%> 90%Disk bottleneck
rkB/s + wkB/sWorkload dependentSudden dropsController or disk issue
await< 10ms HDD, < 1ms SSD> 100msLatency issues
svctm< 10ms HDD> 20msDevice service degradation

Common Pitfalls / Anti-Patterns

I/O Scheduling Security Implications

  1. Timing Attacks: Predictable scheduling can leak information through access patterns. Adversaries measuring I/O timing might infer what data other processes access.

  2. Denial of Service: A malicious process could submit enormous numbers of tiny requests, monopolizing scheduler time and starving legitimate processes. Use cgroups I/O controllers to enforce fairness:

    # Limit cgroup to 50MB/s write bandwidth
    echo "8:0 52428800" > /sys/fs/cgroup/blkio/user.slice/io.max
  3. Data Lifetime: Scheduling decisions affect when data reaches stable storage. Synchronous writes must bypass caches—verify sync() calls actually flush before returning.

Common Pitfalls / Anti-patterns

Anti-patternWhy It’s BadCorrect Approach
Using SSTF on multi-process systemsStarvation causes unpredictable latencyUse deadline or mq-deadline
Ignoring scheduler on NVMeScheduler overhead adds latency without benefitSet none or mq-deadline
Submitting requests synchronouslyUnderutilizes hardware queuesBatch async submissions
Not aligning to physical block sizeCauses read-modify-write cyclesUse fdisk alignment, strace misalignments
Mixing SSD and HDD scheduling logicAssumes seek time matters on flashDifferent schedulers per device type

Quick Recap Checklist

  • I/O schedulers order disk requests to minimize seek time and prevent starvation
  • FCFS is fair but performs poorly; SSTF minimizes seeks but risks starvation
  • SCAN (elevator) sweeps across the disk; C-SCAN provides uniform service times
  • Deadline scheduling adds time guarantees to prevent starvation
  • CFQ provided fairness but was deprecated; mq-deadline is the modern replacement
  • NVMe SSDs benefit less from OS scheduling—use none or minimal scheduling
  • Monitor iostat -x for queue depth, utilization, and latency metrics
  • Use cgroups I/O controllers for multi-tenant fairness guarantees
  • Scheduler choice should match workload: interactive = mq-deadline, database = deadline, real-time = BFQ

Interview Questions

1. What is the "elevator algorithm" and what problem does it solve?

The elevator algorithm (formally SCAN) solves the problem of arbitrary request ordering in FCFS. It services requests in cylinder order as the head sweeps across the disk—moving in one direction, serving all requests it encounters, then reversing and sweeping back. This dramatically reduces total seek distance compared to random access, while guaranteeing every request will be serviced (no starvation). The name comes from the similarity to how an elevator services floor requests.

2. Why did Linux deprecate CFQ (Completely Fair Queueing)?

CFQ was deprecated for several reasons: First, it didn't scale well to high-IOPS devices like NVMe—its per-process queue management overhead became significant. Second, it introduced latency for certain workloads because "fair" doesn't mean "appropriate"—a 1KB metadata read and a 1MB sequential write shouldn't get equal time slices. Third, it interacted poorly with the block layer multi-queue redesign for modern storage. Finally, it was complex (thousands of lines of code) for marginal benefit over simpler deadline-based approaches. mq-deadline and BFQ replaced it for most use cases.

3. What is request starvation and which algorithms prevent it?

Starvation occurs when a request is perpetually deferred because others keep getting priority—typically with SSTF, where requests far from the current head position never get serviced if new closer requests keep arriving. FCFS has no starvation (trivially fair). SCAN prevents starvation because every request in the sweep direction gets serviced before the head reverses. Deadline prevents starvation by enforcing maximum wait times—expired requests jump the queue. CFQ/BFQ prevent starvation by giving each process bounded time slices.

4. How does the deadline scheduler prevent write starvation?

The deadline scheduler maintains separate read and write FIFOs, each with associated deadlines (default: 500ms for reads, 5 seconds for writes). Normally it dispatches requests sorted by location (like SSTF). However, before each dispatch, it checks if any request's deadline has expired—if so, it immediately services that request from the FIFO regardless of position. This ensures writes don't starve even when reads keep arriving, because eventually the write deadline expires and forces the write to be serviced. The write deadline is intentionally longer because applications typically block longer on reads.

5. Why does NVMe make I/O scheduling less important?

NVMe SSDs have fundamentally different characteristics: No mechanical seek time (NAND flash has no moving parts). Internal massive parallelism (thousands of concurrent operations). Deep command queues (up to 64,000 commands per queue). The device knows its own internal topology and state far better than the OS scheduler can predict. The OS scheduler typically adds overhead without meaningful optimization. Modern practice is to either use none scheduler for NVMe, or use mq-deadline only to batch and merge requests before handing them to the device's own internal scheduler. The benefit of OS-level scheduling diminishes dramatically when the device handles its own scheduling with perfect information.

6. What is anticipatory scheduling and why was it deprecated in favor of deadline and CFQ?

Anticipatory scheduling was based on the observation that after a read completes, the process that issued it typically issues another read nearby (sequential pattern). The scheduler would "anticipate" this and wait a few milliseconds after a read completes to see if the same process issues another read nearby, maximizing sequential reads. This improved throughput for sequential-heavy workloads but introduced latency for random I/O because requests waited while the scheduler anticipated. When the workload changed (more random reads, more mixed workloads), anticipatory scheduling caused performance degradation. The Linux kernel deprecated anticipatory scheduling and removed it from mainline in favor of CFQ (which provided fairness) and deadline (which provided latency guarantees). Anticipatory scheduling illustrated that optimizing purely for throughput can harm fairness and latency.

7. How does the Linux block layer multi-queue design improve performance on modern storage devices?

Modern NVMe and SATA SSDs have multiple hardware queues for commands. The older single-queue block layer (with a global elevator) became a bottleneck—all I/O passed through one queue, even when the device could handle many concurrent operations. The multi-queue block layer creates per-CPU software queues (mapped to hardware queues) so that each CPU can submit I/O without contention. The device's own internal scheduler handles ordering across hardware queues. This design reduces lock contention at the OS level (each CPU touches its own queue) and better utilizes the device's internal parallelism. The `mq-deadline` and `BFQ` schedulers are designed for this multi-queue environment. The block layer also handles request batching and merging before submission to reduce per-command overhead.

8. What is the NOOP (Noop) scheduler and when is it appropriate?

The Noop scheduler is the simplest possible scheduler—it implements a FIFO queue with no reordering whatsoever. Requests are processed in the order they arrive. This seems terrible for HDDs (random access is slow), but Noop has specific use cases. For SSDs with internal wear-leveling and command queuing, the OS-level scheduling adds overhead without benefit—the device does its own optimal scheduling. For virtualized environments where the hypervisor's storage layer is the real scheduler, an OS-level scheduler might conflict with or duplicate the hypervisor's scheduling. For storage that is essentially a ramdisk or battery-backed, sequential ordering provides no benefit. Noop has the lowest CPU overhead of any scheduler because it does almost nothing, making it attractive for embedded or high-IOPS scenarios where CPU overhead of scheduling matters more than disk efficiency.

9. How does elevator merging work in the Linux block layer and why does it reduce I/O overhead?

Elevator merging combines adjacent I/O requests into a single larger request before submission to the hardware. If two read requests are for consecutive sectors (e.g., sector 100 and sector 101), the scheduler merges them into one request spanning sectors 100-101. This reduces the number of I/O operations the device must process, which is especially important for HDDs where each operation has seek overhead. Merging is particularly effective for sequential write patterns where multiple small writes are combined into larger, more efficient writes. The deadline scheduler includes merge logic before applying deadline constraints. Without merging, a process writing one byte at a time would generate one I/O operation per byte—a disaster on rotating media. Even on SSDs, merging reduces command processing overhead.

10. What is the relationship between I/O scheduler decisions and filesystem fragmentation?

I/O scheduler decisions directly affect filesystem fragmentation. When the scheduler groups and reorders requests, it changes the order in which file blocks are written to disk. An SCAN-style scheduler that sweeps across the disk tends to write file blocks in roughly sequential order, producing less fragmentation. An SSTF-based scheduler (if not properly deadline-constrained) might interleave blocks from different files, increasing fragmentation over time. Filesystems like ext4 use extent-based allocation to reduce fragmentation sensitivity, but heavy random-write workloads will eventually fragment even extent-based filesystems. The relationship is: more intelligent I/O scheduling leads to more sequential writes, which leads to lower fragmentation, which in turn improves performance because reads are more sequential. This is a feedback loop—better scheduling reduces fragmentation, which reduces the need for seeking, which improves effective throughput.

11. How does the Linux kernel handle I/O scheduling for network storage (NAS, NFS) versus local block devices?

Network storage doesn't go through the block I/O scheduler because network protocols (NFS, CIFS) use the network stack, not the block layer. The OS-level I/O scheduler only applies to block devices (disks, SSDs, RAID arrays). For NFS, the network packet scheduler (qdisc) in the network stack handles ordering, and NFS client readahead/writethrough caching determines when data is sent over the network. For NAS, the "scheduling" effectively happens at the NFS server level—the client sends requests and the server orders them on its attached storage. This means that choosing `deadline` versus `cfq` on a Linux NFS client has no effect on the actual storage performance—only on local caching behavior and how requests are batched before being sent. For local storage, the scheduler matters. For remote storage, you tune the network qdisc and NFS mount options.

12. What is the blk-mq (block multi-queue) architecture and why was it necessary for NVMe?

blk-mq (Block Multi-Queue) is the redesigned block layer introduced in Linux 3.13 for modern storage. It creates per-CPU submission queues that map to hardware submission/completion queues, leveraging multi-core parallelism for high-IOPS devices. The traditional single-queue approach serialized all I/O through one lock, which became a bottleneck at very high IOPS. With blk-mq, each CPU has its own queue and can submit I/O without lock contention—the hardware (NVMe) handles parallel processing internally. blk-mq also enables hardware queue affinity (steering I/O to specific queues based on NUMA locality or device topology). The `mq-deadline` and `BFQ` schedulers sit on top of blk-mq. Most modern SSDs (NVMe and SATA AHCI) are supported via blk-mq. The blk-mq redesign was one of the most significant kernel changes for storage performance in years.

13. What is the relationship between I/O scheduler latency and quality of service (QoS) guarantees?

I/O scheduler latency directly impacts QoS because different workloads require different latency guarantees. A database synchronous write (fsync) needs to complete as fast as possible—the scheduler's deadline mechanism ensures it has a bounded maximum wait. A background backup write is latency-tolerant and can wait. The scheduler's job is to differentiate these: deadline ensures reads are bounded to ~500ms by default; BFQ provides bandwidth guarantees ensuring each process gets its share. Without QoS differentiation, a heavy background backup can starve a small interactive read, causing user-facing latency spikes. Linux's `blk-throttle` mechanism extends this with cgroup-based I/O bandwidth limits, allowing per-cgroup QoS for multi-tenant workloads. The scheduler is the mechanism that enforces the QoS policy the operator defines.

14. How does the elevator algorithm's request merging affect the effective bandwidth of a disk?

Request merging combines adjacent requests, reducing the total number of I/O operations issued to the disk. For an HDD, fewer operations means fewer seeks—each seek takes 5-15ms, so merging adjacent requests can eliminate seeks entirely, dramatically increasing effective bandwidth. If the scheduler merges 8 requests for consecutive sectors into 1 request, the disk handles 1 seek instead of 8, potentially cutting latency by an order of magnitude. For SSDs, merging reduces command processing overhead and allows the drive's internal NAND to handle larger, more efficient write operations. The effective bandwidth improvement from merging depends on the workload: sequential workloads benefit most (nearly all requests are mergeable), while random I/O workloads see little merging benefit. The block layer's merge logic is fundamental to all I/O schedulers.

15. What is the blkcg (block cgroup) infrastructure and how does it enable per-cgroup I/O throttling?

blkcg provides I/O control via cgroups, allowing disk I/O bandwidth and IOPS limits per cgroup. The `io.max` interface in cgroups v2 limits a cgroup to specific bandwidth (bytes per second) or IOPS. Unlike the `blkio` cgroup v1 throttling which operated at the scheduler level, `io.max` works at the device level and is scheduler-agnostic—it works with deadline, mq-deadline, BFQ, or even no scheduler. The throttling happens before the scheduler, making it effective even for NVMe where the scheduler matters less. This enables true per-cgroup guarantees in multi-tenant environments. For example, in Kubernetes, you can set `storage.io.throttle.read_bps` to guarantee that a specific pod cannot exceed a certain read rate. Without blkcg, a noisy neighbor pod could monopolize disk bandwidth.

16. How does the SCAN algorithm's direction reversal (elevator reversal) affect request latency variance?

SCAN reverses direction at each end of the disk—moving from innermost to outermost, then back. This creates a pattern where requests at the edges of the disk have higher latency because they wait for the full sweep before being serviced. After SCAN processes all inner requests and reaches the innermost cylinder, it reverses and starts processing outer requests. If a request arrives at the outermost cylinder after the sweep has passed, it must wait for the full sweep back. This creates asymmetric latency: requests near the center of the disk tend to be serviced more frequently than those at the extremes. C-SCAN addresses this by servicing requests only in one direction and then jumping back to the start (not servicing requests on the return trip), giving every cylinder equal service time. The tradeoff is slightly higher total seek distance (no benefit from the return sweep).

17. What is the difference between I/O throughput and I/O latency, and why does the scheduler prioritize them differently for different workloads?

Throughput is the total amount of data transferred per second (MB/s); latency is the time from request to completion for a single operation (ms). A large sequential write has high throughput but potentially moderate latency. A small random read has high latency but contributes little to throughput. A batch analytics workload cares about throughput—getting as much data processed as possible. A database OLTP workload cares about latency—each individual transaction needs to complete quickly. The scheduler must balance these: pure SSTF maximizes throughput on HDDs but creates latency variance for random requests. Deadline scheduling sacrifices some throughput for bounded latency. BFQ tries to guarantee bandwidth for each process, balancing both. The scheduler configuration for a given system should reflect the workload priority—if latency is critical (interactive), use deadline; if throughput is critical (batch), use mq-deadline with appropriate tuning.

18. How does I/O scheduler selection interact with storage encryption (LUKS, BitLocker) and what overhead does encryption add?

Full-disk encryption (LUKS, BitLocker) operates below the filesystem layer in the block device layer, encrypting each disk sector on write and decrypting on read. This happens in the bio path after the I/O scheduler has merged and ordered requests but before the data reaches the storage medium. The encryption overhead is CPU-bound and pipelineable—modern CPUs with AES-NI can encrypt at several GB/s with negligible overhead. The key interaction with the I/O scheduler is that encryption does not change request ordering or merging behavior—the scheduler operates on logical block addresses, which are encrypted to physical addresses transparently. However, encrypted storage has higher CPU overhead per I/O operation, which can become a bottleneck at very high IOPS when the CPU's crypto unit becomes saturated. Hardware encryption accelerators (Intel QAT, AMD PSP) offload this to dedicated silicon, eliminating the bottleneck.

19. What is the impact of physical block size (512e vs 4Kn) on I/O scheduler performance?

Many modern HDDs and some SSDs report 4KB physical sectors internally while exposing 512-byte logical sectors to the OS (512e). A write that is not aligned to the physical block size triggers read-modify-write cycles: the device must read the full 4KB block, modify the 512-byte portion, and write the block back. The I/O scheduler can reduce the occurrence of these misaligned operations by merging adjacent requests and ensuring that requests are aligned to physical boundaries. An 512-byte write at sector 7 will cause a 4KB read-modify-write cycle at the physical block level; merging it with adjacent writes to reach 4KB alignment can eliminate this overhead. Modern storage devices with native 4KB sectors (4Kn) avoid this complexity. The scheduler's merge logic is particularly important for 512e drives because it can reduce the penalty of misaligned writes.

20. How does the kernel's I/O latency histogram (available in blktrace and btt) help diagnose scheduler problems?

The kernel can record I/O latency at the block layer using the blktrace infrastructure, capturing timestamps for each request from submission through dispatch to completion. The btt (blktrace analysis tool) converts this into histograms showing latency distribution. If the histogram shows a long tail (99th percentile latency much higher than median), the scheduler might be prioritizing throughput over latency, or the disk might be heavily seek-bound. A bimodal distribution (requests clustered at low and high latency) suggests request-dependent behavior—some requests are being serviced while others wait excessively. Comparing latency histograms across different schedulers (deadline vs mq-deadline vs BFQ) reveals which scheduler best controls tail latency for your workload. You can also identify "I/O storms"—periods where latency spikes dramatically, often correlating with specific workloads or cron jobs.

Further Reading

Conclusion

I/O schedulers transform arbitrary disk request streams into optimized service orders, minimizing mechanical seek time and preventing request starvation. The evolution from FCFS through SSTF, SCAN, C-SCAN, and deadline-based algorithms reflects a maturing understanding of the tradeoff between throughput, fairness, and implementation complexity.

The shift from CFQ to mq-deadline and BFQ in modern Linux reflects the hardware reality of NVMe SSDs with internal scheduling and multi-queue architectures. For rotating storage, deadline scheduling remains the pragmatic choice for balancing throughput with write fairness. For interactive workloads, BFQ’s bandwidth-guarantee approach shines.

Looking forward, as NVMe dominates storage and storage-class memory reduces latency gaps further, OS-level scheduling will continue diminishing in importance. The future belongs to device-managed scheduling with application-level hints, not kernel-managed command reordering.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science