B-Trees and B+ Trees: Database Index Structures

Understand B-trees and B+ trees for disk-based storage, database indexing, and efficient range queries with O(log n) operations.

published: reading time: 16 min read author: GeekWorkBench

B-Trees and B+ Trees: Database Index Structures

B-trees are the workhorse of database indexes and file systems. While balanced binary trees give O(log n) search, they perform terribly on disk—each node access might require a disk read. B-trees solve this by storing hundreds of keys per node, matching disk block sizes and reducing tree height from ~30 (for billion records) to ~3.

The B+ tree variant that’s even more common in databases adds a crucial optimization: all records are stored in leaf nodes, while internal nodes only store keys. The leaf nodes are linked sequentially, enabling efficient range queries by traversing the linked list rather than revisiting parent nodes. MySQL’s InnoDB, PostgreSQL, and most RDBMSes use B+ trees as their primary index structure.

Introduction

B-trees are the workhorse of database indexes and file systems, yet their design makes perfect sense once you understand the problem they solve: disk I/O is expensive (milliseconds versus nanoseconds for memory), and balanced binary trees with two children per node require too many disk reads to be practical for large datasets. A B-tree node typically holds hundreds of key-pointer pairs, matching disk block sizes so that each node access costs at most one disk read. With t=100 (200 keys per node), a billion records requires only ~3 levels instead of ~30 for a binary tree.

B+ trees are the dominant variant in relational databases. They optimize B-trees for read-heavy workloads by storing all data only in leaf nodes, while internal nodes store only keys for routing. The leaf nodes are linked sequentially, making range queries efficient by traversing the linked list rather than bouncing back to parent nodes repeatedly. MySQL’s InnoDB, PostgreSQL, Oracle, and SQL Server all use B+ trees as their primary index structure.

This guide covers B-tree and B+ tree structure and operations, node split and merge mechanics, and the critical differences that make B+ trees preferable for most database workloads. You’ll understand why page size and fill factor matter in practice, how clustered versus non-clustered indexes affect performance, and why sequential ID generation (auto-increment) causes less index bloat than random UUIDs.

When to Use

B-trees and B+ trees apply when:

  • Database indexing — Primary access path for most relational databases
  • Filesystem indexing — NTFS, ext4, and HFS+ use B+ tree variants
  • Large datasets on disk — Minimizing disk I/O matters more than memory efficiency
  • Range queries — B+ tree linked leaves enable sequential access
  • Ordered iteration — Keys can be retrieved in sorted order efficiently

When Not to Use

  • In-memory databases (red-black trees or skip lists may be faster due to cache locality)
  • Write-heavy workloads (B-tree rebalancing involves page splits that can be expensive)
  • Key-value stores only needing point queries (hash indexes are O(1))

Architecture: Multi-Way Balanced Trees

graph TD
    subgraph Internal["B+ Tree Internal Node"]
        I1["[20 | 40 | 60]"]
    end

    subgraph Leaf["B+ Tree Leaf Nodes (linked)"]
        L1["[10-15] -> [20-30] -> [40-50] -> [60-70]"]
    end

    I1 --> L1

Internal nodes store keys for routing; leaf nodes store actual data pointers and are linked for range access.

Implementation

B-Tree Node

class BTreeNode:
    def __init__(self, t, leaf=False):
        self.t = t  # Minimum degree (defines order)
        self.leaf = leaf
        self.keys = []  # Key values
        self.children = []  # Child pointers
        self.n = 0  # Current number of keys


class BTree:
    """
    B-tree implementation for educational purposes.
    In practice, use database-provided implementations.
    """

    def __init__(self, t):
        self.t = t  # Minimum degree (order = 2t)
        self.root = BTreeNode(t, leaf=True)

    def search(self, k, node=None):
        """Search for key k, return node if found."""
        if node is None:
            node = self.root

        i = 0
        while i < node.n and k > node.keys[i]:
            i += 1

        if i < node.n and k == node.keys[i]:
            return node, i
        if node.leaf:
            return None
        return self.search(k, node.children[i])

    def insert(self, k):
        """Insert key k into B-tree."""
        root = self.root
        if len(root.keys) == 2 * self.t - 1:
            new_root = BTreeNode(self.t, leaf=False)
            new_root.children.append(root)
            self._split_child(new_root, 0, root)
            self.root = new_root
        self._insert_nonfull(self.root, k)

    def _split_child(self, parent, i, child):
        """Split a full child node."""
        t = self.t
        new_child = BTreeNode(t, leaf=child.leaf)
        new_child.n = t - 1

        # Copy keys to new child
        for j in range(t - 1):
            new_child.keys.append(child.keys[j + t])
        if not child.leaf:
            for j in range(t):
                new_child.children.append(child.children[j + t])

        child.n = t - 1
        child.keys = child.keys[:t - 1]
        child.children = child.children[:t]

        # Insert median key into parent
        parent.keys.insert(i, child.keys.pop(t - 1))
        parent.children.insert(i + 1, new_child)

    def _insert_nonfull(self, node, k):
        """Insert into non-full node."""
        i = len(node.keys) - 1
        if node.leaf:
            node.keys.append(None)
            while i >= 0 and k < node.keys[i]:
                node.keys[i + 1] = node.keys[i]
                i -= 1
            node.keys[i + 1] = k
        else:
            while i >= 0 and k < node.keys[i]:
                i -= 1
            i += 1
            if len(node.children[i].keys) == 2 * self.t - 1:
                self._split_child(node, i, node.children[i])
                if k > node.keys[i]:
                    i += 1
            self._insert_nonfull(node.children[i], k)

Common Pitfalls / Anti-Patterns

  1. Confusing t (minimum degree) with order — B-tree of order m has max m children; t = ceil(m/2)
  2. Not handling root split — Special case when root is full and needs to become internal
  3. B+ tree leaf vs internal keys — Internal nodes store separator keys for routing only
  4. Disk block alignment — Real implementations align to filesystem blocks (4KB typically)

Quick Recap Checklist

  • B-trees store up to 2t-1 keys per node, t-1 to 2t-1 children
  • All leaves at same depth (balanced)
  • B+ tree: only leaves store data, internal nodes store routing keys only
  • B+ tree leaves are linked for efficient range scans
  • Used by virtually all relational databases for index access

Trade-off Analysis

AspectB-TreeB+ TreeNotes
Search depthO(log n) to leafO(log n) to leafSame asymptotic
Range queriesMust revisit parentSequential leaf linksB+ tree wins
Insert/SplitSplit may affect all levelsSameSimilar complexity
Storage efficiencyAll nodes store dataOnly leaves store dataB+ tree wins
Pointer overheadKeys + data in all nodesInternal nodes only have keysB+ tree wins
Point query speedCan find at leaf earlierAlways to leafB-tree may be faster
Leaf scan speedNon-sequentialSequential via linked listB+ tree wins

When to prefer B-tree: Write-heavy workloads, point queries where early termination helps, embedded databases with limited RAM.

When to prefer B+ tree: Read-heavy workloads with range scans, database indexes, filesystem metadata.

Real-world Failure Scenarios

  1. PostgreSQL index bloat: Frequent updates leave empty space in pages, causing index size inflation. Periodic REINDEX or VACUUM needed.

  2. MySQL InnoDB page splits: Random inserts cause many page splits at intermediate levels, degrading insert performance. Bulk loading or sorted inserts perform better.

  3. SQLite B-tree depth growth: Small cache + large dataset = cache misses. With default page size, deep trees cause 10-20 I/Os per query on cold cache.

  4. MongoDB index selectivity: Low-selectivity indexes (e.g., boolean field) cause full index scans that are slower than collection scans.

  5. Hot spot nodes in distributed B+ trees: Cassandra/ScyllaDB partition splits create hot nodes. Partition strategy matters more than tree algorithm.

Interview Questions

1. Why are B-trees better for disk-based storage than binary trees?

Disk I/O is expensive (milliseconds vs nanoseconds for memory). A binary tree with 1 billion records has height ~30—worst case 30 disk reads. A B-tree with t=100 (200 keys per node) has height only 3—maximum 3 disk reads. B-trees maximize fanout to minimize tree height and disk accesses per lookup.

2. What's the difference between B-tree and B+ tree?

In a B-tree, all nodes (including leaves) can store data and keys. In a B+ tree, only leaf nodes store data; internal nodes store only routing keys. This means: B+ trees have higher fan-out (more keys per internal node), all leaves at same depth, and efficient range queries via linked leaves. B-trees can find a key without traversing to a leaf.

3. What happens during B-tree node split?

When a node is full (2t-1 keys) and needs to insert, it splits at the median key. The median key goes to the parent; the left half stays in original node, right half goes to new node. This may cascade—parent might also be full, requiring recursive split. Splits propagate toward root at most O(tree height), typically 3-4 levels for large trees.

4. How does B+ tree support efficient range queries?

B+ tree leaves are linked via next/prev pointers. After finding the starting leaf via the tree, range queries traverse the linked list without returning to parent nodes. This enables sequential leaf access with minimal disk seeks—one I/O per leaf page vs. O(log n) parent revisits for a standard B-tree.

5. What is the relationship between B-tree order (m) and minimum degree (t)?

A B-tree of order m has a maximum of m children per node. The minimum degree t defines the node capacity as: minimum children = t, maximum children = 2t. Relationship: m = 2t (or close to it depending on implementation). For example, with t=50, each node holds up to 99 keys and up to 100 children pointers.

6. Why do database indexes use B+ trees instead of red-black trees?

Red-black trees have height O(log n) but low fanout (~2 children per node), meaning ~30 levels for 1 billion records. Each level may be a disk I/O. B+ trees pack hundreds of keys per node (matching 4KB disk blocks), reducing height to 3-4 levels. Fewer levels = fewer disk seeks = orders of magnitude faster for storage-backed systems.

7. What is write amplification in B-trees and when does it matter?

Write amplification = actual disk writes per user insert. B-trees cause writes when: (1) updating leaf pages, (2) splitting pages (write parent and two children), (3) updating parent keys. In write-heavy workloads, page splits cause multiple disk writes per insert. LSM trees reduce write amplification by batching writes in memory before sequential SSTable writes, at the cost of read amplification.

8. How does key size affect B-tree performance?

Larger keys reduce fanout: fewer keys fit per node, increasing tree height and I/O per search. For a 4KB page with 8-byte pointers, a 16-byte key allows ~200 keys per node (height ~3 for 1B records), while a 64-byte key drops to ~50 keys (height ~5). Composite indexes (key1, key2) compound this. Covered indexes and prefix compression mitigate the problem for index-only scans.

9. How does B+ tree handle deletions and what is page reclamation?

Expected answer points:

  • Deletions merge sibling nodes when keys fall below minimum threshold (t-1)
  • Coalescence or redistribution handles underfull nodes
  • Page reclamation: deleted pages return to free list for reuse
  • PostgreSQL VACUUM reclaims dead tuples; MySQL InnoDB purge thread handles deleted records
  • Lazy deletion: pages marked deleted but not immediately reused until VACUUM/purge runs
10. What is the difference between clustered and non-clustered indexes in the context of B+ trees?

Expected answer points:

  • Clustered index: B+ tree leaf pages ARE the data pages; table rows stored in order of index key
  • Non-clustered index: B+ tree leaf pages contain indexed key + row pointer to data
  • Each table has one clustered index (primary key typically); multiple non-clustered indexes
  • Clustered: range queries faster (data already ordered), insert/delete more expensive
  • Non-clustered: extra I/O for data lookup via pointer chase
11. What is index bloat and how does it affect B+ tree performance?

Expected answer points:

  • Index bloat: accumulation of dead or reusable pages due to updates/deletes
  • Causes: random inserts/deletes leave gaps, update-in-place rewrites pages
  • Symptoms: index size larger than necessary, degraded scan performance, increased cache pressure
  • Solutions: REINDEX rebuilds index from scratch; VACUUM (PostgreSQL) reclaims dead space
  • MySQL InnoDB: OPTIMIZE TABLE reorganizes pages;ibuffg purge thread handles deletions
12. How does B+ tree height stay low even with billions of records?

Expected answer points:

  • High fanout: each internal node holds hundreds of children pointers
  • 4KB page with 8-byte pointers allows ~500 children per internal node
  • Height calculation: log_500(1 billion) ≈ 3.4 — only 4 levels needed
  • Root node typically cached in memory; path from root to leaf is 2-3 disk I/Os
  • B-tree variants with larger pages (e.g., 64KB) can achieve height of 2 for same dataset
13. What is the fill factor in B+ tree indexes and why does it matter?

Expected answer points:

  • Fill factor: percentage of page space to reserve for future inserts (default 90%)
  • Higher fill factor: more storage efficient, fewer pages, but more page splits on insert
  • Lower fill factor: leaves room for in-page inserts, reduces splits but wastes space
  • Use low fill factor for heavily updated tables, high fill factor for read-mostly
  • PostgreSQL: CREATE INDEX WITH (fillfactor=70); MySQL InnoDB:innodb_fill_factor
14. How does B+ tree support unique indexes vs non-unique indexes?

Expected answer points:

  • Unique index: each key appears exactly once in tree; enforced at insert time
  • Non-unique index: duplicate keys allowed; typically handled by appending row IDs
  • B+ tree stores (key, row_pointer) tuples; for duplicates, uses (key, transaction_id) or list pages
  • Unique constraint check traverses tree O(log n) then verifies no match at leaf
  • Non-unique: may need to scan all matching entries for query execution
15. What is prefix compression in B+ tree leaf pages and how does it work?

Expected answer points:

  • Leaf pages store sorted keys; adjacent keys often share common prefixes
  • Prefix compression: store first key fully, subsequent keys store only differing suffix
  • Example: ["apple", "apricot", "banana"] becomes ["apple", "-ricot", "banana"]
  • Reduces leaf page space by 20-40% depending on key prefix similarity
  • Decompression needed on read; trade-off between CPU and I/O savings
16. Compare B+ trees with LSM trees (Log-Structured Merge Trees).

Expected answer points:

  • LSM trees: writes go to memory (MemTable), then flushed to SSTables on disk
  • B+ trees: in-place writes; page splits cause random I/O
  • LSM tree write amplification: 1 (data written once), B+ tree: 3-10x
  • B+ tree read amplification: O(log n), LSM tree: O(log n) + K (K = number of SSTables)
  • LSM trees: better for write-heavy workloads; B+ trees: better for read-heavy or mixed
17. What is a covering index and how does it eliminate table lookups in B+ trees?

Expected answer points:

  • Covering index: query can be satisfied entirely from index pages without touching table
  • All referenced columns included in index key or stored as included columns
  • Query executor does index lookup, reads leaf nodes only — no row pointer chase to heap
  • PostgreSQL: "Index Only Scan"; MySQL: "Using index" in EXPLAIN
  • Trade-offs: larger index, slower inserts/updates, but faster reads for covered queries
18. How does the buffer pool/cache affect B+ tree performance in databases?

Expected answer points:

  • Buffer pool (innodb_buffer_pool_size, shared_buffers) caches B+ tree pages in memory
  • Hot pages: root, intermediate nodes cached; leaf pages loaded on demand
  • Cache hit ratio determines effective I/O cost; 99% hit ratio = 1% disk reads
  • LRU or clock-sweep eviction policies manage buffer pool pages
  • Prefetching: sequential leaf scans pre-load adjacent pages into buffer pool
19. What are the advantages and disadvantages of using GUIDs vs sequential IDs as B+ tree keys?

Expected answer points:

  • Sequential IDs (auto-increment): inserts go to rightmost leaf page — minimal page splits
  • GUIDs (UUIDs): random insertion point causes page splits throughout tree
  • Random GUIDs: write amplification high, index bloat higher, cache efficiency lower
  • Solutions: UUID v7 (time-ordered), UUID v1 with node ID, COMB (sequential GUIDs)
  • PostgreSQL: gen_random_uuid(); MySQL: UUID() — both random by default
20. How does B+ tree behave under high concurrency with multiple readers and writers?

Expected answer points:

  • Latch coupling: B+ tree operations acquire locks on nodes being modified
  • Read-write locks: readers share, writers exclusive; prevents dirty reads
  • Lock escalation: fine-grained node locks scale better than table-level locks
  • Write-write: page split or merge requires parent lock during modification
  • PostgreSQL: Heap Only Tuple (HOT) updates avoid index maintenance when possible
  • MySQL InnoDB: gap locks and next-key locks for isolation; reduces deadlocks but adds contention

Further Reading

Conclusion

B-trees and B+ trees are the backbone of virtually every relational database you’ve ever used. The core idea is simple: keep trees shallow by cramming more keys into each node, so disk I/O stays low. B+ trees go a step further by chaining leaf nodes together, which makes range scans actually bearable. If you want to see where this leads next, look into composite indexes, covering indexes, and how query planners decide which index to use — Database Indexing Deep Dive is a good starting point.

Category

Related Posts

Arrays vs Linked Lists: Understanding the Trade-offs

Compare arrays and linked lists in terms of access time, insertion/deletion efficiency, memory usage, and cache performance.

#arrays #linked-lists #data-structures

Arrays: 1D, 2D, and Multi-dimensional Mastery

Master array operations, traversal, search, common patterns like two-pointer and sliding window, and when to use multi-dimensional arrays.

#arrays #1d-arrays #2d-arrays

AVL Trees: Self-Balancing Binary Search Trees

Master AVL tree rotations, balance factors, and rebalancing logic. Learn when to use AVL vs Red-Black trees for your use case.

#avl-tree #self-balancing-bst #binary-search-tree