Directory Implementation
How file systems implement directories through linear search, hash tables, and B-trees, enabling efficient path resolution and file lookup operations.
Directory Implementation
When you type ls /home/user/documents/report.pdf, something has to figure out where that file actually lives on disk. That something is the directory implementation. The directory is the mapping between human-readable names and the underlying inodes that point to file data. How that mapping is structured determines how fast you can find files, how many files a directory can hold, and how the file system behaves under heavy load.
Most people think directories are simple folders. Under the hood, they’re sophisticated data structures that make the difference between a file system that screams and one that crawls.
Introduction
When to Use / When Not to Use
Understanding directory implementation helps with system design and troubleshooting.
When directory implementation matters:
- Systems with millions of files in single directories
- High-throughput workloads that create/delete files frequently
- Backup systems that must scan directory structures
- Troubleshooting slow file system operations
When you can ignore it:
- Normal desktop/server use with moderate file counts
- Cloud storage abstracted behind network file systems
- Applications using higher-level APIs (databases, etc.)
Architecture or Flow Diagram
graph TD
A[Path: /home/user/docs/file.txt] --> B[Root Directory]
B --> C[Look up "home"]
C --> D[User Directory]
D --> E[Look up "docs"]
E --> F[Docs Directory]
F --> G[Look up "file.txt"]
G --> H[Return Inode Number]
style A stroke:#ff00ff,stroke-width:2px
style H stroke:#ff00ff,stroke-width:2px
Path resolution walks through each component. The efficiency of each lookup step determines overall performance.
Core Concepts
Directory Entry Structure
At the basic level, a directory contains entries. Each entry maps a name to an inode:
// Classic Unix directory entry (simplified)
struct old_dirent {
unsigned long d_ino; // Inode number
unsigned short d_reclen; // Record length
unsigned char d_type; // File type
char d_name[256]; // File name (variable length)
};
// Modern directory entry with type
struct linux_dirent64 {
unsigned long d_ino; // Inode number
unsigned long d_off; // Offset to next entry
unsigned short d_reclen; // Record length
unsigned char d_type; // File type
char d_name[]; // File name (null-terminated)
};
The d_reclen field enables entries of variable length. Older systems used fixed-size entries (often 16 or 32 bytes), which limited file names and wasted space.
Linear Search (Simple List)
The most basic directory implementation: entries stored as a simple list, searched sequentially.
graph TD
A[Directory Entry Array] --> B[.]
A --> C[..]
A --> D[file1.txt]
A --> E[file2.txt]
A --> F[file3.txt]
A --> G[fileN.txt]
H[Search for "file3.txt"] --> I[Compare with file1.txt: NO]
I --> J[Compare with file2.txt: NO]
J --> K[Compare with file3.txt: YES - Return inode]
style H stroke:#00fff9
Characteristics:
- O(n) lookup time where n is number of entries
- No memory overhead for indexing
- Simple implementation
- Performance degrades linearly with directory size
def linear_search(directory, target_name):
"""Simulate linear directory search."""
for entry in directory.entries:
if entry.name == target_name:
return entry.inode
return None # Not found
# Complexity: O(n) where n = number of entries
# For 1 million files in a directory: up to 1 million comparisons
Hash Table Directories
Hash tables provide O(1) average lookup by computing a hash of the filename:
graph TD
A[Filename: "document.txt"] --> B[Hash Function]
B --> C[Bucket Index: 7]
C --> D[Bucket 7]
D --> E[Entry: document.txt -> inode 1234]
F[Filename: "report.pdf"] --> G[Hash Function]
G --> H[Bucket Index: 3]
H --> I[Bucket 3]
I --> J[Entry: report.pdf -> inode 5678]
style A stroke:#ff00ff
style F stroke:#ff00ff
Advantages:
- O(1) average lookup regardless of directory size
- Handles collisions through chaining or open addressing
- Good for large directories
Disadvantages:
- Hash computation overhead
- Collisions require additional handling
- Memory overhead for hash table structure
- Cannot efficiently list entries in sorted order
// Hash table directory implementation (conceptual)
typedef struct dir_entry {
char *name;
unsigned long inode;
struct dir_entry *next; // Collision chain
} DirEntry;
typedef struct {
DirEntry *buckets[TABLE_SIZE];
unsigned int (*hash_fn)(const char *name, int len);
} HashDirectory;
// Insert operation
int hash_insert(HashDirectory *dir, const char *name, unsigned long inode) {
unsigned int index = dir->hash_fn(name, strlen(name));
DirEntry *new_entry = malloc(sizeof(DirEntry));
new_entry->name = strdup(name);
new_entry->inode = inode;
new_entry->next = dir->buckets[index];
dir->buckets[index] = new_entry;
return 0;
}
B-Tree and B+ Tree Directories
Modern file systems (ext4, NTFS, XFS) use B-trees for directories to maintain sorted order and handle millions of entries:
graph TD
A[Root Node] --> B[Intermediate Node]
A --> C[Intermediate Node]
B --> D[Leaf: A-M]
B --> E[Leaf: N-Z]
C --> F[Leaf: A-G]
C --> G[Leaf: H-N]
style A stroke:#ff00ff,stroke-width:3px
style D stroke:#00fff9
style E stroke:#00fff9
style F stroke:#00fff9
style G stroke:#00fff9
B-Tree characteristics:
- O(log n) lookup time
- Maintains sorted order (range queries efficient)
- Handles insertions/deletions without complete restructuring
- Efficient for both small and large directories
ext4 HTree (Indexed Directory): ext4 uses a special B-tree called HTree (a hybrid tree) for directory indexing. It supports:
- O(log n) lookups via two-level indexing
- Linear directory entries as fallback for small directories
- Automatic upgrade to HTree when directory grows large
graph TD
A[Hash Index Root] --> B[Index Node 0]
A --> C[Index Node 1]
B --> D[Leaf: files 0-999]
B --> E[Leaf: files 1000-1999]
C --> F[Leaf: files 2000-2999]
C --> G[Leaf: files 3000-3999]
style A stroke:#ff00ff,stroke-width:3px
Path Resolution
Path resolution walks the directory tree, looking up each component:
// Simplified path resolution algorithm
inode_t resolve_path(const char *path) {
if (path[0] == '/') {
// Absolute path - start at root
current = root_inode;
path = path + 1;
} else {
// Relative path - start at current directory
current = cwd_inode;
}
while (*path) {
// Find next path component
component = next_component(&path);
// Skip empty components (e.g., "a//b")
if (strlen(component) == 0) continue;
// Look up component in current directory
inode = lookup_entry(current, component);
if (!inode) return -ENOENT; // Component doesn't exist
// Handle ".." and "."
if (strcmp(component, "..") == 0) {
current = get_parent(current);
} else if (strcmp(component, ".") == 0) {
// Stay at current
} else {
current = inode;
}
}
return current;
}
Directory Caching
The kernel caches directory lookups to avoid repeated I/O:
# Check dentry cache statistics
cat /proc/sys/fs/dentry-state
# View cached entries
ls -la /proc/slabinfo | grep dentry
# Clear dentry cache (requires root)
echo 3 > /proc/sys/vm/drop_caches
Production Failure Scenarios
Scenario 1: Million-File Directory Meltdown
What happened: A continuous integration system stored build artifacts in a single directory that grew to over 3 million files. Operations like ls and find became unusable. Creating new files took 10+ seconds. The file system was ext3 with linear directory search.
Why it happened: ext3 uses linear directory search by default. With 3 million files, finding any file required scanning millions of entries. The system spent all its time searching directories instead of doing useful work.
Detection:
# Count files in a directory
ls -la /path/to/directory | wc -l
# Or faster
find /path/to/directory -maxdepth 1 | wc -l
# Check directory entry type
debugfs -R "dir /path/to/directory" /dev/sda1
Mitigation:
-
Upgrade to ext4 which enables HTree indexing automatically
-
Use
e2fsckto convert directory to indexed format:sudo tune2fs -O dir_index /dev/sda1 sudo e2fsck -fD /dev/sda1 # Convert directories -
Restructure storage to spread files across multiple directories
-
Implement automatic directory sharding (e.g., using first letters)
Scenario 2: Directory Entry Corruption
What happened: A server crashed mid-write to a directory. On reboot, the directory contained entries pointing to non-existent inodes. Some file names were garbled. The kernel marked the file system dirty.
Why it happened: Without journaling (ext3), directory modifications weren’t atomic. A partial write left the directory in an inconsistent state.
Detection:
# Run file system check in read-only mode
sudo fsck.ext4 -n /dev/sda1
# Use debugfs to examine directory structure
sudo debugfs -w /dev/sda1
debugfs: dirstat /path/to/directory
Mitigation:
- Use journaling file systems (ext4, XFS)
- Convert ext3 to ext4:
tune2fs -O extents,uninit_bg /dev/sda1 - Run periodic consistency checks
- Monitor kernel logs for file system errors
Scenario 3: Case-Insensitive Lookup Failure
What happened: A developer created a file named Config.php and pushed to a case-insensitive Windows file share mounted on Linux. Another developer on Linux couldn’t find config.php. The share appeared to lose the file.
Why it happened: Different file systems have different case sensitivity rules:
| File System | Case Sensitive? |
|---|---|
| ext2/3/4 | Yes |
| XFS | Yes |
| NTFS | No |
| FAT32 | No |
| CIFS/SMB default | Often no |
Detection:
# Check mount options for case sensitivity
mount | grep cifs
# Test case sensitivity
touch TESTFILE
ls testfile 2>/dev/null && echo "Case insensitive" || echo "Case sensitive"
Mitigation:
-
Use consistent naming conventions (all lowercase)
-
Specify
case sensitivitymount option for CIFS:mount -o case insensitive //server/share /mnt # Or mount -o case sensitive //server/share /mnt -
Document case sensitivity rules for cross-platform projects
Trade-off Table
| Method | Lookup Time | Insert Time | Memory | Use Case |
|---|---|---|---|---|
| Linear List | O(n) | O(1) at end | Minimal | Small directories (<1000 files) |
| Hash Table | O(1) avg | O(1) avg | Medium | Large directories, random access |
| B-Tree | O(log n) | O(log n) | Higher | Very large, sorted access needed |
| ext4 HTree | O(log n) | O(log n) | Medium | Modern Linux default |
Implementation Snippet
Directory Entry Parser
#!/usr/bin/env python3
"""Parse Linux ext4 directory entries."""
import struct
def parse_ext4_directory(data, block_size=4096):
"""Parse directory entries from an ext4 directory block."""
entries = []
offset = 0
while offset < block_size:
# Directory entry header (8 bytes)
header = struct.unpack('<IHHHHH', data[offset:offset+struct.calcsize('<IHHHHH')])
inode = header[0]
rec_len = header[1]
name_len = header[3]
if inode == 0:
# Entry not in use
offset += rec_len
continue
# Extract filename
name_start = offset + struct.calcsize('<IHHHHH')
name = data[name_start:name_start + name_len].decode('utf-8', errors='replace')
# File type (stored in record length high bits in ext4)
file_type = (rec_len >> 12) if rec_len > 255 else 1
entries.append({
'inode': inode,
'name': name,
'rec_len': rec_len & 0xFFF, # Lower 12 bits
'type': ['unknown', 'regular', 'directory', 'char', 'block',
'fifo', 'sock', 'symlink'][file_type]
})
offset += rec_len
return entries
def list_directory(device, inode_num):
"""List entries in a directory given its inode."""
import subprocess
result = subprocess.run([
'debugfs', '-R', f'rdump / {inode_num}',
device
], capture_output=True, text=True)
return result.stdout
Simulating Directory Search
#!/usr/bin/env python3
"""Compare different directory search strategies."""
import time
import random
import string
class DirectoryEntry:
def __init__(self, name, inode):
self.name = name
self.inode = inode
class LinearDirectory:
"""Simple list-based directory."""
def __init__(self):
self.entries = []
def insert(self, name, inode):
self.entries.append(DirectoryEntry(name, inode))
def lookup(self, name):
for entry in self.entries:
if entry.name == name:
return entry.inode
return None
def list_all(self):
return [e.name for e in self.entries]
class HashDirectory:
"""Hash table based directory."""
def __init__(self, size=256):
self.buckets = [[] for _ in range(size)]
self.size = size
def _hash(self, name):
h = 0
for c in name:
h = (h * 31 + ord(c)) % self.size
return h
def insert(self, name, inode):
bucket = self._hash(name)
for entry in self.buckets[bucket]:
if entry.name == name:
entry.inode = inode
return
self.buckets[bucket].append(DirectoryEntry(name, inode))
def lookup(self, name):
bucket = self._hash(name)
for entry in self.buckets[bucket]:
if entry.name == name:
return entry.inode
return None
def benchmark(directory, names, iterations=1000):
"""Benchmark directory lookup performance."""
# Insert all names
for name in names:
directory.insert(name, random.randint(1, 1000000))
# Benchmark lookups
start = time.time()
for _ in range(iterations):
for name in random.sample(names, min(100, len(names))):
directory.lookup(name)
elapsed = time.time() - start
return elapsed
if __name__ == "__main__":
# Generate test names
names = [''.join(random.choices(string.ascii_lowercase, k=10))
for _ in range(10000)]
linear = LinearDirectory()
hash_dir = HashDirectory()
linear_time = benchmark(linear, names)
hash_time = benchmark(hash_dir, names)
print(f"Linear search: {linear_time:.4f}s")
print(f"Hash table: {hash_time:.4f}s")
print(f"Speedup: {linear_time/hash_time:.2f}x")
Observability Checklist
Monitoring directory performance:
- ls -la /path: Check entry count and modification
- getfattr -d file: Check extended attributes
- debugfs -R “dirstat /path” /dev/sdX: Directory statistics on ext4
- xfs_info /mount: Show XFS directory block configuration
# Monitor directory operation times
strace -c ls /path/to/directory
# Check for problematic directories
for dir in /var /home /tmp; do
count=$(find "$dir" -maxdepth 1 -type f | wc -l)
if [ $count -gt 10000 ]; then
echo "WARNING: $dir has $count files"
fi
done
# Check ext4 directory indexing
sudo tune2fs -l /dev/sda1 | grep -E "Directory"
# Should show: "Directory ACLs", "Directory Indexed"
Common Pitfalls / Anti-Patterns
Directory Permissions
Directory permissions work differently than file permissions:
- Read (r): List directory contents (use
ls) - Write (w): Create, rename, or delete files within directory
- Execute (x): Access directory contents (traverse, use
cd)
# r--: Can see file names but not metadata or content
ls /directory # Yes
ls -l /directory # No
cat /file # No
# -w-: Can create/delete files but can't read them
touch /directory/newfile # Yes
rm /directory/file # Yes
ls /directory # No
# --x: Can access known files but not list directory
cd /directory # Yes
ls /directory # No
cat /file # Yes (if you know exact path)
Sticky Bit on Shared Directories
# Prevents users from deleting others' files in shared directories
chmod +t /shared/directory
# View sticky bit
ls -la /shared
# drwxrwxrwt 2 root root 4096 May 19 10:00 /shared
# ^ sticky bit
Directory Access Control Lists (ACLs)
# Set default ACL for new files in directory
setfacl -d -m u:alice:rw /shared/directory
# Set ACL for specific user
setfacl -m u:bob:r /shared/directory
# View ACLs
getfacl /shared/directory
# Remove specific ACL entry
setfacl -x u:alice /shared/directory
Common Pitfalls / Anti-patterns
1. Millions of Files in One Directory
# BAD: Storing millions of files in single directory
# This kills performance on most file systems
find /logs -name "*.log" | head -1000000 > /tmp/all_logs
# GOOD: Use subdirectories sharded by date/hash
find /logs -type d | head
# /logs/2024/01/01
# /logs/2024/01/02
# ...
# GOOD: Use hierarchical sharding
for i in $(seq 0 99); do
mkdir -p /data/$i
done
mv file_0001.dat /data/$(($RANDOM % 100))/
2. Long File Names
# Path length limits (older systems):
# - Maximum component: 255 bytes
# - Maximum total path: 4096 bytes
# BAD: Extremely long file names
touch "this_is_a_very_very_long_file_name_that_might_cause_problems.txt"
# GOOD: Use reasonable naming conventions
touch "config_v2_final_REVISED.txt"
3. Special Characters in File Names
# BAD: Special characters that break scripts
touch "file with spaces.txt"
touch "file;with;semicolons.txt"
touch "file*with*asterisks.txt"
# GOOD: Alphanumeric and underscores
touch "config_settings.txt"
touch "user_data_backup.txt"
Quick Recap Checklist
- Directories map file names to inode numbers
- Linear search is O(n), good for small directories (<1000 entries)
- Hash tables provide O(1) average lookup but no sorted iteration
- B-trees give O(log n) and maintain sorted order
- ext4 uses HTree for large directories; convert with
e2fsck -D - Path resolution walks through each component, looking up each name
- Directory caching (dentry cache) dramatically speeds repeated lookups
- Different file systems have different case sensitivity rules
- Watch for inode exhaustion (not just space) when many small files exist
Interview Questions
Path resolution is a multi-step process:
- Parse: Split the path into components:
/,home,user,docs,file.txt - Start: Begin at the root inode (for absolute paths starting with
/) or current directory (for relative paths) - Lookup each component:
- Look up
homein the root directory → get inode for/home - Look up
userin/home→ get inode for/home/user - Continue until
file.txtis found
- Look up
- Validate permissions: At each step, verify execute permission on the directory
- Return inode: The inode number for
file.txt
The kernel uses the dentry cache to speed repeated lookups. Each path component lookup first checks the cache before doing I/O.
ext3 uses linear directory search: To find any file, ext3 scans entries one by one from the beginning. With a million files, the average lookup requires scanning 500,000 entries. This is O(n) complexity.
ext4 uses HTree indexing: When a directory grows beyond ~100 entries, ext4 automatically converts it to use a two-level hash tree. Lookups become O(log n) — finding a file in a million-entry directory takes ~20 comparisons instead of 500,000.
Additional factors:
- ext4 uses larger directory block sizes (4KB vs 1KB)
- ext4 has delayed allocation (better block placement)
- ext4 journal ensures directory modifications are atomic
You can manually convert ext3 directories to indexed format with e2fsck -D.
The dentry cache (Directory Entry cache) caches recently accessed directory entries — the mappings between file names and inode numbers. It's a tree-structured cache that lives in kernel memory.
Importance:
- Avoids repeated disk I/O for frequently accessed paths
- Enables fast path resolution (check cache → hit vs miss)
- Stores validity and aging information
- Helps maintain consistency across mount points
When you access /home/user/docs/file.txt:
- Check cache for
/→ get root inode - Check cache for
/home→ get inode - Check cache for
/home/user→ get inode - And so on...
Without the dentry cache, every path component lookup would require disk I/O.
Directory permissions control what you can do with the directory itself:
- Read (r): List the contents — you can see what files exist (via
ls) - Write (w): Create, rename, or delete files within the directory (even files you don't own, if directory is writable)
- Execute (x): Enter the directory and access files within it — you need this to
cdinto the directory or access any file by path
File permissions control what you can do with the file itself:
- Read (r): View file contents
- Write (w): Modify file contents
- Execute (x): Run the file as a program
Key insight: you need execute permission on a directory to access any file within it, regardless of the file's own permissions. To read /home/user/secret.txt, you need x on /, /home, and /home/user, plus r on secret.txt.
On Unix/Linux systems, deleting an open file works due to how the kernel handles file references:
- Directory entry removed: The file name is removed from its parent directory immediately
- Inode reference held: The file's inode remains allocated as long as any process has it open
- Data blocks intact: The actual file data stays on disk until the last reference (file descriptor or hard link) is released
- Process continues: The process with the open file descriptor can still read/write normally
- Cleanup on close: When the last process closes the file, the kernel frees the inode and data blocks
This behavior is used by package managers and software updates: they create a new version of a file, atomically replace the old one, and the old version persists until processes using it exit.
You can verify this with: lsof +L1 — shows files that are deleted but still open.
B-trees store keys in sorted order at each level. For directory lookups, this means:
- Range queries are efficient: Listing all files starting with "a" can stop early
- Prefix matching: Finding all files in a range is a tree traversal, not a full scan
- Insert/delete balanced: Tree stays balanced, guaranteeing O(log n) even after many modifications
ext4's HTree is a hash-indexed B-tree: the hash of the filename determines which branch to follow. This maintains O(log n) lookup while distributing entries evenly across branches (avoiding the "hot spot" problem of sorted B-trees with sequential inserts).
The dentry cache (dcache) stores directory entry structures—mappings from filename component to inode. Each dentry also holds references to its parent dentry and children, forming the directory tree in memory.
The inode cache stores the actual inode structures loaded from disk. When VFS needs to read a file's metadata or data blocks, it uses the inode cache to avoid disk I/O.
The relationship: dentries point to inodes. When you resolve /home/user/file.txt, the dentry for "user" points to the inode for /home/user. Both caches are separately managed but work together—dentries provide fast path resolution, inodes provide fast metadata access.
On ext3 (linear search), listing a directory with millions of files requires reading every entry sequentially. The O(n) complexity means average lookup is n/2 comparisons—catastrophic for large directories.
On ext4 with HTree, `ls` should be O(log n) for each entry lookup during readdir(), but there are still bottlenecks:
- readdir() must return entries in sorted order—HTree lookups still traverse the tree
- Each directory entry requires a dentry allocation in kernel memory
- Stale dentry invalidation if files are being created/deleted concurrently
- Directory block reads are sequential but still I/O-bound
The real cost is often not the lookup but the memory allocation and userspace processing for millions of entries.
When accessing /home/user/documents/report.txt:
- You need execute (x) permission on `/` to reach any subdirectory
- You need execute (x) on `/home` to enter it
- You need execute (x) on `/home/user` to enter it
- You need execute (x) on `/home/user/documents` to enter it
- Finally, you need the appropriate permission on `report.txt` itself (r for read, w for write, etc.)
The directory permissions don't control access to files inside—they control access to the directory itself. Even if you have full permission on a file, you can't access it if you lack execute permission on any parent directory.
A negative dentry is a cached lookup result indicating "this name does not exist." When you look up /nonexistent once, the kernel caches the negative result, preventing repeated lookup attempts.
Benefits:
- Avoids repeated failed lookups for missing files
- Particularly important for shared libraries (/lib/libfoo.so not found)—avoids filesystem hits on every program load
- TTL-based invalidation prevents stale negatives
Negative dentries can cause issues when a file is created after a negative cache: the cache must be invalidated. This happens through directory notifications (inotify) or timeout-based expiry.
Hard links: Multiple directory entries (directory entries are names in a directory) point to the same inode. The inode maintains a link count — how many directory entries reference it. Deleting a hard link decrements the count; when it reaches zero, the inode is freed. Hard links cannot span file systems (same inode numbers are local to each filesystem), and cannot reference directories (would create cycles).
Symbolic links: A special file type containing a path string. The path can point anywhere (even to nonexistent targets). Resolving a symbolic link requires following the path — if the target is removed, the symlink becomes "dangling." Symlinks can span file systems and reference directories. Directory entries for symlinks store the link's target path, not an inode.
At implementation: hard links share inode references directly; symlinks store path strings that must be recursively resolved.
VFS (Virtual File System) provides a unified interface that all file systems implement. For directory operations:
struct inode_operations: Contains function pointers forlookup,create,mkdir,renamestruct dentry_operations: Containsd_hash,d_compare,d_deletestruct file_operations: Containsreaddirfunction for iterating directory entries
When you call readdir(), VFS routes to the concrete filesystem's iterate() or readdir() implementation. ext4 uses HTree for indexed lookup; NFS may send lookups to remote server; tmpfs stores entries in RAM. VFS handles caching, locking, and pathname translation — filesystem code only deals with its own structures.
Page cache stores disk blocks (file data and metadata) in memory. Dentry cache stores parsed pathname components. They interact:
- When resolving
/home/user/file.txt, dentry cache holds "/" dentry, "home" dentry, "user" dentry - Each dentry points to an inode — inodes are cached in the inode cache
- When a file is opened, its inode's pages (block mapping + data blocks) are in page cache
- Modifying a file: dentry → inode → page cache → mark page dirty → writeback
The relationship: dentries are parsed from directory file data (which lives in page cache). When directory contents change, d_invalidate() marks relevant dentries invalid — they will be re-parsed on next lookup, pulling fresh data from page cache.
Journaling records directory operations in a journal (separate log area) before applying them to the actual directory:
- Journal transaction: Record intent (transaction begin, operation details)
- Commit: Write operation to journal on disk (durable)
- Checkpoint: Apply operation to actual directory structure
- Journal cleanup: Remove committed transactions from journal
On crash recovery: filesystem replays journal entries, completing any committed transactions. Directory structure is never left in partial-write state. ext4 uses journaling for metadata (configurable: ordered mode journals only metadata, writeback allows data journal). XFS journals both data and metadata in different modes.
Local rename (same filesystem, different directory): It's a meta-operation — the inode doesn't change, only the parent directory entry. The kernel:
- Lock both source and destination parent directories (inode mutex)
- Verify destination doesn't already exist (if it does, fail with EEXIST)
- Remove source filename entry, decrement link count of source inode
- Create destination filename entry pointing to same inode, increment link count
- Unlock directories
This is atomic in the sense that no intermediate state is visible: either the old name or the new name exists, never both simultaneously. The rename is cheap because no data moves — only directory entries are updated.
inotify (inode notification) is a Linux kernel subsystem for filesystem event monitoring. When you inotify_init(), you get a file descriptor; inotify_add_watch() registers interest in specific files/directories.
Events are stored in an inotify_event structure and can be read via read(). For directories, events include: IN_CREATE, IN_DELETE, IN_MOVED_FROM, IN_MOVED_TO, IN_MODIFY. The kernel hooks into VFS: when VFS modifies a directory entry, it calls inotify_d_instantiate() which queues events.
Performance: inotify watches consume kernel memory. Large number of watches or rapid events can overwhelm buffers — events may be coalesced or lost (read size determines how many events you retrieve per read).
File descriptor limits (ulimit -n, /proc/sys/fs/file-max) control how many open file descriptors a process can have. Each open file, directory traversal, and memory-mapped file consumes a file descriptor.
During directory traversal (e.g., readdir() loop), each directory entry may trigger:
stat()calls for metadata — these use fd indirectly via dentry cache- Opening files — each open consumes a file descriptor
- Processing large directories without closing descriptors — causes leaks
When FD limit is reached: EMFILE (Too many open files) — new file opens fail, but existing operations may continue if they have open descriptors. Always close directory handles with closedir() after use.
fsck recovery process:
- Phase 1: Check inodes — verify inode structure, link counts, block allocation bitmap consistency
- Phase 2: Check directory structure — verify directory entries point to valid inodes, parent/child relationships
- Phase 3: Reconnect orphaned inodes — inodes with zero directory links are reattached to lost+found
- Phase 4: Check reference counts — verify link counts match actual directory references
- Phase 5: Check cylinder groups — verify superblock and group descriptors
Limitations: fsck cannot recover data content (only structure), cannot repair hardware errors, cannot restore deleted files from backups. Orphaned files in lost+found have numeric names — content must be manually identified. Running fsck on mounted filesystem is dangerous — use recovery mode.
Case insensitivity implementation: When creating or looking up a filename, the filesystem normalizes the name (converts to uppercase in NTFS, or stores comparison flags). Lookup uses case-folded comparison instead of byte comparison.
NTFS specifics: Stores names in Unicode (UTF-16), uses RtlUpcaseUnicodeString() for comparisons. The volume mount option case sensitive can enable case-preserving-but-sensitive behavior for WSL/developers.
FAT32 specifics: 8.3 filename format (max 8 chars + 3 extension), all uppercase internally. Long filename (VFAT) extension stores names as UTF-16 but normalizes for lookup.
Cross-platform issues: A file created as "README.TXT" on Windows may appear as "readme.txt" on Linux — the filesystem stores the original case but lookup is case-insensitive. Applications expecting case-sensitive behavior may fail.
Directory quotas limit the number of entries or total size within a directory tree. Implementation options:
- Project quotas (XFS): Hierarchical quota tracking, each directory tree has a project ID with enforced limits
- Directory-based limits: Some filesystems (ReiserFS, ext4 with patches) support per-directory entry limits
- External quota systems: Quota daemon (quotacheck, quotaon) scans filesystem and enforces limits via inode count
Interaction with entry allocation:
- User attempts to create file in directory with quota
- Kernel checks current entry count vs soft/hard limits
- If hard limit reached, return
EDQUOT(disc quota exceeded) - Quota is checked before actual allocation — prevents partial operations
Quota enforcement for directories is more complex than for files because directories can contain many files — the quota applies to the tree, not individual entries.
Further Reading
Topic-Specific Deep Dives:
-
HTree Implementation: The ext4 HTree is a specialized B-tree variant optimized for directory lookups. Studying the index node format, hash function used for bucket selection, and how leaf nodes compress filename entries reveals why it handles millions of entries efficiently.
-
Distributed Directory Systems: Ceph’s CRUSH algorithm and GlusterFS’s elastic hash algorithm distribute directory metadata across cluster nodes. Understanding how these scale beyond single-machine limits helps design large-scale storage systems.
-
Unicode Normalization: Case-insensitive file systems face Unicode challenges. How does NTFS handle case insensitivity across different locales? Study how normalization forms (NFC, NFD) affect file system behavior.
-
Directory Entry Caching: The dentry cache isn’t just a performance optimization—it maintains consistency. Explore how
d_lookup(),d_invalidate(), and namespace semaphores ensure cache coherence across threads. -
FUSE Directory Implementations: Implementing a FUSE directory index reveals the full complexity. Handling
readdir()with offset, implementingmkdir()with proper error codes, and ensuring atomic rename operations are all non-trivial.
Conclusion
Directory implementations transform flat name spaces into the hierarchical structures we navigate daily. The evolution from linear search through hash tables to B-trees reflects the industry learning to handle millions of files per directory without sacrificing performance. ext4’s HTree index makes million-file directories practical — a task that would bring ext3 to its knees.
Path resolution leverages the dentry cache to avoid repeated disk I/O for frequently accessed paths. Each lookup checks the cache before triggering disk reads, making repeated file access dramatically faster than initial access. This caching strategy applies not just to file content but to the directory structure itself.
For deeper study, explore how distributed file systems like Ceph and GlusterFS implement directory metadata at scale, how case-insensitive file systems handle Unicode normalization, and how the trade-offs between directory implementation strategies affect benchmark results. The principles learned here apply broadly to any system that needs to map keys to values at scale.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.