Virtual File Systems (VFS)
Understanding how Linux abstracts multiple file systems through a common interface, enabling transparent access to ext4, NTFS, FAT, and network file systems.
Virtual File Systems (VFS)
Every time you access a file on Linux—whether it lives on an ext4 partition, a USB drive formatted with FAT32, an NFS share, or even /proc—the same interface handles it. That interface is the Virtual File System (VFS) layer, also known as the VFS abstraction layer. Without VFS, every application would need to understand how to talk to each specific file system type. With VFS, applications speak a universal language while the kernel translates to whatever file system actually stores the data.
VFS is one of the most elegant abstractions in operating systems. It enables the illusion of a unified file tree while simultaneously supporting dozens of radically different file system implementations. Understanding VFS helps you troubleshoot mounting issues, optimize file system performance, and understand how Linux achieves its legendary flexibility.
Introduction
When to Use / When Not to Use
Understanding VFS helps with system administration and troubleshooting.
When VFS knowledge is essential:
- Mounting and configuring various file systems
- Troubleshooting “mount succeeded but files not accessible” issues
- Working with network file systems (NFS, CIFS, FUSE)
- Understanding why some operations are slower on certain file systems
- Container storage and volume mounting
When you can rely on defaults:
- Standard server configuration with single file system type
- Desktop usage with built-in file system support
- Simple container workloads with default storage
Architecture or Flow Diagram
graph TD
A[Application] --> B[POSIX System Calls]
B --> C[VFS Layer]
C --> D[ext4 Driver]
C --> E[NTFS Driver]
C --> F[FAT/VFAT Driver]
C --> G[NFS Client]
C --> H[CIFS/SMB Client]
C --> I[procfs Driver]
C --> J[tmpfs Driver]
D --> K[Block Device Layer]
E --> K
F --> K
G --> L[Network]
H --> L
K --> M[Storage Device]
L --> N[NFS Server]
L --> O[SMB Server]
style A stroke:#ff00ff,stroke-width:2px
style C stroke:#ff00ff,stroke-width:3px
The VFS layer sits between applications and the actual file system implementations. Each file system type implements the VFS interface, making them interchangeable from the application’s perspective.
Core Concepts
VFS Data Structures
The VFS layer is built on four key data structures that every file system must implement:
// Superblock - file system level metadata
struct super_block {
unsigned long s_blocksize; // Block size in bytes
struct super_operations *s_op; // Superblock operations
struct dentry *s_root; // Root directory entry
struct list_head s_files; // All open files
void *s_fs_info; // File system specific info
// ... many more fields
};
// Inode - represents a file (similar to on-disk inode)
struct inode {
unsigned long i_ino; // Inode number
umode_t i_mode; // File type and permissions
struct inode_operations *i_op; // Inode operations
struct file_operations *i_fop; // File operations
struct super_block *i_sb; // Superblock reference
// ... many more fields
};
// Dentry - directory entry (name to inode mapping)
struct dentry {
const char *d_name; // Name component
struct inode *d_inode; // Associated inode
struct dentry *d_parent; // Parent directory
struct list_head d_subdirs; // Child entries
// ... many more fields
};
// File - open file instance
struct file {
struct path f_path; // Path to file
struct file_operations *f_op; // File operations
loff_t f_pos; // Current position
unsigned int f_flags; // Open flags
// ... many more fields
};
The key insight: these are generic structures. Specific file systems fill them in with their own implementations of standard operations.
VFS Operations
Each data structure has an associated operations table:
// Superblock operations - file system level
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *);
void (*destroy_inode)(struct inode *);
void (*dirty_inode)(struct inode *, int);
void (*write_inode)(struct inode *, int);
void (*put_inode)(struct inode *);
void (*put_super)(struct super_block *);
// ... more
};
// Inode operations - file/directory specific
struct inode_operations {
int (*create)(struct inode *, struct dentry *, umode_t, bool);
int (*lookup)(struct inode *, struct dentry *);
int (*link)(struct dentry *, struct inode *, struct dentry *);
int (*unlink)(struct inode *, struct dentry *);
int (*mkdir)(struct inode *, struct dentry *, umode_t);
int (*rmdir)(struct inode *, struct dentry *);
// ... more
};
// File operations - file access
struct file_operations {
loff_t (*llseek)(struct file *, loff_t, int);
ssize_t (*read)(struct file *, char __user *, size_t, loff_t *);
ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *);
int (*open)(struct inode *, struct file *);
int (*release)(struct inode *, struct file *);
// ... more
};
Each file system (ext4, XFS, NTFS, etc.) implements these operations for its own data structures and semantics.
File System Registration
When the kernel boots, file system drivers register with VFS:
// Register a file system type
register_filesystem(&ext4_fs_type);
register_filesystem(&xfs_fs_type);
register_filesystem(&vfat_fs_type);
register_filesystem(&nfs_fs_type);
// File system type structure
struct file_system_type {
const char *name; // "ext4", "xfs", "ntfs"
int fs_flags; // FS_REQUIRES_DEV, FS_BINARY_MOUNTDATA, etc.
struct dentry *(*mount)(struct file_system_type *, int,
const char *, void *);
void (*kill_sb)(struct super_block *);
struct module *owner;
// ...
};
This registration makes the file system available for mounting.
Mount Chain
When you mount a device, the chain looks like:
graph TD
A["mount -t ext4 /dev/sda1 /mnt"] --> B[VFS receives mount request]
B --> C[Find ext4 in registered file systems]
C --> D[Call ext4 mount function]
D --> E[Read superblock from device]
E --> F[Create super_block structure]
F --> G[Create root dentry and inode]
G --> H[Link /mnt to VFS mount tree]
style A stroke:#ff00ff,stroke-width:2px
style H stroke:#00fff9
The mount creates the VFS structures that represent the mounted file system in the unified namespace.
Path Resolution in VFS
When an application accesses /home/user/file.txt:
sequenceDiagram
participant App as Application
participant VFS as VFS Layer
participant Cache as Dentry Cache
participant FS as ext4 Driver
participant Disk as Disk
App->>VFS: open("/home/user/file.txt")
VFS->>Cache: lookup dentry for "/"
Cache-->>VFS: root inode
VFS->>Cache: lookup dentry for "home"
Cache-->>VFS: cached or inode
VFS->>FS: read dir, find "user"
FS->>Disk: read directory blocks
Disk-->>FS: directory entries
FS-->>VFS: inode for "user"
VFS->>Cache: cache dentry
VFS->>FS: lookup "file.txt"
FS-->>VFS: inode for file.txt
VFS-->>App: file descriptor
The dentry cache dramatically speeds repeated path lookups.
Core Concepts: File System Types
Disk-Based File Systems
These work with block devices:
- ext2/ext3/ext4: The Linux standard, journaling, extent support
- XFS: High-performance, scalable, used in enterprise
- Btrfs: Copy-on-write, snapshots, checksums
- NTFS: Windows file system (via ntfs-3g driver)
- FAT32/exFAT: Universal compatibility, no journaling
Network File Systems
These access remote servers:
- NFS (Network File System): Unix/Linux standard
- CIFS/SMB: Windows interoperability
- SSHFS: File system over SSH
- FTPFS: FTP-backed file system
# Mount NFS
sudo mount -t nfs4 server:/share /mnt/nfs
# Mount CIFS
sudo mount -t cifs //server/share /mnt/cifs -o username=user
# Mount SSHFS
sshfs user@server:/path /mnt/sshfs
Virtual/Proc File Systems
These don’t store data on disk:
# proc - process information
ls /proc
# 1/ 1234/ self/
# sys - system information
ls /sys
# block/ bus/ class/ devices/
# tmpfs - RAM-based file system
mount -t tmpfs tmpfs /tmp
# devpts - terminal devices
ls /dev/pts
Union/Mount Namespace File Systems
# overlay - union mount (container storage)
mount -t overlay overlay -o \
lowerdir=/base,upperdir=/changes,workdir=/work /merged
# bind - bind mount (reuse subtree elsewhere)
mount --bind /old/location /new/location
Production Failure Scenarios
Scenario 1: File System Not Registered
What happened: An administrator tried to mount an ext4 partition but got “unknown file system type ‘ext4’.” The system had kernel support for ext4 as a module, but the module wasn’t loaded.
Detection:
# Check loaded file system modules
lsmod | grep -E "ext4|xfs|btrfs"
# Check available file systems
cat /proc/filesystems
# Try loading the module
sudo modprobe ext4
Mitigation:
- Ensure file system modules are built into kernel or loaded
- For embedded systems, include necessary FS support in kernel config
- Use
modprobeor add to/etc/modulesfor persistent loading
Scenario 2: VFS Cache Pressure Causing Memory Issues
What happened: A system with 64GB RAM showed 58GB used by page cache, leaving little for applications. The system started swapping despite having memory pressure from cache.
Detection:
# Check memory usage breakdown
free -h
# Check VFS cache statistics
cat /proc/meminfo | grep -E "Cached|Dirty|Writeback"
# Check for dropping caches
sync
echo 3 > /proc/sys/vm/drop_caches
free -h
Mitigation:
-
Adjust
vm.vfs_cache_pressure:# Default is 100, lower to keep more dentry/inode cache sysctl -w vm.vfs_cache_pressure=50 # Or make persistent in /etc/sysctl.conf vm.vfs_cache_pressure = 50 -
Use
drop_cachesfor immediate relief during maintenance -
Monitor and alert on cache vs application memory balance
Scenario 3: Overlay Mount Inconsistency
What happened: A container runtime used overlay file system. Applications inside containers saw stale files, files that existed in the lower layer weren’t visible, and some files showed old content despite being updated in the base image.
Why it happened: Overlay file systems have specific requirements for showing/hiding files. Incorrect lowerdir/upperdir configuration or copying files instead of using the union semantics caused visibility issues.
Detection:
# Check overlay mount options
mount | grep overlay
# View overlay layers
cat /proc/mounts | grep overlay
# Check which layers files come from
ls -la /merged/file # upper has whiteout?
Mitigation:
-
Ensure proper overlay mount options:
mount -t overlay overlay \ -o lowerdir=/lower1:/lower2,upperdir=/upper,workdir=/work \ /merged -
Understand whiteout files (show deleted files from lower)
-
Use
chattr -ifor immutable flag handling in overlay
Scenario 4: Lost Connection to Network File System
What happened: An NFS server became unreachable. Client systems with NFS mounts hung—any command accessing /mnt/nfs would block indefinitely. The mount point couldn’t be unmounted.
Detection:
# Check NFS mount status
mount | grep nfs
cat /proc/mounts | grep nfs
# Check NFS daemon status
systemctl status nfs-server
# Monitor for network issues
netstat -an | grep 2049
Mitigation:
-
Use
hardvssoftmount options:# Hard mount: retry indefinitely (can hang) mount -t nfs server:/share /mnt -o hard # Soft mount: timeout and return error mount -t nfs server:/share /mnt -o soft,timeo=50 -
Use
introption to allow signals to interrupt:mount -t nfs server:/share /mnt -o hard,intr -
Use
autofsfor on-demand mounting -
Set up monitoring for NFS connectivity
-
Unmount with lazy option when hung:
sudo umount -l /mnt/nfs # lazy unmount sudo umount -f /mnt/nfs # forced unmount
Trade-off Table
| File System | VFS Support | Performance | Features | Complexity |
|---|---|---|---|---|
| ext4 | Native | Good | Journal, extents | Low |
| XFS | Native | Excellent | Journal, quota | Medium |
| Btrfs | Native | Good | COW, snapshots | High |
| NTFS | Via ntfs-3g | Moderate | Windows compat | Medium |
| NFSv4 | Native | Network limited | Stateful | Medium |
| CIFS/SMB | Native | Network limited | Windows compat | Low |
| tmpfs | Native | Excellent (RAM) | Dynamic sizing | Low |
| overlay | Native | Good | Union mount | Medium |
Implementation Snippet
Implementing a Simple FUSE File System
#!/usr/bin/env python3
"""Simple FUSE file system using Python (fuse-python)."""
from fuse import FUSE, FuseOSError, Operations
import os
import time
class SimpleFS(Operations):
"""A simple in-memory file system demonstrating VFS concepts."""
def __init__(self):
# In-memory storage
self.files = {
'/': {
'type': 'directory',
'content': b'',
'st': self._stat('/', is_dir=True)
}
}
def _stat(self, path, is_dir=False):
"""Generate stat information."""
return {
'st_mode': 0o40755 if is_dir else 0o100644,
'st_nlink': 2 if is_dir else 1,
'st_size': len(self.files.get(path, {}).get('content', b'')),
'st_ctime': time.time(),
'st_mtime': time.time(),
'st_atime': time.time(),
}
def getattr(self, path, fh=None):
if path not in self.files:
raise FuseOSError(2) # ENOENT
return self._stat(path, self.files[path]['type'] == 'directory')
def readdir(self, path, fh):
entries = ['.', '..']
for name in self.files:
if name != '/' and os.path.dirname(name) == path.rstrip('/'):
entries.append(os.path.basename(name))
return entries
def read(self, path, size, offset, fh):
if path not in self.files:
raise FuseOSError(2)
data = self.files[path]['content']
return data[offset:offset + size]
def write(self, path, data, offset, fh):
if path not in self.files:
# Create file
self.files[path] = {
'type': 'file',
'content': b''}
current = self.files[path]['content']
self.files[path]['content'] = current[:offset] + data
return len(data)
def create(self, path, mode, fi=None):
self.files[path] = {
'type': 'file',
'content': b'',
'st': self._stat(path)
}
return 0
def mkdir(self, path, mode):
self.files[path] = {
'type': 'directory',
'content': b'',
'st': self._stat(path, is_dir=True)
}
def unlink(self, path):
if path in self.files:
del self.files[path]
def rmdir(self, path):
if path in self.files and self.files[path]['type'] == 'directory':
del self.files[path]
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('mount_point', help='Where to mount')
parser.add_argument('-f', '--foreground', action='store_true')
args = parser.parse_args()
fuse = FUSE(SimpleFS(), args.mount_point, foreground=args.foreground)
Checking VFS Statistics
#!/bin/bash
# vfs_stats.sh - Display VFS statistics
echo "=== VFS Statistics ==="
echo ""
echo "--- File Systems Registered ==="
cat /proc/filesystems
echo ""
echo "--- Mount Points ==="
mount | column -t
echo ""
echo "--- Dentry Cache Stats ==="
cat /proc/sys/fs/dentry-state
echo ""
echo "--- Inode Stats ==="
cat /proc/sys/fs/inode-state
echo ""
echo "--- File Handle Limits ==="
echo "System max: $(cat /proc/sys/fs/file-max)"
echo "Current used: $(cat /proc/sys/fs/file-nr | awk '{print $1}')"
echo "Per-process limit: $(ulimit -n)"
echo ""
echo "--- VFS Cache Pressure ==="
cat /proc/sys/vm/vfs_cache_pressure
echo ""
echo "--- Dentry Cache Size ==="
grep -E "NrDentries|Dcache_alive" /proc/slabinfo 2>/dev/null || echo "Info not available"
Observability Checklist
Monitoring VFS and file system health:
- mount: Show all mounted file systems with options
- cat /proc/filesystems: List supported file system types
- cat /proc/mounts: Detailed mount information including bind mounts
- df -h: Show space usage for all mounted file systems
- du -sh /path: Check space usage of specific directories
# Comprehensive VFS monitoring script
#!/bin/bash
echo "=== VFS Health Report ==="
echo "Generated: $(date)"
echo ""
echo "--- Active Mounts with FS Type ---"
mount | grep -v "tmpfs\|proc\|sys\|devpts\|cgroup" | awk '{print $3, $5}' | sort
echo ""
echo "--- Mount Options Security Check ---"
for mount_point in $(mount | awk '{print $3}'); do
# Skip virtual mounts
[[ "$mount_point" =~ ^(proc|sys|dev|/sys|/proc|/dev) ]] && continue
opts=$(mount | grep " $mount_point " | awk '{print $6}' | tr -d '()')
if [[ "$opts" == *"noexec"* ]]; then
echo "$mount_point: noexec set (good for security)"
fi
if [[ "$opts" == *"nosuid"* ]]; then
echo "$mount_point: nosuid set (good for security)"
fi
done
echo ""
echo "--- NFS Mounts Status ---"
mount | grep -E "nfs|cifs" | while read line; do
echo "$line"
# Check for hung mounts
mount_point=$(echo "$line" | awk '{print $3}')
timeout 1 ls "$mount_point" >/dev/null 2>&1
if [ $? -ne 0 ]; then
echo " WARNING: $mount_point not responding!"
fi
done
echo ""
echo "--- File Descriptor Usage ---"
current=$(cat /proc/sys/fs/file-nr | awk '{print $1}')
max=$(cat /proc/sys/fs/file-max)
pct=$((current * 100 / max))
echo "Used: $current / $max ($pct%)"
if [ $pct -gt 80 ]; then
echo "WARNING: File descriptor usage above 80%"
fi
Common Pitfalls / Anti-Patterns
Secure Mount Options
# security mount options for various scenarios
# /var (data partition)
# - noexec: prevents binary execution from this partition
# - nosuid: ignores setuid bit
# - nodev: no device files
UUID=xxx /var ext4 defaults,noexec,nosuid,nodev 0 2
# /tmp (temporary files)
# Consider tmpfs with size limit
tmpfs /tmp tmpfs defaults,noexec,nosuid,nodev,size=2G 0 0
# Network mounts
# Prevent execution of remote binaries
mount -t nfs server:/share /mnt -o noexec,nosuid,hard,intr
File System Hardening
# Enable file system access time recording (or disable for performance)
# No atime update (good for SSDs, reduces writes)
mount -o noatime /dev/sda1 /mnt
# Read-only mounting
mount -o remount,ro /mnt
# Prevent setuid execution
mount -o nosuid /mnt
# No device files
mount -o nodev /mnt
# No binary execution
mount -o noexec /mnt
Common Pitfalls / Anti-patterns
1. Ignoring Bind Mount Flags
# BAD: Bind mount without considering security
mount --bind /home /mnt/shared
# GOOD: Use appropriate flags
mount --bind /home /mnt/shared
mount -o remount,bind,nosuid,nodev,ro /mnt/shared
2. Network File System Without Timeout
# BAD: Hard mount with no interrupt capability
mount -t nfs server:/share /mnt -o hard
# GOOD: Soft mount with timeout, interruptible
mount -t nfs server:/share /mnt -o soft,timeo=50,retrans=3,intr
# BEST for critical systems: autofs
echo "/mnt/nfs -fstype=nfs4 ro,intr server:/share" >> /etc/auto.master
3. Union Mount Misconfiguration
# BAD: Incorrect overlay order
mount -t overlay overlay \
-o upperdir=/upper,lowerdir=/lower,workdir=/work /merged
# If upper is below lower in order, lower wins
# GOOD: Correct order
mount -t overlay overlay \
-o lowerdir=/lower:/base,upperdir=/upper,workdir=/work /merged
4. Assuming VFS Caches Are Always Safe
# BAD: Not syncing before unmount
umount /mnt # Could lose data in cache
# GOOD: Sync first
sync
umount /mnt
# Or use lazy unmount if busy
umount -l /mnt
Quick Recap Checklist
- VFS provides the common interface all Linux file systems implement
- Key structures: super_block, inode, dentry, file
- Each file system implements operations through function pointers
- Dentry cache dramatically speeds repeated path lookups
- Network file systems add latency but enable sharing
- Virtual file systems (proc, sys, tmpfs) provide kernel interfaces
- Mount options control security and performance
- VFS is why you can
cat /proc/cpuinfoandmount -t nfs server:/sharewith the same API
Interview Questions
The Virtual File System (VFS) is an abstraction layer in the Linux kernel that provides a unified interface to different file system implementations. Before VFS, applications would need to know how to communicate with each specific file system type.
VFS was created to solve the problem of file system heterogeneity. When you have ext4, XFS, Btrfs, NTFS, NFS, CIFS, and dozens of other file systems, applications shouldn't need separate code paths for each one.
The key insight: all file systems present the same API through VFS. Applications call open(), read(), write(), and close(). VFS translates these to whatever the underlying file system understands. The application has no idea—and doesn't care—what's underneath.
The dentry cache (Directory Entry cache) and inode cache work together to speed file system operations:
Dentry cache stores the mapping between directory entry names and inode numbers. When you access /home/user/file.txt, the dentry cache remembers:
- "
/" maps to the root inode - "
home" maps to inode for/home - "
user" maps to inode for/home/user
Inode cache stores the actual inode structures (metadata about files) including permissions, timestamps, and pointers to data blocks.
The relationship: dentries point to inodes. When you resolve a path, you use the dentry cache to quickly find each component, which gives you the inode number, which the inode cache can then provide the full inode structure.
Without these caches, every file access would require disk I/O to read directory entries and inodes.
When you execute mount -t ext4 /dev/sda1 /mnt, the process involves:
- Parse mount options: VFS extracts file system type (ext4) and target (/mnt)
- Locate file system driver: Looks up "ext4" in registered file systems
- Call mount function: Invokes ext4's
mount()function - Read superblock: ext4 driver reads the file system's superblock from the device
- Create VFS structures: Allocates
super_block,inodefor root directory - Link to mount tree: Adds the mount to the VFS mount namespace
- Return success: Now
/mntrepresents the root of ext4 filesystem
After mounting, any file operation in /mnt goes through the ext4 driver's VFS operations to the underlying blocks on /dev/sda1.
The kernel supports multiple file system types through registration and operation vectors:
Each file system driver registers with VFS using register_filesystem(), providing:
- Its name (e.g., "ext4", "xfs", "nfs")
- Its mount function
- Its operation vectors (super_operations, inode_operations, file_operations)
When a mount is requested, VFS looks up the file system by name and calls the registered mount function. Each file system implements the same interface but with its own logic.
At runtime, you can have ext4 on /, XFS on /home, tmpfs on /tmp, and NFS on /mnt/nfs simultaneously. Applications see all as part of the unified namespace, but VFS routes each operation to the appropriate driver.
Symbolic link is a file type (stored in directory entries, has its own inode, contains a path string). When you access a symlink, VFS performs path resolution on the target path, which may cross mount points.
Bind mount is a VFS concept where the same directory entry (same dentry/inode) appears in multiple places in the mount tree. The underlying data is identical—they share the same VFS structures.
Key differences:
- Symlinks can cross mount boundaries; bind mounts stay within the same file system view
- Bind mounts show the actual data, not a path that could be modified
- Deleting through a bind mount affects the original (they're the same inode)
- Symlinks have their own inode; bind mounts share the same inode
In container contexts, bind mounts are used to expose host directories into containers. The container sees the same data as the host because it's the same VFS entry, just accessed from a different mount point.
When you execute mount -t ext4 /dev/sda1 /mnt:
- sys_mount() system call: Triggers VFS mount logic
- Parse mount options: VFS extracts file system type and flags
- Locate file system driver: Looks up "ext4" in the registered file system list
- Call mount function: Invokes ext4's
mount(), which reads the superblock - Create super_block: Allocates kernel structure, reads superblock from disk
- Create root dentry and inode: Represents
/of the new filesystem - Link to mount tree: Adds to the per-process mount namespace
- Return: Now
/mntpaths route through ext4 driver
The mount namespace is per-process (container isolation). Each process may see different mounts.
struct super_operations is a function pointer table that defines callbacks for file system-level operations. Each file system implements these to provide its specific behavior:
- alloc_inode / destroy_inode: Create/free inode structures
- dirty_inode: Called when inode is modified
- write_inode: Persist inode to disk
- put_super: Clean up during unmount
- remount_fs: Handle mount option changes
This is the VFS polymorphism pattern: VFS calls these functions without knowing if it's ext4, XFS, or NTFS. Each driver fills in its own implementations, and VFS calls through the function pointers.
VFS abstraction leaks when the unified interface doesn't fully mask differences:
- Extended attributes: ext4 supports ACLs via xattrs; FAT32 doesn't. Copying files between them loses permissions.
- Case sensitivity: ext4 is case-sensitive, NTFS/FAT are case-insensitive. A file created on Linux may be invisible on Windows mounts.
- Symbolic links on FAT: FAT doesn't support symlinks. Creating one on a CIFS mount backed by FAT might create a shortcut (.lnk) file instead—or fail silently.
- Special files:
/procand/sysaren't real directories. Tools likefindbehave differently on them.
Understanding these leaks helps diagnose cross-filesystem issues like "my permissions don't work on NAS."
Containers use mount namespaces (CLONE_NEWNS) to create isolated mount views:
- Clone with new namespace:
clone(CLONE_NEWNS)creates process with copy of parent's mount namespace - Private mount: The container's root is initially a copy of the host's
- Bind mounts:
mount --bind /host/path /container/pathexposes host directories at container paths - Overlay mount: Upperdir/lowerdir layers implement copy-on-write for container changes
- Pivot_root or chroot: Changes the container's view of "/" to the container's rootfs
The container sees only its mounts—a process in the container cannot see or affect host mounts (unless explicitly shared). This isolation is entirely a VFS concept.
Page cache: Stores actual file data content. When you read() a file, data goes into the page cache. When you write(), data is written to page cache first and flushed to disk later.
Dentry cache: Stores directory entry metadata—filename to inode mappings. When you resolve a path, you traverse dentry cache (cached lookups) to find the inode number. Dentries also cache child dentries for fast subtree traversal.
Key differences:
- Page cache stores data; dentry cache stores structure
- Page cache is page-granularity (4KB typically); dentries are variable-size
- Dentry cache is purely kernel RAM; page cache can be swapped
- Dentries implement directory tree structure; page cache is linear file content
Both are critical for performance—dentries speed path resolution, page cache speeds file content access.
For NFS, each VFS operation triggers network I/O:
- open(): NFS client sends OPEN call to NFS server, receives file handle
- read(): Client sends READ request with file handle, offset, count; server responds with data
- write(): Client sends WRITE request with data; server acknowledges
- close(): Client sends CLOSE; server releases file state
NFS client caches aggressively:
- Attribute cache: Stales inode metadata locally
- Data cache: Pages cached locally with weak consistency
- Dentry cache: Path component lookups cached
The trade-off: network latency (milliseconds) vs local disk (microseconds). NFS performance depends on cache hit rates.
Path resolution in VFS follows a systematic traversal:
- Starting at root: VFS starts with the root dentry (always cached)
- Component lookup: For each path component ("home", "user", "file.txt"), VFS calls
lookup()on the parent directory's inode - Dentry cache check: Before calling the file system's
lookup(), VFS checks the dentry cache. If the dentry is already cached, return it immediately - FS-specific lookup: If not cached, call the file system's
inode->i_op->lookup()function which reads directory entries from disk - Cache the result: The newly found dentry is cached for future lookups
- Repeat: Continue until the final component is resolved
Each cached dentry also caches the dentries of its children, so deep path traversal after the first access is mostly cache hits. The d_lookup() function handles the hash-table lookup in the dentry cache.
Inode (struct inode): Represents a file on disk. There is exactly one inode per file (identified by inode number). It contains metadata (permissions, timestamps, size, block pointers) and points to the file's data blocks. Inodes are persistent—they exist on disk and are loaded into memory when needed.
File struct (struct file): Represents an open file handle. It exists only in memory for as long as the file is open. It contains the current file position (f_pos), open flags (f_flags), and points to the inode. Multiple processes can have the same file open, each with their own struct file but sharing the same inode.
Key difference: One inode per file on disk; one file struct per open file handle per process. If two processes open the same file, you have 2 file structs but 1 inode. If one process opens the same file twice, you have 2 file structs but 1 inode.
Unmounting involves several steps:
- Sync:
sync()flushes all dirty data and metadata to disk - Reference count check: VFS checks that no files are open and no processes have chdir'd into the mount point
- Call put_super(): The file system's
put_super()is called to release the super_block - Free inodes: All inodes associated with the mount are freed (or marked for destruction)
- Remove from mount tree: The mount entry is removed from the VFS mount namespace
- Release resources: Filesystem-specific cleanup (close block device, free private data)
If files are still open or processes are using the mount, umount fails with "Device or resource busy" (unless umount -l lazy unmount is used, which detaches immediately and cleans up later).
Rename within the same file system is straightforward: VFS calls inode->i_op->rename(), which updates directory entries to point to the same inode under a new name.
Rename across file systems is not permitted at the VFS level. The operation:
- Check source and target: VFS verifies both paths resolve within the same mount
- Fail if different mounts: Cross-mount renames (e.g., /mnt/drive1/file to /mnt/drive2/file) return
EXDEV("Cross-device link")
This is a fundamental VFS constraint. Applications must implement cross-device rename as copy + delete: read source, write to destination, then unlink source. This preserves data integrity but loses metadata like timestamps and permissions unless explicitly preserved.
struct inode_operations defines callbacks for file and directory operations that act on inodes. Each file system implements these for its specific semantics:
- create: Create a regular file in a directory (e.g.,
open(filename, O_CREAT)) - lookup: Find a directory entry by name, returning its inode
- link: Create a hard link (same inode, new directory entry)
- unlink: Remove a directory entry pointing to an inode
- mkdir: Create a subdirectory
- rmdir: Remove an empty subdirectory
- rename: Change a file's name (possibly within the same directory)
- setattr: Change inode attributes (permissions, timestamps)
The VFS layer calls these function pointers without knowing the underlying file system. ext4, XFS, and NTFS each have their own implementations with different algorithms and on-disk structures.
tmpfs is a file system implemented entirely in VFS—it has no disk backing. It stores files in virtual memory (RAM) and can optionally use swap space when RAM is low.
tmpfs registers with VFS just like disk-based file systems (register_filesystem(&tmpfs_fs_type)). It implements the same VFS operations: inode_operations, file_operations, super_operations.
Key characteristics:
- No on-disk structure—files vanish on reboot
- Dynamic sizing: uses available RAM/swap up to a configured limit
- Fast: no disk I/O for reads/writes
- Commonly used for /dev/shm (shared memory), /tmp, and container mounts
From VFS perspective, tmpfs looks like any other file system. Applications access it via the same open(), read(), write() calls. The difference is purely in the implementation—the tmpfs driver never touches a block device.
Reads flow through VFS to the page cache:
- Application calls
read(fd, buf, size) - VFS's
generic_file_read()checks the page cache first - If the page is cached, copy data from page cache to userspace buffer
- If not cached, allocate a page, call the file system's
readpage(), then copy
Writes also use write-back caching:
- Application calls
write(fd, buf, size) - VFS writes to the page cache (marking pages as dirty)
- Returns immediately to application (fast)
- Background kernel threads (pdflush/flush) periodically write dirty pages to disk
The page cache is unified—ext4, XFS, and all other file systems share it. When ext4 writes a block, it goes into the same page cache that XFS uses. This maximizes cache utilization across file systems.
struct file_operations defines callbacks for file I/O operations—things you do on an open file handle:
- llseek: Change file position
- read / write: Data I/O
- readdir: Iterate directory entries (for
readdir()syscall) - mmap: Memory-map the file
- fsync: Force dirty pages to disk
- lock: File locking (flock, fcntl)
struct inode_operations defines operations on the inode itself—metadata and name-level operations:
- create: Create a new file
- lookup: Find a file in a directory
- mkdir, rmdir: Directory operations
- rename: Change file name
- link, unlink: Hard link operations
File operations are per-open-file (struct file), inode operations are per-file (struct inode). Multiple opens of the same file share the inode but have separate file operation tables.
Mount namespaces (CLONE_NEWNS) give each process or container group an independent view of the mount table. Mount propagation determines how mounts in one namespace affect others:
Mount types:
- Private (default): Mounts and unmounts do not propagate to/from other namespaces
- Shared: Mount propagates bidirectionally with peer namespaces
- slave: Mounts from master propagate to slave, but not vice versa
- unbindable: Cannot be bind mounted
When a container is created with its own mount namespace:
- Initial mounts are copied from the parent namespace (private mounts become private in the new namespace)
- Bind mounts (like mounting host directories into containers) can be marked shared so changes are visible to the host, or private for isolation
- Unmounting inside the container (e.g., /proc) does not affect the host's mount table
The /proc/self/mountinfo file shows the propagation type and peer relationships for each mount. Container runtimes carefully configure propagation (e.g., Docker's volume mounts are typically shared or slave) to enable the desired isolation.
Further Reading
Topic-Specific Deep Dives:
-
Page Cache Deep Dive: Explore how the page cache (formerly buffer cache) interacts with VFS. The page cache stores file data pages in memory, and
mmap(),read(), andwrite()operations all flow through it. Study theaddress_spacestructure and how writeback works. -
Container Storage Drivers: Overlay file systems are just one layer. Investigate how devicemapper (-thinp_), Btrfs, and VFS interact in container runtimes. Understand why overlay2 is preferred over overlay for Docker.
-
Linux Page Cache Eviction: The kernel uses an LRU (Least Recently Used) list with active/inactive pages. Study
shrink_page_list()and howvm.vfs_cache_pressureaffects dentry and inode cache eviction versus page cache. -
Mount Namespaces: The mount namespace isolation that containers rely on is a VFS concept. Explore how
clone()withCLONE_NEWNScreates isolated mount tables per process. -
FUSE in Userspace: VFS supports user-space file systems through FUSE (Filesystem in Userspace). This enables creative implementations like
sshfs,gocryptfs, andborgbackup—all without kernel code.
Conclusion
The Virtual File System layer is what makes Linux’s unified namespace possible. By defining a standard set of operations (super_operations, inode_operations, file_operations) that each file system implements, VFS allows applications to interact with ext4, XFS, NFS, CIFS, and even virtual file systems like procfs through the same POSIX API. The dentry cache and inode cache are the performance keys, dramatically reducing disk I/O for repeated path resolutions.
When working with file systems, understanding VFS helps you troubleshoot mount issues, choose appropriate mount options for security and performance, and design storage for containers and networked environments. The mount namespace isolation that containers rely on is fundamentally a VFS concept. Remember that network file systems (NFS, CIFS) add latency and failure modes that local file systems do not have—design for timeouts and retry logic in production.
For continued learning, explore the page cache and buffer cache interactions with VFS, study container storage drivers (overlay, devicemapper, btrfs) and how they build on VFS, and examine the Linux page cache eviction policies (LRU, active/inactive lists) that determine file system performance under memory pressure.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.