Git LFS for Large Files: Binary Asset Management at Scale
Master Git Large File Storage for managing binaries, media, and datasets in Git repositories. Learn pointer files, migration strategies, and production patterns for large file workflows.
Introduction
Git was designed for source code — small text files that change incrementally. When you try to version large binary files (images, videos, datasets, compiled binaries) with plain Git, things fall apart. Repository sizes balloon, clone times stretch to minutes, and every operation becomes sluggish. The problem isn’t Git’s fault; it’s using the wrong tool for the job.
Git Large File Storage (Git LFS) solves this by replacing large files with lightweight pointer files in your repository while storing the actual file contents on a remote server. Your Git history stays lean, clones stay fast, and large files are downloaded on demand. It’s the standard solution for game assets, machine learning datasets, design files, and any repository that needs to version binaries.
This post covers Git LFS architecture, setup workflows, migration strategies for existing repositories, and production patterns for managing large files at scale. If your repository contains files larger than a few megabytes, this is essential reading.
When to Use / When Not to Use
Use Git LFS when:
- Your repository contains files larger than 10MB
- You version binary assets (images, videos, audio, 3D models)
- You manage datasets or compiled artifacts
- Your clone/push times are dominated by large files
- You need to track specific file types across the repository
Avoid Git LFS when:
- All files are small text files (< 1MB)
- You can use external storage (S3, CDN) instead
- Your Git hosting doesn’t support LFS
- You need to search or diff file contents (LFS files are opaque blobs)
- You’re storing files that change frequently (LFS bandwidth costs add up)
Core Concepts
Git LFS works by replacing large files with pointer files:
flowchart TD
A[Large File<br/>image.png 50MB] --> B[Git LFS Filter]
B --> C[Pointer File<br/>128 bytes]
C --> D[Git Repository<br/>Lightweight]
B --> E[LFS Server<br/>Actual file stored]
F[git clone] --> G[Download pointers]
G --> H[Checkout triggers]
H --> I[Download actual files<br/>from LFS server]
I --> J[Working directory<br/>with real files]
Architecture and Flow Diagram
sequenceDiagram
participant Dev as Developer
participant Git as Git CLI
participant LFS as Git LFS
pointer as Pointer File
participant LFSRemote as LFS Server
participant GitRemote as Git Remote
Dev->>Git: git add large-file.bin
Git->>LFS: Smudge filter detects LFS file
LFS->>LFS: Generate SHA256 hash
LFS->>pointer: Create pointer file
LFS->>LFSRemote: Upload actual file
LFS->>Git: Stage pointer file
Git->>GitRemote: Push pointer (small)
Dev->>Git: git clone
Git->>GitRemote: Download pointers
Git->>LFS: Checkout triggers smudge
LFS->>LFSRemote: Download actual files
LFSRemote-->>LFS: Large file content
LFS->>Dev: Working directory with real files
Step-by-Step Guide
1. Install Git LFS
# macOS
brew install git-lfs
# Ubuntu/Debian
sudo apt install git-lfs
# Windows
# Download from git-lfs.github.com
# Initialize (run once per user)
git lfs install
2. Track File Types
# Track specific file types
git lfs track "*.psd"
git lfs track "*.png"
git lfs track "*.mp4"
git lfs track "datasets/*.csv"
# Track specific large files
git lfs track "models/model-v1.bin"
# View tracked patterns
git lfs track
# This creates/updates .gitattributes
cat .gitattributes
# *.psd filter=lfs diff=lfs merge=lfs -text
3. Commit and Push
# Commit the .gitattributes file (important!)
git add .gitattributes
git commit -m "chore: configure Git LFS tracking"
# Add large files normally
git add large-file.bin
git commit -m "feat: add large asset"
# Push (LFS files upload automatically)
git push origin main
4. Clone LFS Repository
# Standard clone (downloads LFS files)
git clone https://github.com/user/repo.git
# Clone without LFS files (faster)
git clone --filter=blob:none https://github.com/user/repo.git
# Pull LFS files later
git lfs pull
# Pull specific paths
git lfs pull --include="images/*"
5. Migration: Convert Existing Repository
# Install git-lfs-migrate
brew install git-lfs
# Migrate specific file types (rewrites history)
git lfs migrate import --include="*.psd,*.png,*.mp4" --everything
# Migrate with backup
git lfs migrate import --include="*.bin" --everything --verbose
# Push migrated history
git push --force --all
git push --force --tags
⚠️ Warning: History rewriting requires force push and coordination with all collaborators.
6. LFS Management Commands
# List LFS files in repository
git lfs ls-files
# Show LFS file details
git lfs ls-files --long
# Check LFS status
git lfs status
# Prune old LFS objects
git lfs prune
# Fetch specific LFS objects
git lfs fetch --all
# Verify LFS integrity
git lfs fsck
Production Failure Scenarios + Mitigations
| Scenario | Impact | Mitigation |
|---|---|---|
| LFS server unavailable | Can’t checkout files | Cache LFS files locally; use git lfs fetch --all |
| Bandwidth limits exceeded | Push/clone fails | Monitor usage; compress files; use external storage for archives |
| Pointer file committed without LFS | Large file in Git history | Use pre-commit hook to catch; migrate with git lfs migrate |
| LFS file corruption | Checkout fails | Run git lfs fsck; re-fetch from server |
| Migration breaks collaboration | Team confusion | Communicate migration; provide re-clone instructions |
| Storage limits reached | Can’t push new LFS files | Prune old objects; upgrade storage plan; archive unused files |
Trade-offs
| Aspect | Git LFS | External Storage |
|---|---|---|
| Integration | Seamless with Git | Manual sync required |
| Versioning | Full Git history | Separate versioning |
| Cost | LFS storage fees | S3/CDN costs |
| Clone speed | Fast (on-demand download) | Manual download |
| Collaboration | Native Git workflow | Extra steps |
| Backup | Git + LFS server | Separate backup strategy |
Implementation Snippets
Pre-commit hook to catch missing LFS tracking:
#!/bin/bash
# .husky/pre-commit
large_files=$(git diff --cached --name-only | while read file; do
if [ -f "$file" ] && [ $(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null) -gt 1048576 ]; then
if ! git check-attr filter "$file" | grep -q "lfs"; then
echo "$file"
fi
fi
done)
if [ -n "$large_files" ]; then
echo "Error: Large files not tracked by LFS:"
echo "$large_files"
echo "Run: git lfs track '<pattern>'"
exit 1
fi
LFS configuration for specific paths:
# .gitattributes
# Images
*.png filter=lfs diff=lfs merge=lfs -text
*.jpg filter=lfs diff=lfs merge=lfs -text
*.gif filter=lfs diff=lfs merge=lfs -text
# Design files
*.psd filter=lfs diff=lfs merge=lfs -text
*.ai filter=lfs diff=lfs merge=lfs -text
*.sketch filter=lfs diff=lfs merge=lfs -text
# Media
*.mp4 filter=lfs diff=lfs merge=lfs -text
*.mp3 filter=lfs diff=lfs merge=lfs -text
# Datasets
*.csv filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
# Binaries
*.bin filter=lfs diff=lfs merge=lfs -text
*.exe filter=lfs diff=lfs merge=lfs -text
*.dll filter=lfs diff=lfs merge=lfs -text
CI configuration with LFS:
# GitHub Actions
- uses: actions/checkout@v4
with:
lfs: true # Automatically pulls LFS files
# Or manual LFS pull
- uses: actions/checkout@v4
- run: git lfs pull
Observability Checklist
- Logs: Log LFS upload/download operations and failures
- Metrics: Track LFS storage usage, bandwidth, and file counts
- Alerts: Alert on storage limits, bandwidth thresholds, and fetch failures
- Dashboards: Monitor LFS adoption and repository size trends
- Traces: Trace LFS file lifecycle from add to checkout
Security/Compliance Notes
- LFS files are stored on remote servers; verify encryption at rest
- Access controls apply to LFS objects; configure repository permissions
- For regulated data, ensure LFS provider meets compliance requirements
- LFS pointer files don’t contain file contents; safe to commit publicly
- Audit LFS access logs for unauthorized downloads
- Consider encrypting sensitive LFS files before committing
Common Pitfalls / Anti-Patterns
| Anti-Pattern | Why It’s Bad | Fix |
|---|---|---|
| Forgetting to commit .gitattributes | LFS not configured for team | Always commit .gitattributes with tracking changes |
| Tracking too many file types | Unnecessary LFS overhead | Track only files > 1MB |
| Not pruning old LFS objects | Disk space waste | Run git lfs prune regularly |
| LFS files in submodule | Complex LFS handling | Keep LFS files in main repository |
| Migrating without team coordination | Broken clones for collaborators | Communicate migration; provide re-clone steps |
| Using LFS for frequently changing files | High bandwidth costs | Use external storage for volatile large files |
Quick Recap Checklist
- Install Git LFS and run
git lfs install - Configure file type tracking in .gitattributes
- Commit .gitattributes before adding large files
- Set up pre-commit hooks to catch missing LFS tracking
- Configure CI to pull LFS files automatically
- Monitor LFS storage usage and bandwidth
- Set up regular pruning schedule
- Document LFS workflow for team members
Interview Q&A
When you add a tracked file, Git LFS intercepts the operation via smudge/clean filters. It stores the actual file on the LFS server and creates a pointer file (containing OID, size, and server URL) in the Git repository. On checkout, the smudge filter replaces the pointer with the actual file downloaded from the LFS server.
Git LFS replaces large files with pointers while keeping them in the same repository. Submodules reference external repositories. LFS is better for large files within a project; submodules are better for separate projects with independent lifecycles. LFS provides seamless integration; submodules require explicit initialization and updating.
Use git lfs migrate import --include="*.ext" --everything to rewrite history and replace existing large files with LFS pointers. This requires a force push and coordination with all collaborators, who must re-clone the repository. Always backup before migration and test on a copy first.
You can't push new LFS files until you free up space or upgrade your plan. Existing LFS files remain accessible. Solutions include pruning old objects (git lfs prune), removing unused LFS files from history, or migrating to external storage for archival files. Monitor usage proactively to avoid surprises.
Most major providers support LFS: GitHub, GitLab, Bitbucket, Azure DevOps. However, storage limits, bandwidth policies, and pricing vary significantly. Some providers include LFS in their plans; others charge extra. Always verify LFS support and limits before committing to a hosting provider for large-file repositories.
Extended Production Failure Scenarios
LFS Quota Exceeded
A team’s Git LFS storage quota (e.g., GitHub’s 1GB free tier) is exceeded during a large asset push. The push fails with batch response: Storage quota exceeded. The Git commits succeed (pointer files are small), but the actual LFS objects are rejected. The repository now contains pointer files that reference objects that don’t exist on the LFS server. Anyone who clones gets broken checkouts — files are replaced with 128-byte pointer text instead of actual content.
Mitigation: Monitor LFS storage usage proactively. Set alerts at 80% quota. Before pushing large batches, check available space: git lfs status. If quota is exceeded, either upgrade the plan or remove unnecessary LFS objects from the push and store them externally.
Pointer Files Without Actual Content
A developer adds a new file type to .gitattributes but forgets to run git lfs install on their machine. Large files are committed as regular Git blobs instead of LFS pointers. The repository bloats, and other developers who have LFS installed see the files as regular Git objects — they can’t use git lfs pull to fetch them because they were never stored on the LFS server.
Mitigation: Use a pre-commit hook that checks for large files not tracked by LFS. Run git lfs migrate import to fix any files that slipped through. Add CI validation that rejects pushes containing large non-LFS blobs.
Extended Trade-offs
| Aspect | Git LFS | Git Annex | External Storage (S3) |
|---|---|---|---|
| Cost | Provider-dependent (often paid after free tier) | Free — self-hosted | Pay-per-use, can be cheaper at scale |
| Accessibility | Native Git workflow | Complex — separate toolchain | Manual — separate download step |
| Versioning | Full Git history of pointers | Full history tracking | Separate versioning system needed |
| Setup | Simple — git lfs track | Complex — init, configure remotes | Moderate — SDK or CLI integration |
| Collaboration | Seamless — works with any Git host | Requires all users to install git-annex | Requires separate access management |
| Best for | Game assets, design files, ML datasets | Academic archives, personal backups | Static assets, distribution files |
Security and Compliance: LFS Object Access Control
- Access control: LFS objects inherit repository permissions. If a user can clone the repo, they can download all LFS objects. For sensitive binaries, use private repositories with strict access controls.
- Bandwidth costs: LFS bandwidth is often billed separately from storage. Large teams cloning frequently can incur significant costs. Use
git clone --filter=blob:noneto skip LFS downloads until needed. - Regional storage: Some LFS providers store objects in specific regions. For compliance (GDPR, data residency), verify where LFS objects are stored. GitHub LFS uses the same region as the repository; self-hosted GitLab allows region configuration.
- Encryption: LFS objects are encrypted in transit (HTTPS) but may not be encrypted at rest depending on the provider. For sensitive binaries, encrypt files before adding them to LFS and manage decryption keys separately.
- Audit logging: Track LFS download events for compliance. Most providers log LFS access separately from Git access. Review logs periodically for unusual download patterns.
Cross-Roadmap References
- Object Storage — System Design roadmap: S3-compatible blob storage for large binary assets
- System Design Learning Roadmap — Broader architecture context for large file handling
Resources
Category
Related Posts
Centralized vs Distributed VCS: Architecture, Trade-offs, and When to Use Each
Compare centralized (SVN, CVS) vs distributed (Git, Mercurial) version control systems — their architectures, trade-offs, and when to use each approach.
Automated Changelog Generation: From Commit History to Release Notes
Build automated changelog pipelines from git commit history using conventional commits, conventional-changelog, and semantic-release. Learn parsing, templating, and production patterns.
Choosing a Git Team Workflow: Decision Framework for Branching Strategies
Decision framework for selecting the right Git branching strategy based on team size, release cadence, project type, and organizational maturity. Compare Git Flow, GitHub Flow, and more.