Longest Common Subsequence: String Alignment Problems
Master the classic DP problem for string similarity, diff tools, and bioinformatics with O(mn) time and space optimization techniques.
Longest Common Subsequence: String Alignment Problems
The longest common subsequence (LCS) problem asks: given two sequences, what is the longest subsequence they share while preserving order? Unlike substrings, subsequences need not be contiguous—it’s about finding the common “skeleton” when you delete zero or more elements from each sequence. This problem underlies file diff tools like Git, plagiarism detection, bioinformatics sequence alignment, and countless other applications where measuring similarity matters.
The DP solution builds a table where dp[i][j] represents the LCS length of the first i characters of sequence A and first j characters of sequence B. The recurrence is elegant: if characters match, it’s 1 + dp[i-1][j-1]; otherwise, it’s max(dp[i-1][j], dp[i][j-1]). With space optimization from O(mn) to O(min(m,n)), this becomes surprisingly practical.
Introduction
LCS is one of the most foundational problems in dynamic programming. Given two sequences, LCS finds the longest subsequence present in both in the same relative order. It’s not just an academic exercise—it powers Git diff, DNA sequence alignment in bioinformatics, plagiarism detection in academic systems, and fuzzy matching in autocomplete features. The key insight that makes it tractable is that the problem exhibits optimal substructure: the LCS of two strings depends only on the LCS of their prefixes.
Understanding LCS matters for several reasons. First, it establishes the string alignment paradigm that underlies many DP problems. Second, the space-optimized O(min(m,n)) solution teaches an important lesson about reducing memory footprint while preserving correctness. Third, variations like edit distance and shortest common supersequence build directly on LCS concepts. In technical interviews, LCS appears regularly—not just as a direct “find LCS length” question, but as a blueprint for solving subsequence problems in disguise.
This guide covers the O(mn) DP table approach, how to reconstruct the actual subsequence string (not just its length), and the space optimization that reduces memory to O(min(m,n)). You’ll also learn Hirschberg’s algorithm for linear-space reconstruction, bit-parallel techniques for small alphabets, and how Hunt-Szymanski speeds up LCS when matches are sparse. By the end, you’ll recognize LCS patterns in new problems and know exactly which optimization applies.
When to Use
LCS applies when:
- Comparing versions or detecting changes — Git diff, version control
- Plagiarism detection — Measuring textual similarity between documents
- Bioinformatics — DNA sequence alignment (BLAST, CLUSTAL)
- Minimum edit distance problems — Levenshtein distance is a close cousin
- Auto-completion and fuzzy matching — Suggesting corrections based on similarity
When Not to Use
- When you need contiguous matches (use longest common substring instead)
- Very large sequences with small alphabet (consider suffix arrays or FM-index)
- Real-time streaming scenarios (batch processing assumes full input)
Architecture: DP Table Construction
graph TD
A["dp[i-1][j-1] if match, else max(dp[i-1][j], dp[i][j-1])"] --> B[Fill dp table]
B --> C["Trace back from dp[m][n]"]
C --> D[LCS characters]
subgraph Table
E["s1[0] s1[1] s1[2]..."]
F["s2[0]"]
G["dp[0][*] = 0"]
H["dp[*][0] = 0"]
end
The first row and column are zeros (empty sequence has LCS length 0). Fill row by row, then trace diagonally when characters match.
Implementation
Standard LCS (O(mn) space)
def lcs_length(s1, s2):
"""
Find LCS length between two strings.
Time: O(mn), Space: O(mn)
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
return dp[m][n]
Space-Optimized LCS (O(n) space)
def lcs_length_optimized(s1, s2):
"""
Space-optimized LCS using rolling array.
Time: O(mn), Space: O(min(m, n))
"""
# Ensure s2 is the shorter one for space efficiency
if len(s1) < len(s2):
s1, s2 = s2, s1
m, n = len(s1), len(s2)
prev = [0] * (n + 1)
curr = [0] * (n + 1)
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
curr[j] = prev[j - 1] + 1
else:
curr[j] = max(prev[j], curr[j - 1])
prev, curr = curr, [0] * (n + 1)
return prev[n]
Reconstruct LCS String
def lcs_string(s1, s2):
"""
Reconstruct the actual LCS string, not just its length.
"""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
# Trace back to reconstruct
lcs = []
i, j = m, n
while i > 0 and j > 0:
if s1[i - 1] == s2[j - 1]:
lcs.append(s1[i - 1])
i -= 1
j -= 1
elif dp[i - 1][j] > dp[i][j - 1]:
i -= 1
else:
j -= 1
return ''.join(reversed(lcs))
Common Pitfalls / Anti-Patterns
- Off-by-one indexing — DP table has (m+1)×(n+1) cells; character indices offset by 1
- Confusing subsequence with substring — LCS allows gaps; substring requires contiguity
- Space optimization with mutation — When swapping, don’t accidentally share references
- Trace-back boundary conditions — While loop needs both i > 0 and j > 0 to avoid index errors
Quick Recap Checklist
- Characters must match AND be in same relative order
- DP table: match = diagonal + 1, no match = max of left/top
- First row/column always 0 (empty sequence)
- Space optimization: use two rows (prev/curr) since only previous row needed
- Test with: empty strings, single char, identical strings, no common chars
Interview Questions
The recurrence captures the two possibilities for the optimal solution ending at position (i,j):
If s1[i-1] == s2[j-1], we extend the LCS of prefixes by one. Otherwise,
the optimal solution must omit one of these characters, so we take the max of excluding s1[i-1]
(dp[i-1][j]) or excluding s2[j-1] (dp[i][j-1]). By induction, these subproblems are
already solved optimally when we compute dp[i][j].
Start from dp[m][n] (bottom-right of table) and trace backwards. When characters match, move diagonally (up-left) and append the character. When they don't match, move in the direction of the larger value (up if dp[i-1][j] > dp[i][j-1], else left). This reconstructs the LCS by reversing the decisions made during table construction.
Both are DP string alignment problems. LCS finds the longest common subsequence,
maximizing matches. Edit distance (Levenshtein) finds minimum operations to transform
one string to another, minimizing cost. The relationship: edit_distance = |s1| + |s2| - 2 × LCS_length
when insertions/deletions cost 1 and substitutions cost 2. They optimize opposite objectives.
Yes, using a rolling array technique. Since each dp[i][j] only depends on dp[i-1][j-1], dp[i-1][j], and dp[i][j-1], we only need the previous row. Maintain two arrays of length n+1 (prev and curr). After processing each row, swap them. This reduces space from O(mn) to O(min(m,n)). The trade-off is that you lose the ability to reconstruct the actual LCS string — only the length is available.
To enumerate all LCS strings, modify the traceback. When both
dp[i-1][j] == dp[i][j-1] and they exceed the diagonal value, two valid paths exist.
Use backtracking with memoization to collect all distinct LCS strings. The
worst-case number of LCS strings is exponential — strings of all identical
characters like "AAA" and "AAA" yield C(2n,n) possible LCS strings — so
complete enumeration is only practical for small inputs.
The standard O(mn) DP becomes impractical due to both time and memory. A 100K x 100K table requires 10 billion cells. Workarounds exist: use Hirschberg's algorithm for O(min(m,n)) space LCS reconstruction, suffix automaton or bit-parallel techniques for small alphabets, Myers' O(ND) diff algorithm for nearly identical strings, or heuristic approximations when exact LCS is not required.
LCS allows gaps between matched characters; longest common substring (LCSubstr)
requires characters to be contiguous in both strings. LCS uses DP with
max-over-neighbors: if match: dp[i][j] = dp[i-1][j-1] + 1, else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]).
LCSubstr resets to zero on a miss: if match: dp[i][j] = dp[i-1][j-1] + 1, else: dp[i][j] = 0.
LCSubstr can be solved in O(n+m) with suffix automaton or rolling
hash, while LCS is inherently O(mn) in general.
Extend the 2D DP to 3D. When all three characters match,
dp[i][j][k] = 1 + dp[i-1][j-1][k-1]. Otherwise take the max of
dp[i-1][j][k], dp[i][j-1][k], and dp[i][j][k-1].
Time complexity becomes O(mnp) and space O(mnp). Generalizing to k strings
blows up to O(n^k) time, which becomes intractable quickly — multi-string LCS is NP-hard
for arbitrary k.
Compute the LCS between two documents (tokenized as sentences, words, or n-grams).
The plagiarism ratio is LCS_length / min(|A|, |B|). A high ratio
signals significant overlap. In practice: normalize text by removing
whitespace and punctuation, use sentence-level tokens for better semantics,
combine with shingling and MinHash for large corpora,
and set a threshold (e.g., 0.8) to flag potential plagiarism.
Hirschberg's algorithm computes the LCS in O(mn) time but only O(min(m,n)) space while still reconstructing the full LCS string. The trick is divide-and-conquer: compute the DP forward from the start and backward from the end for the middle row, find where forward + backward is maximized to get the split point, then recursively solve the two halves. The significance is that LCS reconstruction doesn't need the full O(mn) DP table — this makes it feasible for large strings where memory is the real bottleneck.
Bit-parallel LCS (Myers algorithm) encodes the DP row as machine words, exploiting parallelism in hardware. For alphabet size Σ, store Σ bitmasks for string B, then use bitwise operations to compute the entire row in O(n/W) where W is the machine word size (e.g., 64 bits). Per-character work drops from O(1) to O(1/W) amortized. This only works for small alphabets where the bitmask fits in a word, and performs best when |B| < W. For Unicode or large alphabets, bitmasks become impractical and you fall back to standard DP.
The Shortest Common Supersequence (SCS) of two strings is the shortest string that
has both as subsequences. The connection to LCS is direct:
SCS_length = |s1| + |s2| - LCS_length. To build an SCS, run the
LCS traceback but interleave non-matching characters from both strings.
SCS is essentially the LCS plus the gaps between matched pairs.
For more than two strings, SCS stays NP-hard — the same intractability as multi-string LCS.
Normalize both strings before running LCS. For case-insensitive matching, convert both to lowercase (or uppercase) upfront — a one-time O(m+n) cost. For Unicode, NFC/NFD normalization handles characters with diacritics that visually match but have different code points. Consider grapheme clusters (e.g., é as e +́) if visual equality matters. The critical rule: preprocessing must be consistent across both strings before any comparison, never done as a post-check during DP, because the DP recurrence depends on character equality being uniform everywhere.
Git uses a variation of Myers' diff algorithm, which borrows from LCS concepts. When merging branches, git finds the LCS between the original file and each branch version to separate non-conflicting changes from conflicting ones. Segments that differ in only one branch merge automatically; segments changed in both produce merge conflicts. The LCS pinpoints the common base, enabling a three-way merge: changes from the common ancestor to branch A and to branch B are identified independently, then combined. Conflicts only appear where the LCS-based alignment breaks because both branches modified the same region in incompatible ways.
Standard LCS needs both strings in full, but streaming has its workarounds. Keep the current DP row for the first string against the streaming second string, updating it row-by-row as each chunk arrives. The full table never materializes. If the first string also streams in piece by piece, you're dealing with a different problem — true online LCS requires a bounded buffer strategy. When memory is tight, compute LCS on windows or fall back to approximation algorithms like ukkonen's that allow an error threshold ε.
LPS finds the longest subsequence that reads the same forwards and backwards in a single string.
The connection: LPS(s) = LCS(s, reverse(s)).
Reverse the string and find LCS with the original — the result is the longest palindrome.
A palindrome's first and last characters must match, so aligning s with its reverse
captures that symmetric structure. Complexity stays O(n²) with O(n) space possible.
This relationship shows up in problems like minimum insertions to form a palindrome.
If the first strings share structure (like a common prefix), build a suffix automaton or trie of all first-strings to share DP work. When strings share the same alphabet, pre-compute character positions and reuse row computations. N pairs processed naively take O(N·m·n). You can parallelize at the pair level (each pair is independent) or at the cell level (trickier to sync). If one string is shared across all pairs (e.g., one query against N database entries), build the DP table once and reuse partial results. Sorting pairs by length and processing shortest first helps when using space-optimized O(n) methods.
Hunt-Szymanski is a divide-and-conquer LCS tuned for strings with few matching pairs. First find all matches (positions where s1[i] == s2[j]), then treat these as a DAG longest path problem. Building the full O(m·n) table is wasteful when matches are sparse. Sort match positions by row, then find the longest increasing subsequence in that sorted list. Complexity is O((m+n)·log(m) + R·log(R)) where R is the match count. This beats O(mn) DP when R << mn — which is exactly the scenario git faces with large, divergent files.
Constrained LCS (CLCS) allows up to k mismatched positions without breaking the subsequence.
Extend the DP state to dp[i][j][t] = LCS length with exactly t mismatches used so far.
On a character match, take the diagonal and optionally reset the mismatch counter.
On a miss, either move without counting (free gap) or increment the mismatch counter (counted gap).
This runs in O(m·n·k) time with O(m·n) or O(m·n·k) space depending on pruning.
When k is small relative to m and n, it's tractable; when k grows large,
you're essentially computing string similarity anyway.
Hirschberg maintains two DP rows (forward and backward) instead of the full table.
The forward pass goes left to right up to the middle column; the backward pass goes
right to left from the end. At the middle row, forward[i][j] + backward[i][j]
gives the LCS length crossing that boundary. Find the split point where
forward[mid][j] + backward[mid][j] == LCS_length,
then recursively reconstruct the LCS string by splitting at that mid point.
Standard traceback walks from dp[m][n] backward through the entire O(mn) matrix.
Hirschberg achieves O(min(m,n)) space by keeping only two rows at each recursion level —
a fundamental difference in how traceback information is stored.
Further Reading
- Edit Distance and Levenshtein — The sister problem to LCS, covering minimum edit distance and Levenshtein distance with similar DP techniques.
- Hirschberg’s Algorithm — Linear-space LCS reconstruction combining divide-and-conquer with DP for O(min(m,n)) space.
- Myers’ Diff Algorithm — The greedy O(ND) algorithm behind Git diff, optimized for nearly identical strings.
- CLRS: Introduction to Algorithms (Chapter 15) — Dynamic programming fundamentals with formal proof of optimal substructure.
- LeetCode 1143: Longest Common Subsequence — Classic interview problem with test cases and community solutions.
- Rosalind: Finding a Shared Spliced Motif — Bioinformatics application using LCS on DNA sequences.
Conclusion
LCS comes down to building a DP table where a match gives you 1 + dp[i-1][j-1] and a miss gives you max(dp[i-1][j], dp[i][j-1]). Space-optimize to two rows if you only need the length, or keep the full table if you need to reconstruct the actual subsequence string. Watch for the off-by-one trap: your table indices are offset by one from your string indices. For related string problems, see Edit Distance and Levenshtein.
Category
Related Posts
Edit Distance: Transforming One String to Another
Solve the minimum edit distance problem using dynamic programming with applications in spell checking, DNA alignment, and autocomplete.
1D Dynamic Programming Problems: Classic Patterns
Learn to solve common 1D dynamic programming problems including climbing stairs, house robber, and coin change with optimized space solutions.
2D Dynamic Programming: Matrix Chain and Beyond
Master 2D DP problems with two state variables for string manipulation, matrix chain multiplication, and optimal game strategies.