NoSQL Databases: Document, Key-Value, Column-Family, Graph

Explore NoSQL database types, CAP theorem implications, and when to choose MongoDB, Cassandra, DynamoDB, or graph databases over relational systems.

published: March 22, 2026 reading time: 40 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

NoSQL databases trade away ACID transactions and SQL flexibility to handle specific access patterns at massive scale. Document databases like MongoDB handle flexible schemas and embedded data, key-value stores like Redis deliver sub-millisecond lookups for sessions and caching, column-family databases like Cassandra excel at write-heavy workloads with predictable queries, and graph databases like Neo4j handle relationship traversal efficiently. The CAP theorem forces a choice during partitions: choose consistency (CP systems like HBase) or availability (AP systems like DynamoDB and Cassandra). Sharding strategies—hash-based, consistent hashing, and range-based—determine how data distributes across nodes, and getting partition keys wrong creates hotspots that kill performance. For most teams, relational databases still win unless you have specific scale requirements that NoSQL actually solves.

NoSQL Databases: Document, Key-Value, Column-Family, and Graph Databases

Introduction

NoSQL databases emerged to solve problems relational systems handle poorly. Some data is naturally document-structured, not tabular. Some access patterns require single-key lookups at massive scale. Some workloads need graph traversal rather than filtered aggregations.

The common thread is flexibility. NoSQL systems typically trade away features to get it. You might lose ACID transactions across documents, or find yourself deciding between consistency and availability. Understanding these trade-offs matters more than the marketing around any particular database.

Topic-Specific Deep Dives

Sharding Fundamentals

Sharding distributes data across multiple nodes so no single machine handles everything. Write throughput and storage grow as you add nodes, which sounds simple because it is. The catch is that your application now needs to figure out which node has the data it wants, and queries that span multiple shards get expensive fast. You trade operational complexity for horizontal scalability.

The distribution strategies each make different tradeoffs. Hash-based sharding spreads writes evenly across nodes, but range queries scatter across every shard. Range-based sharding keeps related data together for efficient range scans, but you can end up with hot spots if your access patterns match your partition key. Consistent hashing reduces data movement when nodes join or leave, but the implementation is more involved. Picking a shard key is one of those decisions that is hard to change later, so it pays to think about access patterns before you commit.

The next few sections break down each strategy, starting with hash-based approaches and how most databases handle those automatically, then moving into range-based sharding and where time-series workloads fit.

Sharding Patterns Deep Dive

Sharding strategies differ more than the marketing suggests. Here is how they actually work.

Hash-Based Sharding

The basic formula: shard_key = hash(primary_key) % num_shards. This distributes data evenly and avoids hotspots, but range queries scatter across every shard — expensive at scale.

# Application-level hash sharding
def get_shard(key, num_shards=16):
    return abs(hash(key)) % num_shards

shards = [connect(f"shard_{i}.db") for i in range(num_shards)]
key = "user:12345"
shard = shards[get_shard(key)]

MongoDB and Cassandra handle hash-based sharding automatically. MongoDB uses a variant called “consistent hashing” with chunk-based reassignment. Cassandra uses the partitioner’s hash output to assign tokens to nodes.

Consistent Hashing

The idea is to minimize shuffling when nodes join or leave. Instead of modulo on the hash, data goes to the first node clockwise on the ring.

graph LR
    subgraph "Hash Ring"
        H1["hash(A)"];
        H2["hash(B)"];
        H3["hash(C)"];
        H4["hash(D)"];
    end
    A["Node A<br/>Range: H1→H2"] --> B["Node B<br/>Range: H2→H3"];
    B --> C["Node C<br/>Range: H3→H4"];
    C --> D["Node D<br/>Range: H4→H1"];
    D --> A;

When a node fails, only its range moves to the next node. Adding a node takes ranges from neighbors rather than rehashing everything.

Advanced Sharding Strategies

Range-Based Sharding

Partition by key ranges instead of hash. This makes range queries within a partition efficient, but you risk hotspots if your access patterns correlate with the shard key.

Cassandra excels with range-based sharding using partition keys. A time-series table might shard by month, enabling efficient queries for a given time window:

-- Cassandra: range sharding for time-series
CREATE TABLE sensor_readings (
    device_id uuid,
    month text,  -- '2026-03'
    reading_time timestamp,
    temperature float,
    PRIMARY KEY ((device_id, month), reading_time)
);

Application-Level Sharding

Some systems push sharding entirely to the application. Redis Cluster and memcached use client-side routing — the application figures out which node owns a key and talks to it directly.

# Application-level sharding with Redis Cluster-like routing
class ShardedRedis:
    def __init__(self, nodes):
        self.nodes = nodes
        self.slot_map = self._build_slot_map(16384, nodes)

    def _slot_for_key(self, key):
        return crc16(key) % 16384

    def get(self, key):
        slot = self._slot_for_key(key)
        node = self._node_for_slot(slot)
        return node.get(key)

Sharding Anti-Patterns

Hot partition keys: If you shard by something low-cardinality — timestamps, status codes, a popular entity — one partition absorbs all the traffic. A celebrity user’s ID as partition key will saturate a single shard even in a 100-node cluster.

Unbounded partitions: Time-based or counter-based keys without bucketing grow without limit. In Cassandra, a single conversation with no end date eventually exceeds memory limits. Always bucket your time series.

Cross-partition transactions: When an operation touches multiple shards, you need distributed coordination — slower, more failure-prone. Design queries to keep related data on the same partition.

Trade-off Analysis

When Relational Still Wins

NoSQL is not always the answer — and honestly, most of the time it probably is not.

If your data fits a tabular model with clear relationships, relational databases offer more features and better tooling. ACID transactions across multiple tables are hard to match. SQL provides powerful querying for analytical workloads.

Operational simplicity matters. Most developers know SQL. Most teams can debug a MySQL or PostgreSQL issue quickly. The ecosystem around relational databases is mature.

NoSQL databases often require more operational expertise. You need to understand the CAP trade-offs, figure out capacity planning for sharding, and learn operational procedures specific to whichever system you chose.

Start with relational unless you have specific reasons not to. The flexibility argument for NoSQL is often overstated.

When to Use / When Not to Use

When to Use NoSQL Databases:

Document databases: Your data is document-structured, varies in schema, or benefits from embedding related data
Key-value stores: You need simple lookups at extreme speed for caching or sessions
Column-family databases: You have write-heavy workloads with predictable query patterns
Graph databases: Relationship traversal is the core of your workload
You need horizontal write scaling beyond what relational databases offer
Your access patterns are simple and predictable at scale

When Not to Use NoSQL Databases:

You need ACID transactions across multiple documents or entities
Your data has complex relationships requiring multi-table joins
You need ad-hoc querying capabilities or powerful aggregation
Your team lacks operational expertise for the specific NoSQL system
You are using the “flexibility” argument without specific requirements

Production Failure Scenarios

Actual production incidents teach you more than any architecture diagram. Here is how real systems break.

DynamoDB Thundering Herd

DynamoDB adaptive capacity handles uneven access patterns, but only up to a point. If a popular item suddenly goes viral — a flash sale, viral post, or celebrity endorsement — the partition serving that item receives 100x normal traffic. Adaptive capacity splits that partition, but splitting takes time. During the split window, requests throttle. DynamoDB returns ProvisionedThroughputExceededException. Your application retries with exponential backoff, which queues more requests, which makes the overload worse.

For thundering herd, scatter-gather helps: cache popular items at the application layer, distribute reads across multiple keys that all map to the same underlying item, or add a random suffix to partition keys to spread load. Do not rely on DynamoDB adaptive capacity as your primary strategy — it is a backstop.

What this looks like in practice: A product page goes viral and receives 50,000 requests per second against a single partition key. DynamoDB begins splitting the partition, but the split takes 30-60 seconds to complete. During that window, every request to that partition key returns ProvisionedThroughputExceededException. Your application retries with exponential backoff starting at 100ms, doubling each retry. By the fifth retry, you are looking at 100ms × 2^5 = 3,200ms of queued delays. The retry queue grows faster than DynamoDB can drain it. The viral content eventually cools off before the split finishes, so the overload is self-limiting, but users experience degraded latency the entire time.

How to detect it early: Watch ConsumedCapacityUnits and ThrottledRequests per partition in CloudWatch. Set an alert when any single partition exceeds 80% of its consumed capacity for more than 30 seconds. Once you hit 100%, you are already in the throttling window. You can also monitor GetItemThrottledRequests and QueryScannedCount per partition to identify which partition key is the culprit before it cascades.

Architectural fixes: The real fix is redesigning your access patterns so popular items do not map to a single partition key. Add a random suffix, for example product:12345:A or product:12345:B, to spread reads across 10-100 partition keys. Your application handles the fan-out, but each partition now receives 1/10th to 1/100th of the load. Alternatively, cache aggressively at the application layer or API gateway. Serve the viral item from CloudFront, Cloudflare, or an in-memory cache for 30-60 seconds, and DynamoDB never sees the spike. For flash sales with known start times, pre-warm the cache before the sale begins.

Cassandra Compaction Storms

Cassandra compacts SSTables periodically to remove tombstones and overwrite dead data. Under heavy write workloads, compactions can fall behind. When compaction finally runs, it reads and rewrites large amounts of data, consuming disk I/O and CPU. This creates a feedback loop: writes accumulate faster, compaction falls further behind, and read performance degrades because the database is scanning past dead data.

The usual fixes: size your cluster for 50% headroom above peak write throughput. Use Size-Tiered Compaction Strategy (STCS) for read-heavy workloads or Time Window Compaction Strategy (TWCS) for time-series. Watch compaction queue depth and alert before it grows unbounded. When a compaction storm hits, reduce write throughput temporarily or add nodes to spread the I/O load.

What triggers compaction storms: Compaction kicks in when the number of SSTables on disk crosses a threshold. With STCS, that means 4 or more SSTables of similar size. Under heavy write load, Cassandra can produce SSTables faster than compaction can drain them. The tipping point sneaks up on you: if your write throughput stays above what compaction can handle, the queue grows until reads start scanning past dead data on every query. TWCS avoids this by only compacting within time windows, so SSTable counts stay bounded for time-series data.

How to detect it early: Watch CompactionTaskCount and PendingCompactions in nodetool cfstats. A growing pending compactions count is the first warning. DiskSpaceUsed anomalies help too — a sudden jump means compaction has started rewriting large amounts of data. Alert when pending compactions exceeds 10 for any table under write load. During the storm itself, nodetool tpstats shows the compaction thread pool queue depth growing without bound.

Why reducing writes helps: The fastest way to break the feedback loop is to cut write throughput by 30-50%. This gives compaction room to catch up. You can do this at the application layer (batch writes), at the Cassandra level (reduce concurrent_writes), or by adding nodes to spread the I/O across more disks. Adding nodes mid-storm is slower than cutting writes — the new node needs to bootstrap before it contributes.

MongoDB Shard Imbalance After Chunk Migration

MongoDB’s chunk balancer moves data between shards to maintain even distribution. During a migration, the source shard has reduced capacity for regular operations. If the balancer migrates too aggressively during peak traffic, client queries time out because the source shard is overloaded. After migration completes, the balancer may decide the distribution is still uneven and trigger another migration immediately.

Use sh.balancerWindow to schedule chunk migrations during off-peak windows. Set concurrentMergeThreads to limit compaction impact. Monitor mongos latency and shardChunkDistribution to catch imbalanced distributions before they degrade user-facing latency.

Why chunk migrations cause latency spikes: During a chunk migration, MongoDB streams documents from source to destination shard, then deletes from source. The source shard handles both regular traffic and the migration drain at the same time. With default settings, MongoDB allows up to 2 concurrent migrations per shard. Under load, each migration soaks up roughly 10-20 MB/s of network bandwidth between shards. If your shards are already near capacity, migration bandwidth fights with client query bandwidth for the same pipe, and source shard latency climbs.

How to detect an imbalanced cluster before it degrades users: Run sh.status() in the MongoDB shell to see chunk distribution across shards. Any shard with more than 30% of total chunks is the threshold worth investigating. The balancer logs its decisions in the config server’s log with entries like balance is already even, skipping. Repeated balancing rounds with no net improvement point to something blocking proper distribution: a missing index on the shard key, very large documents, or a hot spot that keeps chunking on one shard.

The migration latency window: A single 64 MB chunk migration takes 30-60 seconds on a healthy cluster with enough network bandwidth. The cutover from source to destination is atomic, but the final delete from source runs after the cutover, so there is a brief window where the same data lives on both shards. Under normal load this is invisible. Under high concurrency, a query landing on the destination during this window may not find the data yet.

The immediate fix: If you are already in a storm, sh.stopBalancer() halts all migrations immediately. It does not roll back in-progress migrations, but it stops new ones from starting. Then cut application load to give the overloaded shard breathing room. Once things stabilize, db.currentOp() shows long-running migrations and db.adminCommand({_flushChunkForSharding: ...}) lets you manually force distribution if needed.

Redis Dataset Eviction Under Memory Pressure

Redis is memory-resident by design. When Redis approaches its maxmemory limit, it evicts keys using the configured policy (LRU, LFU, TTL, or random). The application suddenly starts missing cache entries it expected to exist. If the application assumes data is always in Redis — not designed to handle cache misses — every miss hits the database, potentially causing a thundering herd on the backend.

Set maxmemory to 70-80% of available RAM to leave headroom for Redis internal operations. Watch mem_fragmentation_ratio — a value above 1.5 indicates memory fragmentation eating into available RAM. Build your application to handle cache misses gracefully with database fallbacks, not as a hard dependency.

Why eviction feels sudden: Redis does not evict until it actually hits maxmemory. But the tipping point is abrupt because the database uses memory in two places: user data and the allocator’s internal overhead. On a system with 100 GB RAM and maxmemory set to 80 GB, you might sit at 78 GB for weeks, then a batch job or traffic spike pushes you to 82 GB in minutes. When Redis crosses the limit, it evicts keys aggressively on every write — sometimes thousands per second — until usage drops below the threshold. LRU and LFU policies scan the entire keyspace to find the best eviction candidate, and that full scan itself causes a brief latency spike on the way down.

The four eviction policies and when each applies: noeviction rejects writes and is rarely useful in production. allkeys-lru is the most common choice for cache workloads — it removes the least recently used keys regardless of TTL. allkeys-lfu removes the least frequently used keys, which works better when you have popular items that should stay and unpopular items that should go. volatile-lru and volatile-lfu restrict eviction to keys with a TTL set, useful if you want to preserve keys without expiration. allkeys-random and volatile-random are for uniform sampling without tracking access patterns. The wrong policy for your workload can cause unnecessary cache misses — LRU when your hits are on recently written data makes sense; LFU when access frequencies follow a power law distribution is better.

Early warning signs: Watch evicted_keys and expired_keys in the Redis INFO output. Any eviction above zero means you are operating at or near the memory limit. Alert when used_memory exceeds 75% of maxmemory. Also watch mem_fragmentation_ratio — above 1.5 means your allocator has fragmented that 80 GB reservation into something consuming 120 GB of physical RAM, so your actual available headroom is much smaller than configured. MEMORY DEFRAG helps temporarily, but restarting Redis before fragmentation gets extreme is the reliable fix.

Network Partitions and Split-Brain Scenarios

When a network partition divides a Cassandra cluster, nodes in one partition can still talk to each other but not to the other partition. If your consistency level is ONE or LOCAL_ONE, both sides accept writes independently. When the partition heals, you have divergent data with no automatic way to determine which writes should win. This is not hypothetical — it has caused data loss in production systems that believed their configured consistency level was sufficient.

Use QUORUM consistency for critical data. Test network partition scenarios explicitly in staging. Run nodetool repair regularly to reconcile divergent data. For truly critical data, accept that NoSQL consistency guarantees have limits and design your application accordingly.

Common Pitfalls / Anti-Patterns

Using NoSQL without clear access patterns: NoSQL requires understanding your query patterns upfront. Without this, you either over-engineer or end up with hot spots and inefficient access.
Ignoring eventual consistency implications: Reading from replicas before writes propagate means you might get stale data. Consider your consistency requirements per operation.
Unbounded partition key growth: Time-based or counter-based partition keys create unbounded partitions that will haunt you. Use natural keys with fixed cardinality.
Over-reliance on secondary indexes: Secondary indexes often cause full cluster scans. Design your primary access patterns around partition keys instead.
Ignoring data modeling for access patterns: In relational DBs you model data. In NoSQL you model queries. Design tables for your query patterns, not your entity structure.
Skipping backup and recovery testing: Many NoSQL systems have complex backup mechanisms. Test recovery procedures regularly or you will lose data eventually.
Underestimating operational complexity: Each NoSQL system has its own operational quirks. Make sure your team has training and runbooks for your specific system.
Using wide partitions for “flexibility”: Stuffing too much data into single partitions kills performance. Keep partition size bounded.
Assuming linear scalability: Adding nodes does not instantly double your capacity. Rebalancing takes time and causes temporary load spikes you did not plan for.

Real-World Case Studies

Netflix runs one of the largest Cassandra deployments in the world. At peak they handle over 14 million concurrent streams, each backed by metadata queries to their Cassandra clusters. Their architecture uses Cassandra for its linear scalability — adding nodes directly increases throughput without redesigning queries or rebalancing clusters.

Their specific pattern: they partition data by user ID, so each user’s viewing history, preferences, and recommendations live on a small set of partitions. This gives them predictable latency per user and avoids the scatter-gather problem that kills performance when one query touches half the cluster. The lesson is that Cassandra rewards thoughtful partition design upfront. A hot partition — say, a popular show — can saturate a single node even in a 100-node cluster. Netflix mitigates this with virtual nodes and by spreading popular content across many partition keys.

Amazon DynamoDB was designed with a different constraint: partition the problem or lose availability. Their 2012 paper described how they partition by key range, with each partition responsible for a slice of the key space. If a partition receives more traffic than it can handle, DynamoDB splits it automatically. This adaptive splitting is what makes DynamoDB feel infinite — you never hit a ceiling, the service redistributes before you notice.

The tradeoff is that DynamoDB requires upfront access pattern planning. In DynamoDB you define your primary key structure and secondary indexes, and queries are efficient only within those structures. Ad-hoc queries that would be trivial in PostgreSQL require either scanning the entire table (expensive) or redesigning your indexes. This is the right tradeoff for Amazon’s use case — their services have well-defined access patterns. It is the wrong tradeoff for a startup that does not yet know how users will query their data.

Replication Topologies

How you replicate across regions determines what happens when partitions occur, how consistent your data is, and how much latency your writes tolerate.

Multi-Region Replication Topologies

Single-Leader Replication

One region is primary — all writes go there and replicate asynchronously to secondaries. Reads can come from any replica.

graph LR
    Primary["Primary Region<br/>US-East"] --> Replica1["Replica<br/>US-West"];
    Primary --> Replica2["Replica<br/>EU-West"];
    Client1["Client<br/>US-East"] --> Primary;
    Client2["Client<br/>US-West"] --> Replica1;

Simple model, strong write consistency, reads are fast locally. The downside: writes from other regions are slow since everything goes to the primary, and failover requires manual intervention. MongoDB replica sets and PostgreSQL streaming replication use this by default.

Multi-Leader Replication

Any region can accept writes and propagates them to all others. Write latency drops significantly for distributed users, but you now have conflict resolution problems — the same write can succeed differently in each region.

graph LR
    L1["Leader 1<br/>US-East"] <--> L2["Leader 2<br/>US-West"];
    L2 <--> L3["Leader 3<br/>EU-West"];
    L1 <--> L3;

CouchDB and Cassandra both support multi-datacenter replication. DynamoDB Global Tables use multi-leader replication.

Leaderless Replication

No primary at all — any replica accepts reads and writes. Quorum-based coordination keeps things consistent (or at least tunable).

graph LR
    Client --> R1["Replica 1<br/>US-East"];
    Client --> R2["Replica 2<br/>US-West"];
    Client --> R3["Replica 3<br/>EU-West"];
    R1 <--> R2;
    R2 <--> R3;
    R3 <--> R1;

No single point of failure, which sounds great until you try to tune quorum without hurting either latency or consistency. DynamoDB and Cassandra both use variants of this.

Consistency and Failover

Consistency vs Latency Trade-offs

Consistency Level	Writes	Reads	Latency Impact
Strong (quorum)	Wait for N/2+1	Wait for N/2+1	Highest
Local quorum	Wait for local quorum	Wait for local	Medium
Eventual (one)	Immediate local	May read stale	Lowest

Cassandra’s CONSISTENCY ONE returns immediately after local write. CONSISTENCY QUORUM waits for replicas across datacenters — adding cross-region latency.

Conflict Resolution Strategies

When writes can happen in multiple places, something has to decide which one wins:

Last-writer-wins (LWW): DynamoDB and Cassandra use timestamps — simple, but you can lose updates. Application-defined: Your code decides based on business logic — take the larger counter value, keep the more recent order, whatever makes sense for your domain. CRDTs: Data structures designed to merge without conflicts — sets, counters, registers that converge automatically regardless of order. Manual resolution: Flag conflicts for a human to sort out, which does not scale.

# Example: LWW conflict resolution
def resolve_conflict(local_value, remote_value, local_ts, remote_ts):
    return remote_value if remote_ts > local_ts else local_value

Cross-Region Failover Considerations

When an entire region disappears, what happens next depends on your topology:

Single-leader: Promote a replica in another region. Update DNS to point at the new primary. Multi-leader: Other regions keep accepting writes. When the failed region comes back, conflict resolution merges whatever diverged. Leaderless: Everything keeps working — reads and writes go to available regions. The failed region rejoins via hinted handoff or repair when it recovers.

Consistency Models and Data Modeling

Eventual Consistency vs Strong Consistency

NoSQL systems expose consistency as a dial, not a binary switch. The choice you make per operation determines latency, availability, and whether you read stale data.

flowchart TD
    subgraph "Consistency Spectrum"
        E["Eventual Consistency<br/>Writes propagate async<br/>Reads may return stale data"]
        S["Strong Consistency<br/>Writes sync to quorum<br/>Reads return latest write"]
    end
    E -->|Quorum reads/writes| S

Eventual consistency means reads may not immediately reflect the latest write. The window of inconsistency is typically milliseconds. DynamoDB, Cassandra, and CouchDB default to eventual consistency.

Strong consistency guarantees reads see all acknowledged writes. MongoDB replica sets with majority read concern, and Cassandra with quorum consistency, provide strong consistency.

The CAP theorem: during a partition, you choose between consistency (CP systems) or availability (AP systems). Most NoSQL systems are AP by default, tunable to CP via consistency levels.

The consistency spectrum in practice: Most NoSQL databases do not actually offer just two modes. Cassandra has six consistency levels (ONE, LOCAL_ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL). DynamoDB has three (Strong, Conditional, Eventual). MongoDB has five read concern levels. Each level sits somewhere on the latency-consistency curve. Writing with CONSISTENCY ONE in Cassandra is effectively eventual — the write succeeds as soon as the local node acknowledges it. Writing with CONSISTENCY QUORUM waits for a majority of replicas, adding cross-region latency if your cluster spans datacenters. Most reads can tolerate eventual consistency; critical writes like charging a credit card or decrementing inventory usually need quorum.

Stale reads in eventual consistency: With eventual consistency, a read can return data older than what you just wrote. Here is the scenario: you write key K with value “processed” to node N1. Before N1 replicates to N2, a client reads from N2 and gets the old value “pending”. The stale window is your replication lag — milliseconds normally, seconds during network congestion or node failures. Applications that read-then-write without re-reading after the write end up with read-before-write bugs. The fix is to read from a quorum of replicas, or to use a read repair mechanism that validates freshness on read.

Tunable consistency in DynamoDB: DynamoDB offers three consistency options per read. Eventual reads go to whichever replica responds first — fastest but potentially stale. Strong reads use a quorum-like mechanism that waits for writes to propagate before reading — adds roughly 10-20ms latency. Consistent reads in DynamoDB are per-item and only available when you specify a partition key and sort key together. The strong read path requires coordination across partitions, which is why DynamoDB does not default to it. For most use cases, eventual reads are fine. For financial or inventory operations, strong reads are worth the latency cost.

NoSQL Data Modeling Patterns

Materialized Path Pattern

Store the full path to enable efficient tree traversal.

# MongoDB: Materialized Path
{
  "_id": "electronics/tvs/large-tvs",
  "name": "65 inch TV",
  "path": "/electronics/tvs/large-tvs/",
  "ancestors": ["electronics", "electronics/tvs"]
}

# Query all descendants of electronics
db.categories.find({ path: /^electronics\// })

Tree Structure with Nested Sets

Alternative for read-heavy tree operations.

// Nested set model - efficient subtree reads
{
  "_id": "electronics",
  "left": 1,
  "right": 14
}
{
  "_id": "tvs",
  "left": 2,
  "right": 13
}

Tradeoff: updates require rebalancing the entire tree. Use materialized path for write-heavy workloads.

Document Versioning Pattern

Track change history without schema migrations.

// Current version
{ "_id": "doc-123", "version": 3, "data": {...} }

// Version history (separate collection)
{ "doc_id": "doc-123", "version": 1, "data": {...}, "updated_at": ... }
{ "doc_id": "doc-123", "version": 2, "data": {...}, "updated_at": ... }

Interview Questions

1. What are the main differences between MongoDB and DynamoDB when deciding which to use at scale?

MongoDB gives you flexible querying — ad-hoc filters, secondary indexes, aggregation pipelines — and you can change your schema without rewriting queries. DynamoDB requires you to plan access patterns upfront, and changing your key structure means migrating data or maintaining multiple indexes. At very small scale, MongoDB is simpler. At massive scale, DynamoDB's predictable performance and automatic partitioning win. The deciding factor is usually whether your access patterns are known and stable. If they are, DynamoDB avoids operational complexity. If they are not, MongoDB's flexibility saves you from constant schema migrations.

2. You are designing a Cassandra keyspace for a messaging application. What partition key would you use and why?

A user-centric partition key works well: (recipient_id, created_at) or (conversation_id, message_id). The goal is keeping related messages co-located on the same partition so reads for a conversation are a single partition query, not a scatter-gather across hundreds of nodes. The risk is unbounded partitions — a prolific user's conversation can grow without limit and eventually saturate a single node. The mitigation is bucketing: using (conversation_id, bucket_id, message_id) where bucket is a time window (e.g., monthly), so any single partition stays bounded. The partition key design is the most consequential decision in Cassandra — get it wrong and you either have hot spots or inefficient queries.

3. When would you choose Redis over a disk-based database for a given workload?

Redis when your working set fits in memory and you need sub-millisecond latency. Redis is single-threaded (with some parallelization for persistence) and O(1) for key operations, so latency is extremely consistent. It is the right choice for session storage, rate limiting, caching with TTLs, and leaderboards. It is the wrong choice when your data does not fit in memory (it will swap and performance collapses), when you need durability guarantees that survive crashes without careful appendfsync configuration, or when your access patterns require complex queries that Redis cannot serve.

4. Your team is considering a NewSQL database (CockroachDB or TiDB) over PostgreSQL with read replicas. What questions would you ask before making that decision?

The key question is whether you need distributed writes. If reads are the bottleneck, PostgreSQL replicas solve that problem cheaply. If writes need to scale horizontally across regions, NewSQL earns its complexity. Ask specifically: do you need multi-region write latency below what eventual-consistency NoSQL offers? Can you tolerate 2-4x write latency compared to single-node PostgreSQL? Is your team prepared for distributed SQL operational quirks — rebalancing, zone failures, distributed transaction contention? For most teams, PostgreSQL with proper indexing and connection pooling handles far more than they expect. NewSQL is for when you have exhausted PostgreSQL's vertical scalability and need geographic distribution with strong consistency.

5. Explain how consistent hashing works and why it reduces data movement when nodes are added or removed compared to traditional hash-based sharding.

Consistent hashing assigns data to nodes based on the position on a hash ring rather than modulo of the hash. Each node gets a position on the ring, and data is assigned to the first node clockwise from its hash. When a node is added, it only takes over the range from its predecessor — other nodes are unaffected. When a node fails, its range moves to the next node. Traditional modulo hashing requires rehashing nearly all keys when the number of shards changes. Consistent hashing minimizes remapping to roughly 1/n of keys for n nodes, making it ideal for distributed caches and some NoSQL databases like Cassandra's virtual nodes.

6. How would you design a partition key for a Cassandra table that stores user activity events to avoid hot partitions while keeping related data co-located?

Design the partition key around the most frequent query pattern. For user activity events, partitioning by user_id keeps all a single user's activities on one partition — efficient for "get all activities for this user." The risk is unbounded growth if one user generates millions of events. The solution is bucketing: use (user_id, day_bucket, event_id) as a composite key, where day_bucket is a date or week number. This bounds partition size while keeping a user's data relatively co-located for time-range queries. Always consider cardinality — a partition key with too many possible values creates many small partitions; too few creates hotspots.

7. What are the trade-offs between single-leader, multi-leader, and leaderless replication topologies, and when would you choose each?

Single-leader is simplest — writes go to one region and replicate asynchronously. Choose it when writes are geographically concentrated or when you need strong consistency. Multi-leader allows writes in any region — choose it when users are globally distributed and write latency matters more than conflict complexity. Leaderless avoids a primary entirely, using quorum for reads and writes — choose it for maximum availability when you can tolerate eventual consistency. The failure mode differs significantly: single-leader requires failover management; multi-leader risks write conflicts that need resolution; leaderless requires careful quorum tuning to avoid split-brain scenarios.

8. Describe how you would migrate a large MySQL workload to MongoDB while maintaining application availability during the cutover.

Use a change-data-capture approach: keep both systems running in parallel. First, run a historical backfill from MySQL to MongoDB — do this in batches to avoid locking. Enable change streams or binlog capture to capture incremental changes during the backfill. Run dual-write during the transition period, reading from MongoDB for verified data. DNS or feature flag routes traffic to MongoDB incrementally, starting with read-only traffic. Rollback capability is critical — keep MySQL accessible until MongoDB reads are verified. Monitor for query pattern mismatches — MongoDB's aggregation pipeline and indexing work differently than SQL JOINs.

9. How does the BASE model differ from ACID, and what types of applications are better suited for each?

ACID prioritizes consistency — transactions are isolated, all-or-nothing, and durable immediately. BASE prioritizes availability and partition tolerance, accepting that data will be eventually consistent. ACID fits applications where correctness is paramount: financial transactions, inventory management, anything where reading stale data causes business problems. BASE fits high-throughput, globally distributed applications where temporary inconsistency is acceptable: social media feeds, analytics, caching layers, IoT sensor data. Most production systems use both — ACID for core business logic, BASE for auxiliary data stores and caches.

10. Your DynamoDB table is experiencing throttling on a specific partition despite having low overall table utilization. What could be causing this and how would you diagnose it?

DynamoDB partitions are independent — each has its own 1000 WCU and 3000 RCU limit. Throttling on a single partition indicates a hot partition key. Common causes: using a low-cardinality attribute as the partition key (e.g., status codes, geographic regions), viral content hitting a single item's access patterns, or skewed secondary indexes. Diagnose with CloudWatch partition metrics: GetItemThrottledRequests and QueryScannedCount per partition. Solutions: introduce a high-cardinality suffix to the partition key, scatter-gather across multiple keys and aggregate in application, or redesign the access pattern to distribute load. DynamoDB adaptive capacity provides temporary relief but cannot fix fundamentally skewed access patterns.

11. What are CRDTs and how do they solve conflict resolution in distributed databases without coordination?

CRDTs (Conflict-free Replicated Data Types) are data structures mathematically designed to merge concurrent changes without coordination between nodes. Operations are commutative and idempotent, so the order of application does not matter. Examples: G-Counters (merge by taking max of each node's count), LWW-Registers (last-writer-wins based on timestamp), OR-Sets (adds and removes tracked separately). CRDTs enable availability-first architectures — writes succeed locally even during network partitions, and state converges automatically when nodes communicate. The trade-off is memory overhead and the types of data that can be represented. Not all data structures have CRDT equivalents — arbitrary linked data structures cannot be represented as CRDTs without significant compromises.

12. How would you approach capacity planning for a Cassandra cluster handling 50,000 writes per second with 2TB of data growing at 100GB per day?

Start with write throughput: 50,000 writes/second at typical 1KB per write = ~50MB/s write throughput. Each Cassandra node handles roughly 50MB/s with proper tuning on NVMe SSD. Plan for 3x growth headroom — capacity planning for 150,000 writes/second means minimum 3-4 data nodes plus replication factor. For storage: 100GB/day growth with RF=3 means 300GB/day of storage consumption. Over 3 years that is ~110TB raw storage — plan accordingly. Network matters: cross-datacenter replication doubles network egress. Model failure scenarios: losing one node with RF=3 still meets quorum. Leave 50% capacity headroom for compaction and repairs. Consider using Time Window Compaction Strategy for time-series workloads to manage tombstones efficiently.

13. Compare the operational complexity of running self-hosted Cassandra versus using a managed service like DataStax Astra or Amazon Keyspaces.

Self-hosted Cassandra gives you full control: you manage node provisioning, configuration tuning, repairs, compactions, and failure handling. Operational complexity is significant — you need engineers who understand Cassandra internals, repair schedules, and compaction strategies. DataStax Astra reduces this burden dramatically (30-50% cost premium) but has limitations on custom configurations and can hit scaling ceilings at extreme throughput. Amazon Keyspaces is serverless and fully managed but uses a Cassandra-compatible API with limited tuning options and higher per-operation costs at scale. For most teams, managed Cassandra (Astra) is the right balance. Self-hosted makes sense when you have dedicated ops expertise, need extreme customization, or are running at such scale that managed pricing becomes prohibitive.

14. What monitoring metrics are most critical for detecting impending failures in a MongoDB replica set before they cause downtime?

Watch these early warning indicators: replication lag — if secondary lag exceeds your consistency window, reads may return stale data and writes are at risk if primary fails. Connection pool utilization — approaching 100% means application requests queue and latency spikes. Disk IOPS and queue depth — when disk cannot keep up with write workload, operations back up. Memory pressure — MongoDB's wiredTiger cache should fit in RAM; swapping causes catastrophic latency spikes. Lock contention — high lock wait times indicate write congestion. Network metrics — packet loss, latency variance, and connection resets between replica set members. Oplog size and generation rate — if oplog is filling faster than secondaries can apply, lag grows. Set alerts at 70% of thresholds that would cause failures, not at failure points.

15. What are the most common causes of data loss in NoSQL databases and how would you prevent or recover from each?

The most common causes: accidental DELETE or DROP operations without proper backups — prevention is point-in-time recovery enabled and tested restores. Replication lag losing writes during failover — use proper write concern levels and monitor lag aggressively. Corrupt on-disk data from disk failures or kernel bugs —prevention is RAID, checksums, and regular repair commands (nodetool repair in Cassandra, db.repairDatabase() in MongoDB). Hot partition overload causing throttling and dropped requests — fix partition key design and use adaptive capacity as a backstop. Network partitions causing divergent data — use quorum consistency and conflict resolution. Accidental batch deletes with TTLs — model TTLs carefully and test with small batches first. Recovery from each: point-in-time restore from backups for catastrophic loss; distributed repair for divergent replicas; failover to new primary for failed nodes.

16. How does the choice between synchronous and asynchronous replication affect consistency, availability, and latency in a distributed database?

Synchronous replication writes wait for acknowledgment from all replicas before responding to the client. This guarantees strong consistency — every read sees writes that have been confirmed — but adds write latency proportional to the slowest replica. If a replica is slow or unreachable, writes block or fail. Asynchronous replication returns success immediately after the primary writes, propagating to replicas in the background. Write latency is low, but reads may see stale data until replication catches up. Availability is higher since writes never wait for slow replicas. The tradeoff: synchronous suits environments where consistency matters more than write latency (financial transactions); asynchronous suits environments where write throughput and availability matter more than immediate consistency (social feeds, analytics). Most production systems use semi-synchronous approaches — wait for a quorum of replicas but not all.

17. What are tombstones in Cassandra and MongoDB, why do they cause problems at scale, and how do you manage them?

Tombstones are deletion markers — when you delete a record in Cassandra or MongoDB, the data is marked as deleted but not immediately removed. This allows replication to propagate deletes reliably across nodes. The problem: during reads, Cassandra must scan past tombstones to find live data. If a partition accumulates many tombstones (from time-series data with deletes, or high churn), reads can become slow — reading nothing but ghosts. At scale, this is called "tombstone rain" and can cause timeouts even though the partition is logically empty. Mitigation: use TTLs instead of manual deletes where possible so expired data is purged automatically. Use Time Window Compaction Strategy (TWCS) in Cassandra for time-series workloads — it compacts and drops tombstones together with the data window they cover. Avoid wide partitions with many deleted rows. Run nodetool garbagecollect to force tombstone cleanup on specific partitions.

18. Describe the differences between MongoDB's WiredTiger storage engine and the in-memory engine. When would you choose one over the other?

WiredTiger is the default storage engine for MongoDB — it is disk-based with LSM tree indexes optimized for write-heavy workloads. It uses compression (Snappy or Zstd) to fit more data in RAM and reduces disk I/O. It supports document-level concurrency (multicutx-transactions), which is more efficient than collection-level locking in MMAPv1. The in-memory engine stores all data in RAM with optional persistence of a journal — no disk I/O for reads or writes. It provides the lowest possible latency and predictable performance, but is limited by RAM size and is significantly more expensive at scale. Choose WiredTiger for production workloads where data volume exceeds RAM, where you need compression to reduce costs, and where writes dominate. Choose in-memory engine for latency-sensitive workloads where all active data fits comfortably in RAM and the operational simplicity of predictable latency outweighs the cost premium — caching layers, session stores, real-time analytics dashboards.

19. What is the difference between optimistic locking and pessimistic locking in the context of NoSQL databases? Give an example of when each applies.

Optimistic locking assumes conflicts are rare and lets transactions proceed without acquiring locks — it detects conflicts at commit time by checking a version number or timestamp. If the version changed since you read the data, the transaction aborts and you retry. Pessimistic locking assumes conflicts are likely and acquires a lock before reading or modifying data — other transactions block waiting for the lock. Optimistic locking in NoSQL: MongoDB's findAndModify with a version field; DynamoDB's conditional writes. Use when conflicts are rare, throughput matters more than latency, and clients can retry. Pessimistic locking: Redis WATCH/MULTI/EXEC for critical sections; Cassandra's Lightweight Transactions (LWT) using IF conditions. Use when conflicts are frequent, the cost of a conflict is high (overwriting financial data), and you need guaranteed exclusive access. The tradeoff: optimistic locking has lower latency but more retries under contention; pessimistic locking has higher latency but fewer retries.

20. What is the "dual write" problem in distributed databases, and how do you solve it when migrating between data stores or maintaining multiple data sources?

The dual write problem occurs when your application writes to two data systems simultaneously — for example, writing to PostgreSQL and Elasticsearch at the same time. If the first write succeeds and the second fails, your data stores diverge with no easy way to know which is correct. You cannot easily roll back the first write, and detecting the inconsistency requires a full reconciliation scan. Solutions: change data capture (CDC) — instead of writing to both systems, write to one and use a stream processor (Kafka Connect, Debezium) to replicate changes to the second. This separates the write path from the replication path and guarantees at-least-once delivery. Transactional outbox pattern — write to a local outbox table within the same transaction as your primary write, then use a relay process to publish outbox entries to the second system. This guarantees exactly-once semantics within the transaction boundary. The relay tracks its position in the outbox so it resumes correctly after failures. Both approaches avoid the "write-then-fail-silently" failure mode of naive dual writes.

Conclusion

Key takeaways:

NoSQL databases are purpose-built for specific access patterns and scale requirements
CAP theorem trade-offs force you to choose between consistency and availability during partitions
Document databases work well for flexible schemas and embedded data patterns
Key-value stores give you extreme speed for simple lookups
Column-family databases handle write-heavy workloads with predictable queries
Graph databases are built for relationship traversal workloads

Copy/Paste Checklist:

# MongoDB connection with monitoring
from pymongo import MongoClient
client = MongoClient("mongodb://user:pass@host:27017/?replicaSet=rs0")
db = client.admin
# Check replica set status
print(db.command('replSetGetStatus'))

# Cassandra connection with monitoring
from cassandra.cluster import Cluster
cluster = Cluster(['host1', 'host2', 'host3'])
session = cluster.connect('mykeyspace')
# Check cluster health
session.execute("SELECT * FROM system.local")
session.execute("SELECT * FROM system.peers")

# Redis security and monitoring
redis-cli INFO clients  # Client connections
redis-cli INFO stats     # Command statistics
redis-cli CONFIG GET *   # Configuration audit

Observability Checklist

Metrics to Monitor:

Track request latency (p50, p95, p99) by operation type and request throughput (reads/writes per second). Watch node availability and cluster health, disk usage per node and cluster-wide, and network I/O between nodes. Replication lag and pending writes matter for consistency. Compaction progress and queue depth affect performance unpredictably. If you use caching, track cache hit ratios. Connection pool utilization catches exhaustion before it becomes an outage.

Logs to Capture:

Log node failures and restarts, compaction events with I/O patterns, authentication failures and denied permissions. Slow query logs (set a threshold that makes sense for your workload) are essential. For JVM-based systems like Cassandra, GC pauses and heap pressure show up at the worst times. Network partition events are worth capturing for post-mortems.

Alerts to Set:

Alert on node down or cluster minority partition immediately. Set thresholds for latency spikes that exceed your SLA, disk usage above 80% on any node, and replication lag beyond your consistency requirements. Watch for failed request rate increases and compaction queue depth spiking. Connection pool exhaustion will take down your app before you notice anything else.

# Example: Cassandra nodetool commands for monitoring
# nodetool status           # Cluster health and node status
# nodetool tpstats          # Thread pool statistics
# nodetool cfstats          # Keyspace and table statistics
# nodetool proxyhistograms  # Client request latency