NoSQL Databases: Document, Key-Value, Column-Family, Graph

Explore NoSQL database types, CAP theorem implications, and when to choose MongoDB, Cassandra, DynamoDB, or graph databases over relational systems.

published: reading time: 32 min read author: GeekWorkBench

NoSQL Databases: Document, Key-Value, Column-Family, and Graph Databases

Introduction

NoSQL databases emerged to solve problems relational systems handle poorly. Some data is naturally document-structured, not tabular. Some access patterns require single-key lookups at massive scale. Some workloads need graph traversal rather than filtered aggregations.

The common thread is flexibility. NoSQL systems typically trade away features to get it. You might lose ACID transactions across documents, or find yourself deciding between consistency and availability. Understanding these trade-offs matters more than the marketing around any particular database.

Topic-Specific Deep Dives

Sharding Fundamentals

Sharding Patterns Deep Dive

Sharding strategies differ more than the marketing suggests. Here is how they actually work.

Hash-Based Sharding

The basic formula: shard_key = hash(primary_key) % num_shards. This distributes data evenly and avoids hotspots, but range queries scatter across every shard — expensive at scale.

# Application-level hash sharding
def get_shard(key, num_shards=16):
    return abs(hash(key)) % num_shards

shards = [connect(f"shard_{i}.db") for i in range(num_shards)]
key = "user:12345"
shard = shards[get_shard(key)]

MongoDB and Cassandra handle hash-based sharding automatically. MongoDB uses a variant called “consistent hashing” with chunk-based reassignment. Cassandra uses the partitioner’s hash output to assign tokens to nodes.

Consistent Hashing

The idea is to minimize shuffling when nodes join or leave. Instead of modulo on the hash, data goes to the first node clockwise on the ring.

graph LR
    subgraph "Hash Ring"
        H1["hash(A)"];
        H2["hash(B)"];
        H3["hash(C)"];
        H4["hash(D)"];
    end
    A["Node A<br/>Range: H1→H2"] --> B["Node B<br/>Range: H2→H3"];
    B --> C["Node C<br/>Range: H3→H4"];
    C --> D["Node D<br/>Range: H4→H1"];
    D --> A;

When a node fails, only its range moves to the next node. Adding a node takes ranges from neighbors rather than rehashing everything.

Advanced Sharding Strategies

Range-Based Sharding

Partition by key ranges instead of hash. This makes range queries within a partition efficient, but you risk hotspots if your access patterns correlate with the shard key.

Cassandra excels with range-based sharding using partition keys. A time-series table might shard by month, enabling efficient queries for a given time window:

-- Cassandra: range sharding for time-series
CREATE TABLE sensor_readings (
    device_id uuid,
    month text,  -- '2026-03'
    reading_time timestamp,
    temperature float,
    PRIMARY KEY ((device_id, month), reading_time)
);

Application-Level Sharding

Some systems push sharding entirely to the application. Redis Cluster and memcached use client-side routing — the application figures out which node owns a key and talks to it directly.

# Application-level sharding with Redis Cluster-like routing
class ShardedRedis:
    def __init__(self, nodes):
        self.nodes = nodes
        self.slot_map = self._build_slot_map(16384, nodes)

    def _slot_for_key(self, key):
        return crc16(key) % 16384

    def get(self, key):
        slot = self._slot_for_key(key)
        node = self._node_for_slot(slot)
        return node.get(key)

Sharding Anti-Patterns

Hot partition keys: If you shard by something low-cardinality — timestamps, status codes, a popular entity — one partition absorbs all the traffic. A celebrity user’s ID as partition key will saturate a single shard even in a 100-node cluster.

Unbounded partitions: Time-based or counter-based keys without bucketing grow without limit. In Cassandra, a single conversation with no end date eventually exceeds memory limits. Always bucket your time series.

Cross-partition transactions: When an operation touches multiple shards, you need distributed coordination — slower, more failure-prone. Design queries to keep related data on the same partition.

Trade-off Analysis

When Relational Still Wins

NoSQL is not always the answer — and honestly, most of the time it probably is not.

If your data fits a tabular model with clear relationships, relational databases offer more features and better tooling. ACID transactions across multiple tables are hard to match. SQL provides powerful querying for analytical workloads.

Operational simplicity matters. Most developers know SQL. Most teams can debug a MySQL or PostgreSQL issue quickly. The ecosystem around relational databases is mature.

NoSQL databases often require more operational expertise. You need to understand the CAP trade-offs, figure out capacity planning for sharding, and learn operational procedures specific to whichever system you chose.

Start with relational unless you have specific reasons not to. The flexibility argument for NoSQL is often overstated.

When to Use / When Not to Use

When to Use NoSQL Databases:

  • Document databases: Your data is document-structured, varies in schema, or benefits from embedding related data
  • Key-value stores: You need simple lookups at extreme speed for caching or sessions
  • Column-family databases: You have write-heavy workloads with predictable query patterns
  • Graph databases: Relationship traversal is the core of your workload
  • You need horizontal write scaling beyond what relational databases offer
  • Your access patterns are simple and predictable at scale

When Not to Use NoSQL Databases:

  • You need ACID transactions across multiple documents or entities
  • Your data has complex relationships requiring multi-table joins
  • You need ad-hoc querying capabilities or powerful aggregation
  • Your team lacks operational expertise for the specific NoSQL system
  • You are using the “flexibility” argument without specific requirements

Production Failure Scenarios

Actual production incidents teach you more than any architecture diagram. Here is how real systems break.

DynamoDB Thundering Herd

DynamoDB adaptive capacity handles uneven access patterns, but only up to a point. If a popular item suddenly goes viral — a flash sale, viral post, or celebrity endorsement — the partition serving that item receives 100x normal traffic. Adaptive capacity splits that partition, but splitting takes time. During the split window, requests throttle. DynamoDB returns ProvisionedThroughputExceededException. Your application retries with exponential backoff, which queues more requests, which makes the overload worse.

For thundering herd, scatter-gather helps: cache popular items at the application layer, distribute reads across multiple keys that all map to the same underlying item, or add a random suffix to partition keys to spread load. Do not rely on DynamoDB adaptive capacity as your primary strategy — it is a backstop.

Cassandra Compaction Storms

Cassandra compacts SSTables periodically to remove tombstones and overwrite dead data. Under heavy write workloads, compactions can fall behind. When compaction finally runs, it reads and rewrites large amounts of data, consuming disk I/O and CPU. This creates a feedback loop: writes accumulate faster, compaction falls further behind, and read performance degrades because the database is scanning past dead data.

The usual fixes: size your cluster for 50% headroom above peak write throughput. Use Size-Tiered Compaction Strategy (STCS) for read-heavy workloads or Time Window Compaction Strategy (TWCS) for time-series. Watch compaction queue depth and alert before it grows unbounded. When a compaction storm hits, reduce write throughput temporarily or add nodes to spread the I/O load.

MongoDB Shard Imbalance After Chunk Migration

MongoDB’s chunk balancer moves data between shards to maintain even distribution. During a migration, the source shard has reduced capacity for regular operations. If the balancer migrates too aggressively during peak traffic, client queries time out because the source shard is overloaded. After migration completes, the balancer may decide the distribution is still uneven and trigger another migration immediately.

Use sh.balancerWindow to schedule chunk migrations during off-peak windows. Set concurrentMergeThreads to limit compaction impact. Monitor mongos latency and shardChunkDistribution to catch imbalanced distributions before they degrade user-facing latency.

Redis Dataset Eviction Under Memory Pressure

Redis is memory-resident by design. When Redis approaches its maxmemory limit, it evicts keys using the configured policy (LRU, LFU, TTL, or random). The application suddenly starts missing cache entries it expected to exist. If the application assumes data is always in Redis — not designed to handle cache misses — every miss hits the database, potentially causing a thundering herd on the backend.

Set maxmemory to 70-80% of available RAM to leave headroom for Redis internal operations. Watch mem_fragmentation_ratio — a value above 1.5 indicates memory fragmentation eating into available RAM. Build your application to handle cache misses gracefully with database fallbacks, not as a hard dependency.

Network Partitions and Split-Brain Scenarios

When a network partition divides a Cassandra cluster, nodes in one partition can still talk to each other but not to the other partition. If your consistency level is ONE or LOCAL_ONE, both sides accept writes independently. When the partition heals, you have divergent data with no automatic way to determine which writes should win. This is not hypothetical — it has caused data loss in production systems that believed their configured consistency level was sufficient.

Use QUORUM consistency for critical data. Test network partition scenarios explicitly in staging. Run nodetool repair regularly to reconcile divergent data. For truly critical data, accept that NoSQL consistency guarantees have limits and design your application accordingly.

Common Pitfalls / Anti-Patterns

  1. Using NoSQL without clear access patterns: NoSQL requires understanding your query patterns upfront. Without this, you either over-engineer or end up with hot spots and inefficient access.

  2. Ignoring eventual consistency implications: Reading from replicas before writes propagate means you might get stale data. Consider your consistency requirements per operation.

  3. Unbounded partition key growth: Time-based or counter-based partition keys create unbounded partitions that will haunt you. Use natural keys with fixed cardinality.

  4. Over-reliance on secondary indexes: Secondary indexes often cause full cluster scans. Design your primary access patterns around partition keys instead.

  5. Ignoring data modeling for access patterns: In relational DBs you model data. In NoSQL you model queries. Design tables for your query patterns, not your entity structure.

  6. Skipping backup and recovery testing: Many NoSQL systems have complex backup mechanisms. Test recovery procedures regularly or you will lose data eventually.

  7. Underestimating operational complexity: Each NoSQL system has its own operational quirks. Make sure your team has training and runbooks for your specific system.

  8. Using wide partitions for “flexibility”: Stuffing too much data into single partitions kills performance. Keep partition size bounded.

  9. Assuming linear scalability: Adding nodes does not instantly double your capacity. Rebalancing takes time and causes temporary load spikes you did not plan for.

Real-World Case Studies

Netflix runs one of the largest Cassandra deployments in the world. At peak they handle over 14 million concurrent streams, each backed by metadata queries to their Cassandra clusters. Their architecture uses Cassandra for its linear scalability — adding nodes directly increases throughput without redesigning queries or rebalancing clusters.

Their specific pattern: they partition data by user ID, so each user’s viewing history, preferences, and recommendations live on a small set of partitions. This gives them predictable latency per user and avoids the scatter-gather problem that kills performance when one query touches half the cluster. The lesson is that Cassandra rewards thoughtful partition design upfront. A hot partition — say, a popular show — can saturate a single node even in a 100-node cluster. Netflix mitigates this with virtual nodes and by spreading popular content across many partition keys.

Amazon DynamoDB was designed with a different constraint: partition the problem or lose availability. Their 2012 paper described how they partition by key range, with each partition responsible for a slice of the key space. If a partition receives more traffic than it can handle, DynamoDB splits it automatically. This adaptive splitting is what makes DynamoDB feel infinite — you never hit a ceiling, the service redistributes before you notice.

The tradeoff is that DynamoDB requires upfront access pattern planning. In DynamoDB you define your primary key structure and secondary indexes, and queries are efficient only within those structures. Ad-hoc queries that would be trivial in PostgreSQL require either scanning the entire table (expensive) or redesigning your indexes. This is the right tradeoff for Amazon’s use case — their services have well-defined access patterns. It is the wrong tradeoff for a startup that does not yet know how users will query their data.

Replication Topologies

How you replicate across regions determines what happens when partitions occur, how consistent your data is, and how much latency your writes tolerate.

Multi-Region Replication Topologies

Single-Leader Replication

One region is primary — all writes go there and replicate asynchronously to secondaries. Reads can come from any replica.

graph LR
    Primary["Primary Region<br/>US-East"] --> Replica1["Replica<br/>US-West"];
    Primary --> Replica2["Replica<br/>EU-West"];
    Client1["Client<br/>US-East"] --> Primary;
    Client2["Client<br/>US-West"] --> Replica1;

Simple model, strong write consistency, reads are fast locally. The downside: writes from other regions are slow since everything goes to the primary, and failover requires manual intervention. MongoDB replica sets and PostgreSQL streaming replication use this by default.

Multi-Leader Replication

Any region can accept writes and propagates them to all others. Write latency drops significantly for distributed users, but you now have conflict resolution problems — the same write can succeed differently in each region.

graph LR
    L1["Leader 1<br/>US-East"] <--> L2["Leader 2<br/>US-West"];
    L2 <--> L3["Leader 3<br/>EU-West"];
    L1 <--> L3;

CouchDB and Cassandra both support multi-datacenter replication. DynamoDB Global Tables use multi-leader replication.

Leaderless Replication

No primary at all — any replica accepts reads and writes. Quorum-based coordination keeps things consistent (or at least tunable).

graph LR
    Client --> R1["Replica 1<br/>US-East"];
    Client --> R2["Replica 2<br/>US-West"];
    Client --> R3["Replica 3<br/>EU-West"];
    R1 <--> R2;
    R2 <--> R3;
    R3 <--> R1;

No single point of failure, which sounds great until you try to tune quorum without hurting either latency or consistency. DynamoDB and Cassandra both use variants of this.

Consistency and Failover

Consistency vs Latency Trade-offs

Consistency LevelWritesReadsLatency Impact
Strong (quorum)Wait for N/2+1Wait for N/2+1Highest
Local quorumWait for local quorumWait for localMedium
Eventual (one)Immediate localMay read staleLowest

Cassandra’s CONSISTENCY ONE returns immediately after local write. CONSISTENCY QUORUM waits for replicas across datacenters — adding cross-region latency.

Conflict Resolution Strategies

When writes can happen in multiple places, something has to decide which one wins:

Last-writer-wins (LWW): DynamoDB and Cassandra use timestamps — simple, but you can lose updates. Application-defined: Your code decides based on business logic — take the larger counter value, keep the more recent order, whatever makes sense for your domain. CRDTs: Data structures designed to merge without conflicts — sets, counters, registers that converge automatically regardless of order. Manual resolution: Flag conflicts for a human to sort out, which does not scale.

# Example: LWW conflict resolution
def resolve_conflict(local_value, remote_value, local_ts, remote_ts):
    return remote_value if remote_ts > local_ts else local_value

Cross-Region Failover Considerations

When an entire region disappears, what happens next depends on your topology:

Single-leader: Promote a replica in another region. Update DNS to point at the new primary. Multi-leader: Other regions keep accepting writes. When the failed region comes back, conflict resolution merges whatever diverged. Leaderless: Everything keeps working — reads and writes go to available regions. The failed region rejoins via hinted handoff or repair when it recovers.

Consistency Models and Data Modeling

Eventual Consistency vs Strong Consistency

Understanding consistency models is critical for NoSQL systems.

flowchart TD
    subgraph "Consistency Spectrum"
        E["Eventual Consistency<br/>Writes propagate async<br/>Reads may return stale data"]
        S["Strong Consistency<br/>Writes sync to quorum<br/>Reads return latest write"]
    end
    E -->|Quorum reads/writes| S

Eventual consistency means reads may not immediately reflect the latest write. The window of inconsistency is typically milliseconds. DynamoDB, Cassandra, and CouchDB default to eventual consistency.

Strong consistency guarantees reads see all acknowledged writes. MongoDB replica sets with majority read concern, and Cassandra with quorum consistency, provide strong consistency.

The CAP theorem: during a partition, you choose between consistency (CP systems) or availability (AP systems). Most NoSQL systems are AP by default, tunable to CP via consistency levels.

NoSQL Data Modeling Patterns

Materialized Path Pattern

Store the full path to enable efficient tree traversal.

# MongoDB: Materialized Path
{
  "_id": "electronics/tvs/large-tvs",
  "name": "65 inch TV",
  "path": "/electronics/tvs/large-tvs/",
  "ancestors": ["electronics", "electronics/tvs"]
}

# Query all descendants of electronics
db.categories.find({ path: /^electronics\// })

Tree Structure with Nested Sets

Alternative for read-heavy tree operations.

// Nested set model - efficient subtree reads
{
  "_id": "electronics",
  "left": 1,
  "right": 14
}
{
  "_id": "tvs",
  "left": 2,
  "right": 13
}

Tradeoff: updates require rebalancing the entire tree. Use materialized path for write-heavy workloads.

Document Versioning Pattern

Track change history without schema migrations.

// Current version
{ "_id": "doc-123", "version": 3, "data": {...} }

// Version history (separate collection)
{ "doc_id": "doc-123", "version": 1, "data": {...}, "updated_at": ... }
{ "doc_id": "doc-123", "version": 2, "data": {...}, "updated_at": ... }

Interview Questions

1. What are the main differences between MongoDB and DynamoDB when deciding which to use at scale?

MongoDB gives you flexible querying — ad-hoc filters, secondary indexes, aggregation pipelines — and you can change your schema without rewriting queries. DynamoDB requires you to plan access patterns upfront, and changing your key structure means migrating data or maintaining multiple indexes. At very small scale, MongoDB is simpler. At massive scale, DynamoDB's predictable performance and automatic partitioning win. The deciding factor is usually whether your access patterns are known and stable. If they are, DynamoDB avoids operational complexity. If they are not, MongoDB's flexibility saves you from constant schema migrations.

2. You are designing a Cassandra keyspace for a messaging application. What partition key would you use and why?

A user-centric partition key works well: (recipient_id, created_at) or (conversation_id, message_id). The goal is keeping related messages co-located on the same partition so reads for a conversation are a single partition query, not a scatter-gather across hundreds of nodes. The risk is unbounded partitions — a prolific user's conversation can grow without limit and eventually saturate a single node. The mitigation is bucketing: using (conversation_id, bucket_id, message_id) where bucket is a time window (e.g., monthly), so any single partition stays bounded. The partition key design is the most consequential decision in Cassandra — get it wrong and you either have hot spots or inefficient queries.

3. When would you choose Redis over a disk-based database for a given workload?

Redis when your working set fits in memory and you need sub-millisecond latency. Redis is single-threaded (with some parallelization for persistence) and O(1) for key operations, so latency is extremely consistent. It is the right choice for session storage, rate limiting, caching with TTLs, and leaderboards. It is the wrong choice when your data does not fit in memory (it will swap and performance collapses), when you need durability guarantees that survive crashes without careful appendfsync configuration, or when your access patterns require complex queries that Redis cannot serve.

4. Your team is considering a NewSQL database (CockroachDB or TiDB) over PostgreSQL with read replicas. What questions would you ask before making that decision?

The key question is whether you need distributed writes. If reads are the bottleneck, PostgreSQL replicas solve that problem cheaply. If writes need to scale horizontally across regions, NewSQL earns its complexity. Ask specifically: do you need multi-region write latency below what eventual-consistency NoSQL offers? Can you tolerate 2-4x write latency compared to single-node PostgreSQL? Is your team prepared for distributed SQL operational quirks — rebalancing, zone failures, distributed transaction contention? For most teams, PostgreSQL with proper indexing and connection pooling handles far more than they expect. NewSQL is for when you have exhausted PostgreSQL's vertical scalability and need geographic distribution with strong consistency.

5. Explain how consistent hashing works and why it reduces data movement when nodes are added or removed compared to traditional hash-based sharding.

Consistent hashing assigns data to nodes based on the position on a hash ring rather than modulo of the hash. Each node gets a position on the ring, and data is assigned to the first node clockwise from its hash. When a node is added, it only takes over the range from its predecessor — other nodes are unaffected. When a node fails, its range moves to the next node. Traditional modulo hashing requires rehashing nearly all keys when the number of shards changes. Consistent hashing minimizes remapping to roughly 1/n of keys for n nodes, making it ideal for distributed caches and some NoSQL databases like Cassandra's virtual nodes.

6. How would you design a partition key for a Cassandra table that stores user activity events to avoid hot partitions while keeping related data co-located?

Design the partition key around the most frequent query pattern. For user activity events, partitioning by user_id keeps all a single user's activities on one partition — efficient for "get all activities for this user." The risk is unbounded growth if one user generates millions of events. The solution is bucketing: use (user_id, day_bucket, event_id) as a composite key, where day_bucket is a date or week number. This bounds partition size while keeping a user's data relatively co-located for time-range queries. Always consider cardinality — a partition key with too many possible values creates many small partitions; too few creates hotspots.

7. What are the trade-offs between single-leader, multi-leader, and leaderless replication topologies, and when would you choose each?

Single-leader is simplest — writes go to one region and replicate asynchronously. Choose it when writes are geographically concentrated or when you need strong consistency. Multi-leader allows writes in any region — choose it when users are globally distributed and write latency matters more than conflict complexity. Leaderless avoids a primary entirely, using quorum for reads and writes — choose it for maximum availability when you can tolerate eventual consistency. The failure mode differs significantly: single-leader requires failover management; multi-leader risks write conflicts that need resolution; leaderless requires careful quorum tuning to avoid split-brain scenarios.

8. Describe how you would migrate a large MySQL workload to MongoDB while maintaining application availability during the cutover.

Use a change-data-capture approach: keep both systems running in parallel. First, run a historical backfill from MySQL to MongoDB — do this in batches to avoid locking. Enable change streams or binlog capture to capture incremental changes during the backfill. Run dual-write during the transition period, reading from MongoDB for verified data. DNS or feature flag routes traffic to MongoDB incrementally, starting with read-only traffic. Rollback capability is critical — keep MySQL accessible until MongoDB reads are verified. Monitor for query pattern mismatches — MongoDB's aggregation pipeline and indexing work differently than SQL JOINs.

9. How does the BASE model differ from ACID, and what types of applications are better suited for each?

ACID prioritizes consistency — transactions are isolated, all-or-nothing, and durable immediately. BASE prioritizes availability and partition tolerance, accepting that data will be eventually consistent. ACID fits applications where correctness is paramount: financial transactions, inventory management, anything where reading stale data causes business problems. BASE fits high-throughput, globally distributed applications where temporary inconsistency is acceptable: social media feeds, analytics, caching layers, IoT sensor data. Most production systems use both — ACID for core business logic, BASE for auxiliary data stores and caches.

10. Your DynamoDB table is experiencing throttling on a specific partition despite having low overall table utilization. What could be causing this and how would you diagnose it?

DynamoDB partitions are independent — each has its own 1000 WCU and 3000 RCU limit. Throttling on a single partition indicates a hot partition key. Common causes: using a low-cardinality attribute as the partition key (e.g., status codes, geographic regions), viral content hitting a single item's access patterns, or skewed secondary indexes. Diagnose with CloudWatch partition metrics: GetItemThrottledRequests and QueryScannedCount per partition. Solutions: introduce a high-cardinality suffix to the partition key, scatter-gather across multiple keys and aggregate in application, or redesign the access pattern to distribute load. DynamoDB adaptive capacity provides temporary relief but cannot fix fundamentally skewed access patterns.

11. What are CRDTs and how do they solve conflict resolution in distributed databases without coordination?

CRDTs (Conflict-free Replicated Data Types) are data structures mathematically designed to merge concurrent changes without coordination between nodes. Operations are commutative and idempotent, so the order of application does not matter. Examples: G-Counters (merge by taking max of each node's count), LWW-Registers (last-writer-wins based on timestamp), OR-Sets (adds and removes tracked separately). CRDTs enable availability-first architectures — writes succeed locally even during network partitions, and state converges automatically when nodes communicate. The trade-off is memory overhead and the types of data that can be represented. Not all data structures have CRDT equivalents — arbitrary linked data structures cannot be represented as CRDTs without significant compromises.

12. How would you approach capacity planning for a Cassandra cluster handling 50,000 writes per second with 2TB of data growing at 100GB per day?

Start with write throughput: 50,000 writes/second at typical 1KB per write = ~50MB/s write throughput. Each Cassandra node handles roughly 50MB/s with proper tuning on NVMe SSD. Plan for 3x growth headroom — capacity planning for 150,000 writes/second means minimum 3-4 data nodes plus replication factor. For storage: 100GB/day growth with RF=3 means 300GB/day of storage consumption. Over 3 years that is ~110TB raw storage — plan accordingly. Network matters: cross-datacenter replication doubles network egress. Model failure scenarios: losing one node with RF=3 still meets quorum. Leave 50% capacity headroom for compaction and repairs. Consider using Time Window Compaction Strategy for time-series workloads to manage tombstones efficiently.

13. Compare the operational complexity of running self-hosted Cassandra versus using a managed service like DataStax Astra or Amazon Keyspaces.

Self-hosted Cassandra gives you full control: you manage node provisioning, configuration tuning, repairs, compactions, and failure handling. Operational complexity is significant — you need engineers who understand Cassandra internals, repair schedules, and compaction strategies. DataStax Astra reduces this burden dramatically (30-50% cost premium) but has limitations on custom configurations and can hit scaling ceilings at extreme throughput. Amazon Keyspaces is serverless and fully managed but uses a Cassandra-compatible API with limited tuning options and higher per-operation costs at scale. For most teams, managed Cassandra (Astra) is the right balance. Self-hosted makes sense when you have dedicated ops expertise, need extreme customization, or are running at such scale that managed pricing becomes prohibitive.

14. What monitoring metrics are most critical for detecting impending failures in a MongoDB replica set before they cause downtime?

Watch these early warning indicators: replication lag — if secondary lag exceeds your consistency window, reads may return stale data and writes are at risk if primary fails. Connection pool utilization — approaching 100% means application requests queue and latency spikes. Disk IOPS and queue depth — when disk cannot keep up with write workload, operations back up. Memory pressure — MongoDB's wiredTiger cache should fit in RAM; swapping causes catastrophic latency spikes. Lock contention — high lock wait times indicate write congestion. Network metrics — packet loss, latency variance, and connection resets between replica set members. Oplog size and generation rate — if oplog is filling faster than secondaries can apply, lag grows. Set alerts at 70% of thresholds that would cause failures, not at failure points.

15. What are the most common causes of data loss in NoSQL databases and how would you prevent or recover from each?

The most common causes: accidental DELETE or DROP operations without proper backups — prevention is point-in-time recovery enabled and tested restores. Replication lag losing writes during failover — use proper write concern levels and monitor lag aggressively. Corrupt on-disk data from disk failures or kernel bugs —prevention is RAID, checksums, and regular repair commands (nodetool repair in Cassandra, db.repairDatabase() in MongoDB). Hot partition overload causing throttling and dropped requests — fix partition key design and use adaptive capacity as a backstop. Network partitions causing divergent data — use quorum consistency and conflict resolution. Accidental batch deletes with TTLs — model TTLs carefully and test with small batches first. Recovery from each: point-in-time restore from backups for catastrophic loss; distributed repair for divergent replicas; failover to new primary for failed nodes.

16. How does the choice between synchronous and asynchronous replication affect consistency, availability, and latency in a distributed database?

Synchronous replication writes wait for acknowledgment from all replicas before responding to the client. This guarantees strong consistency — every read sees writes that have been confirmed — but adds write latency proportional to the slowest replica. If a replica is slow or unreachable, writes block or fail. Asynchronous replication returns success immediately after the primary writes, propagating to replicas in the background. Write latency is low, but reads may see stale data until replication catches up. Availability is higher since writes never wait for slow replicas. The tradeoff: synchronous suits environments where consistency matters more than write latency (financial transactions); asynchronous suits environments where write throughput and availability matter more than immediate consistency (social feeds, analytics). Most production systems use semi-synchronous approaches — wait for a quorum of replicas but not all.

17. What are tombstones in Cassandra and MongoDB, why do they cause problems at scale, and how do you manage them?

Tombstones are deletion markers — when you delete a record in Cassandra or MongoDB, the data is marked as deleted but not immediately removed. This allows replication to propagate deletes reliably across nodes. The problem: during reads, Cassandra must scan past tombstones to find live data. If a partition accumulates many tombstones (from time-series data with deletes, or high churn), reads can become slow — reading nothing but ghosts. At scale, this is called "tombstone rain" and can cause timeouts even though the partition is logically empty. Mitigation: use TTLs instead of manual deletes where possible so expired data is purged automatically. Use Time Window Compaction Strategy (TWCS) in Cassandra for time-series workloads — it compacts and drops tombstones together with the data window they cover. Avoid wide partitions with many deleted rows. Run nodetool garbagecollect to force tombstone cleanup on specific partitions.

18. Describe the differences between MongoDB's WiredTiger storage engine and the in-memory engine. When would you choose one over the other?

WiredTiger is the default storage engine for MongoDB — it is disk-based with LSM tree indexes optimized for write-heavy workloads. It uses compression (Snappy or Zstd) to fit more data in RAM and reduces disk I/O. It supports document-level concurrency (multicutx-transactions), which is more efficient than collection-level locking in MMAPv1. The in-memory engine stores all data in RAM with optional persistence of a journal — no disk I/O for reads or writes. It provides the lowest possible latency and predictable performance, but is limited by RAM size and is significantly more expensive at scale. Choose WiredTiger for production workloads where data volume exceeds RAM, where you need compression to reduce costs, and where writes dominate. Choose in-memory engine for latency-sensitive workloads where all active data fits comfortably in RAM and the operational simplicity of predictable latency outweighs the cost premium — caching layers, session stores, real-time analytics dashboards.

19. What is the difference between optimistic locking and pessimistic locking in the context of NoSQL databases? Give an example of when each applies.

Optimistic locking assumes conflicts are rare and lets transactions proceed without acquiring locks — it detects conflicts at commit time by checking a version number or timestamp. If the version changed since you read the data, the transaction aborts and you retry. Pessimistic locking assumes conflicts are likely and acquires a lock before reading or modifying data — other transactions block waiting for the lock. Optimistic locking in NoSQL: MongoDB's findAndModify with a version field; DynamoDB's conditional writes. Use when conflicts are rare, throughput matters more than latency, and clients can retry. Pessimistic locking: Redis WATCH/MULTI/EXEC for critical sections; Cassandra's Lightweight Transactions (LWT) using IF conditions. Use when conflicts are frequent, the cost of a conflict is high (overwriting financial data), and you need guaranteed exclusive access. The tradeoff: optimistic locking has lower latency but more retries under contention; pessimistic locking has higher latency but fewer retries.

20. What is the "dual write" problem in distributed databases, and how do you solve it when migrating between data stores or maintaining multiple data sources?

The dual write problem occurs when your application writes to two data systems simultaneously — for example, writing to PostgreSQL and Elasticsearch at the same time. If the first write succeeds and the second fails, your data stores diverge with no easy way to know which is correct. You cannot easily roll back the first write, and detecting the inconsistency requires a full reconciliation scan. Solutions: change data capture (CDC) — instead of writing to both systems, write to one and use a stream processor (Kafka Connect, Debezium) to replicate changes to the second. This separates the write path from the replication path and guarantees at-least-once delivery. Transactional outbox pattern — write to a local outbox table within the same transaction as your primary write, then use a relay process to publish outbox entries to the second system. This guarantees exactly-once semantics within the transaction boundary. The relay tracks its position in the outbox so it resumes correctly after failures. Both approaches avoid the "write-then-fail-silently" failure mode of naive dual writes.

Further Reading

Official Documentation

Books

  • “Designing Data-Intensive Applications” by Martin Kleppmann — The definitive text on distributed systems and data storage trade-offs
  • “Cassandra: The Definitive Guide” by Jeff Carpenter and Eben Hewitt — Practical Cassandra operations and modeling
  • “MongoDB Applied Design Patterns” by Rick Copeland — Document modeling patterns for common use cases

Papers and References

Conclusion

Key takeaways:

  • NoSQL databases are purpose-built for specific access patterns and scale requirements
  • CAP theorem trade-offs force you to choose between consistency and availability during partitions
  • Document databases work well for flexible schemas and embedded data patterns
  • Key-value stores give you extreme speed for simple lookups
  • Column-family databases handle write-heavy workloads with predictable queries
  • Graph databases are built for relationship traversal workloads

Copy/Paste Checklist:

# MongoDB connection with monitoring
from pymongo import MongoClient
client = MongoClient("mongodb://user:pass@host:27017/?replicaSet=rs0")
db = client.admin
# Check replica set status
print(db.command('replSetGetStatus'))

# Cassandra connection with monitoring
from cassandra.cluster import Cluster
cluster = Cluster(['host1', 'host2', 'host3'])
session = cluster.connect('mykeyspace')
# Check cluster health
session.execute("SELECT * FROM system.local")
session.execute("SELECT * FROM system.peers")

# Redis security and monitoring
redis-cli INFO clients  # Client connections
redis-cli INFO stats     # Command statistics
redis-cli CONFIG GET *   # Configuration audit

Observability Checklist

Metrics to Monitor:

Track request latency (p50, p95, p99) by operation type and request throughput (reads/writes per second). Watch node availability and cluster health, disk usage per node and cluster-wide, and network I/O between nodes. Replication lag and pending writes matter for consistency. Compaction progress and queue depth affect performance unpredictably. If you use caching, track cache hit ratios. Connection pool utilization catches exhaustion before it becomes an outage.

Logs to Capture:

Log node failures and restarts, compaction events with I/O patterns, authentication failures and denied permissions. Slow query logs (set a threshold that makes sense for your workload) are essential. For JVM-based systems like Cassandra, GC pauses and heap pressure show up at the worst times. Network partition events are worth capturing for post-mortems.

Alerts to Set:

Alert on node down or cluster minority partition immediately. Set thresholds for latency spikes that exceed your SLA, disk usage above 80% on any node, and replication lag beyond your consistency requirements. Watch for failed request rate increases and compaction queue depth spiking. Connection pool exhaustion will take down your app before you notice anything else.

# Example: Cassandra nodetool commands for monitoring
# nodetool status           # Cluster health and node status
# nodetool tpstats          # Thread pool statistics
# nodetool cfstats          # Keyspace and table statistics
# nodetool proxyhistograms  # Client request latency

Security Checklist

  • Enable authentication (internal authentication or external like Kerberos)
  • Implement authorization with role-based access control
  • Encrypt client-to-node and node-to-node traffic
  • Encrypt data at rest (filesystem or application-level encryption)
  • Use TLS for all inter-node communication
  • Restrict network access to database nodes (firewall rules)
  • Implement keyspace-level or table-level permissions
  • Audit log sensitive operations and access patterns
  • Sanitize all user inputs to prevent injection attacks
  • Use secure secret management for database credentials
  • Regularly rotate credentials and encryption keys
  • Test security configuration with penetration testing

Category

Related Posts

Apache Cassandra: Distributed Column Store Built for Scale

Explore Apache Cassandra's peer-to-peer architecture, CQL query language, tunable consistency, compaction strategies, and use cases at scale.

#distributed-systems #databases #cassandra

Column-Family Databases: Cassandra and HBase Architecture

Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.

#database #nosql #column-family

Document Databases: MongoDB and CouchDB Data Modeling

Learn MongoDB and CouchDB data modeling, embedding vs referencing, schema validation, and when document stores fit better than relational databases.

#database #nosql #document-database