Column-Family Databases: Cassandra and HBase Architecture

Cassandra and HBase data storage explained. Learn partition key design, column families, time-series modeling, and consistency tradeoffs.

published: reading time: 34 min read author: GeekWorkBench

Column-Family Databases: Cassandra and HBase Architecture

Column-family databases feel like spreadsheets but behave nothing like relational systems. They organize data into column families, rows, and columns—optimized for write-heavy workloads, massive scale, and queries that fetch many columns from a single row. The two major players in this space are Cassandra and HBase. They share the wide-column model but differ significantly in architecture and operational characteristics.

The Wide-Column Model

Despite the name, column-family databases are not column-oriented like analytical databases (ClickHouse, Redshift). They store data row by row, but group columns into families.

-- Cassandra CQL looks like SQL but isn't
CREATE TABLE users (
    user_id uuid PRIMARY KEY,
    profile frozen<profile_type>,
    preferences map<text, text>,
    created_at timestamp
);

Under the hood, data is stored as:

RowKey: user-123
|
+-- Column Family: profile
|   +-- Column: first_name, value: "Alice", timestamp: 1709256000
|   +-- Column: last_name, value: "Chen", timestamp: 1709256000
|   +-- Column: bio, value: "Engineer", timestamp: 1709256000
|
+-- Column Family: preferences
    +-- Column: theme, value: "dark", timestamp: 1709256000
    +-- Column: language, value: "en", timestamp: 1709256000

Each cell has a value and a timestamp. The timestamp enables optimistic concurrency and can track data versioning.


Cassandra: Distributed by Default

Cassandra is designed as a distributed, masterless database. Every node can serve any request. Data is replicated across nodes based on a configurable replication factor.

Partition Key Design

The partition key determines data distribution. All columns with the same partition key reside on the same node(s).

-- Good: high cardinality column as partition key
CREATE TABLE orders_by_user (
    user_id uuid,
    order_id timeuuid,
    total decimal,
    status text,
    items list<text>,
    PRIMARY KEY (user_id, order_id)
);

-- Query: get all orders for a user (efficient - single partition lookup)
SELECT * FROM orders_by_user WHERE user_id = ?;

-- Bad: low cardinality as partition key
CREATE TABLE orders_by_status (
    user_id uuid,
    order_id timeuuid,
    status text,
    PRIMARY KEY (status, order_id)
);

-- Query: get all orders with status = 'pending' (scans many partitions)
SELECT * FROM orders_by_status WHERE status = 'pending';

Data Modeling for Query Patterns

In Cassandra, you model for queries, not entities. You create separate tables for each access pattern.

-- Table for user profile lookups
CREATE TABLE user_profiles (
    user_id uuid PRIMARY KEY,
    email text,
    name text,
    created_at timestamp
);

-- Table for looking up users by email
CREATE TABLE user_emails (
    email text PRIMARY KEY,
    user_id uuid
);

-- Table for getting all users (for admin views)
CREATE TABLE all_users (
    tenant_id uuid,
    user_id uuid,
    email text,
    name text,
    PRIMARY KEY (tenant_id, user_id)
);

-- Table for time-based activity queries
CREATE TABLE user_activity (
    user_id uuid,
    activity_date date,
    activity_hour int,
    event_type text,
    event_data map<text, text>,
    PRIMARY KEY ((user_id), activity_date, activity_hour, event_type)
);

Consistency Levels

Cassandra offers tunable consistency. You can trade consistency for availability.

from cassandra.cluster import Cluster

cluster = Cluster(['cassandra-node-1', 'cassandra-node-2', 'cassandra-node-3'])
session = cluster.connect('mykeyspace')

# Write with quorum consistency (most common)
session.execute(
    "INSERT INTO user_profiles (user_id, email, name) VALUES (%s, %s, %s)",
    [user_id, email, name],
    consistency_level=ConsistencyLevel.QUORUM
)

# Read with quorum
result = session.execute(
    "SELECT * FROM user_profiles WHERE user_id = %s",
    [user_id],
    consistency_level=ConsistencyLevel.QUORUM
)

# Consistency levels:
# ONE: fastest, weakest consistency
# TWO: better consistency, still fast
# THREE: strong consistency
# QUORUM: majority (n/2 + 1)
# ALL: strongest, slowest, unavailable if any node down
# LOCAL_ONE: local datacenter only
# LOCAL_QUORUM: quorum in local datacenter

Time-Series Modeling

Cassandra handles time-series data well when modeled correctly.

-- Time-series table with bucketing
CREATE TABLE sensor_readings (
    sensor_id text,
    bucket date,  -- daily bucket
    reading_time timestamp,
    temperature float,
    humidity float,
    PRIMARY KEY ((sensor_id, bucket), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);

-- Query a specific sensor for a day
SELECT * FROM sensor_readings
WHERE sensor_id = 'temp-sensor-001'
  AND bucket = '2024-03-15';

-- Query for a time range within a bucket
SELECT * FROM sensor_readings
WHERE sensor_id = 'temp-sensor-001'
  AND bucket = '2024-03-15'
  AND reading_time > '2024-03-15T10:00:00'
  AND reading_time < '2024-03-15T14:00:00';

The bucket prevents unbounded partition growth. Each partition holds one day’s worth of data for one sensor.

flowchart TD
    W["Write Request"]
    WL["WAL<br/>(Write-Ahead Log)"]
    MS["MemStore<br/>(In-Memory)"]
    HF["HFiles<br/>(Disk)"]
    RS["Read Request"]
    BC["BlockCache"]
    Merge["HFiles<br/>(Merged)"]

    W --> WL --> MS -->|Flush when full| HF
    RS --> BC
    BC --> MS
    Merge -->|Compaction| HF
    HF -.->|"On read"| Merge

HBase’s write path goes through WAL (durability) then MemStore (speed). Reads check BlockCache first, then MemStore, then merged HFiles. Compaction runs in the background to merge HFiles and remove deleted cells.


HBase: Hadoop’s Column Store

HBase is built on top of HDFS (Hadoop Distributed File System). It integrates with the Hadoop ecosystem and is optimized for sequential reads and writes.

Architecture

HBase stores data in HFiles on HDFS. Regions serve ranges of row keys and split as they grow. HBase uses ZooKeeper for coordination.

Client -> ZooKeeper -> Meta Table (lists regions) -> Region Server
                                      |
                                      v
                              HFile (stored on HDFS)

Data Model

HBase uses a sparse, multi-dimensional sorted map.

Key: (Row Key, Column Family, Column Qualifier, Timestamp)
Value: byte[]
from happybase import Connection

connection = Connection('hbase-host')
table = connection.table('user_events')

# Store events
row_key = f'user-123#{datetime.now().isoformat()}'
table.put(row_key, {
    'cf:event_type': 'purchase',
    'cf:amount': '129.99',
    'cf:product_id': 'SKU-001',
    'cf:timestamp': datetime.now().isoformat()
})

# Scan with filters
for key, data in table.scan(row_start='user-123',
                             row_stop='user-124',
                             filter="SingleColumnValueFilter('cf', 'event_type', =, 'binary:purchase')"):
    print(key, data)

# Get specific row
row = table.row('user-123#2024-03-15T10:30:00')

Row Key Design

Row keys in HBase determine locality. Sequential keys cause write hotspots.

# Bad: sequential keys cause hotspot on one region server
row_key = f'order-{order_id:010d}'  # order-0000000001, order-0000000002...

# Better: salted keys distribute writes
# But adds complexity to reads

# Better yet: randomizing keys for write distribution
import uuid
row_key = f'{uuid.uuid4().hex}#user-123'

# For time-series: reverse timestamp keys for hot/cold access patterns
# or use salting with date prefixes
row_key = f'2024-03-15#{uuid.uuid4().hex}#user-123'

Read/Write Paths

HBase writes go through a Write-Ahead Log (WAL) and MemStore before being flushed to HFiles.

Write: Client -> WAL -> MemStore (in memory) -> Flush to HFile
                                                              |
Read: Client -> BlockCache -> MemStore -> HFiles (merged) <--+

Compaction merges HFiles, removing deleted or overwritten cells.


Comparing Cassandra and HBase

CharacteristicCassandraHBase
ArchitectureMasterless, every node equalMaster-slave with ZooKeeper
ConsistencyTunable per-queryStrong consistency only
Write PerformanceExcellentGood
Query LanguageCQL (SQL-like)No native SQL (requires Phoenix)
Secondary IndexesLimited, localLimited
Hadoop IntegrationLimitedNative MapReduce, Spark
Operational ComplexityLower (no ZooKeeper)Higher ( ZooKeeper, HDFS)
Global ScansPossible but expensiveEfficient via MapReduce
Use Case FitWrite-heavy, multi-datacenterHadoop ecosystem, sequential access

When to Use / When Not to Use Column-Family Databases

Use them when:

  • You need to handle millions of writes per second with horizontal scaling
  • Your access patterns are known upfront and map to partition keys
  • You need multi-datacenter replication with tunable consistency (Cassandra)
  • Time-series data with bucketing fits your workload
  • You’re in the Hadoop ecosystem and need native Spark/Hive integration (HBase)

Do not use them when:

  • You need ad-hoc queries on arbitrary columns — CQL is not SQL and joins are not supported
  • Your team can’t manage ZooKeeper, HDFS, and region management (HBase)
  • You need strong consistency across wide range queries with complex aggregation
  • Your data model changes frequently — wide-column schemas require rethinking partition keys
  • You need real-time secondary indexing across non-key columns

Partition Key Strategies

The partition key is the most critical design decision.

Goals for Partition Keys

  1. Distribute writes evenly across partitions
  2. Keep related data in the same partition
  3. Avoid partitions that grow too large
  4. Enable efficient range queries when needed
-- E-commerce: multiple access patterns

-- Pattern 1: User-centric
CREATE TABLE user_orders (
    user_id uuid,
    order_id timeuuid,
    total decimal,
    items list<text>,
    PRIMARY KEY (user_id, order_id)
);

-- Pattern 2: Product-centric (for "who bought this?")
CREATE TABLE product_orders (
    product_id uuid,
    order_id timeuuid,
    user_id uuid,
    quantity int,
    PRIMARY KEY (product_id, order_id)
);

-- Pattern 3: Time-based (for analytics)
CREATE TABLE daily_sales (
    date date,
    product_id uuid,
    order_id timeuuid,
    quantity int,
    PRIMARY KEY (date, product_id, order_id)
) WITH CLUSTERING ORDER BY (product_id ASC, order_id DESC);

Avoiding Hotspots

-- Bad: monotonically increasing key causes hotspot
CREATE TABLE events (
    event_id bigint,
    data text,
    PRIMARY KEY (event_id)
);
-- All writes go to one partition (highest event_id)

-- Good: hash the key to distribute
CREATE TABLE events (
    event_id bigint,
    partition_key text,  -- first part of composite key
    data text,
    PRIMARY KEY ((partition_key), event_id)
);

-- Better: use a natural key that has cardinality
CREATE TABLE user_events (
    user_id uuid,
    event_id timeuuid,
    data text,
    PRIMARY KEY ((user_id), event_id)
);

Time-Series Data Modeling

Time-series is a natural fit for wide-column stores.

-- Cassandra: bucket by time, query by partition
CREATE TABLE metrics (
    metric_name text,
    bucket text,  -- YYYY-MM-DD
    timestamp timestamp,
    value double,
    tags map<text, text>,
    PRIMARY KEY ((metric_name, bucket), timestamp)
);

-- Query: get all CPU metrics for a host on a day
SELECT * FROM metrics
WHERE metric_name = 'cpu.usage'
  AND bucket = '2024-03-15'
  AND timestamp > '2024-03-15T00:00:00'
  AND timestamp < '2024-03-15T23:59:59';
# HBase: store metrics with composite row key
# Row key: metric_name#hostname#timestamp

from happybase import Connection
import time

def store_metric(connection, metric_name, hostname, value, tags=None):
    table = connection.table('metrics')
    timestamp = int(time.time() * 1000)
    row_key = f'{metric_name}#{hostname}#{timestamp:013d}'

    data = {'cf:value': str(value), 'cf:timestamp': str(timestamp)}
    if tags:
        for k, v in tags.items():
            data[f'cf:tag.{k}'] = v

    table.put(row_key, data)

def query_metrics(connection, metric_name, hostname, start_time, end_time):
    table = connection.table('metrics')
    start_row = f'{metric_name}#{hostname}#{start_time:013d}'
    end_row = f'{metric_name}#{hostname}#{end_time:013d}'

    return table.scan(row_start=start_row, row_stop=end_row)

Consistency Tradeoffs

Wide-column databases make interesting consistency tradeoffs.

Cassandra’s Tunable Consistency

With RF=3 and QUORUM=2:

  • Write: 2 of 3 nodes must acknowledge
  • Read: 2 of 3 nodes must respond
  • You may read stale data if the third node has a newer value
# Read repair: check other replicas after read
# If mismatch found, repair asynchronously

# For high consistency: use SERIAL consistency
session.execute(..., consistency_level=ConsistencyLevel.SERIAL)
# Requires CL=QUORUM for writes
# Linearizable consistency for lightweight transactions

HBase’s Strong Consistency

HBase guarantees strong consistency. Every read sees the latest write. This comes at the cost of potential latency during region splits or failover.

Handling Inconsistency

# Application-level conflict resolution
def write_with_retry(session, key, value, max_retries=3):
    for attempt in range(max_retries):
        try:
            # Write with latest timestamp
            session.execute(
                "INSERT INTO events (key, value) VALUES (%s, %s)",
                [key, value]
            )
            return
        except WriteTimeoutException:
            if attempt == max_retries - 1:
                raise
            time.sleep(0.1 * (attempt + 1))

# Conflict resolution strategies:
# 1. Last-write-wins (use TTL for automatic cleanup)
# 2. Application-defined merge (examine timestamps, values)
# 3. CRDTs (Conflict-free Replicated Data Types)

Capacity Estimation: Partition Sizing and Bloom Filter Memory

Cassandra partitions grow until they hit two limits: 2 billion cells per partition (a hard internal limit) or 100MB of data per partition (a practical soft limit for read performance). Beyond 100MB, reads become slow because deserializing a large partition consumes significant heap.

The formula for estimating partition size: average row size times rows per partition. If each row is 1KB and you expect 50,000 rows per partition, that is 50MB — manageable. If rows grow to 10KB over time (more columns get added), you are at 500MB per partition and reads start timing out. Budget headroom: size partitions for peak row count, not average.

Cassandra uses bloom filters to avoid reading partitions that do not contain the requested row key. Each SSTable has its own bloom filter, and bloom filter memory is roughly 1-2 bytes per key per SSTable. For a table with 100GB of data stored across 100 SSTables and 1 billion partition keys, you are looking at roughly 200MB of bloom filter memory per node. This is compact enough to keep in memory and is why bloom filters are effective at reducing unnecessary disk reads.

HBase also uses bloom filters at the StoreFile level. The block size (default 64KB or 128KB) determines how many disk reads a bloom filter can eliminate. With a bloom filter, HBase can skip entire blocks when the row key is confirmed absent. Without it, HBase reads the block index and then the data block.

For capacity planning: estimate your partition count by dividing total data by the target partition size (100MB). With 1TB of data and 100MB target partition size, you need at minimum 10 partitions for data distribution. But with replication factor 3, that is 30 partitions across your cluster. Size your cluster so each node handles a reasonable number of partitions — 100-500 partitions per node is typical for Cassandra.

Observability Hooks: Cassandra Nodetool and HBase Master Metrics

For Cassandra, nodetool is the primary operational interface. The commands that matter in production: nodetool status shows which nodes are up and how much data each owns. nodetool tpstats shows thread pool statistics — if DroppedMessages is non-zero, your cluster is overloaded and messages are being dropped. nodetool tablestats gives per-table read/write latency, tombstone counts, and bloom filter false positive rates. nodetool proxyhistograms shows the full distribution of latencies across all coordinators.

The bloom filter false positive rate (FFP) tells you if bloom filters are working. An FFP above 1% means too many unnecessary disk reads are occurring — your bloom filter is not eliminating enough absent key lookups. This usually happens when partition sizes grow too large or when the bloom filter was misconfigured at table creation.

For HBase, the HBase Master UI exposes region assignment, load distribution, and compaction queue depth. The RegionServer metrics are more critical: MemStoreSize (if it approaches hbase.regionserver.global.memstore.size, writes will be blocked), StoreFileCount (too many StoreFiles indicate compaction is lagging), and compactionQueueSize. If the compaction queue grows faster than it is consumed, read latency degrades.

HBase also exposes metrics via JMX: HRegionServer.metrics shows per-region read/write latencies. The SlowGet metric tracks get operations taking longer than hbase.slow.get.threshold (default 1000ms). If SlowGet is climbing, your BlockCache is likely thrashing or HFiles have grown too large.

For both systems, monitor your compaction impact. Cassandra’s nodetool compactionhistory shows compaction throughput over time. HBase’s admin.getCompactionState() per table shows whether minor or major compactions are running. Compaction I/O competes with normal read/write I/O — if latency spikes coincide with compactions, consider throttling compaction throughput or scheduling it during off-peak hours.

Real-World Case Study: Instagram’s Cassandra at Scale

Instagram’s Cassandra deployment is one of the largest in the world, storing hundreds of terabytes across their cluster. Their primary use case was the activity log — who commented on your post, who followed you, who liked your photo. The problem was that at Instagram’s scale, writing to a user’s activity log meant writing to a single wide partition keyed by user ID, and celebrity accounts (accounts with millions of followers) created partitions that dwarfed regular user partitions.

Their first approach used Cassandra’s wide rows directly: each user had one massive partition containing all their notifications. This worked until they hit the partition size limit and read latencies spiked for users with large follower counts. Their fix was bucketing within the partition — instead of one row per user, they used bucketed keys like user_id#0 through user_id#15 to split one partition into 16 smaller ones. This kept partition size manageable but added query complexity: to get all notifications, they had to query 16 buckets.

The operational lesson from Instagram’s experience: Cassandra’s partition model works until it does not, and when it breaks, it breaks at the worst time (when you are already at scale). Instagram’s team recommends planning partition sizes with 10x headroom over your expected growth — if you expect 10,000 rows per partition, design for 1,000 to leave room for the unexpected.

Interview Questions

1. You are designing a Cassandra table for an IoT sensor network. Each sensor sends 100 readings per minute, and you need to query readings by sensor and time range. How do you model the partition key?

Model with a composite partition key (sensor_id, bucket) where bucket is a daily or hourly time bucket depending on retention requirements. A daily bucket for each sensor keeps partition size bounded — 100 readings/minute × 60 minutes × 24 hours = 144,000 rows per partition per day, which at 200 bytes per row is roughly 28MB per partition per day. Manageable but tight. If readings are larger or you need finer time buckets, switch to hourly bucketing. The alternative is using the sensor ID as the sole partition key and using clustering columns for time, but that puts all days for one sensor on one partition — unbounded growth.

2. Your Cassandra cluster has one node showing 3x higher latency than other nodes despite similar data sizes. What do you investigate first?

Check nodetool tpstats on the slow node for non-zero DroppedMessages, which indicates the node is overloaded and cannot keep up with requests. Then check nodetool compactionstats — if compactions are running constantly, the node is I/O bound. Also check nodetool tablestats for that node specifically to see if one table has unusually large partitions or a high bloom filter false positive rate. A single large partition forces Cassandra to read more data per query, and if that partition is hosted on one node, that node bears the extra load while others do not.

3. When would you choose HBase over Cassandra?

HBase makes sense when you are already in the Hadoop ecosystem and need native integration with MapReduce, Spark, or Hive for bulk processing of your data. HBase also makes sense when you need strong consistency guarantees — Cassandra's tunable consistency means you can accidentally read stale data if your consistency level does not match your requirements. HBase is also preferable when your access patterns are predominantly sequential reads and writes (which HDFS handles well) and when you need efficient wide scans across large tables. If you do not have ZooKeeper expertise and are not in the Hadoop ecosystem, HBase's operational complexity is a liability.

4. What happens internally when a Cassandra partition becomes too large?

Cassandra does not split a partition in the traditional sense — instead, when a partition approaches the 100MB soft limit, the storage engine begins marking cells as deleted (tombstoned) more aggressively during compaction. Reads on oversized partitions must deserialize more data from disk into memory, which increases read latency and heap pressure. If the partition hits the 2 billion cell hard limit, writes to that partition start failing with a InvalidRequestException. The practical fix is to redesign the data model to split the partition using bucketing (adding a bucket number to the partition key), which forces Cassandra to create separate physical partitions for what was previously one logical partition.

5. What is the difference between Cassandra's Leveled Compaction Strategy (LCS) and Size-Tiered Compaction Strategy (STCS)? When would you choose each?

STCS groups SSTables by size — when 4 SSTables of similar size exist, they merge into one larger SSTable. Good for read-heavy workloads with infrequent deletes. LCS organizes SSTables into levels, each level 10x larger than the previous, and only merges SSTables within the same level. Better for write-heavy workloads and workloads with many deletes because tombstone cleanup is faster (LCS processes entire levels at once). Choose LCS for time-series data with high write volume and periodic deletes. Choose STCS for archival data with infrequent deletes where read amplification is acceptable. TWCS (Time Window Compaction) is preferred for time-series specifically — it compacts data within time windows together, making tombstone expiration cleaner.

6. How does Cassandra's lightweight transactions (LWT) work, and what is the performance cost?

Cassandra LWT uses the IF clause in CQL (e.g., UPDATE ... IF condition) to provide compare-and-set semantics. Under the hood, it uses Paxos consensus to achieve linearizable consistency — a majority of replicas must agree before the write succeeds. This is different from normal Cassandra writes which are eventually consistent. The cost: each LWT requires a full Paxos round trip to a quorum of replicas, adding 2-4x latency compared to a normal write. Use LWT sparingly — only for operations where correctness matters more than latency (e.g., maintaining unique constraints, conditional updates on financial data). For high-throughput use cases, implement optimistic locking at the application level instead.

7. How do you handle Cassandra tombstone accumulation and the resulting read performance degradation?

Tombstones accumulate when you delete data — the deletion is recorded as a tombstone, not removed immediately. During reads, Cassandra must scan past tombstones to find live data. With many tombstones, read latency increases even though the partition is logically empty. Mitigation strategies: First, use TTLs instead of manual deletes wherever possible — expired data is automatically purged during compaction without tombstones. Second, use Time Window Compaction Strategy (TWCS) for time-series data — it compacts data within time windows together, and when the window expires, the entire SSTable is dropped at once, no tombstones retained. Third, avoid wide partitions with high delete rates. Fourth, if tombstones have already accumulated, run nodetool garbagecollect on specific SSTables to force tombstone cleanup. Monitor tombstone_warnings in nodetool cfstats — non-zero warnings indicate tombstone-related slow reads.

8. Compare HBase's HFile structure to Cassandra's SSTable. How does the storage format affect read/write performance?

Both HBase HFiles and Cassandra SSTables are immutable, append-only storage files with sorted key-value pairs. The key difference is in how they handle writes: HBase writes go through a MemStore and are flushed to HFiles when full; compaction merges HFiles and removes deleted cells. Cassandra writes go through a MemTable and are flushed to SSTables, with compaction merging SSTables and handling tombstones. Read performance in both depends on bloom filters and block indexes stored at the file level — these determine whether a file must be read for a given row key. HBase's BlockCache caches data blocks in memory; Cassandra's key cache caches row summaries. Both benefit from row-tiered storage that keeps recent data in fast access tiers. The compaction strategy in both systems determines the tradeoff between write amplification and read performance.

9. How does the HBase write path differ from Cassandra's, and what are the implications for durability?

HBase write path: Client → WAL (Write-Ahead Log) on HDFS → MemStore (in-memory) → acknowledgment. Data is durable as soon as it is written to the WAL, which is replicated across HDFS DataNodes. Cassandra write path: Client → WAL (commitlog) → MemTable → acknowledgment. Data is durable when written to the commitlog on disk. The difference: HBase's WAL is on HDFS, which provides replication and durability guarantees beyond a single node. Cassandra's commitlog is local disk — for true durability you need replication to other nodes or proper fsync configuration. HBase's architecture means data survives node failures more gracefully (HDFS handles replication and rebalancing). Cassandra requires explicit replication factor settings and repair operations for durability.

10. What is the difference between anti-entropy repair and incremental repair in Cassandra?

Anti-entropy repair (run via nodetool repair) is the standard mechanism for synchronizing data between replicas. It compares the Merkle tree of all data between replicas and streams the differences. The problem: for large tables, repair is expensive and can take days. nodetool repair -pr repairs only the primary ranges (one range per replica) to reduce duplicate work. nodetool repair -inc (incremental repair) only repairs data that has changed since the last repair, using the repair_history system table. For very large datasets, incremental repair is significantly faster. The tradeoff: incremental repair must be run regularly (at least within the gc_grace period) to ensure tombstones are propagated. If you run it rarely, deleted data can reappear.

11. How do you design a Cassandra schema for a messaging application where you need to query by conversation and by time, and handle unlimited message growth?

Design with bucketing to handle unlimited growth. Use (conversation_id, bucket_id, message_id) as the primary key, where bucket_id is a time bucket (e.g., monthly). This keeps messages within a conversation grouped together for queries, and bounds partition size to one bucket's worth of messages. For querying all messages in a conversation within a time range, you query the specific buckets that cover the time range. Each partition is bounded: even a busy conversation only stores 1-2 months of messages per bucket. Add writetime as a clustering column for ordering within the bucket. Use a separate table for the latest message per conversation (updated on each send) for inbox listing queries, so you do not need to scan all buckets to show the conversation list.

12. What are the tradeoffs between HBase's ZooKeeper-based coordination versus Cassandra's gossip-based coordination?

ZooKeeper provides a centralized, consistent coordination service — all cluster state changes go through ZooKeeper consensus. This gives HBase strong consistency guarantees on cluster operations (region assignment, master election) but creates a dependency: ZooKeeper is a single point of failure if not properly replicated, and ZooKeeper performance limits cluster operation speed. Cassandra uses gossip protocol — nodes exchange state information peer-to-peer, no central coordinator. This is more resilient to node failures (no single point of failure) and scales better for large clusters, but achieving consistency takes longer (must wait for gossip to propagate through the cluster). Cassandra's approach is operationally simpler (no ZooKeeper cluster to manage) but requires more nodes for quorum decisions. ZooKeeper's approach is more predictable but adds operational complexity.

13. How does Cassandra's hint replay work, and what happens to writes during a node's downtime?

When a Cassandra node is down, other nodes store hints (small files containing the failed write) for the duration of max_hint_window (default 3 hours). When the down node comes back, the coordinator replays the hints — the writes are applied as if they came from the original coordinator. This provides eventual durability for writes that failed during the downtime. The implications: if a node is down longer than max_hint_window, writes to that node are lost and must be recovered via anti-entropy repair. The node that was down will be out of sync until repair runs. Use nodetool repair after any significant downtime to synchronize data.

14. What is the difference between LOCAL_ONE and LOCAL_QUORUM consistency levels, and when would you use each?

LOCAL_ONE reads/writes only the closest replica in the local datacenter — fastest but weakest consistency (may read stale data). LOCAL_QUORUM reads/writes a majority of replicas in the local datacenter — good consistency for multi-datacenter deployments without cross-datacenter latency. Use LOCAL_ONE when speed matters more than consistency (dashboards, non-critical reads). Use LOCAL_QUORUM when you need strong consistency but cannot tolerate cross-datacenter latency (user-facing writes that must be durable). QUORUM waits for majority across ALL datacenters — highest consistency but slow for global deployments.

15. How do you handle the Cassandra 2 billion cells per partition limit in practice?

The 2 billion cell limit is a hard internal limit. Hitting it typically manifests as write timeouts on a specific partition. The fix: redesign the partition key to split the data across more partitions. Bucketing is the standard approach: instead of (sensor_id, timestamp) as the primary key (unbounded), use ((sensor_id, bucket), timestamp) where bucket is a time bucket (daily, monthly). Each sensor's data splits across multiple buckets, each with bounded row count. Monitor partition size via nodetool tablestats and alert on partitions approaching 100MB (the soft limit for read performance). When designing, calculate worst-case cell count: rows per partition times columns per row. A partition with 1 million rows each having 100 columns has 100 million cells — still far from 2B but approaching performance concerns.

16. How do Cassandra's bloom filters work, and how would you diagnose a bloom filter false positive rate that has climbed to 5%?

Bloom filters are space-efficient probabilistic data structures that tell Cassandra whether a partition might exist in an SSTable without reading the disk. They return false positives (partition marked as existing when it does not) but never false negatives. Each SSTable has its own bloom filter stored in memory, sized at roughly 1-2 bytes per key per SSTable. A 5% FFP means 1 in 20 partition lookups unnecessarily read an SSTable from disk — this adds latency and I/O overhead. Causes of elevated FFP include: partition key cardinality change (much higher than expected), SSTable count growth without compaction, or misconfigured bloom filter space allocation. Fix: run nodetool scrub to rebuild bloom filters for problematic tables, or run major compaction to reduce SSTable count. Monitor BloomFilterFalsePositives and BloomFilterFalseRatio via nodetool tablestats to catch degradation early.

17. What are virtual nodes (vnodes) in Cassandra, and what are the tradeoffs of using 256 vnodes versus 8 vnodes per node?

Virtual nodes divide each physical node into multiple token ranges (typically 256 by default), allowing data to be distributed more evenly across the cluster and simplifying operational tasks. With vnodes, when a node fails, its token ranges are automatically scattered across remaining nodes rather than being reassigned as a single contiguous block. This speeds up rebalancing and improves cluster recovery. The tradeoff: more vnodes mean more per-node overhead — each vnode maintains its own Merkle tree for anti-entropy repair, which increases memory usage and network traffic during repair operations. With 256 vnodes, repair operations generate significantly more Merkle tree comparisons than with 8 vnodes. Use 256 vnodes for large clusters where you want granular data distribution and fast recovery. Use fewer vnodes (8-16) for small clusters where operational simplicity and reduced repair overhead matter more than fine-grained distribution.

18. How does HBase's major versus minor compaction differ, and when would you intentionally delay major compaction?

Minor compaction merges a few smaller HFiles into a larger one, reducing the number of files that must be read on each query. Major compaction merges all HFiles for a region into a single file and removes deleted cells (tombstones). By default, HBase runs minor compactions continuously and major compactions on a schedule (typically every 24 hours). Delaying major compaction is sometimes intentional when: you have time-sensitive workloads that cannot tolerate the I/O spike (major compaction is I/O intensive on large tables), you rely on TTLs and the TTL window means tombstones will be cleaned before major compaction runs, or you are in a period of high write load and prefer to defer compaction overhead. The risk: deleted data remains visible until major compaction runs (HBase retains deleted cells for versioning), and read performance may degrade as HFile count grows. Monitor compactionQueueSize and StoreFileCount on RegionServers — a growing queue signals compaction is falling behind.

19. What is the role of Cassandra's snitch, and how does the GossipingPropertyFileSnitch differ from SimpleSnitch in a multi-datacenter deployment?

The snitch determines the network topology of a Cassandra cluster — it tells Cassandra which datacenter and rack a node belongs to, so the replication strategy can place replicas correctly. SimpleSnitch treats all nodes as if they are in the same datacenter and rack — use it only for single-datacenter single-rack clusters. GossipingPropertyFileSnitch is the standard for multi-datacenter deployments: each node gossips its own location (datacenter and rack) to other nodes, and this information propagates cluster-wide via the gossip protocol. This allows Cassandra's replication strategy to place replicas in different datacenters and racks for fault tolerance. Using SimpleSnitch in a multi-datacenter cluster means Cassandra ignores topology — replicas may end up in the same datacenter when they should be spread across datacenters, defeating the purpose of multi-datacenter replication. Always configure endpoint_snitch: GossipingPropertyFileSnitch and set dc and rack in the cassandra-rackdc.properties file for any cluster with more than one datacenter.

20. You need to build an analytics dashboard on Cassandra that aggregates data across all partitions. What are the tradeoffs between materialized views, manual denormalization tables, and Spark integration for this use case?

Materialized views (MV) in Cassandra automatically maintain a denormalized table keyed by a different partition key — useful for shifting the partition key to a new shape while keeping data in sync. Tradeoff: MV writes add overhead to every write in the source table, and MV builds asynchronously so you may read stale data. For analytics aggregating across all partitions, MVs help but do not solve the fundamental problem: Cassandra is not designed for cross-partition scans. Manual denormalization (separate tables per access pattern) gives you full control and predictable performance, but requires your application to keep multiple tables in sync — more code, more complexity, risk of inconsistency. Spark integration (using Spark Cassandra Connector) allows parallel scans across the entire cluster by leveraging Spark's distributed execution model. This is the most scalable approach for analytics — you read all data in parallel, aggregate in Spark, and write results to a separate table or external system. Tradeoff: Spark jobs have startup overhead (not for sub-second queries), and the connector must carefully manage partition pruning to avoid scanning the entire cluster for every query. For dashboard use cases with moderate latency requirements (seconds to minutes), Spark is the right tool; for real-time dashboards, pre-aggregate into a separate table using scheduled Spark jobs.

Failure Scenarios and Trade-off Analysis

Compaction Strategy Trade-offs

StrategyBest ForAvoid WhenTombstone Handling
LCSWrite-heavy, time-seriesRead-heavy with few deletesFaster
STCSRead-heavy, archivalHigh delete ratesSlower
TWCSTime-series with TTLsNon-time-ordered dataClean per window

Write Amplification Analysis

Wide-column databases suffer from write amplification — one logical write can become multiple physical writes during compaction.

# Write amplification factor by compaction strategy
def estimate_write_amplification(compaction_strategy):
    if compaction_strategy == "STCS":
        # Size-tiered merges large files = high amplification
        return 10  # 10x is common
    elif compaction_strategy == "LCS":
        # Leveled has predictable 1MB per level
        return 4   # 4x is typical
    elif compaction_strategy == "TWCS":
        # Time window compacts within window
        return 2   # 2x for time-bucketed data
    return 1

Common Production Failures

Monotonically increasing partition key causing write hotspot: You use event_id bigint as the primary key and insert events with increasing IDs. All new writes land on the same node because Cassandra’s partitioner routes consecutive keys to the same hash. Your throughput collapses under load. The solution is a partition key with natural cardinality — something like (user_id, event_time) spreads writes across nodes.

Unbounded partition growing past Cassandra’s 2B cell limit: You model IoT sensor data with (sensor_id, timestamp) as the primary key but forget to bucket by date. One sensor running for 10 years generates millions of columns in a single partition, eventually hitting internal limits and causing read timeouts. Bucket time-series data by day or week — this is standard practice for a reason.

HBase region hotspot from sequential row keys: You use sequential IDs like order-00000001 as row keys. All recent orders cluster on one region server, which gets overwhelmed while others sit idle. UUIDs, hash-based keys, or reversed timestamps all spread writes more evenly.

Cassandra consistency misconfigured for the replication factor: You set consistency_level=ONE with a replication factor of 3, thinking you’re getting good coverage. With ONE, you read from the closest replica — which might have stale data while another replica holds the latest write. QUORUM with replication_factor=3 requires 2 of 3 replicas to acknowledge, giving you actual consistency.

HBase compaction starving compute resources: HBase runs major compaction at night, merging hundreds of HFiles. On large tables this saturates disk I/O and network bandwidth, and reads slow to a crawl right when your users are waking up. Schedule compactions during off-peak hours or use incremental compaction strategies.

Wrong partition key preventing efficient time-range queries: You model (user_id, order_id) as the primary key, then need to query all orders within a date range regardless of user. Cassandra scatters these across partitions, so range scans hit every node. If time-range access is a real use case, put the date or time bucket in the partition key from the start.


Security Checklist

  • Enable Cassandra’s authenticator and authorizer to enforce role-based access control on keyspaces and tables
  • Use TLS for all inter-node communication (Cassandra’s server_encryption_options) to prevent traffic sniffing between nodes
  • For HBase: enable Kerberos authentication via hbase.security.authentication = kerberos; use ACLs on namespace and table levels
  • Restrict HBase’s Thrift and REST ports to internal networks; disable the HBase Master UI port on public interfaces
  • Encrypt SSTable data at rest using sstableencryption in Cassandra and HDFS encryption at rest for HBase
  • Implement row-level access control in Cassandra using GRANT/REVOKE on roles to limit access to specific partition ranges

Common Pitfalls and Anti-Patterns

Unbounded partition size: A partition with no upper size limit causes read and write performance to degrade as it grows. Cassandra’s 2GB partition limit is a hard ceiling; HBase has no explicit limit but large rows cause memory pressure. Fix: use bucketed partition keys to split large partitions, and monitor partition statistics continuously.

Tombstone overload during reads: Cassandra marks deleted data as tombstones; queries that must scan many tombstones become slow. Fix: avoid querying partitions with high delete rates, and rebuild problematic tables with nodetool garbagecollect.

Using sequential keys in Cassandra: Keys like timeuuid or monotonic counters create sequential data on disk, causing write hot spots on a single node. Fix: use a hash-based partition key or add a randomizing element to the key design.

HBase pre-splitting misconfiguration: Starting with a single region causes all initial writes to hit one region server until a split occurs. Fix: pre-split tables using a hex-split strategy or hash-based split keys.

Ignoring bloom filter memory: Bloom filters in Cassandra can consume significant memory for high-cardinality partitions. Fix: monitor BloomFilterFalsePositives and BloomFilterFalseRatio metrics.

Quick Recap Checklist

  • Wide-column databases suit write-heavy workloads with predictable access patterns and horizontal scaling needs
  • Design partition keys for even write distribution; avoid sequential keys and unbounded partition sizes
  • Cassandra offers tunable consistency per query; HBase provides strong consistency with ZooKeeper coordination
  • Use CONCURRENTLY for Cassandra index creation; use bulk load tools for HBase to avoid region splitting during ingestion
  • Monitor tombstone counts, partition sizes, and bloom filter metrics continuously

Further Reading

Conclusion

Wide-column databases trade query flexibility for write scalability and operational simplicity at scale. The data model requires understanding your access patterns upfront, but when modeled correctly, these databases handle massive write volumes efficiently.

Cassandra suits applications needing multi-datacenter replication, tunable consistency, and a SQL-like query interface. HBase suits applications deeply integrated with Hadoop, needing strong consistency and sequential access patterns.

Partition key design determines everything. A poor partition key causes hotspots, inefficient queries, or unbounded partition growth. Spend time on your query patterns before you design your schema — it is the one decision you cannot easily reverse.

For more on NoSQL databases, see the NoSQL overview. For time-series specific databases, see Time-Series Databases. For scaling strategies, see Horizontal Sharding.

Category

Related Posts

Apache Cassandra: Distributed Column Store Built for Scale

Explore Apache Cassandra's peer-to-peer architecture, CQL query language, tunable consistency, compaction strategies, and use cases at scale.

#distributed-systems #databases #cassandra

CQRS Pattern

Separate read and write models. Command vs query models, eventual consistency implications, event sourcing integration, and when CQRS makes sense.

#database #cqrs #event-sourcing

Document Databases: MongoDB and CouchDB Data Modeling

Learn MongoDB and CouchDB data modeling, embedding vs referencing, schema validation, and when document stores fit better than relational databases.

#database #nosql #document-database