Search Scaling: Sharding, Routing, and Horizontal Growth

Learn how to scale search systems horizontally with index sharding strategies, query routing, replication patterns, and cluster management techniques.

published: March 22, 2026 reading time: 34 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Search systems scale horizontally by distributing index shards across nodes to handle increasing data volumes and query loads. Query routing determines which shards to search and how results get merged, while replication adds copies to absorb read traffic. Index Lifecycle Management automates transitions through hot, warm, and cold storage tiers based on access patterns, moving older data to cheaper storage as it ages. These techniques let search clusters grow from single nodes to distributed architectures handling millions of queries per day.

Search Scaling: Sharding, Routing, and Horizontal Growth

Search queries are CPU-intensive, latency-sensitive, and often return large result sets. A single node can only handle so much before query times degrade. At that point, you need to distribute the load across multiple nodes while maintaining fast queries and consistent results.

This post covers sharding strategies, query routing, replication for read scalability, and cluster management.

Core Concepts

Search queries are CPU-bound for relevance scoring and aggregation. A query that scores a million documents uses significantly more CPU than a simple key-value lookup. Memory matters too, but for inverted indexes, the bottleneck is often CPU during the scoring phase.

Search is also partition-sensitive. A query for “recent blog posts” might hit different shards depending on how data is distributed. If your sharding strategy does not match your query patterns, you end up hitting all shards for every query, which defeats the purpose of distribution.

Sharding Strategies

Sharding splits your index into smaller pieces distributed across nodes. Choosing the right sharding strategy is the most important decision in search scaling.

Hash-Based Sharding

The default approach in Elasticsearch: hash the document ID and modulo by shard count.

shard = hash(document_id) % num_primary_shards;

This distributes documents evenly but has a downside: queries without a filter on document_id must query all shards. You cannot target a specific shard.

// Every query hits all shards - no optimization possible
GET /my-index/_search
{
  "query": { "match": { "content": "search term" } }
}

Range-Based Sharding

Shard by a field value, like date or category:

{
  "settings": {
    "index": {
      "sort.field": "publish_date",
      "sort.order": "desc"
    }
  }
}

Range sharding lets you target subsets of data. A query for recent posts might only hit the relevant time-based shards. The tradeoff is hot spots: if most queries hit recent data, those shards become overloaded.

graph LR
    subgraph TimeBasedShards
        Shard1[2024-Q1]
        Shard2[2024-Q2]
        Shard3[2024-Q3]
        Shard4[2024-Q4]
    end

    Query1["query: recent posts"] --> Shard4
    Query2["query: old posts"] --> Shard1

Custom Routing

Elasticsearch allows custom routing to target specific shards:

PUT /my-index/_doc/1?routing=category:tutorials
{
  "title": "Getting Started",
  "category": "tutorials"
}

Queries can then target specific routing values:

GET /my-index/_search?routing=category:tutorials
{
  "query": { "match": { "title": "search" } }
}

This reduces the number of shards queried from N to 1, cutting latency significantly for targeted queries.

Query Routing

Once you have multiple shards, query routing determines which shards get queried and how results are merged.

Coordination Node

When a client sends a search request to Elasticsearch, any node can receive it. That node becomes the coordination node for that request. It broadcasts the query to all relevant shards, collects the results, and merges them.

graph TD
    Client --> Coord[Coordinator Node]
    Coord -->|broadcast| Shard1[Shard 1]
    Coord -->|broadcast| Shard2[Shard 2]
    Coord -->|broadcast| Shard3[Shard 3]
    Shard1 -->|top 10| Coord
    Shard2 -->|top 10| Coord
    Shard3 -->|top 10| Coord
    Coord -->|merged top 10| Client

The coordinator does not do the full search on each shard. Shards return their top N results, and the coordinator merges and re-ranks. This is efficient as long as N is small relative to the total matches.

Adaptive Replica Selection

By default, Elasticsearch routes queries to the nearest replica with available capacity. In a multi-datacenter setup, this means queries go to the replica in the same datacenter as the client, reducing network latency.

You can control replica selection:

GET /my-index/_search?preference=_only_local

The _only_local preference ensures the query hits the local replica only. This is useful when you know the data is local and want to avoid cross-node traffic.

Replication for Read Scalability

Replication adds copies of your data to handle read traffic. Each replica is a full copy of the primary shard, capable of serving searches.

Read-Through Scaling

Adding replicas linearly scales read throughput. With 3 primary shards and 2 replicas, you have 9 total shard copies. Under heavy read load, the coordinator distributes requests across all 9 shards.

PUT /my-index/_settings
{
  "number_of_replicas": 2
}

The tradeoff is write amplification: every write must be indexed on the primary and all replicas. More replicas mean slower writes.

Consistency Considerations

With replication, queries may return stale data if reads hit a replica that has not yet caught up with the primary. Elasticsearch offers consistency controls:

GET /my-index/_search?consistency=quorum

The quorum setting ensures at least half the replicas have acknowledged the write before the read, reducing the chance of reading stale data.

Index Lifecycle Management

As indices grow and age, their access patterns change. A blog post from three years ago gets almost no traffic compared to posts from last week. ILM automates the transition of indices through lifecycle stages, optimizing cost and performance.

Hot-Warm-Cold Architecture

The classic tiered architecture separates nodes by role:

graph TD
    subgraph HotTier[Hot Nodes - SSD]
        HN1[Node 1]
        HN2[Node 2]
        HN3[Node 3]
    end

    subgraph WarmTier[Warm Nodes - High Capacity]
        WN1[Node 4]
        WN2[Node 5]
    end

    subgraph ColdTier[Cold Nodes - Archive Storage]
        CN1[Node 6]
        CN2[Node 7]
    end

    Ingest[Ingest Pipeline] --> HN1
    HN1 -->|recent indices| WN1
    WN1 -->|90 days+| CN1

Hot nodes are where all the writes happen and where most reads land. Fast NVMe SSDs and plenty of RAM for the file system cache.

Warm nodes sit in the middle — indices that have cooled off but still get queried occasionally. SATA SSDs do the job.

Cold nodes are for archived data that almost nobody touches. Spinning disks or object store, whatever’s cheapest. Query latency is higher but that’s fine for old logs.

ILM Policy Definition

PUT _ilm/policy/search-indices-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          }
        }
      },
      "cold": {
        "min_age": "90d",
        "actions": {
          "set_priority": {
            "priority": 0
          },
          "allocate": {
            "require": {
              "data": "cold"
            }
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

The rollover action creates a new index when the current one hits 50GB or is older than 7 days. This creates time-based indices like search-indices-000001, search-indices-000002, etc.

ILM Takeaways

Build hot-warm-cold in early — retrofitting under traffic is painful
Set rollover on hot indices so new indices spin up automatically when limits hit
Run forcemerge in the warm phase to cut segment count and speed up reads
Check that phases are actually triggering — monitor index age distribution across tiers
Frozen indices in the cold tier can drop memory usage by ~90% on archived data

Cross-Cluster Search

For organizations with multiple Elasticsearch clusters — whether for multi-tenancy, geographic distribution, or isolation between teams — cross-cluster search enables federated queries across cluster boundaries.

Use Cases

Multi-datacenter replication: Query data in both primary and DR clusters simultaneously
Team isolation: Each team manages their own cluster while a central cluster federates queries
Data sovereignty: Sensitive data stays in a specific region while being queryable globally

Configuration

On the coordinating cluster that federates queries:

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote": {
      "cluster_one": {
        "seeds": ["192.168.1.1:9300", "192.168.1.2:9300"]
      },
      "cluster_two": {
        "seeds": ["192.168.2.1:9300", "192.168.2.2:9300"]
      }
    }
  }
}

Query across clusters using the fully qualified index name:

GET cluster_one:my-index,cluster_two:my-index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "search term" } },
        { "term": { "cluster": "cluster_one" } }
      ]
    }
  }
}

Cross-Cluster Replication vs Cross-Cluster Search

Feature	Cross-Cluster Search	Cross-Cluster Replication (CCR)
Data movement	Query-only, no replication	Real-time index replication
Latency	Higher (network round-trip)	Lower (local reads)
Consistency	Eventually consistent	Consistently follower reads
Failover	Manual query redirection	Automatic failover
Use case	Ad-hoc federation, read scaling	Active-passive DR

Cross-Cluster Takeaways

Cross-cluster search adds network hops — budget 10-30ms of extra latency
Use CCR if you need automatic failover for disaster recovery
Watch seed node health — failed seeds mean failed queries
Encrypt and authenticate cross-cluster traffic; don’t let it traverse networks untrusted

Search Capacity Planning

Before scaling, you need to know how much capacity you have and how much you need. Capacity planning prevents both over-engineering (wasted cost) and under-engineering (performance degradation).

Estimating Storage Requirements

Storage per shard depends on several factors:

{
  "index": {
    "number_of_shards": 5,
    "number_of_replicas": 2,
    "refresh_interval": "1s"
  },
  "analysis": {
    "analyzer": "standard"
  }
}

Rough estimation formula:

raw_data_size * (1 + overhead) * (1 + replica_factor) / num_primary_shards

Where:
- overhead = ~1.1-1.3x (segment merge, field data, deleted docs)
- replica_factor = 1 + number_of_replicas

For 100GB of raw JSON data with 2 replicas and 1.2x overhead: 100GB * 1.2 * 3 / 5 = 72GB per primary shard

Estimating Query Throughput

Single-shard QPM varies across a 3x range depending on query complexity, hardware, and segment count. A simple term match against a small index with few segments runs fast on modest hardware. A multi-match query with nested aggregations across 50 segments on a CPU-constrained node can easily drop to the low end. Network topology also matters — shards that cross datacenters add latency from inter-node communication during fetch phases.

To measure your actual shard capacity, run representative queries against a single-node benchmark cluster under controlled load. Use a tool like rally (from Elasticsearch’s benchmarking toolkit) or Bombardier for HTTP-level load testing. Vary concurrency from 10 to 500 parallel clients and plot throughput versus P99 latency. The inflection point where latency spikes while throughput flatlines is your practical shard ceiling. Target 70% of that ceiling in your capacity formula to leave headroom for traffic spikes.

A single shard can handle roughly 5,000-15,000 queries per minute depending on query complexity and hardware. For P99 latency under 100ms, target the lower end.

Needed QPS * 60 / (shards * replicas * 5000 QPM_shard) = num_nodes

Node Sizing Guidelines

Node Role	CPU Cores	RAM (GB)	Storage	Typical Use
Hot	16-32	64-128	1-2TB NVMe	Primary + recent indices
Warm	8-16	32-64	2-4TB SATA SSD	Older indices
Cold	4-8	16-32	4-8TB Spinning	Frozen/archive indices
Master	4-8	8-16	64GB SSD	Cluster coordination only
Coordinating	8-16	16-32	Local SSD	Query routing only

Capacity Planning Checklist

Baseline current QPS, P50/P95/P99 latency, and throughput per shard
Pull storage growth rate (GB/day) from historical data
Project out 6-12 months, multiply by 1.5-3x for growth
Size for N+2 redundancy on critical clusters
Run stress tests with simulated peak load before going live
Leave 30% headroom when adding nodes — rebalancing chews through resources

Index vs Shard Strategy

A common question: should I add more shards to an existing index or create new indices? The answer depends on your access patterns and data lifecycle.

When to Add Shards to an Existing Index

Document volume is increasing: More shards distribute the write load
Query latency is increasing: More shards allow more parallel processing
Single index is approaching size limits: Elasticsearch recommends 50GB per shard as a guideline

POST my-index/_shrink
{
  "settings": {
    "index.number_of_shards": 3,
    "index.number_of_replicas": 1
  }
}

When to Create New Indices

Time-based data: Create daily, weekly, or monthly indices (e.g., logs-2026-04-01)
Rolling retention: Easy to delete old indices without reindexing
Different mappings: Separate indices for different document types

PUT logs-2026-04-17
{
  "aliases": {
    "logs-current": {}
  }
}

Index-Per-Day vs Index-Per-Month Trade-offs

Strategy	Shard Count	Index Management	Deletion	Best For
Index/day	High (365/year)	Complex rollover	Granular	High-volume logs
Index/week	Medium (52/year)	Moderate	Less granular	Moderate volume
Index/month	Low (12/year)	Simple	Coarse	Search over archives

Index vs Shard Takeaways

Index-per-day is the standard for time-series — makes rollover, deletion, and tiering straightforward
Don’t let shards blow past 50GB — Elasticsearch itself recommends staying under that
For non-time-series data, index aliases let you change routing without touching application code
An index template keeps all new indices consistent on settings and mappings

Vector Search Scaling

With the rise of AI-powered search (semantic search, retrieval-augmented generation), many search systems now handle vector embeddings alongside text. Scaling vector search has different characteristics than text search.

ANN Index Structures

Approximate Nearest Neighbor (ANN) indexes like HNSW (Hierarchical Navigable Small World) trade recall for speed. HNSW builds a multi-layer graph where searching starts from the top layer and narrows down.

graph TD
    subgraph Layer2[Layer 2 - Sparse]
        L2_1((A)) --> L2_2((B))
        L2_2 --> L2_3((C))
        L2_3 --> L2_4((D))
    end

    subgraph Layer1[Layer 1 - Medium]
        L1_1((E)) --> L1_2((F))
        L1_3((G)) --> L1_4((H))
    end

    subgraph Layer0[Layer 0 - Dense]
        L0_1((I)) --> L0_2((J))
        L0_3((K)) --> L0_4((L))
    end

    L2_1 -->|down| L1_1
    L1_2 -->|down| L0_2
    L0_2 -->|down| L0_3

HNSW parameters to tune:

m: number of connections per node (higher = better recall, more memory)
ef_construction: depth of search during indexing (higher = better recall, slower indexing)

Vector Sharding Challenges

In low-dimensional spaces (2D or 3D), you can partition cleanly — a query in quadrant A only touches quadrant A’s shard. In 768D space (typical for text embeddings) or 1536D (OpenAI ada-002), that guarantee breaks down. As dimensions increase, distances between any two points converge. Every vector ends up roughly equidistant from every other. This is the curse of dimensionality in vector search: partitioning boundaries become meaningless because a vector near the edge of shard 1 is just as close to points in shard 2.

You cannot route vector queries the way you route text queries. With text, a filter on publish_date or user_id tells Elasticsearch exactly which shard to hit. Vector search has no equivalent — there is no “vector routing key” that maps an embedding to a specific shard. You must query all shards to guarantee you find the true nearest neighbors. Skip one shard and the best match for your query might live there.

This fan-out has a real cost. With 8 primary shards, a single vector query fires 8 parallel ANN searches and waits for the slowest to respond. Compare that to a text query using custom routing, which hits exactly one shard. The latency multiplier is 8x from fan-out alone, before any scoring or merging overhead. And unlike text where you can improve routing with better keys, vector fan-out is built into the data structure — no routing trick eliminates it.

The coordination node collects and merges 8 or 16 or 32 sets of top-k results from every shard. Each shard still runs a full HNSW traversal. If you are running hybrid search (text + vector in a single query), the text filter narrows candidate documents, but the vector portion still fans out because the coordinator doesn’t know which shards hold the matching documents. This is why vector search clusters lean harder on replicas than equivalent text clusters: you cannot shard your way out of the fan-out, so the only scalable path is more copies serving parallel queries.

Scaling Strategies

Coordinating node pre-filters: Route queries to shards that likely contain relevant vectors
Semantic routing: Use a lightweight classifier to route queries to topic-specific indices
Ensemble retrieval: Query all shards with shorter ef_construction and merge, vs query fewer shards with higher ef_construction

Vector Search Takeaways

HNSW is hungry for memory: plan on ~1GB per million vectors at 768 dimensions
Quantization (int8 or product quantization) cuts memory use by 4-8x with acceptable recall loss
For hybrid search (text + vector), run knn alongside match in a bool filter
Pre-filtering kills vector search performance — design your index structure around the filter, not the other way around

Trade-off Analysis

Search scaling involves fundamental trade-offs between competing concerns. Understanding these helps you make informed decisions for your specific workload.

Core Trade-off Dimensions

Consistency vs Performance

Distributed search systems must balance consistency against query performance. Strong consistency (ensuring all replicas reflect the latest writes) requires coordination overhead that increases latency. Most search systems opt for eventual consistency, where queries may briefly return stale results in exchange for lower latency. Use consistency controls like quorum writes only when your application genuinely requires them.

Write Throughput vs Data Safety

More replicas improve read scalability but amplify write overhead. Each write must be indexed on the primary and all replicas, multiplying I/O and CPU consumption. For write-heavy workloads, minimize replica count. For read-heavy workloads, add replicas strategically — do not over-provision.

Query Latency vs Recall

Approximate Nearest Neighbor (ANN) algorithms like HNSW trade recall accuracy for query speed. Higher ef_construction and m parameters improve recall at the cost of memory usage and indexing time. For latency-sensitive production systems, benchmark acceptable recall thresholds before settling on parameters.

Shard Count vs Overhead

More shards enable greater parallelism but each shard incurs overhead: heap usage for the shard object, segment metadata, and coordination overhead. Elasticsearch recommends keeping shard size under 50GB and total shard count per node under 1,000. Over-sharding (too many tiny shards) is a common mistake that degrades cluster stability.

Operational Complexity vs Capability

Features like cross-cluster search, custom routing, and hot-warm-cold architectures add capabilities but also operational complexity. Cross-cluster search introduces network latency. Custom routing requires application-level awareness of shard keys. Hot-warm-cold requires careful capacity planning across tiers. Evaluate whether the added complexity delivers genuine value for your use case.

Operations and Reliability

Node Failure Handling

When a node fails, Elasticsearch automatically promotes replicas to primaries. This is automatic but can cause a brief period of reduced capacity while promotion completes.

For critical workloads, keep at least 2 replicas of every shard. Single replicas create single points of failure.

Rebalancing

When you add a new node, Elasticsearch rebalances shards across the cluster automatically. This involves moving shard copies from overloaded nodes to underloaded ones.

Rebalancing can impact query performance. For sensitive workloads, use shadow replicas or temporarily disable rebalancing during maintenance windows.

Frozen Indices

Elasticsearch 7.x introduced frozen indices. These are indices explicitly marked as frozen, consuming less memory because their shard data is not cached. Queries against frozen indices are slower but functional.

POST /my-index/_freeze
POST /my-index/_unfreeze

If you have time-based data that is rarely queried (like logs from two years ago), freezing can save significant memory.

When to Use / When Not to Use

When to Use Search Scaling Techniques

Single-node search hits a wall at two distinct bottlenecks. The first is CPU — relevance scoring, which runs for every matching document during query execution, becomes the limiting factor once your index fits in memory and disk I/O is no longer the bottleneck. The second is memory pressure from inverted index data structures: a large index with high cardinality fields can consume tens of gigabytes even for moderate document counts. When your working set no longer fits in RAM, page faults add double-digit milliseconds to every query.

Monitor query latency P99, CPU utilization on the search nodes, and JVM heap pressure (if using Elasticsearch/OpenSearch). If P99 latency degrades by 2-3x under peak sustained load while CPU stays below 70%, the JVM is spending time in GC pauses — adding more nodes will not fix that. If CPU is consistently above 80% during normal traffic, you need more compute. If memory is high but CPU is low, segment the data differently or add warm nodes. Scale when latency degradation pairs with one of these resource constraints, not when some round data-size threshold is crossed.

Data volume exceeds single-node capacity (typically > 500GB per node for search workloads)
Query latency degrades under load despite indexing optimization
Read throughput requirements exceed single-node capabilities
High availability is mandatory — zero tolerance for single points of failure
Multi-datacenter deployments requiring geographic distribution of queries

When Not to Use Search Scaling Techniques

Small to medium datasets (< 100GB) where a single well-tuned node suffices
Low query concurrency — a single node handling < 100 QPS rarely needs sharding
Simple key-value lookups that do not benefit from distributed search
Write-heavy workloads with simple queries — replication does not scale writes
Cost-constrained projects where operational complexity of multi-node clusters outweighs benefits

Production Failure Scenarios

Failure	Impact	Mitigation
Shard hotspot due to skewed routing	Some nodes overloaded while others idle	Use routing keys that distribute evenly; monitor shard sizes
Coordinating node bottleneck	All search requests fail if coordination node crashes	Use dedicated coordinating nodes; do not route client traffic to data nodes
Network partition causing split-brain	Conflicting primaries accept divergent writes	Use quorum-based writes (`consistency=quorum`), monitor cluster state
Replica lag during heavy indexing	Stale results returned from lagging replicas	Set `read_slience` timeout, throttle indexing rate during peak read hours
Shard relocation timeout	Rebalancing stalls mid-transfer, data temporarily unavailable	Increase `recovery_time_window`; avoid rebalancing during high load
Memory pressure from aggregations	Facet/aggregation queries cause OOM	Set `max_buckets` limits; pre-compute aggregations with background jobs

Common Pitfalls / Anti-Patterns

Anti-Patterns

Premature Sharding

Sharding adds complexity without benefits for small datasets. A single shard performs scatter-gather across segments within that shard anyway. If your total index size is under 50GB and latency is acceptable, do not shard.

Rule: Shard when a single node cannot handle the workload or data volume.

Ignoring Rebalancing Impact

Adding a new node triggers automatic rebalancing. During rebalance, the cluster moves shard copies, consuming network bandwidth and CPU. This degrades query performance for running requests.

Fix: Disable automatic rebalancing during maintenance windows:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "none"
  }
}

Not Using Circuit Breakers

Complex queries with deep aggregations or large page sizes can cause OOM on coordinating nodes. Without circuit breakers, one bad query can crash an entire cluster.

Fix: Configure and monitor circuit breakers:

{
  "indices.breaker.total.limit": "70%",
  "indices.breaker.request.limit": "40%",
  "indices.breaker.fielddata.limit": "60%"
}

Targeting Wrong Shards with Routing

Custom routing narrows the shards queried, but if the routing key is unevenly distributed, it creates hotspots. A common mistake is routing by user ID when a small subset of users generates most traffic.

Fix: Choose routing keys with high cardinality and even distribution. Monitor shard sizes per routing value.

Over-Allocating Replicas

Replicas multiply storage and indexing overhead. Six replicas of three shards means six copies of every document. For write-heavy workloads, this slows indexing dramatically.

Fix: Start with one replica. Add more only when read scalability demands it.

Quick Reference

Key Bullets

Hash-based sharding queries all shards by default
Range-based or custom routing targets subsets when access patterns allow
Replicas scale reads but amplify writes and storage
The coordination node broadcasts to all shards and merges results
Adaptive replica selection routes queries to the least-loaded replica
Frozen indices trade query speed for memory savings on rarely-accessed data
Rebalancing is automatic but can degrade performance during heavy load

Copy/Paste Checklist

# Check shard distribution across nodes
GET /_cat/shards?v&h=index,shard,prirep,state,node

# Force rebalancing after adding a node
POST /_cluster/reroute?retry_failed=true

# Disable rebalancing during maintenance
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "none"
  }
}

# Update replica count for a specific index
PUT /my-index/_settings
{
  "number_of_replicas": 2
}

# Freeze a rarely-used index
POST /my-index/_freeze

# Set custom routing on a document
PUT /my-index/_doc/1?routing=user_123
{
  "title": "My Post",
  "user_id": "user_123"
}

# Query with custom routing
GET /my-index/_search?routing=user_123
{
  "query": { "match": { "title": "search term" } }
}

# Monitor pending tasks
GET /_cluster/pending_tasks

# Check circuit breaker status
GET /_nodes/stats/breaker

Observability Checklist

Metrics to Monitor

{
  "cluster_level": {
    "cluster_health": "green/yellow/red",
    "unassigned_shards": "should be 0 in healthy cluster",
    "number_of_pending_tasks": "< 50 typically",
    "shard_balance_variance": "< 20% across nodes"
  },
  "per_node": {
    "heap_used_percent": "< 85% sustained",
    "cpu_iowait_percent": "< 20% sustained",
    "search_latency_p99_per_node": "< 500ms",
    "indexing_latency_p99": "< 1s per batch"
  },
  "per_shard": {
    "query_count_per_shard": "high variance indicates hotspot",
    "doc_count_per_shard": "rebalance if one shard > 2x average",
    "segment_count": "< 50; force merge if higher"
  }
}

Key Logs to Capture

graph LR
    subgraph LogSources
        ES[Elasticsearch Logs] --> Cluster[Cluster Events]
        ES --> Shard[Shard Allocation]
        ES --> Search[Slow Search]
        ES --> Index[Indexing Operations]
    end

    Cluster --> |node join/leave| Monitor[Monitoring]
    Shard --> |allocation fail| Monitor
    Search --> |query > threshold| Monitor
    Index --> |bulk reject| Monitor

Cluster events: node additions/removals, master elections, cluster state transitions
Shard allocation logs: failed allocations, delayed decisions, recovery progress
Slow search logs: queries exceeding search.slowlog.threshold
Indexing logs: bulk request rejections, mapping conflicts, refresh intervals

# Enable cluster-level allocation logging
PUT /_cluster/settings
{
  "transient": {
    "logger.index_allocation": "DEBUG",
    "logger.index_recovery": "INFO"
  }
}

Alerts to Configure

Alert	Condition	Severity
Cluster health yellow	`status == yellow` for > 5 min	Warning
Cluster health red	`status == red`	Critical
Shard imbalance	`shard_count_max - shard_count_min > 5`	Warning
Coordinating node queue	`search_queue > 100`	Warning
Recovery stalled	`recovering_shards > 0` for > 10 min	Warning
Disk watermark low	`disk.low watermark` exceeded	Critical

Security Checklist

Network-level isolation — place search cluster on internal network; never expose directly to internet
Node authentication — use TLS certificates for inter-node communication
Role-based access — define read-only, read-write, and admin roles; apply per index
Index-level privileges — restrict write access to pipelines and ingestion services only
Audit logging — enable security audit logs for all write operations and access attempts
Field-level security — use document-level security if certain fields must be hidden from some users
Query restrictions — disable expensive aggregations for public-facing endpoints (max_buckets)
API key rotation — rotate service account keys quarterly; use secret management system
Input validation — sanitize query strings to prevent injection via special characters

# Example: Create a read-only role for production traffic
POST /_security/role/production_reader
{
  "indices": [
    {
      "names": ["production-*"],
      "privileges": ["read", "view_index_metadata"]
    }
  ],
  "field_security": {
    "grant": ["title", "content", "publish_date", "author"]
  }
}

Interview Questions

1. How does hash-based sharding work in Elasticsearch, and what is its main limitation?

Hash-based sharding computes a hash of the document ID and modulo by the number of primary shards to determine which shard owns the document. The main limitation is that queries without a filter on document_id must query all shards (scatter-gather), because the routing cannot be determined without the hash. This means even targeted queries like a term filter on a specific field still hit every shard.

2. What is the role of a coordination node in Elasticsearch, and how does it handle distributed queries?

The coordination node receives search requests from clients and acts as a load balancer. It broadcasts the query to all relevant shards (or all shards if routing can't narrow it down), collects the top N results from each shard, then merges and re-ranks these results to return the final response. It does not perform the full search itself—shards return only their top N matches to reduce data transfer.

3. Explain the difference between hot, warm, and cold nodes in an Elasticsearch cluster. When would you use each?

Hot nodes handle all writes and recent read traffic, using the fastest storage (NVMe SSD) and most RAM. Warm nodes hold indices that are accessed less frequently, typically after 7-30 days, using cheaper storage (SATA SSD). Cold nodes store rarely-accessed data, often frozen, using the cheapest storage (spinning disk or object store). Use hot nodes for active indexing and recent data, warm for historical but queryable data, and cold for archival data that rarely gets queried.

4. What is custom routing in Elasticsearch, and when should you use it?

Custom routing allows you to specify a routing key at index time (e.g., `routing=category:tutorials`) so that queries can target specific shards without querying all shards. Use it when you have high-cardinality access patterns where a subset of documents is queried much more frequently (e.g., queries by tenant_id or category). The tradeoff is potential hotspotting if the routing key is not evenly distributed.

5. How does Index Lifecycle Management (ILM) help optimize search infrastructure costs?

ILM automates the transition of indices through lifecycle stages (hot → warm → cold → delete) based on age or size thresholds. This optimizes costs by moving older, less-frequently accessed data to cheaper storage tiers and eventually deleting it, rather than keeping all data on expensive hot-node storage. It also automates maintenance actions like rollover, shrink, and forcemerge to keep indices performing well.

6. What is the tradeoff between replication for read scalability and write performance?

Replication linearly scales read throughput because queries can be distributed across all replica copies. However, every write must be indexed on the primary shard and all replicas, causing write amplification. More replicas mean slower indexing throughput and higher storage costs. The optimal replica count depends on your read-to-write ratio: read-heavy workloads benefit from more replicas, while write-heavy workloads should minimize replicas.

7. What are circuit breakers in Elasticsearch, and why are they important?

Circuit breakers prevent runaway queries from consuming too much memory and causing OOM errors. Elasticsearch has several circuit breakers (parent, request, fielddata, inbox breaker) each with configurable limits. When a query would exceed a circuit breaker's limit, it fails fast with a Phase 1 execution exception rather than consuming memory until the node crashes. They are critical for cluster stability, especially with complex aggregations or large page sizes.

8. Explain cross-cluster search versus cross-cluster replication. When would you use each?

Cross-cluster search (CCS) federates queries across multiple clusters in real-time without data movement, useful for multi-tenancy or ad-hoc multi-datacenter queries. Cross-cluster replication (CCR) provides active-passive DR by continuously replicating indices from a primary to follower cluster with automatic failover. Use CCS when you need unified read access across isolated clusters. Use CCR when you need true disaster recovery with minimal RPO.

9. What is the "curse of dimensionality" in vector search, and how does it affect sharding strategy?

In high-dimensional vector spaces, the distance between any two points becomes similar as dimensions increase, making it hard to distinguish true nearest neighbors. For sharding, this means a query must theoretically hit all shards to guarantee finding true nearest neighbors (unlike text search where routing can narrow the scope). Practical solutions include semantic routing, pre-filtering to candidate subsets, or ensemble retrieval with approximate results.

10. How would you plan capacity for a new Elasticsearch cluster serving search traffic?

Capacity planning involves: (1) estimating storage: raw data × overhead (1.2-1.3x) × replica factor, keeping shards under 50GB; (2) estimating throughput: single shard handles 5,000-15,000 QPM, scale nodes linearly; (3) node sizing: hot nodes need 16-32 cores, 64-128GB RAM, NVMe storage; (4) accounting for N+2 redundancy and 30% extra capacity for rebalancing. Always stress test with simulated peak load before production deployment.

11. What causes replica lag in Elasticsearch, and how does it affect query consistency?

Replica lag occurs when replicas fall behind the primary shard during indexing, often due to heavy write loads overwhelming replica indexing threads or network bottlenecks. When a query hits a lagging replica, it returns stale or missing documents. This is called eventual consistency—inconsistency window depends on how far behind replicas are. Mitigation strategies include: reducing indexing bulk size, increasing replica threads, using async replication for non-critical data, or routing reads to primaries during consistency-sensitive operations.

12. Explain the difference between `read_before_write` and `wait_for_active_shards` consistency modes.

`wait_for_active_shards` (quorum) ensures writes are acknowledged by a minimum number of shard copies before returning success, providing durability guarantees. `read_before_write` (or `refresh=true`) forces a refresh before the read so that recently indexed documents are visible, at the cost of higher latency. The former protects against reading unacknowledged data; the latter ensures read-your-writes consistency by guaranteeing the indexing operation is searchable before the read proceeds.

13. How does shard rebalancing work in Elasticsearch, and what are the performance implications during rebalance?

Elasticsearch rebalances shards automatically when nodes are added or removed, using allocation filters and awareness attributes to optimize physical placement. The process moves shard copies from overloaded to underloaded nodes, throttled by `indices.recovery.max_bytes_per_sec` and `cluster.routing.allocation.cluster_concurrent_rebalance`. During rebalance, network bandwidth and disk I/O are consumed for data transfer, and primary promotion causes brief unavailability. For sensitive workloads, disable automatic rebalancing during maintenance windows using `cluster.routing.rebalance.enable: none`.

14. What is the purpose of circuit breakers in Elasticsearch, and how do you configure them appropriately?

Circuit breakers prevent runaway queries from causing OOM by failing fast when memory estimates exceed thresholds. Elasticsearch has: parent circuit breaker (total across all breakers), request circuit breaker (per-query memory for aggregations/sorts), fielddata circuit breaker (field data loading), and inbox breaker (coordinator queue). Recommended configuration: parent at 70%, request at 40%, fielddata at 60%. Monitor `/_nodes/stats/breaker` and adjust based on workload—too tight causes false positives, too loose risks actual OOM crashes.

15. How would you handle a "hotspot" problem where certain shards receive disproportionately high traffic?

Hotspots occur when shard distribution doesn't match query traffic patterns—common with time-based data where recent shards get hammered. Solutions include: (1) split your index to add more shards on the hot tier, distributing load; (2) adjust `index.routing.allocation` to move hot shards to dedicated hot nodes with more CPU; (3) use custom routing with a high-cardinality key that distributes traffic evenly; (4) scale horizontally by adding hot nodes which Elasticsearch will rebalance across; (5) implement read-throttling or query queueing at the application layer during extreme spikes.

16. What are the trade-offs between increasing replicas versus adding more primary shards for read scalability?

Adding replicas scales reads linearly (more copies to distribute queries across) but amplifies writes and storage. Adding primary shards increases parallelism for both reads and writes but requires reindexing to change shard count and increases coordination overhead. General guidance: if read latency is the bottleneck, add replicas first (simpler, no reindexing). If indexing throughput is the bottleneck, add primary shards (more parallelism for indexing). If data volume exceeds single-shard capacity limits, you need more primary shards anyway. Over-sharding causes metadata overhead; under-sharding limits parallelism.

17. Describe the steps to perform a rolling restart of an Elasticsearch cluster without downtime.

Rolling restarts require careful sequencing: (1) Disable shard allocation: `PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "none"}}` to prevent rebalancing when nodes go down. (2) Stop indexing and pause ILM if needed to avoid rollover during restart. (3) Gracefully stop one node (SIGTERM), wait for its shards to be allocated elsewhere. (4) Perform your upgrade or configuration change. (5) Restart the node and verify it joins the cluster. (6) Re-enable allocation: `{"transient": {"cluster.routing.allocation.enable": "none"}}` to `{"transient": {"cluster.routing.allocation.enable": "all"}}`. (7) Wait for cluster health to return to green/yellow before proceeding to the next node.

18. How does the `dfs_query_then_fetch` search type differ from `query_then_fetch`, and when would you use it?

`query_then_fetch` (default) distributes the query to relevant shards, each returns its top N results, and the coordinator merges and re-ranks. `dfs_query_then_fetch` first executes a distributed frequency search across all shards to compute global document frequencies for scoring, then executes the actual search with corrected scores. This produces more accurate relevance ranking when documents are unevenly distributed across shards, at the cost of an extra network round-trip. Use `dfs` when you have few shards with heavily skewed document distribution and relevance scoring accuracy matters more than latency.

19. What is field data cache and why can it cause heap pressure issues?

Field data cache loads all field values into heap memory to support sorting and aggregations on field types like keyword, numeric, and geo. Without bounds, a single large aggregation (like a terms bucket on high-cardinality fields) can load millions of values and exhaust heap, causing OOM. Mitigation: use `doc_values` instead of `field_data=true` for most use cases (doc_values are on-disk, not heap). Set `indices.breaker.fielddata.limit` to cap field data memory. Consider `eager_global_ordinals` for frequently-aggregated fields to pre-load during refresh rather than on first query. Monitor `fielddata` size in `/_nodes/stats/indices/fielddata`.

20. How would you design a multi-tenant search architecture where each tenant's data must be isolated?

Multi-tenant isolation strategies: (1) Index-per-tenant—each tenant gets their own index, simplest isolation, easiest to manage per-tenant ILM and scaling, but thousands of indices can stress cluster state. (2) Document-level field with filtered aliases—add `tenant_id` field, create filtered aliases per tenant, query through alias. Lower overhead per tenant but requires careful filter bounds to prevent cross-tenant leaks. (3) Searchable snapshots with cold storage for rarely-accessed tenants. Choose based on tenant count and isolation requirements. For strict compliance isolation, index-per-tenant with dedicated clusters for high-security tenants. Always use role-based access and document-level security for shared indices.

Conclusion

Three things matter most for search scaling: sharding, replication, and routing. Hash-based sharding is simple but queries all shards. Range-based or custom routing targets subsets when your access patterns allow. Replication scales reads at the cost of writes. The coordination node handles the scatter-gather that makes distributed search work.

The right configuration depends on your query patterns. If most queries target recent data, time-based sharding with more replicas on recent shards makes sense. If queries span your entire dataset, hash-based sharding with many replicas handles the load.

Measure your query latency distribution before optimizing. It is easy to over-engineer sharding for workloads that do not need it.

Search Scaling: Sharding, Routing, and Horizontal Growth

Core Concepts

Sharding Strategies

Hash-Based Sharding

Range-Based Sharding

Custom Routing

Query Routing

Coordination Node

Adaptive Replica Selection

Replication for Read Scalability

Read-Through Scaling

Consistency Considerations

Index Lifecycle Management

Hot-Warm-Cold Architecture

ILM Policy Definition

ILM Takeaways

Cross-Cluster Search

Use Cases

Configuration

Cross-Cluster Replication vs Cross-Cluster Search

Cross-Cluster Takeaways

Search Capacity Planning

Estimating Storage Requirements

Estimating Query Throughput

Node Sizing Guidelines

Capacity Planning Checklist

Index vs Shard Strategy

When to Add Shards to an Existing Index

When to Create New Indices

Index-Per-Day vs Index-Per-Month Trade-offs

Index vs Shard Takeaways

Vector Search Scaling

ANN Index Structures

Vector Sharding Challenges

Scaling Strategies

Vector Search Takeaways

Trade-off Analysis

Core Trade-off Dimensions

Consistency vs Performance

Write Throughput vs Data Safety

Query Latency vs Recall

Shard Count vs Overhead

Operational Complexity vs Capability

Operations and Reliability

Node Failure Handling

Rebalancing

Frozen Indices

When to Use / When Not to Use

When to Use Search Scaling Techniques

When Not to Use Search Scaling Techniques

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

Anti-Patterns

Premature Sharding

Ignoring Rebalancing Impact

Not Using Circuit Breakers

Targeting Wrong Shards with Routing

Over-Allocating Replicas

Quick Reference

Key Bullets

Copy/Paste Checklist

Observability Checklist

Metrics to Monitor

Key Logs to Capture

Alerts to Configure

Security Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Database Capacity Planning: A Practical Guide

Read/Write Splitting

Database Scaling: Vertical, Horizontal, and Read Replicas