Elasticsearch can handle millions or even billions of documents. It is fast and scalable, but only if you manage it correctly. When your data grows very large, bad shard planning or poor data balance can make the cluster slow or unstable.

This article explains how to maintain very large datasets in Elasticsearch, including the trade-offs between many vs. few shards, Index Lifecycle Management (ILM), and how to prevent hot nodes.

The Challenge of Large Data

Elasticsearch splits every index into smaller parts called shards. Each shard is like a small Lucene database that stores part of your documents.

When your data grows:

Queries become slower
Nodes use too much memory and CPU
Backups take longer
Cluster recovery after restart can be very slow

So, shard sizing and balancing are critical for performance and stability.

Ideal Shard Size and Trade-Offs

There is no perfect shard size, it depends on your data type and query pattern, but most clusters work well when each shard is around 10–50 GB.

Type of Data	Recommended Shard Size
Logs / time-based data	10–50 GB
Analytics / numeric data	50–100 GB
Text-heavy search	10–30 GB

Trade-Off: Many Small Shards vs. Few Large Shards

Case	Pros	Cons
Many small shards	Better parallelism; faster recovery of small pieces	High memory overhead; cluster state becomes large; GC pressure; slow searches due to coordination
Few large shards	Less overhead, simpler cluster state	Long recovery times; single shard can become a bottleneck; risk of “hot shard”

Each shard consumes heap memory, even if it is small.
So, thousands of small shards can waste resources.
On the other hand, very large shards (for example, >100 GB) can make merges and queries slow.

Rule of thumb:
Keep each shard between 20 GB and 50 GB, and total shard count per node under a few thousand (ideally < 2000).

You can check shard sizes with:

GET _cat/shards?v

Use Index Lifecycle Management (ILM)

ILM automatically manages the size, age, and location of your data. It helps prevent oversized indices and automates clean-up.

Example Policy

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "30gb", "max_age": "7d" }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "allocate": { "include": { "box_type": "warm" } },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

This keeps your indices small, moves old data to cheaper nodes, and deletes data after 90 days.

Hot–Warm–Cold Architecture

Not all data is equal.
Recent data is queried all the time (hot), while old data is rarely touched (cold).
You can separate them by node tiers.

Tier	Role	Hardware
Hot	Active data, frequent queries and indexing	Fast CPU, SSD
Warm	Older data, less indexing	Medium CPU, HDD
Cold	Rarely accessed archives	Large but slow storage
Frozen	Archived snapshots	Low-cost object storage

ILM can automatically move indices between these tiers.

"allocate": { "include": { "box_type": "warm" } }

Pages: 1 2

Category: ElasticSearch

Maintaining Super Large Datasets in Elasticsearch