Skip to content

Widhian Bramantya

coding is an art form

Menu
  • About Me
Menu
elasticsearch

Maintaining Super Large Datasets in Elasticsearch

Posted on October 5, 2025October 5, 2025 by admin

Elasticsearch can handle millions or even billions of documents. It is fast and scalable, but only if you manage it correctly. When your data grows very large, bad shard planning or poor data balance can make the cluster slow or unstable.

This article explains how to maintain very large datasets in Elasticsearch, including the trade-offs between many vs. few shards, Index Lifecycle Management (ILM), and how to prevent hot nodes.

The Challenge of Large Data

Elasticsearch splits every index into smaller parts called shards. Each shard is like a small Lucene database that stores part of your documents.

When your data grows:

  • Queries become slower
  • Nodes use too much memory and CPU
  • Backups take longer
  • Cluster recovery after restart can be very slow

So, shard sizing and balancing are critical for performance and stability.

Ideal Shard Size and Trade-Offs

There is no perfect shard size, it depends on your data type and query pattern, but most clusters work well when each shard is around 10–50 GB.

Type of DataRecommended Shard Size
Logs / time-based data10–50 GB
Analytics / numeric data50–100 GB
Text-heavy search10–30 GB

Trade-Off: Many Small Shards vs. Few Large Shards

CaseProsCons
Many small shardsBetter parallelism; faster recovery of small piecesHigh memory overhead; cluster state becomes large; GC pressure; slow searches due to coordination
Few large shardsLess overhead, simpler cluster stateLong recovery times; single shard can become a bottleneck; risk of “hot shard”

Each shard consumes heap memory, even if it is small.
So, thousands of small shards can waste resources.
On the other hand, very large shards (for example, >100 GB) can make merges and queries slow.

See also  Elasticsearch Best Practices for Beginners

Rule of thumb:
Keep each shard between 20 GB and 50 GB, and total shard count per node under a few thousand (ideally < 2000).

You can check shard sizes with:

GET _cat/shards?v

Use Index Lifecycle Management (ILM)

ILM automatically manages the size, age, and location of your data. It helps prevent oversized indices and automates clean-up.

Example Policy

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "30gb", "max_age": "7d" }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "allocate": { "include": { "box_type": "warm" } },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

This keeps your indices small, moves old data to cheaper nodes, and deletes data after 90 days.

Hot–Warm–Cold Architecture

Not all data is equal.
Recent data is queried all the time (hot), while old data is rarely touched (cold).
You can separate them by node tiers.

TierRoleHardware
HotActive data, frequent queries and indexingFast CPU, SSD
WarmOlder data, less indexingMedium CPU, HDD
ColdRarely accessed archivesLarge but slow storage
FrozenArchived snapshotsLow-cost object storage

ILM can automatically move indices between these tiers.

"allocate": { "include": { "box_type": "warm" } }
Pages: 1 2
Category: ElasticSearch

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Linkedin

Widhian Bramantya

Recent Posts

  • Log Management at Scale: Integrating Elasticsearch with Beats, Logstash, and Kibana
  • Index Lifecycle Management (ILM) in Elasticsearch: Automatic Data Control Made Simple
  • Blue-Green Deployment in Elasticsearch: Safe Reindexing and Zero-Downtime Upgrades
  • Maintaining Super Large Datasets in Elasticsearch
  • Elasticsearch Best Practices for Beginners
  • Implementing the Outbox Pattern with Debezium
  • Production-Grade Debezium Connector with Kafka (Postgres Outbox Example – E-Commerce Orders)
  • Connecting Debezium with Kafka for Real-Time Streaming
  • Debezium Architecture – How It Works and Core Components
  • What is Debezium? – An Introduction to Change Data Capture
  • Offset Management and Consumer Groups in Kafka
  • Partitions, Replication, and Fault Tolerance in Kafka
  • Delivery Semantics in Kafka: At Most Once, At Least Once, Exactly Once
  • Producers and Consumers: How Data Flows in Kafka
  • Kafka Architecture Explained: Brokers, Topics, Partitions, and Offsets
  • Getting Started with Apache Kafka: Core Concepts and Use Cases
  • Security Best Practices for RabbitMQ in Production
  • Understanding RabbitMQ Virtual Hosts (vhosts) and Their Uses
  • RabbitMQ Performance Tuning: Optimizing Throughput and Latency
  • High Availability in RabbitMQ: Clustering and Mirrored Queues Explained

Recent Comments

  1. Playing with VPC AWS (Part 2) – Widhian's Blog on Playing with VPC AWS (Part 1): VPC, Subnet, Internet Gateway, Route Table, NAT, and Security Group
  2. Basic Concept of ElasticSearch (Part 3): Translog, Flush, and Refresh – Widhian's Blog on Basic Concept of ElasticSearch (Part 1): Introduction
  3. Basic Concept of ElasticSearch (Part 2): Architectural Perspective – Widhian's Blog on Basic Concept of ElasticSearch (Part 3): Translog, Flush, and Refresh
  4. Basic Concept of ElasticSearch (Part 3): Translog, Flush, and Refresh – Widhian's Blog on Basic Concept of ElasticSearch (Part 2): Architectural Perspective
  5. Basic Concept of ElasticSearch (Part 1): Introduction – Widhian's Blog on Basic Concept of ElasticSearch (Part 2): Architectural Perspective

Archives

  • October 2025
  • September 2025
  • August 2025
  • November 2021
  • October 2021
  • August 2021
  • July 2021
  • June 2021
  • March 2021
  • January 2021

Categories

  • Debezium
  • Devops
  • ElasticSearch
  • Golang
  • Kafka
  • Lua
  • NATS
  • Programming
  • RabbitMQ
  • Redis
  • VPC
© 2025 Widhian Bramantya | Powered by Minimalist Blog WordPress Theme