Voozh

🔍 Introduction

Running Elasticsearch in production requires deep visibility into CPU, memory, shards, and cluster health.

One of the most confusing scenarios DevOps engineers face is:

⚠️ High CPU alerts, but CPU usage looks normal

In this blog, I’ll walk you through a real production incident where:

Elasticsearch triggered CPU alerts
But the actual root cause was memory pressure + shard imbalance + node failure

We’ll cover:

Core Elasticsearch concepts
Real logs and debugging steps
Root cause analysis
Production fix

📘 Important Elasticsearch Concepts

Before diving into the issue, let’s understand some key building blocks.

📦 How Elasticsearch Stores Data

Elasticsearch stores data as documents, grouped into an index.

However, when data grows large (billions/trillions of records), a single index cannot be stored efficiently on one node.

🔹 What is an Index?

An Index is:

A collection of documents
Logical partition of data
Similar to a database

👉 Example:

metricbeat-*
.monitoring-*
user-data

🔹 What are Shards?

To scale horizontally, Elasticsearch splits an index into shards.

Each shard is a small unit of data
Stored across multiple nodes
Acts like a mini-index

⚙️ Why Shards Matter
✅ Scalability → Data distributed across nodes
✅ Performance → Parallel query execution
✅ Availability → Supports failover

🔁 Primary vs Replica Shards

Primary Shard → Original data
Replica Shard → Copy for fault tolerance

🚨 Cluster Health Status
🟢 Green → All shards assigned
🟡 Yellow → Replica shards missing
🔴 Red → Primary shards missing

🧠 JVM & Memory Basics

Elasticsearch runs on JVM:

Heap memory is critical
High usage → Garbage Collection (GC)
GC → CPU spikes

⚠️ Production Issue Overview

We received alerts for:

🔴 High CPU usage
⚠️ Cluster health degraded
📉 Slow search performance

📊 Investigation & Debugging

🔍 Step 1: Cluster Health Check

[ec2-user@ip-x-x-x-x ~]$curl -X GET "localhost:9200/_cluster/health?pretty"
{
 "cluster_name" : "web-test",
 "status" : "yellow",
 "timed_out" : false,
 "number_of_nodes" : 5,
 "number_of_data_nodes" : 5,
 "active_primary_shards" : 247,
 "active_shards" : 343,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 193,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0,
 "task_max_waiting_in_queue_millis" : 0,
 "active_shards_percent_as_number" : 63.99253731343284
}

[ec2-user@ip-x-x-x-x ~]$curl -X GET "localhost:9200/_cluster/health?filter_path=status,*_shards&pretty"
{
 "status" : "yellow",
 "active_primary_shards" : 247,
 "active_shards" : 343,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 193,
 "delayed_unassigned_shards" : 0
}

👉 Key Insight:

193 unassigned shards → Major issue

🔍 Step 2: Node Resource Usage

[ec2-user@ip-x-x-x-x ~]$curl -X GET "localhost:9200/_cat/nodes?v=true&s=cpu:desc&pretty"
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
1x.x.x.2x9 73 97 3 0.19 0.16 0.11 cdfhilmrstw - node-5
1x.x.x.8x 77 90 2 0.03 0.06 0.03 cdfhilmrstw * node-1
1x.x.x.x 60 84 1 0.22 0.65 0.72 cdfhilmrstw - node-3
1x.x.x.x 46 90 1 0.03 0.06 0.01 cdfhilmrstw - node-4
1x.x.x.x 65 91 0 0.01 0.03 0.00 cdfhilmrstw - node-2

Observation:

CPU: 0–5% (low)
RAM: 88–97% (very high)

👉 This is critical:

CPU alert was misleading — actual issue was memory pressure

🔍 Step 3: OS-Level Analysis

top

[ec2-user@ip-x-x-x-xx ~]$top
top - 10:57:46 up 13 days, 22:42, 1 user, load average: 0.77, 0.73, 0.60
Tasks: 114 total, 1 running, 64 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.3 us, 0.1 sy, 0.0 ni, 97.6 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 7863696 total, 744000 free, 5938932 used, 1180764 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 2202220 avail Mem

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 3743 elastic+ 20 0 48.0g 4.9g 36368 S 8.7 65.7 7078:50 java
 1 root 20 0 117520 5144 3408 S 0.0 0.1 22:27.92 systemd
 2 root 20 0 0 0 0 S 0.0 0.0 0:00.25 kthreadd
 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
 7 root 20 0 0 0 0 S 0.0 0.0 0:13.95 ksoftirqd/0
 8 root 20 0 0 0 0 I 0.0 0.0 2:29.56 rcu_sched
 9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
 10 root rt 0 0 0 0 S 0.0 0.0 0:02.68 migration/0
 11 root rt 0 0 0 0 S 0.0 0.0 0:01.54 watchdog/0
 12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
 14 root rt 0 0 0 0 S 0.0 0.0 0:01.63 watchdog/1

Findings:
Java process:

~4.9 GB memory usage
~65% system memory

👉 Elasticsearch consuming most resources

🔍 Step 4: JVM Memory Pressure

curl -X GET "_nodes/stats?filter_path=nodes.*.jvm.mem.pools.old"

Observation:

High old-gen memory usage
Frequent GC cycles

🔍 Step 5: Unassigned Shards Analysis

Unassigned shards have a state of UNASSIGNED. The prirep value is p for primary shards and r for replicas.

curl -X GET "localhost:9200/_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state&pretty"

[ec2-user@ip-x-x-x-xx ~]$curl -X GET "localhost:9200/_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state&pretty"
index shard prirep state unassigned.reason
product_search_tab_data 0 r UNASSIGNED NODE_LEFT
metricbeat-7.10.2-2023.02.08-000024 0 r UNASSIGNED NODE_LEFT
metricbeat-7.17.0-2022.12.04-000004 0 r UNASSIGNED NODE_LEFT
.monitoring-es-7-mb-2023.04.16 0 r UNASSIGNED REPLICA_ADDED
.monitoring-es-7-mb-2023.04.14 0 r UNASSIGNED REPLICA_ADDED
apm-7.9.2-span-000002 0 r UNASSIGNED NODE_LEFT
metricbeat-7.10.2-2021.12.29-000012 0 r UNASSIGNED NODE_LEFT
product_search_analytics 0 r UNASSIGNED NODE_LEFT
product_search_analytics 0 r UNASSIGNED NODE_LEFT
product_search_analytics 0 r UNASSIGNED NODE_LEFT
product_search_analytics 0 r UNASSIGNED NODE_LEFT
product_fap_model_item 0 r UNASSIGNED NODE_LEFT
metricbeat-7.10.2-2021.11.29-000011 0 r UNASSIGNED NODE_LEFT
metricbeat-7.17.1-2022.12.07-000008 0 r UNASSIGNED NODE_LEFT
.kibana-event-log-7.9.2-000024 0 r UNASSIGNED NODE_LEFT
.kibana-event-log-7.17.1-000010 0 r UNASSIGNED NODE_LEFT
.monitoring-kibana-7-2023.04.16 0 r UNASSIGNED REPLICA_ADDED
.kibana-event-log-7.9.2-000026 0 r UNASSIGNED INDEX_CREATED
product_fap_price 0 r UNASSIGNED NODE_LEFT
.ds-.logs-deprecation.elasticsearch-default-2022.12.12-000020 0 r UNASSIGNED NODE_LEFT
ilm-history-2-000025 0 r UNASSIGNED NODE_LEFT
metricbeat-7.17.1-2022.10.08-000006 0 r UNASSIGNED NODE_LEFT
ilm-history-2-000023 0 r UNASSIGNED NODE_LEFT
product_product_hierarchy 0 r UNASSIGNED NODE_LEFT
product_product_hierarchy 0 r UNASSIGNED NODE_LEFT
product_product_hierarchy 0 r UNASSIGNED NODE_LEFT
product_product_hierarchy 0 r UNASSIGNED NODE_LEFT

Key Finding:

UNASSIGNED → NODE_LEFT

👉 Meaning:

A node left the cluster
Replica shards not reassigned

🔍 Step 6: UNASSIGNED Shard Analysis

To understand why an unassigned shard is not being assigned and what action you must take to allow Elasticsearch to assign it, use the cluster allocation explanation API.

curl -X GET "localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&pretty"

[ec2-user@ip-x-x-x-xx ~]$curl -X GET "localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&pretty"
{
 "index" : "product_search_tab_data",
 "node_allocation_decisions" : [
 {
 "node_name" : "node-1",
 "deciders" : [
 {
 "decider" : "same_shard",
 "decision" : "NO",
 "explanation" : "a copy of this shard is already allocated to this node [[product_search_tab_data][0], node[EQ6QyUbhQZCZRqP78rMIIQ], [P], s[STARTED], a[id=7vBWLesZQAS4zYjt_ER2bw]]"
 },
 {
 "decider" : "disk_threshold",
 "decision" : "NO",
 "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.42130719712077%]"
 }
 ]
 },
 {
 "node_name" : "node-5",
 "deciders" : [
 {
 "decider" : "disk_threshold",
 "decision" : "NO",
 "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [9.907598002066106%]"
 }
 ]
 },
 {
 "node_name" : "node-2",
 "deciders" : [
 {
 "decider" : "disk_threshold",
 "decision" : "NO",
 "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [11.010075893021023%]"
 }
 ]
 },
 {
 "node_name" : "node-3",
 "deciders" : [
 {
 "decider" : "disk_threshold",
 "decision" : "NO",
 "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.938318653211446%]"
 }
 ]
 },
 {
 "node_name" : "node-4",
 "deciders" : [
 {
 "decider" : "disk_threshold",
 "decision" : "NO",
 "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [12.273611767876893%]"
 }
 ]
 }
 ]
}
[ec2-user@ip-x-x-x-xx ~]$

🧠 Root Cause Analysis (RCA)

After correlating all logs, metrics, and cluster behavior, we identified multiple layered issues contributing to the problem.

🔴 1. Large Number of Unassigned Shards
193 shards were unassigned

Majority had reason:

UNASSIGNED → NODE_LEFT

👉 Impact:

Continuous shard allocation attempts
Increased cluster overhead
Memory and thread pressure

🔴 2. Node Failure (NODE_LEFT)

- One or more nodes temporarily left the cluster
- Replica shards lost their assigned nodes

👉 Result:

Cluster moved to YELLOW state
Triggered rebalancing operations

🔴 3. Disk Watermark Threshold Breach (Critical Finding 🚨)

During shard allocation analysis, we found:

"index":"search","node_allocation_decisions":[{"node_name":"node-3","deciders":[{"decider":"disk_threshold","decision":"NO","explanation":"node above low watermark (85%), free: ~7.6%"}]},{"node_name":"node-5","deciders":[{"decider":"disk_threshold","decision":"NO","explanation":"node above low watermark (85%), free: ~9.6%"}]},{"node_name":"node-4","deciders":[{"decider":"disk_threshold","decision":"NO","explanation":"node above low watermark (85%), free: ~10.7%"}]}]

👉 Key Insight:

Elasticsearch refused to allocate shards on nodes
Because disk usage crossed:

cluster.routing.allocation.disk.watermark.low = 85%

👉 Actual situation:

Nodes had only ~7%–10% free disk space
Allocation decision = ❌ NO

⚠️ Why This Is Critical

When disk watermark is breached:

Elasticsearch blocks shard allocation
Unassigned shards remain stuck
Cluster cannot rebalance

👉 This directly caused:

Persistent unassigned shards
Memory pressure
Internal retries → CPU spikes

🔴 4. High JVM Memory Pressure

Heap usage consistently high
JVM old-gen heavily utilized

👉 Result:

Frequent Garbage Collection (GC)
CPU spikes during GC cycles

🔴 5. Thread Pool Pressure

Even though CPU looked low:

Threads were blocked due to:
Allocation retries
Memory pressure

👉 As per Elasticsearch behavior:

Thread pool exhaustion can trigger CPU-related alerts

🧩 Final Root Cause Summary

The issue was NOT just CPU-related.

It was a combination of:

❌ Disk space exhaustion (Watermark breach)
❌ Unassigned shards (allocation blocked)
❌ Node failure (NODE_LEFT)
❌ High JVM memory pressure
❌ Continuous allocation retries

🛠️ Final Fix Implemented

After complete analysis, we identified that:

👉 Insufficient disk space was the primary blocker

🔧 Solution Steps
✅ 1. Increased Disk Capacity

Added +50 GB storage to all Elasticsearch nodes
👉 Result:
Disk usage dropped below watermark threshold
Shard allocation resumed

monitoring-kibana-7-2023.04.17 0 p STARTED node-5
catelog-7.9.2-span-000010 0 p STARTED node-1
catelog-7.9.2-span-000010 0 r STARTED node-3
product_fragments 0 p STARTED node-3
packetbeat-7.9.3-2023.04.14-000019 0 p STARTED node-5
metricbeat-7.10.2-2022.04.14-000014 0 p STARTED node-3
.ds-.logs-deprecation.elasticsearch-default-2022.09.19-000014 0 p STARTED node-1
.ds-ilm-history-5-2023.04.09-000028 0 p STARTED node-5
catelog-7.9.2-profile-000010 0 p STARTED node-2
catelog-7.9.2-profile-000010 0 r STARTED node-3
packetbeat-7.9.3-2022.09.16-000012 0 p STARTED node-2
metricbeat-7.13.3-2021.07.11-000001 0 p STARTED node-2
logstash 0 p STARTED node-3
.monitoring-es-7-mb-2023.04.12 0 p STARTED node-4
.catelog-custom-link 0 p STARTED node-1
.catelog-custom-link 0 r STARTED node-3
catelog-7.9.2-metric-000015 0 p STARTED node-1
catelog-7.9.2-metric-000015 0 r STARTED node-3
catelog-7.9.2-profile-000017 0 r STARTED node-3
catelog-7.9.2-profile-000017 0 p STARTED node-5

✅ 2. Rolling Restart

Restarted nodes one by one (rolling restart)

👉 Ensured:

No downtime
Safe cluster recovery

✅ 3. Automatic Shard Reallocation

Elasticsearch started assigning shards automatically
Cluster began stabilizing

🎯 Final Result
✅ Unassigned shards → 0
✅ Cluster status → GREEN
✅ Memory pressure reduced
✅ CPU spikes eliminated

[ec2-user@ip-x-x-x-xx ~]$curl -X GET "localhost:9200/_cluster/health?pretty"
{
 "cluster_name" : "web-test",
 "status" : "green",
 "timed_out" : false,
 "number_of_nodes" : 5,
 "number_of_data_nodes" : 5,
 "active_primary_shards" : 247,
 "active_shards" : 536,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 0,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0,
 "task_max_waiting_in_queue_millis" : 0,
 "active_shards_percent_as_number" : 100.0
}

💡 Key Learning (Very Important 🚀)

🔥 Disk space is directly linked to cluster stability in Elasticsearch

Even if:

CPU looks fine
Memory seems manageable

👉 If disk crosses watermark:

Shards won’t allocate
Cluster will degrade

✍️ Conclusion

This incident was a great reminder that Elasticsearch performance issues are rarely straightforward.

What initially appeared as a high CPU problem turned out to be a cascading failure caused by:

Disk watermark threshold breaches
Unassigned shards
Node failure (NODE_LEFT)
JVM memory pressure
Continuous shard allocation retries

👉 The most critical takeaway:

🔥 Disk space is not just a storage concern in Elasticsearch — it directly impacts shard allocation, memory usage, and overall cluster stability.

Even when CPU usage looks normal, underlying factors like:

Heap pressure
Disk utilization
Cluster health 4.can silently degrade the system until it reaches a breaking point.

🚀 Final Thoughts for DevOps Engineers

In production environments, always think beyond surface-level alerts:

Don’t trust CPU metrics alone
Correlate memory, disk, and cluster state
Monitor unassigned shards and disk watermarks proactively
Design clusters with proper shard sizing and capacity planning.

URL: https://dev.to/alok_shankar/elasticsearch-high-cpu-issue-due-to-memory-pressure-real-production-incident-fix-3c8k

⇱ 🚨 Elasticsearch High CPU Issue Due to Memory Pressure – Real Production Incident & Fix - DEV Community