ElasticDoctor - Elasticsearch Health Diagnostics

Real-time Cluster Pulse

Node stats provide the real-time heartbeat of your cluster. This check monitors live performance metrics, resource utilization, and operational statistics to predict issues before they impact users.

While node info tells you about hardware capabilities and settings show configuration, node stats reveal what's actually happening right now. This dynamic check monitors real-time metrics, tracks performance trends, and helps predict capacity needs before problems occur.

What You'll Learn

Performance Metrics

• Real-time resource utilization
• JVM and garbage collection statistics
• Index and search performance metrics
• Network and I/O throughput

Trend Analysis

• Capacity planning indicators
• Performance degradation detection
• Resource exhaustion prediction
• Operational efficiency measurement

Node Stats API Deep Dive

GET RequestAll ES Versions (5.x - 9.x)

GET /_nodes/stats

Simple English Explanation

Think of this API as asking each server: "What are you doing right now? How busy are you? How much memory are you using? How many requests have you handled?"

It's like checking your car's dashboard while driving - fuel level, engine temperature, RPM, speed - all the live information you need to know if everything is running smoothly.

📊 Metric Categories

• JVM Stats: Memory, GC, threads, uptime
• Process Stats: CPU, memory, file descriptors
• OS Stats: System load, memory, disk I/O
• Index Stats: Documents, operations, storage

⏱️ Update Frequency

• Real-time: Current resource usage
• Cumulative: Total operations since startup
• Rate-based: Operations per second
• Averaged: Load averages and trends

Critical Performance Metrics

JVM Performance Metrics

🚨 Critical: Heap Memory Usage

Key Metrics

• heap_used_percent: Current heap utilization
• heap_max_in_bytes: Maximum heap size
• heap_committed_in_bytes: Committed memory
• non_heap_used_in_bytes: Non-heap usage

Warning Thresholds

• Warning: ≥75% heap usage
• Critical: ≥90% heap usage
• Emergency: ≥95% heap usage
• Track trends over time

⚡ Important: Garbage Collection

GC Metrics

• collection_count: Total GC cycles
• collection_time_in_millis: Time spent in GC
• gc_frequency: Collections per minute
• avg_collection_time: Average GC pause

Health Indicators

• GC pause time <100ms (good)
• GC frequency <1/minute (stable)
• GC time <5% of total uptime
• No frequent full GCs

🧵 Thread Pool Statistics

"thread_pool": {
  "search": {
    "threads": 13,
    "queue": 0,
    "active": 0,
    "rejected": 0,
    "largest": 13,
    "completed": 1234567
  },
  "write": {
    "threads": 8,
    "queue": 0,
    "active": 1,
    "rejected": 0,
    "completed": 987654
  }
}

Monitor queue sizes and rejection counts to identify bottlenecks in search and indexing operations.

System Resource Metrics

💻 CPU and Load

CPU Metrics

• cpu_percent: Current CPU usage
• load_average: System load (1m, 5m, 15m)
• cpu_total_in_millis: Total CPU time

Load Analysis

• Load < CPU cores (healthy)
• Load = CPU cores (fully utilized)
• Load > CPU cores (overloaded)

💾 Memory and Storage

Memory Stats

• total_in_bytes: Total system memory
• free_in_bytes: Available memory
• used_in_bytes: Memory in use
• free_percent: Available percentage

File System

• total_in_bytes: Total disk space
• available_in_bytes: Free space
• free_in_bytes: Available space
• Monitor disk usage trends

Index and Search Operations

📝 Indexing Performance

• index_total: Total documents indexed
• index_time_in_millis: Time spent indexing
• index_current: Currently indexing
• index_failed: Failed indexing operations
• docs/sec: Indexing throughput

🔍 Search Performance

• query_total: Total search queries
• query_time_in_millis: Time spent searching
• query_current: Currently executing
• fetch_total: Fetch operations
• avg_query_time: Average query latency

Performance Trend Analysis

Why Trends Matter More Than Snapshots

A single measurement can be misleading. Trends reveal patterns, predict problems, and help you understand your cluster's behavior over time. ElasticDoctor analyzes statistical trends to provide actionable insights.

Capacity Planning Indicators

🚨 Memory Growth

• Heap usage trending upward
• Consistent memory pressure
• Frequent GC activity
• Reduced cache efficiency

⚠️ CPU Saturation

• Load average increasing
• CPU usage >80% sustained
• Thread pool rejections
• Query latency increases

💾 Storage Growth

• Disk usage trending up
• Index size growth rate
• Approaching disk watermarks
• Shard size increases

Performance Degradation Detection

🔍 How ElasticDoctor Detects Performance Trends

Heap Usage Trends

Analyzes heap usage patterns over time to predict memory exhaustion before it occurs. Tracks growth rates and calculates estimated time to capacity.

Query Latency Analysis

Monitors query response time trends to detect performance degradation. Identifies patterns that indicate resource contention or capacity issues.

Predictive Alerts

Uses statistical analysis to predict when thresholds will be reached, enabling proactive capacity planning and performance optimization.

Monitoring Best Practices

✅ Proactive Monitoring

• Monitor trends, not just current values
• Set alerts based on rate of change
• Track performance baselines
• Monitor during different load patterns
• Use percentile-based thresholds

💡 Key Metrics to Watch

• Heap usage percentage and growth rate
• GC frequency and pause times
• CPU load average and utilization
• Thread pool queue sizes and rejections
• Search and indexing latencies

❌ Monitoring Anti-Patterns

• Only checking metrics during incidents
• Ignoring gradual performance degradation
• Setting alerts too late (95% thresholds)
• Not correlating metrics across nodes
• Focusing on averages instead of outliers

⚠️ Alert Thresholds

• Heap usage: 75% warning, 85% critical
• CPU load: >cores warning, >1.5×cores critical
• GC pause: >100ms warning, >1s critical
• Thread rejections: >0 warning
• Disk usage: 80% warning, 90% critical

Real-time Performance Excellence

Monitoring Principles

• Proactive: Predict issues before they occur
• Trend-based: Focus on patterns, not snapshots
• Comprehensive: Monitor all critical subsystems
• Actionable: Connect metrics to specific actions

Implementation Steps

• Set up continuous node stats monitoring
• Establish performance baselines
• Configure trend-based alerting
• Create capacity planning processes

Previous: Node Settings Check Next: Hot Threads Check

Node Stats Check: Real-time Metrics and Performance Indicators