ElasticDoctor - Elasticsearch Health Diagnostics

Performance is Everything

Node performance monitoring is critical for Elasticsearch stability. This check validates JVM memory usage, garbage collection, CPU utilization, and system resources to prevent outages and optimize performance.

The Node Performance check is your early warning system for resource exhaustion and performance degradation. It monitors five critical areas: heap memory, garbage collection, CPU utilization, system memory, and file descriptors. Understanding these metrics can prevent crashes and maintain optimal cluster performance.

API Endpoint and Data Sources

GET RequestAll ES Versions (5.x - 9.x)

GET /_nodes/stats

📊 Data Categories

• JVM Stats: Heap, non-heap, GC performance
• OS Stats: CPU, memory, swap usage
• Process Stats: File descriptors, uptime
• System Resources: Load averages, I/O

🔄 Update Frequency

• Real-time: Current resource usage
• Cumulative: GC counts, total operations
• Averages: Calculated performance metrics
• Peak values: Maximum observed usage

Five Critical Performance Checks

1. Heap Memory Usage (Most Critical)

Why It Matters

Heap exhaustion causes OutOfMemoryError, leading to node crashes and data loss. The most critical metric for Elasticsearch stability.

Default Thresholds

• Warning: ≥75% heap usage
• Critical: ≥90% heap usage

"jvm": {
  "mem": {
    "heap_used_in_bytes": 7516192768,
    "heap_used_percent": 89,
    "heap_max_in_bytes": 8589934592
  }
}

⚡ Immediate Actions for High Heap Usage

1. Increase heap size: -Xms8g -Xmx8g (up to 50% of RAM)
2. Monitor field data: GET /_cat/fielddata?v
3. Check query cache: Clear if evictions are high
4. Review mappings: Reduce unnecessary field storage
5. Scale horizontally: Add more nodes if vertical scaling isn't sufficient

Integration with Other Checks

Performance as Infrastructure Foundation

Node performance metrics influence nearly every other diagnostic check. Poor performance can mask other issues or indicate root causes of problems detected elsewhere.

Dependent Checks

• Hot Threads: CPU issue investigation
• Cluster Tasks: Resource impact on operations
• Index Stats: Performance correlation with usage
• Cache Performance: Memory pressure effects

Performance Indicators

• Memory pressure affects query performance
• CPU saturation slows all operations
• GC pauses cause cluster communication issues
• Resource exhaustion leads to node failures

Best Practices & Recommendations

✅ Performance Optimization

• Set heap to 50% of RAM maximum
• Monitor GC frequency and duration regularly
• Increase file descriptor limits to 65536+
• Disable swap or use memory locking
• Use G1GC for heaps >8GB

💡 Monitoring Tips

• Track trends, not just current values
• Set alerts at warning levels for proactive action
• Monitor during peak usage periods
• Correlate performance with application metrics
• Baseline normal performance for comparison

❌ Performance Killers

• Ignoring gradual memory leaks
• Setting heap size >32GB without compressed OOPs
• Running with swap enabled
• Allowing sustained high CPU usage
• Insufficient file descriptor limits

⚠️ Common Mistakes

• Reactive instead of proactive monitoring
• Ignoring GC tuning opportunities
• Overlooking system-level resource limits
• Not monitoring performance during scaling

Performance Monitoring Essentials

Critical Metrics

• Heap usage: Most critical for stability
• GC performance: Impacts all operations
• CPU utilization: Indicates processing capacity
• System memory: Prevents swapping issues

Action Plan

• Implement continuous performance monitoring
• Set up proactive alerting at warning thresholds
• Create performance baselines for your workload
• Develop capacity planning based on trends

Previous: Cluster Health Check Next: Indices Stats Check

Node Performance Check: Critical JVM and System Resource Monitoring