Why Cluster Health Matters
The cluster health check is the foundation of Elasticsearch diagnostics. It provides the most critical overview of your cluster's operational status and is often the first indicator of problems that need immediate attention.
The /_cluster/health
API is your first line of defense against Elasticsearch outages. This check validates core cluster health status, shard distribution, and operational stability. Understanding how to interpret its output can mean the difference between preventing an outage and experiencing data loss.
API Endpoint and Compatibility
GET /_cluster/health
✅ Version Compatibility
- • Elasticsearch 5.x - 9.x
- • OpenSearch 1.x - 2.x
- • Consistent API across versions
- • No breaking changes
🔧 Optional Parameters
- •
level=shards
- Detailed shard info - •
wait_for_status=green
- Wait for status - •
timeout=30s
- Request timeout - •
local=true
- Local node only
Key Metrics Analyzed
1. Cluster Status
All shards are active and allocated. Cluster is fully operational.
Some replica shards are unassigned. Reduced redundancy but functional.
Primary shards are unassigned. Data loss or unavailability possible.
2. Node Count Validation
"number_of_nodes": 3, "number_of_data_nodes": 3
Production Minimums
- • Total nodes: 3+ (split-brain prevention)
- • Data nodes: 2+ (replica allocation)
- • Master nodes: 3+ (odd number)
ElasticDoctor Thresholds
- • Critical: < 1 total node
- • Warning: < 3 total nodes
- • Warning: < 2 data nodes
3. Shard Distribution
"active_primary_shards": 15, "active_shards": 30, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0, "delayed_unassigned_shards": 0, "active_shards_percent_as_number": 100.0
Health Indicators
- • Active shards: Functional data segments
- • Unassigned: Shards without node allocation
- • Relocating: Shards moving between nodes
- • Initializing: Shards being created/recovered
Warning Thresholds
- • Unassigned shards: > 1
- • Active shard %: < 95%
- • High relocating: > 10
- • High initializing: > 20
Common Issues Detected
🚨 Critical: RED Cluster Status
Primary shards are unassigned, indicating potential data loss or unavailability.
Immediate Actions:
- 1. Check allocation explain:
GET /_cluster/allocation/explain
- 2. Verify node availability and connectivity
- 3. Check disk space on data nodes
- 4. Review recent cluster changes or failures
- 5. Consider shard reallocation if nodes are healthy
⚠️ Warning: Single Node Deployment
Running on a single node provides no redundancy and is unsuitable for production.
Recommended Actions:
- • Add additional nodes for high availability
- • Implement minimum 3-node setup for production
- • Configure dedicated master nodes for larger clusters
- • Review backup strategies for disaster recovery
ℹ️ Info: Cluster Activity
Shards are relocating or initializing, indicating cluster maintenance or scaling.
Monitoring Points:
- • Monitor relocation progress and completion
- • Check if activity is planned maintenance
- • Verify performance impact during activity
- • Consider limiting concurrent relocations if needed
Threshold Configuration
ElasticDoctor Default Thresholds
# From shared/utils/constants.py CLUSTER_THRESHOLDS = { "min_production_nodes": 3, "min_data_nodes": 2, "unassigned_shards_warning": 1, "unassigned_shards_critical": 10, "active_shards_percent_warning": 90, "active_shards_percent_critical": 85, "high_relocating_shards": 10, "high_initializing_shards": 20 }
Customizing Thresholds
Adjust thresholds based on your cluster size and requirements:
- • Small clusters: Lower node count minimums
- • Large clusters: Higher shard activity tolerance
- • Development: Relaxed production rules
- • Critical systems: Stricter availability requirements
Configuration Methods
- • Config file: Update constants.py
- • Environment variables: Override defaults
- • API parameters: Runtime customization
- • Per-cluster settings: Cluster-specific rules
Integration with Other Checks
Cluster Health as Foundation
The cluster health check provides essential metrics used by other diagnostic phases. Its results influence the execution and interpretation of subsequent checks.
Dependent Checks
- • Node Info/Stats: Node count validation
- • Cat Shards: Detailed shard analysis
- • Allocation Explain: Unassigned shard investigation
- • Index Health: Per-index status correlation
Shared Metrics
- • Node count for capacity planning
- • Shard distribution for performance analysis
- • Cluster status for severity assessment
- • Activity indicators for timing considerations
Implementation Examples
Basic Health Check
# Basic health check curl -X GET "localhost:9200/_cluster/health?pretty" # Response { "cluster_name": "my-cluster", "status": "green", "timed_out": false, "number_of_nodes": 3, "number_of_data_nodes": 3, "active_primary_shards": 15, "active_shards": 30, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 100.0 }
ElasticDoctor Implementation
🔍 How ElasticDoctor Analyzes Cluster Health
Status Validation
ElasticDoctor automatically categorizes cluster status (GREEN/YELLOW/RED) and maps each status to appropriate severity levels with actionable recommendations.
Node Count Analysis
Validates production readiness by checking minimum node requirements and assessing deployment architecture against best practices.
Shard Health Assessment
Analyzes shard distribution patterns, identifies allocation issues, and monitors cluster activity levels to detect potential problems.
Advanced Health Monitoring
# Wait for specific status with timeout curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&timeout=30s" # Get detailed shard-level information curl -X GET "localhost:9200/_cluster/health?level=shards&pretty" # Monitor specific indices curl -X GET "localhost:9200/_cluster/health/my-index?pretty" # Check for no relocating shards curl -X GET "localhost:9200/_cluster/health?wait_for_no_relocating_shards=true&timeout=5m"
Best Practices & Recommendations
✅ Do
- • Monitor cluster health continuously
- • Set up alerts for status changes
- • Maintain minimum 3-node production clusters
- • Plan for shard allocation during maintenance
- • Use wait_for_status for deployment scripts
💡 Tips
- • Yellow status is often acceptable for dev environments
- • High initialization activity after node additions is normal
- • Use allocation explain for detailed shard investigation
- • Monitor trends, not just current status
❌ Don't
- • Ignore RED status - investigate immediately
- • Run single-node clusters in production
- • Forget to check after cluster changes
- • Assume YELLOW is always a problem
- • Make allocation changes during high activity
⚠️ Warnings
- • Health can change rapidly during failures
- • Network partitions can show false health
- • Some operations require cluster stability
- • Resource constraints can cause status changes
Key Takeaways
Critical Points
- • Cluster health is the foundation of all diagnostics
- • RED status requires immediate investigation
- • Node count affects availability and performance
- • Shard distribution impacts cluster stability
Next Steps
- • Implement automated health monitoring
- • Learn allocation explain for troubleshooting
- • Review node and shard check guides
- • Set up alerting for status changes