ElasticDoctor - Elasticsearch Health Diagnostics

Why Cluster Health Matters

The cluster health check is the foundation of Elasticsearch diagnostics. It provides the most critical overview of your cluster's operational status and is often the first indicator of problems that need immediate attention.

The /_cluster/health API is your first line of defense against Elasticsearch outages. This check validates core cluster health status, shard distribution, and operational stability. Understanding how to interpret its output can mean the difference between preventing an outage and experiencing data loss.

API Endpoint and Compatibility

GET RequestAll ES Versions (5.x - 9.x)

GET /_cluster/health

✅ Version Compatibility

• Elasticsearch 5.x - 9.x
• OpenSearch 1.x - 2.x
• Consistent API across versions
• No breaking changes

🔧 Optional Parameters

• level=shards - Detailed shard info
• wait_for_status=green - Wait for status
• timeout=30s - Request timeout
• local=true - Local node only

Key Metrics Analyzed

1. Cluster Status

GREEN

All shards are active and allocated. Cluster is fully operational.

YELLOW

Some replica shards are unassigned. Reduced redundancy but functional.

RED

Primary shards are unassigned. Data loss or unavailability possible.

2. Node Count Validation

"number_of_nodes": 3,
"number_of_data_nodes": 3

Production Minimums

• Total nodes: 3+ (split-brain prevention)
• Data nodes: 2+ (replica allocation)
• Master nodes: 3+ (odd number)

ElasticDoctor Thresholds

• Critical: < 1 total node
• Warning: < 3 total nodes
• Warning: < 2 data nodes

3. Shard Distribution

"active_primary_shards": 15,
"active_shards": 30,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"active_shards_percent_as_number": 100.0

Health Indicators

• Active shards: Functional data segments
• Unassigned: Shards without node allocation
• Relocating: Shards moving between nodes
• Initializing: Shards being created/recovered

Warning Thresholds

• Unassigned shards: > 1
• Active shard %: < 95%
• High relocating: > 10
• High initializing: > 20

Common Issues Detected

🚨 Critical: RED Cluster Status

Primary shards are unassigned, indicating potential data loss or unavailability.

Immediate Actions:

1. Check allocation explain: GET /_cluster/allocation/explain
2. Verify node availability and connectivity
3. Check disk space on data nodes
4. Review recent cluster changes or failures
5. Consider shard reallocation if nodes are healthy

⚠️ Warning: Single Node Deployment

Running on a single node provides no redundancy and is unsuitable for production.

Recommended Actions:

• Add additional nodes for high availability
• Implement minimum 3-node setup for production
• Configure dedicated master nodes for larger clusters
• Review backup strategies for disaster recovery

ℹ️ Info: Cluster Activity

Shards are relocating or initializing, indicating cluster maintenance or scaling.

Monitoring Points:

• Monitor relocation progress and completion
• Check if activity is planned maintenance
• Verify performance impact during activity
• Consider limiting concurrent relocations if needed

Threshold Configuration

ElasticDoctor Default Thresholds

# From shared/utils/constants.py
CLUSTER_THRESHOLDS = {
    "min_production_nodes": 3,
    "min_data_nodes": 2,
    "unassigned_shards_warning": 1,
    "unassigned_shards_critical": 10,
    "active_shards_percent_warning": 90,
    "active_shards_percent_critical": 85,
    "high_relocating_shards": 10,
    "high_initializing_shards": 20
}

Customizing Thresholds

Adjust thresholds based on your cluster size and requirements:

• Small clusters: Lower node count minimums
• Large clusters: Higher shard activity tolerance
• Development: Relaxed production rules
• Critical systems: Stricter availability requirements

Configuration Methods

• Config file: Update constants.py
• Environment variables: Override defaults
• API parameters: Runtime customization
• Per-cluster settings: Cluster-specific rules

Integration with Other Checks

Cluster Health as Foundation

The cluster health check provides essential metrics used by other diagnostic phases. Its results influence the execution and interpretation of subsequent checks.

Dependent Checks

• Node Info/Stats: Node count validation
• Cat Shards: Detailed shard analysis
• Allocation Explain: Unassigned shard investigation
• Index Health: Per-index status correlation

Shared Metrics

• Node count for capacity planning
• Shard distribution for performance analysis
• Cluster status for severity assessment
• Activity indicators for timing considerations

Implementation Examples

Basic Health Check

# Basic health check
curl -X GET "localhost:9200/_cluster/health?pretty"

# Response
{
  "cluster_name": "my-cluster",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 15,
  "active_shards": 30,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

ElasticDoctor Implementation

🔍 How ElasticDoctor Analyzes Cluster Health

Status Validation

ElasticDoctor automatically categorizes cluster status (GREEN/YELLOW/RED) and maps each status to appropriate severity levels with actionable recommendations.

Node Count Analysis

Validates production readiness by checking minimum node requirements and assessing deployment architecture against best practices.

Shard Health Assessment

Analyzes shard distribution patterns, identifies allocation issues, and monitors cluster activity levels to detect potential problems.

Advanced Health Monitoring

# Wait for specific status with timeout
curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&timeout=30s"

# Get detailed shard-level information
curl -X GET "localhost:9200/_cluster/health?level=shards&pretty"

# Monitor specific indices
curl -X GET "localhost:9200/_cluster/health/my-index?pretty"

# Check for no relocating shards
curl -X GET "localhost:9200/_cluster/health?wait_for_no_relocating_shards=true&timeout=5m"

Best Practices & Recommendations

✅ Do

• Monitor cluster health continuously
• Set up alerts for status changes
• Maintain minimum 3-node production clusters
• Plan for shard allocation during maintenance
• Use wait_for_status for deployment scripts

💡 Tips

• Yellow status is often acceptable for dev environments
• High initialization activity after node additions is normal
• Use allocation explain for detailed shard investigation
• Monitor trends, not just current status

❌ Don't

• Ignore RED status - investigate immediately
• Run single-node clusters in production
• Forget to check after cluster changes
• Assume YELLOW is always a problem
• Make allocation changes during high activity

⚠️ Warnings

• Health can change rapidly during failures
• Network partitions can show false health
• Some operations require cluster stability
• Resource constraints can cause status changes

Key Takeaways

Critical Points

• Cluster health is the foundation of all diagnostics
• RED status requires immediate investigation
• Node count affects availability and performance
• Shard distribution impacts cluster stability

Next Steps

• Implement automated health monitoring
• Learn allocation explain for troubleshooting
• Review node and shard check guides
• Set up alerting for status changes

Back to Blog Next: Node Performance Check

Elasticsearch Cluster Health Check: Complete Guide to /_cluster/health API