ElasticDoctor - Elasticsearch Health Diagnostics

Beyond Green, Yellow, Red

While Elasticsearch gives you basic cluster status colors, ElasticDoctor provides a comprehensive 0-100 health score that considers 22 diagnostic checks across 4 critical phases, each weighted by real-world impact on cluster stability and performance.

A single number that tells you everything about your Elasticsearch cluster's health. That's the promise of ElasticDoctor's health score algorithm. But behind this simple metric lies a sophisticated weighted scoring system that considers infrastructure stability, performance metrics, data layer health, and operational efficiency.

What You'll Learn

Algorithm Fundamentals

• How the 0-100 score is calculated
• Why different checks have different weights
• The 4-phase diagnostic framework
• Score interpretation and thresholds

Practical Application

• Reading your cluster's health score
• Understanding score trends over time
• Prioritizing improvements based on impact
• Customizing weights for your environment

The 4-Phase Diagnostic Framework

Simple English Explanation

Think of ElasticDoctor like a doctor examining a patient. Just as a doctor checks different body systems (heart, lungs, blood, etc.) with different levels of concern, ElasticDoctor examines your cluster in 4 phases, each with different importance levels.

Some problems (like a heart attack) are more serious than others (like a minor cut), so they get more weight in the overall health assessment.

Foundation Phase (Weight: 35%)

These are the "life support" checks. If these fail, your cluster is in serious trouble.

Critical Checks

• Cluster Health: Is the cluster responding?
• Cluster Info: Can we connect and get basic info?
• License Status: Are features available?
• Cluster Settings: Are configurations safe?

Why High Weight?

These issues can cause immediate outages, data loss, or complete cluster failure. They're like checking if the patient is breathing - nothing else matters if these fail.

Infrastructure Phase (Weight: 30%)

These checks validate that your servers have adequate resources and are configured properly.

Resource Checks

• Node Performance: CPU, memory, disk usage
• Node Info: Hardware specifications
• Node Settings: JVM and OS configuration
• Node Stats: Real-time resource metrics
• Hot Threads: Performance bottlenecks

Why Important?

Resource problems cause performance issues, instability, and eventual failures. It's like checking the patient's vital signs - heart rate, blood pressure, temperature.

Data Layer Phase (Weight: 25%)

These checks ensure your data is properly stored, distributed, and accessible.

Data Checks

• Cat Indices: Index health overview
• Index Settings: Configuration optimization
• Index Stats: Performance metrics
• Cat Shards: Shard distribution
• Allocation Explain: Shard assignment issues

Why Moderate Weight?

Data issues affect performance and availability but usually don't cause immediate total failure. Like checking specific organs - important but not immediately life-threatening.

Operations Phase (Weight: 10%)

These are optimization and maintenance checks that improve long-term health and efficiency.

Operational Checks

• Cluster Tasks: Long-running operations
• Pending Tasks: Queue management
• Ingest Pipelines: Data processing
• Snapshots: Backup strategies
• Deprecations: Future-proofing
• ILM Policies: Lifecycle management
• Data Tiers: Storage optimization
• Mappings: Field optimization

Why Lower Weight?

These are optimization opportunities, not critical problems. Like recommending exercise or vitamins - good for long-term health but not urgent.

Score Calculation Method

The Formula

Final Score = (Foundation × 0.35) + (Infrastructure × 0.30) + (Data Layer × 0.25) + (Operations × 0.10)

Each phase score is calculated by averaging the individual check scores within that phase, then the weighted average gives the final 0-100 score.

🚨 Critical (0-50)

Your cluster needs immediate attention.

• Major issues in Foundation phase
• Risk of data loss or outages
• Immediate action required

⚠️ Warning (51-75)

Performance issues or risks present.

• Infrastructure or data layer problems
• Degraded performance likely
• Plan improvements soon

✅ Healthy (76-100)

Your cluster is performing well.

• All critical systems functioning
• Good performance and stability
• Focus on optimization opportunities

Real-World Example

Sample Cluster: "production-logs"

Let's walk through how ElasticDoctor calculated a score of 84 for a real production cluster.

Phase 1: Foundation (Weight: 35%)

Individual Scores

• Cluster Health: 100 (Green status)
• Cluster Info: 100 (All info available)
• License: 85 (Expires in 45 days)
• Settings: 90 (Minor optimization needed)

Calculation

Foundation = (100 + 100 + 85 + 90) / 4 = 93.75

Phase 2: Infrastructure (Weight: 30%)

Individual Scores

• Node Performance: 75 (High memory usage)
• Node Info: 85 (Adequate hardware)
• Node Settings: 80 (Some tuning needed)
• Node Stats: 70 (CPU pressure)
• Hot Threads: 60 (Some bottlenecks)

Calculation

Infrastructure = (75 + 85 + 80 + 70 + 60) / 5 = 74

Final Score Calculation

Final Score = (Foundation × 0.35) + (Infrastructure × 0.30) + (Data Layer × 0.25) + (Operations × 0.10)

Final Score = (93.75 × 0.35) + (74 × 0.30) + (86 × 0.25) + (77.5 × 0.10)
Final Score = 32.81 + 22.2 + 21.5 + 7.75
Final Score = 84.26 → 84

Score: 84

Status: Healthy - Good performance with some optimization opportunities

Understanding Your Cluster's Health

Key Insights

• Weighted Intelligence: Not all issues are equal - critical problems get more attention
• Holistic View: Considers all aspects of cluster health, not just basic status
• Actionable Prioritization: Higher-weighted issues should be addressed first
• Trend Monitoring: Track scores over time to identify patterns

Best Practices

• Monitor health scores regularly, not just during incidents
• Focus on Foundation and Infrastructure phases first
• Use score trends to predict and prevent issues
• Customize weights based on your specific requirements

Previous: Multi-Version Diagnostics Back to Blog

The Elasticsearch Health Score: How ElasticDoctor Calculates Your Cluster's Fitness

Beyond Green, Yellow, Red

What You'll Learn

Algorithm Fundamentals

Practical Application

The 4-Phase Diagnostic Framework

Simple English Explanation

Foundation Phase (Weight: 35%)

Critical Checks

Why High Weight?

Infrastructure Phase (Weight: 30%)

Resource Checks

Why Important?

Data Layer Phase (Weight: 25%)

Data Checks

Why Moderate Weight?

Operations Phase (Weight: 10%)

Operational Checks

Why Lower Weight?

Score Calculation Method

The Formula

🚨 Critical (0-50)

⚠️ Warning (51-75)

✅ Healthy (76-100)

Real-World Example

Sample Cluster: "production-logs"

Phase 1: Foundation (Weight: 35%)

Individual Scores

Calculation

Phase 2: Infrastructure (Weight: 30%)

Individual Scores

Calculation

Final Score Calculation

Understanding Your Cluster's Health

Key Insights

Best Practices