Beyond Green, Yellow, Red
While Elasticsearch gives you basic cluster status colors, ElasticDoctor provides a comprehensive 0-100 health score that considers 22 diagnostic checks across 4 critical phases, each weighted by real-world impact on cluster stability and performance.
A single number that tells you everything about your Elasticsearch cluster's health. That's the promise of ElasticDoctor's health score algorithm. But behind this simple metric lies a sophisticated weighted scoring system that considers infrastructure stability, performance metrics, data layer health, and operational efficiency.
What You'll Learn
Algorithm Fundamentals
- • How the 0-100 score is calculated
- • Why different checks have different weights
- • The 4-phase diagnostic framework
- • Score interpretation and thresholds
Practical Application
- • Reading your cluster's health score
- • Understanding score trends over time
- • Prioritizing improvements based on impact
- • Customizing weights for your environment
The 4-Phase Diagnostic Framework
Simple English Explanation
Think of ElasticDoctor like a doctor examining a patient. Just as a doctor checks different body systems (heart, lungs, blood, etc.) with different levels of concern, ElasticDoctor examines your cluster in 4 phases, each with different importance levels.
Some problems (like a heart attack) are more serious than others (like a minor cut), so they get more weight in the overall health assessment.
Foundation Phase (Weight: 35%)
These are the "life support" checks. If these fail, your cluster is in serious trouble.
Critical Checks
- • Cluster Health: Is the cluster responding?
- • Cluster Info: Can we connect and get basic info?
- • License Status: Are features available?
- • Cluster Settings: Are configurations safe?
Why High Weight?
These issues can cause immediate outages, data loss, or complete cluster failure. They're like checking if the patient is breathing - nothing else matters if these fail.
Infrastructure Phase (Weight: 30%)
These checks validate that your servers have adequate resources and are configured properly.
Resource Checks
- • Node Performance: CPU, memory, disk usage
- • Node Info: Hardware specifications
- • Node Settings: JVM and OS configuration
- • Node Stats: Real-time resource metrics
- • Hot Threads: Performance bottlenecks
Why Important?
Resource problems cause performance issues, instability, and eventual failures. It's like checking the patient's vital signs - heart rate, blood pressure, temperature.
Data Layer Phase (Weight: 25%)
These checks ensure your data is properly stored, distributed, and accessible.
Data Checks
- • Cat Indices: Index health overview
- • Index Settings: Configuration optimization
- • Index Stats: Performance metrics
- • Cat Shards: Shard distribution
- • Allocation Explain: Shard assignment issues
Why Moderate Weight?
Data issues affect performance and availability but usually don't cause immediate total failure. Like checking specific organs - important but not immediately life-threatening.
Operations Phase (Weight: 10%)
These are optimization and maintenance checks that improve long-term health and efficiency.
Operational Checks
- • Cluster Tasks: Long-running operations
- • Pending Tasks: Queue management
- • Ingest Pipelines: Data processing
- • Snapshots: Backup strategies
- • Deprecations: Future-proofing
- • ILM Policies: Lifecycle management
- • Data Tiers: Storage optimization
- • Mappings: Field optimization
Why Lower Weight?
These are optimization opportunities, not critical problems. Like recommending exercise or vitamins - good for long-term health but not urgent.
Score Calculation Method
The Formula
Final Score = (Foundation × 0.35) + (Infrastructure × 0.30) + (Data Layer × 0.25) + (Operations × 0.10)
Each phase score is calculated by averaging the individual check scores within that phase, then the weighted average gives the final 0-100 score.
🚨 Critical (0-50)
Your cluster needs immediate attention.
- • Major issues in Foundation phase
- • Risk of data loss or outages
- • Immediate action required
⚠️ Warning (51-75)
Performance issues or risks present.
- • Infrastructure or data layer problems
- • Degraded performance likely
- • Plan improvements soon
✅ Healthy (76-100)
Your cluster is performing well.
- • All critical systems functioning
- • Good performance and stability
- • Focus on optimization opportunities
Real-World Example
Sample Cluster: "production-logs"
Let's walk through how ElasticDoctor calculated a score of 84 for a real production cluster.
Phase 1: Foundation (Weight: 35%)
Individual Scores
- • Cluster Health: 100 (Green status)
- • Cluster Info: 100 (All info available)
- • License: 85 (Expires in 45 days)
- • Settings: 90 (Minor optimization needed)
Calculation
Foundation = (100 + 100 + 85 + 90) / 4 = 93.75
Phase 2: Infrastructure (Weight: 30%)
Individual Scores
- • Node Performance: 75 (High memory usage)
- • Node Info: 85 (Adequate hardware)
- • Node Settings: 80 (Some tuning needed)
- • Node Stats: 70 (CPU pressure)
- • Hot Threads: 60 (Some bottlenecks)
Calculation
Infrastructure = (75 + 85 + 80 + 70 + 60) / 5 = 74
Final Score Calculation
Final Score = (Foundation × 0.35) + (Infrastructure × 0.30) + (Data Layer × 0.25) + (Operations × 0.10) Final Score = (93.75 × 0.35) + (74 × 0.30) + (86 × 0.25) + (77.5 × 0.10) Final Score = 32.81 + 22.2 + 21.5 + 7.75 Final Score = 84.26 → 84
Understanding Your Cluster's Health
Key Insights
- • Weighted Intelligence: Not all issues are equal - critical problems get more attention
- • Holistic View: Considers all aspects of cluster health, not just basic status
- • Actionable Prioritization: Higher-weighted issues should be addressed first
- • Trend Monitoring: Track scores over time to identify patterns
Best Practices
- • Monitor health scores regularly, not just during incidents
- • Focus on Foundation and Infrastructure phases first
- • Use score trends to predict and prevent issues
- • Customize weights based on your specific requirements