Back to Blog
Technical Deep Dive

The Elasticsearch Health Score: How ElasticDoctor Calculates Your Cluster's Fitness

Deep dive into the weighted scoring algorithm behind ElasticDoctor's 0-100 health score and why different checks have different importance levels.

December 10, 2024
5 min read
ElasticDoctor Team

Beyond Green, Yellow, Red

While Elasticsearch gives you basic cluster status colors, ElasticDoctor provides a comprehensive 0-100 health score that considers 22 diagnostic checks across 4 critical phases, each weighted by real-world impact on cluster stability and performance.

A single number that tells you everything about your Elasticsearch cluster's health. That's the promise of ElasticDoctor's health score algorithm. But behind this simple metric lies a sophisticated weighted scoring system that considers infrastructure stability, performance metrics, data layer health, and operational efficiency.

What You'll Learn

Algorithm Fundamentals

  • • How the 0-100 score is calculated
  • • Why different checks have different weights
  • • The 4-phase diagnostic framework
  • • Score interpretation and thresholds

Practical Application

  • • Reading your cluster's health score
  • • Understanding score trends over time
  • • Prioritizing improvements based on impact
  • • Customizing weights for your environment

The 4-Phase Diagnostic Framework

Simple English Explanation

Think of ElasticDoctor like a doctor examining a patient. Just as a doctor checks different body systems (heart, lungs, blood, etc.) with different levels of concern, ElasticDoctor examines your cluster in 4 phases, each with different importance levels.

Some problems (like a heart attack) are more serious than others (like a minor cut), so they get more weight in the overall health assessment.

1

Foundation Phase (Weight: 35%)

These are the "life support" checks. If these fail, your cluster is in serious trouble.

Critical Checks

  • Cluster Health: Is the cluster responding?
  • Cluster Info: Can we connect and get basic info?
  • License Status: Are features available?
  • Cluster Settings: Are configurations safe?

Why High Weight?

These issues can cause immediate outages, data loss, or complete cluster failure. They're like checking if the patient is breathing - nothing else matters if these fail.

2

Infrastructure Phase (Weight: 30%)

These checks validate that your servers have adequate resources and are configured properly.

Resource Checks

  • Node Performance: CPU, memory, disk usage
  • Node Info: Hardware specifications
  • Node Settings: JVM and OS configuration
  • Node Stats: Real-time resource metrics
  • Hot Threads: Performance bottlenecks

Why Important?

Resource problems cause performance issues, instability, and eventual failures. It's like checking the patient's vital signs - heart rate, blood pressure, temperature.

3

Data Layer Phase (Weight: 25%)

These checks ensure your data is properly stored, distributed, and accessible.

Data Checks

  • Cat Indices: Index health overview
  • Index Settings: Configuration optimization
  • Index Stats: Performance metrics
  • Cat Shards: Shard distribution
  • Allocation Explain: Shard assignment issues

Why Moderate Weight?

Data issues affect performance and availability but usually don't cause immediate total failure. Like checking specific organs - important but not immediately life-threatening.

4

Operations Phase (Weight: 10%)

These are optimization and maintenance checks that improve long-term health and efficiency.

Operational Checks

  • Cluster Tasks: Long-running operations
  • Pending Tasks: Queue management
  • Ingest Pipelines: Data processing
  • Snapshots: Backup strategies
  • Deprecations: Future-proofing
  • ILM Policies: Lifecycle management
  • Data Tiers: Storage optimization
  • Mappings: Field optimization

Why Lower Weight?

These are optimization opportunities, not critical problems. Like recommending exercise or vitamins - good for long-term health but not urgent.

Score Calculation Method

The Formula

Final Score = (Foundation × 0.35) + (Infrastructure × 0.30) + (Data Layer × 0.25) + (Operations × 0.10)

Each phase score is calculated by averaging the individual check scores within that phase, then the weighted average gives the final 0-100 score.

🚨 Critical (0-50)

Your cluster needs immediate attention.

  • • Major issues in Foundation phase
  • • Risk of data loss or outages
  • • Immediate action required

⚠️ Warning (51-75)

Performance issues or risks present.

  • • Infrastructure or data layer problems
  • • Degraded performance likely
  • • Plan improvements soon

✅ Healthy (76-100)

Your cluster is performing well.

  • • All critical systems functioning
  • • Good performance and stability
  • • Focus on optimization opportunities

Real-World Example

Sample Cluster: "production-logs"

Let's walk through how ElasticDoctor calculated a score of 84 for a real production cluster.

Phase 1: Foundation (Weight: 35%)

Individual Scores
  • • Cluster Health: 100 (Green status)
  • • Cluster Info: 100 (All info available)
  • • License: 85 (Expires in 45 days)
  • • Settings: 90 (Minor optimization needed)
Calculation
Foundation = (100 + 100 + 85 + 90) / 4 = 93.75

Phase 2: Infrastructure (Weight: 30%)

Individual Scores
  • • Node Performance: 75 (High memory usage)
  • • Node Info: 85 (Adequate hardware)
  • • Node Settings: 80 (Some tuning needed)
  • • Node Stats: 70 (CPU pressure)
  • • Hot Threads: 60 (Some bottlenecks)
Calculation
Infrastructure = (75 + 85 + 80 + 70 + 60) / 5 = 74

Final Score Calculation

Final Score = (Foundation × 0.35) + (Infrastructure × 0.30) + (Data Layer × 0.25) + (Operations × 0.10)

Final Score = (93.75 × 0.35) + (74 × 0.30) + (86 × 0.25) + (77.5 × 0.10)
Final Score = 32.81 + 22.2 + 21.5 + 7.75
Final Score = 84.26 → 84
Score: 84
Status: Healthy - Good performance with some optimization opportunities

Understanding Your Cluster's Health

Key Insights

  • Weighted Intelligence: Not all issues are equal - critical problems get more attention
  • Holistic View: Considers all aspects of cluster health, not just basic status
  • Actionable Prioritization: Higher-weighted issues should be addressed first
  • Trend Monitoring: Track scores over time to identify patterns

Best Practices

  • • Monitor health scores regularly, not just during incidents
  • • Focus on Foundation and Infrastructure phases first
  • • Use score trends to predict and prevent issues
  • • Customize weights based on your specific requirements