ElasticDoctor - Elasticsearch Health Diagnostics

The Troubleshooter's Best Friend

When shards won't allocate, the allocation explain API is your diagnostic superpower. It reveals exactly why Elasticsearch made allocation decisions and provides actionable solutions to fix distribution problems.

You've seen the red cluster status. You know shards are unassigned. But why? The allocation explain API is Elasticsearch's built-in detective, revealing the exact reasons behind shard allocation decisions. This check transforms cryptic allocation failures into clear, actionable troubleshooting guidance.

What You'll Learn

Core Concepts

• How Elasticsearch makes allocation decisions
• Understanding allocation constraints and filters
• Reading allocation explain responses
• Common allocation failure patterns

Practical Skills

• Diagnosing unassigned shard root causes
• Resolving disk space allocation issues
• Fixing node attribute constraints
• Optimizing allocation performance

Allocation Explain API Deep Dive

GET RequestES 5.x+ (Enhanced in 6.x+)

GET /_cluster/allocation/explain

Simple English Explanation

Think of this API as asking Elasticsearch: "Hey, why didn't you put this shard on that node?" Elasticsearch then explains its decision-making process, like a teacher showing their work on a math problem.

It's especially helpful when you have unassigned shards (shards that aren't placed on any node) and need to understand what's blocking the allocation.

📋 What It Reveals

• Allocation decisions: Why shards are placed or rejected
• Constraint violations: What rules prevent allocation
• Node-by-node analysis: Per-node allocation feasibility
• Timing information: When decisions were made

🔧 Usage Modes

• Automatic: Explains first unassigned shard
• Specific shard: Analyze particular index/shard
• Primary vs replica: Different allocation rules
• Include decisions: Detailed node-by-node breakdown

Understanding Allocation Decisions

How Elasticsearch Thinks About Allocation

Elasticsearch considers allocation like a careful librarian organizing books. It has rules about:

• Space: Is there enough room on the shelf (disk space)?
• Balance: Are books evenly distributed across shelves (shard balance)?
• Rules: Are there special requirements (allocation filters)?
• Safety: Don't put original and copy on same shelf (same node)

Allocation Decision Process

Can Allocate Check

Basic eligibility: Does the node have enough disk space, memory, and meet basic requirements?

Allocation Filters

Specific rules: Are there constraints like "only allocate to hot tier" or "avoid this node"?

Balance and Optimization

Performance considerations: Which node gives the best distribution and performance?

Common Allocation Failures & Solutions

Disk Space Issues

Most common reason: Node has exceeded disk usage thresholds (default 85% warning, 90% critical).

Quick Fix:

• Delete old indices or increase disk capacity
• Adjust watermark settings temporarily
• Move shards to nodes with more space

Same Shard Same Node

Primary and replica can't be on the same node for data safety.

Quick Fix:

• Add more nodes to the cluster
• Reduce replica count if redundancy allows
• Check if node filters are too restrictive

Allocation Filtering

Custom rules prevent allocation (e.g., "only hot nodes" but no hot nodes available).

Quick Fix:

• Review and adjust allocation filters
• Add nodes with required attributes
• Temporarily relax filtering rules

Practical Troubleshooting Examples

1. Basic Allocation Explain

# Get explanation for first unassigned shard
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# Response shows why shard can't be allocated
{
  "index": "my-index",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "CLUSTER_RECOVERED",
    "at": "2024-12-15T10:30:00.000Z"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because all nodes are above the disk watermark"
}

Translation:

"I can't place this shard anywhere because all your nodes are running out of disk space. You need to either free up space or add more nodes with available storage."

2. Detailed Node-by-Node Analysis

# Get detailed explanation with node decisions
curl -X GET "localhost:9200/_cluster/allocation/explain?include_disk_info=true&include_yes_decisions=true&pretty"

# Response includes per-node analysis
{
  "index": "logs-2024",
  "shard": 1,
  "primary": false,
  "current_state": "unassigned",
  "node_allocation_decisions": [
    {
      "node_id": "node-1",
      "node_name": "elasticsearch-node-1",
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node has exceeded the low disk watermark [85%]"
        }
      ]
    },
    {
      "node_id": "node-2", 
      "node_name": "elasticsearch-node-2",
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO", 
          "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists"
        }
      ]
    }
  ]
}

Translation:

"Node-1 is full (85% disk used), and Node-2 already has a copy of this shard. I need either more disk space on Node-1 or a third node to place this replica."

Allocation Best Practices & Prevention

✅ Proactive Prevention

• Monitor disk usage and set alerts at 75%
• Maintain at least 3 nodes for replica allocation
• Use allocation awareness for rack/zone distribution
• Implement automated disk cleanup policies
• Test allocation filters before applying

💡 Troubleshooting Tips

• Always start with allocation explain for unassigned shards
• Check cluster health before making changes
• Use include_disk_info for space-related issues
• Monitor allocation during cluster changes
• Document allocation filter decisions

❌ Common Mistakes

• Ignoring disk watermark warnings
• Setting overly restrictive allocation filters
• Not having enough nodes for replica placement
• Making allocation changes without understanding impact
• Forgetting to monitor after "quick fixes"

⚠️ Emergency Procedures

• For red status: Address primary shard issues first
• Use reroute API for manual allocation if needed
• Consider empty_primary allocation for data loss acceptance
• Document all emergency allocation decisions
• Plan proper recovery after emergency fixes

Integration with ElasticDoctor Health Checks

Allocation Explain in the Diagnostic Flow

The allocation explain check is triggered automatically when ElasticDoctor detects unassigned shards during the cluster health check. It provides the detailed "why" behind allocation failures.

Triggers This Check

• Cluster Health: Detects unassigned shards
• Cat Shards: Identifies specific unassigned shards
• Manual Request: Troubleshooting specific allocation issues
• Scheduled Diagnostics: Regular allocation health validation

Informs Other Checks

• Node Stats: Validates disk space concerns
• Cluster Settings: Reviews allocation-related settings
• Index Settings: Checks index-specific allocation rules
• Recommendations: Provides specific remediation actions

Mastering Allocation Troubleshooting

Key Takeaways

• Diagnostic Power: Allocation explain reveals exact reasons for shard placement decisions
• Prevention Focus: Monitor disk usage and allocation constraints proactively
• Systematic Approach: Use structured troubleshooting for allocation issues
• Documentation: Record allocation decisions for future reference

Next Steps

• Set up automated monitoring for unassigned shards
• Create allocation explain automation for red clusters
• Document your allocation filtering strategies
• Practice allocation troubleshooting in test environments

Previous: Cat Shards Check Back to Blog

Allocation Explain Check: The Ultimate Guide to Debugging Shard Assignment Issues

The Troubleshooter's Best Friend

What You'll Learn

Core Concepts

Practical Skills

Allocation Explain API Deep Dive

Simple English Explanation

📋 What It Reveals

🔧 Usage Modes

Understanding Allocation Decisions

How Elasticsearch Thinks About Allocation

Allocation Decision Process

Can Allocate Check

Allocation Filters

Balance and Optimization

Common Allocation Failures & Solutions

Disk Space Issues

Same Shard Same Node

Allocation Filtering

Practical Troubleshooting Examples

1. Basic Allocation Explain

2. Detailed Node-by-Node Analysis

Allocation Best Practices & Prevention

✅ Proactive Prevention

💡 Troubleshooting Tips

❌ Common Mistakes

⚠️ Emergency Procedures

Integration with ElasticDoctor Health Checks

Allocation Explain in the Diagnostic Flow

Triggers This Check

Informs Other Checks

Mastering Allocation Troubleshooting

Key Takeaways

Next Steps