The Troubleshooter's Best Friend
When shards won't allocate, the allocation explain API is your diagnostic superpower. It reveals exactly why Elasticsearch made allocation decisions and provides actionable solutions to fix distribution problems.
You've seen the red cluster status. You know shards are unassigned. But why? The allocation explain API is Elasticsearch's built-in detective, revealing the exact reasons behind shard allocation decisions. This check transforms cryptic allocation failures into clear, actionable troubleshooting guidance.
What You'll Learn
Core Concepts
- • How Elasticsearch makes allocation decisions
- • Understanding allocation constraints and filters
- • Reading allocation explain responses
- • Common allocation failure patterns
Practical Skills
- • Diagnosing unassigned shard root causes
- • Resolving disk space allocation issues
- • Fixing node attribute constraints
- • Optimizing allocation performance
Allocation Explain API Deep Dive
GET /_cluster/allocation/explain
Simple English Explanation
Think of this API as asking Elasticsearch: "Hey, why didn't you put this shard on that node?" Elasticsearch then explains its decision-making process, like a teacher showing their work on a math problem.
It's especially helpful when you have unassigned shards (shards that aren't placed on any node) and need to understand what's blocking the allocation.
📋 What It Reveals
- • Allocation decisions: Why shards are placed or rejected
- • Constraint violations: What rules prevent allocation
- • Node-by-node analysis: Per-node allocation feasibility
- • Timing information: When decisions were made
🔧 Usage Modes
- • Automatic: Explains first unassigned shard
- • Specific shard: Analyze particular index/shard
- • Primary vs replica: Different allocation rules
- • Include decisions: Detailed node-by-node breakdown
Understanding Allocation Decisions
How Elasticsearch Thinks About Allocation
Elasticsearch considers allocation like a careful librarian organizing books. It has rules about:
- • Space: Is there enough room on the shelf (disk space)?
- • Balance: Are books evenly distributed across shelves (shard balance)?
- • Rules: Are there special requirements (allocation filters)?
- • Safety: Don't put original and copy on same shelf (same node)
Allocation Decision Process
Can Allocate Check
Basic eligibility: Does the node have enough disk space, memory, and meet basic requirements?
Allocation Filters
Specific rules: Are there constraints like "only allocate to hot tier" or "avoid this node"?
Balance and Optimization
Performance considerations: Which node gives the best distribution and performance?
Common Allocation Failures & Solutions
Disk Space Issues
Most common reason: Node has exceeded disk usage thresholds (default 85% warning, 90% critical).
Quick Fix:
- • Delete old indices or increase disk capacity
- • Adjust watermark settings temporarily
- • Move shards to nodes with more space
Same Shard Same Node
Primary and replica can't be on the same node for data safety.
Quick Fix:
- • Add more nodes to the cluster
- • Reduce replica count if redundancy allows
- • Check if node filters are too restrictive
Allocation Filtering
Custom rules prevent allocation (e.g., "only hot nodes" but no hot nodes available).
Quick Fix:
- • Review and adjust allocation filters
- • Add nodes with required attributes
- • Temporarily relax filtering rules
Practical Troubleshooting Examples
1. Basic Allocation Explain
# Get explanation for first unassigned shard curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" # Response shows why shard can't be allocated { "index": "my-index", "shard": 0, "primary": true, "current_state": "unassigned", "unassigned_info": { "reason": "CLUSTER_RECOVERED", "at": "2024-12-15T10:30:00.000Z" }, "can_allocate": "no", "allocate_explanation": "cannot allocate because all nodes are above the disk watermark" }
Translation:
"I can't place this shard anywhere because all your nodes are running out of disk space. You need to either free up space or add more nodes with available storage."
2. Detailed Node-by-Node Analysis
# Get detailed explanation with node decisions curl -X GET "localhost:9200/_cluster/allocation/explain?include_disk_info=true&include_yes_decisions=true&pretty" # Response includes per-node analysis { "index": "logs-2024", "shard": 1, "primary": false, "current_state": "unassigned", "node_allocation_decisions": [ { "node_id": "node-1", "node_name": "elasticsearch-node-1", "node_decision": "no", "deciders": [ { "decider": "disk_threshold", "decision": "NO", "explanation": "the node has exceeded the low disk watermark [85%]" } ] }, { "node_id": "node-2", "node_name": "elasticsearch-node-2", "node_decision": "no", "deciders": [ { "decider": "same_shard", "decision": "NO", "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists" } ] } ] }
Translation:
"Node-1 is full (85% disk used), and Node-2 already has a copy of this shard. I need either more disk space on Node-1 or a third node to place this replica."
Allocation Best Practices & Prevention
✅ Proactive Prevention
- • Monitor disk usage and set alerts at 75%
- • Maintain at least 3 nodes for replica allocation
- • Use allocation awareness for rack/zone distribution
- • Implement automated disk cleanup policies
- • Test allocation filters before applying
💡 Troubleshooting Tips
- • Always start with allocation explain for unassigned shards
- • Check cluster health before making changes
- • Use include_disk_info for space-related issues
- • Monitor allocation during cluster changes
- • Document allocation filter decisions
❌ Common Mistakes
- • Ignoring disk watermark warnings
- • Setting overly restrictive allocation filters
- • Not having enough nodes for replica placement
- • Making allocation changes without understanding impact
- • Forgetting to monitor after "quick fixes"
⚠️ Emergency Procedures
- • For red status: Address primary shard issues first
- • Use reroute API for manual allocation if needed
- • Consider empty_primary allocation for data loss acceptance
- • Document all emergency allocation decisions
- • Plan proper recovery after emergency fixes
Integration with ElasticDoctor Health Checks
Allocation Explain in the Diagnostic Flow
The allocation explain check is triggered automatically when ElasticDoctor detects unassigned shards during the cluster health check. It provides the detailed "why" behind allocation failures.
Triggers This Check
- • Cluster Health: Detects unassigned shards
- • Cat Shards: Identifies specific unassigned shards
- • Manual Request: Troubleshooting specific allocation issues
- • Scheduled Diagnostics: Regular allocation health validation
Informs Other Checks
- • Node Stats: Validates disk space concerns
- • Cluster Settings: Reviews allocation-related settings
- • Index Settings: Checks index-specific allocation rules
- • Recommendations: Provides specific remediation actions
Mastering Allocation Troubleshooting
Key Takeaways
- • Diagnostic Power: Allocation explain reveals exact reasons for shard placement decisions
- • Prevention Focus: Monitor disk usage and allocation constraints proactively
- • Systematic Approach: Use structured troubleshooting for allocation issues
- • Documentation: Record allocation decisions for future reference
Next Steps
- • Set up automated monitoring for unassigned shards
- • Create allocation explain automation for red clusters
- • Document your allocation filtering strategies
- • Practice allocation troubleshooting in test environments