Data Protection is Not Optional
Elasticsearch snapshots are your safety net against data loss, corruption, and disasters. A well-designed snapshot strategy can mean the difference between a quick recovery and permanent data loss. This check validates your backup policies and identifies gaps in your disaster recovery preparation.
The snapshots check evaluates your backup strategy comprehensively - from repository configuration to snapshot frequency, retention policies, and recovery procedures. It ensures you're prepared for both planned maintenance and unexpected disasters.
Snapshot APIs and Monitoring
GET /_snapshot - List repositoriesGET /_snapshot/_all - Repository detailsGET /_snapshot/repo/_all - All snapshotsGET /_snapshot/_status - Active snapshots✅ What This Check Validates
- • Repository configuration and health
- • Snapshot frequency and consistency
- • Retention policies and cleanup
- • Recovery readiness and testing
- • Cross-region backup strategies
- • Automation and monitoring setup
🔧 Repository Types
- • S3: AWS S3 bucket storage
- • GCS: Google Cloud Storage
- • Azure: Azure Blob Storage
- • HDFS: Hadoop Distributed File System
- • Shared file system: NFS/SMB mounts
Critical Snapshot Validations
1. Repository Configuration
{
"my-s3-repo": {
"type": "s3",
"settings": {
"bucket": "my-es-backups",
"region": "us-west-2",
"base_path": "elasticsearch/snapshots",
"compress": true,
"server_side_encryption": true
}
}
}Health Checks
- • Repository accessibility and permissions
- • Storage bucket/path availability
- • Compression and encryption settings
- • Cross-region replication setup
ElasticDoctor Thresholds
- • Critical: No repositories configured
- • Critical: Repository inaccessible
- • Warning: Single repository only
- • Warning: No encryption enabled
2. Backup Frequency and Consistency
{
"snapshot": "daily-2024-12-15",
"uuid": "abc123...",
"state": "SUCCESS",
"start_time": "2024-12-15T02:00:00.000Z",
"end_time": "2024-12-15T02:45:00.000Z",
"duration_in_millis": 2700000,
"indices": ["logs-2024.12.15", "metrics-2024.12.15"],
"shards": {
"total": 15,
"successful": 15,
"failed": 0
}
}Daily Backups
Automated daily snapshots for production data with retention policies.
Weekly Archives
Longer-term weekly snapshots for compliance and historical data.
Pre-Change Backups
Manual snapshots before major changes or upgrades.
3. Retention Policies
# Snapshot Lifecycle Management Policy
{
"policy": {
"name": "daily-snapshots",
"schedule": "0 2 * * *",
"repository": "my-s3-repo",
"config": {
"indices": ["logs-*", "metrics-*"],
"ignore_unavailable": true,
"include_global_state": false
},
"retention": {
"expire_after": "30d",
"min_count": 5,
"max_count": 50
}
}
}Retention Strategy
- • Daily snapshots: 30 days retention
- • Weekly snapshots: 12 weeks retention
- • Monthly snapshots: 12 months retention
- • Yearly snapshots: 7 years retention
Cleanup Monitoring
- • Automated old snapshot deletion
- • Storage cost optimization
- • Compliance with data retention laws
- • Failed snapshot cleanup
Common Snapshot Issues
🚨 Critical: No Snapshot Repository
No snapshot repositories are configured, meaning no backups are possible. This is a critical data protection gap.
Immediate Actions:
- 1. Set up a snapshot repository immediately
- 2. Configure automated daily snapshots
- 3. Test restore procedures
- 4. Document backup and recovery processes
- 5. Set up monitoring and alerting
⚠️ Warning: Stale Snapshots
Latest snapshot is older than expected, indicating potential backup failures or scheduling issues.
Investigation Steps:
- • Check snapshot job logs and scheduling
- • Verify repository accessibility and permissions
- • Review cluster resources during backup windows
- • Test manual snapshot creation
- • Update monitoring and alerting thresholds
ℹ️ Info: Snapshot Performance
Snapshots are taking longer than expected or consuming significant cluster resources during creation.
Optimization Options:
- • Schedule snapshots during low-traffic periods
- • Enable compression in repository settings
- • Use incremental snapshots for faster backups
- • Consider multiple smaller snapshots vs. one large snapshot
- • Monitor and tune network and storage performance
Snapshot Strategy Best Practices
✅ Essential Practices
- • Automate daily snapshots for all critical data
- • Use multiple repositories for redundancy
- • Enable compression and encryption
- • Implement proper retention policies
- • Test restore procedures regularly
- • Monitor snapshot success and failures
💡 Advanced Strategies
- • Cross-region backup replication
- • Snapshot lifecycle management (SLM)
- • Selective index backup strategies
- • Hot-warm-cold backup tiers
- • Disaster recovery runbooks
❌ Common Pitfalls
- • Assuming snapshots without testing restores
- • Using only local storage for backups
- • Ignoring snapshot failures and alerts
- • Not documenting recovery procedures
- • Inadequate retention policies
- • Missing encryption for sensitive data
⚠️ Monitoring Points
- • Snapshot completion time and success rate
- • Repository storage usage and costs
- • Backup window performance impact
- • Failed snapshot cleanup and alerts
- • Recovery time objective (RTO) testing
Snapshot Configuration Examples
S3 Repository Setup
# Create S3 repository
PUT /_snapshot/my-s3-backup
{
"type": "s3",
"settings": {
"bucket": "my-elasticsearch-backups",
"region": "us-west-2",
"base_path": "prod-cluster/snapshots",
"compress": true,
"server_side_encryption": true,
"storage_class": "standard_ia"
}
}
# Verify repository
POST /_snapshot/my-s3-backup/_verifyAutomated Snapshot Policy
# Create snapshot lifecycle policy
PUT /_slm/policy/daily-snapshots
{
"schedule": "0 2 * * *",
"name": "<daily-snap-{now/d}>",
"repository": "my-s3-backup",
"config": {
"indices": ["logs-*", "metrics-*"],
"ignore_unavailable": true,
"include_global_state": false
},
"retention": {
"expire_after": "30d",
"min_count": 5,
"max_count": 50
}
}
# Start the policy
POST /_slm/policy/daily-snapshots/_executeRestore Operations
# List available snapshots
GET /_snapshot/my-s3-backup/_all
# Restore specific indices
POST /_snapshot/my-s3-backup/daily-snap-2024-12-15/_restore
{
"indices": "logs-2024.12.15",
"ignore_unavailable": true,
"include_global_state": false,
"rename_pattern": "(.+)",
"rename_replacement": "restored-$1"
}
# Monitor restore progress
GET /_recovery/restored-logs-2024.12.15Disaster Recovery Planning
🚨 Recovery Scenarios
Total Cluster Loss
- • Hardware failure or data center outage
- • Requires complete cluster rebuild
- • RTO: 2-4 hours depending on data size
- • RPO: Last successful snapshot
Partial Data Loss
- • Corrupted or deleted indices
- • Selective restore operations
- • RTO: 30 minutes to 2 hours
- • RPO: Last snapshot of affected indices
📋 Recovery Checklist
Pre-Disaster Preparation
- • ✅ Automated snapshot schedules
- • ✅ Multiple repository locations
- • ✅ Documented recovery procedures
- • ✅ Regular restore testing
- • ✅ Contact information and escalation
During Disaster
- • ✅ Assess scope of data loss
- • ✅ Identify last good snapshot
- • ✅ Rebuild cluster infrastructure
- • ✅ Restore from snapshots
- • ✅ Validate data integrity
Protecting Your Elasticsearch Data
Critical Points
- • Snapshots are your primary defense against data loss
- • Automated, tested backup strategies are essential
- • Multiple repositories provide redundancy
- • Recovery procedures must be documented and tested
Action Items
- • Set up automated snapshot policies immediately
- • Test restore procedures monthly
- • Document disaster recovery runbooks
- • Monitor snapshot health and success rates