Data Protection is Not Optional
Elasticsearch snapshots are your safety net against data loss, corruption, and disasters. A well-designed snapshot strategy can mean the difference between a quick recovery and permanent data loss. This check validates your backup policies and identifies gaps in your disaster recovery preparation.
The snapshots check evaluates your backup strategy comprehensively - from repository configuration to snapshot frequency, retention policies, and recovery procedures. It ensures you're prepared for both planned maintenance and unexpected disasters.
Snapshot APIs and Monitoring
GET /_snapshot
- List repositoriesGET /_snapshot/_all
- Repository detailsGET /_snapshot/repo/_all
- All snapshotsGET /_snapshot/_status
- Active snapshots✅ What This Check Validates
- • Repository configuration and health
- • Snapshot frequency and consistency
- • Retention policies and cleanup
- • Recovery readiness and testing
- • Cross-region backup strategies
- • Automation and monitoring setup
🔧 Repository Types
- • S3: AWS S3 bucket storage
- • GCS: Google Cloud Storage
- • Azure: Azure Blob Storage
- • HDFS: Hadoop Distributed File System
- • Shared file system: NFS/SMB mounts
Critical Snapshot Validations
1. Repository Configuration
{ "my-s3-repo": { "type": "s3", "settings": { "bucket": "my-es-backups", "region": "us-west-2", "base_path": "elasticsearch/snapshots", "compress": true, "server_side_encryption": true } } }
Health Checks
- • Repository accessibility and permissions
- • Storage bucket/path availability
- • Compression and encryption settings
- • Cross-region replication setup
ElasticDoctor Thresholds
- • Critical: No repositories configured
- • Critical: Repository inaccessible
- • Warning: Single repository only
- • Warning: No encryption enabled
2. Backup Frequency and Consistency
{ "snapshot": "daily-2024-12-15", "uuid": "abc123...", "state": "SUCCESS", "start_time": "2024-12-15T02:00:00.000Z", "end_time": "2024-12-15T02:45:00.000Z", "duration_in_millis": 2700000, "indices": ["logs-2024.12.15", "metrics-2024.12.15"], "shards": { "total": 15, "successful": 15, "failed": 0 } }
Daily Backups
Automated daily snapshots for production data with retention policies.
Weekly Archives
Longer-term weekly snapshots for compliance and historical data.
Pre-Change Backups
Manual snapshots before major changes or upgrades.
3. Retention Policies
# Snapshot Lifecycle Management Policy { "policy": { "name": "daily-snapshots", "schedule": "0 2 * * *", "repository": "my-s3-repo", "config": { "indices": ["logs-*", "metrics-*"], "ignore_unavailable": true, "include_global_state": false }, "retention": { "expire_after": "30d", "min_count": 5, "max_count": 50 } } }
Retention Strategy
- • Daily snapshots: 30 days retention
- • Weekly snapshots: 12 weeks retention
- • Monthly snapshots: 12 months retention
- • Yearly snapshots: 7 years retention
Cleanup Monitoring
- • Automated old snapshot deletion
- • Storage cost optimization
- • Compliance with data retention laws
- • Failed snapshot cleanup
Common Snapshot Issues
🚨 Critical: No Snapshot Repository
No snapshot repositories are configured, meaning no backups are possible. This is a critical data protection gap.
Immediate Actions:
- 1. Set up a snapshot repository immediately
- 2. Configure automated daily snapshots
- 3. Test restore procedures
- 4. Document backup and recovery processes
- 5. Set up monitoring and alerting
⚠️ Warning: Stale Snapshots
Latest snapshot is older than expected, indicating potential backup failures or scheduling issues.
Investigation Steps:
- • Check snapshot job logs and scheduling
- • Verify repository accessibility and permissions
- • Review cluster resources during backup windows
- • Test manual snapshot creation
- • Update monitoring and alerting thresholds
ℹ️ Info: Snapshot Performance
Snapshots are taking longer than expected or consuming significant cluster resources during creation.
Optimization Options:
- • Schedule snapshots during low-traffic periods
- • Enable compression in repository settings
- • Use incremental snapshots for faster backups
- • Consider multiple smaller snapshots vs. one large snapshot
- • Monitor and tune network and storage performance
Snapshot Strategy Best Practices
✅ Essential Practices
- • Automate daily snapshots for all critical data
- • Use multiple repositories for redundancy
- • Enable compression and encryption
- • Implement proper retention policies
- • Test restore procedures regularly
- • Monitor snapshot success and failures
💡 Advanced Strategies
- • Cross-region backup replication
- • Snapshot lifecycle management (SLM)
- • Selective index backup strategies
- • Hot-warm-cold backup tiers
- • Disaster recovery runbooks
❌ Common Pitfalls
- • Assuming snapshots without testing restores
- • Using only local storage for backups
- • Ignoring snapshot failures and alerts
- • Not documenting recovery procedures
- • Inadequate retention policies
- • Missing encryption for sensitive data
⚠️ Monitoring Points
- • Snapshot completion time and success rate
- • Repository storage usage and costs
- • Backup window performance impact
- • Failed snapshot cleanup and alerts
- • Recovery time objective (RTO) testing
Snapshot Configuration Examples
S3 Repository Setup
# Create S3 repository PUT /_snapshot/my-s3-backup { "type": "s3", "settings": { "bucket": "my-elasticsearch-backups", "region": "us-west-2", "base_path": "prod-cluster/snapshots", "compress": true, "server_side_encryption": true, "storage_class": "standard_ia" } } # Verify repository POST /_snapshot/my-s3-backup/_verify
Automated Snapshot Policy
# Create snapshot lifecycle policy PUT /_slm/policy/daily-snapshots { "schedule": "0 2 * * *", "name": "<daily-snap-{now/d}>", "repository": "my-s3-backup", "config": { "indices": ["logs-*", "metrics-*"], "ignore_unavailable": true, "include_global_state": false }, "retention": { "expire_after": "30d", "min_count": 5, "max_count": 50 } } # Start the policy POST /_slm/policy/daily-snapshots/_execute
Restore Operations
# List available snapshots GET /_snapshot/my-s3-backup/_all # Restore specific indices POST /_snapshot/my-s3-backup/daily-snap-2024-12-15/_restore { "indices": "logs-2024.12.15", "ignore_unavailable": true, "include_global_state": false, "rename_pattern": "(.+)", "rename_replacement": "restored-$1" } # Monitor restore progress GET /_recovery/restored-logs-2024.12.15
Disaster Recovery Planning
🚨 Recovery Scenarios
Total Cluster Loss
- • Hardware failure or data center outage
- • Requires complete cluster rebuild
- • RTO: 2-4 hours depending on data size
- • RPO: Last successful snapshot
Partial Data Loss
- • Corrupted or deleted indices
- • Selective restore operations
- • RTO: 30 minutes to 2 hours
- • RPO: Last snapshot of affected indices
📋 Recovery Checklist
Pre-Disaster Preparation
- • ✅ Automated snapshot schedules
- • ✅ Multiple repository locations
- • ✅ Documented recovery procedures
- • ✅ Regular restore testing
- • ✅ Contact information and escalation
During Disaster
- • ✅ Assess scope of data loss
- • ✅ Identify last good snapshot
- • ✅ Rebuild cluster infrastructure
- • ✅ Restore from snapshots
- • ✅ Validate data integrity
Protecting Your Elasticsearch Data
Critical Points
- • Snapshots are your primary defense against data loss
- • Automated, tested backup strategies are essential
- • Multiple repositories provide redundancy
- • Recovery procedures must be documented and tested
Action Items
- • Set up automated snapshot policies immediately
- • Test restore procedures monthly
- • Document disaster recovery runbooks
- • Monitor snapshot health and success rates