Back to Blog
Health Checks - Operations

Snapshots Check: Backup Strategy and Disaster Recovery

Ensure robust backup strategies, validate snapshot policies, and prepare for disaster recovery with comprehensive snapshot monitoring.

November 26, 2024
15 min read
ElasticDoctor Team

Data Protection is Not Optional

Elasticsearch snapshots are your safety net against data loss, corruption, and disasters. A well-designed snapshot strategy can mean the difference between a quick recovery and permanent data loss. This check validates your backup policies and identifies gaps in your disaster recovery preparation.

The snapshots check evaluates your backup strategy comprehensively - from repository configuration to snapshot frequency, retention policies, and recovery procedures. It ensures you're prepared for both planned maintenance and unexpected disasters.

Snapshot APIs and Monitoring

Multiple EndpointsES 5.x - 9.x
GET /_snapshot - List repositories
GET /_snapshot/_all - Repository details
GET /_snapshot/repo/_all - All snapshots
GET /_snapshot/_status - Active snapshots

✅ What This Check Validates

  • • Repository configuration and health
  • • Snapshot frequency and consistency
  • • Retention policies and cleanup
  • • Recovery readiness and testing
  • • Cross-region backup strategies
  • • Automation and monitoring setup

🔧 Repository Types

  • S3: AWS S3 bucket storage
  • GCS: Google Cloud Storage
  • Azure: Azure Blob Storage
  • HDFS: Hadoop Distributed File System
  • Shared file system: NFS/SMB mounts

Critical Snapshot Validations

1. Repository Configuration

{
  "my-s3-repo": {
    "type": "s3",
    "settings": {
      "bucket": "my-es-backups",
      "region": "us-west-2",
      "base_path": "elasticsearch/snapshots",
      "compress": true,
      "server_side_encryption": true
    }
  }
}

Health Checks

  • • Repository accessibility and permissions
  • • Storage bucket/path availability
  • • Compression and encryption settings
  • • Cross-region replication setup

ElasticDoctor Thresholds

  • Critical: No repositories configured
  • Critical: Repository inaccessible
  • Warning: Single repository only
  • Warning: No encryption enabled

2. Backup Frequency and Consistency

{
  "snapshot": "daily-2024-12-15",
  "uuid": "abc123...",
  "state": "SUCCESS",
  "start_time": "2024-12-15T02:00:00.000Z",
  "end_time": "2024-12-15T02:45:00.000Z",
  "duration_in_millis": 2700000,
  "indices": ["logs-2024.12.15", "metrics-2024.12.15"],
  "shards": {
    "total": 15,
    "successful": 15,
    "failed": 0
  }
}

Daily Backups

Automated daily snapshots for production data with retention policies.

Weekly Archives

Longer-term weekly snapshots for compliance and historical data.

Pre-Change Backups

Manual snapshots before major changes or upgrades.

3. Retention Policies

# Snapshot Lifecycle Management Policy
{
  "policy": {
    "name": "daily-snapshots",
    "schedule": "0 2 * * *",
    "repository": "my-s3-repo",
    "config": {
      "indices": ["logs-*", "metrics-*"],
      "ignore_unavailable": true,
      "include_global_state": false
    },
    "retention": {
      "expire_after": "30d",
      "min_count": 5,
      "max_count": 50
    }
  }
}

Retention Strategy

  • • Daily snapshots: 30 days retention
  • • Weekly snapshots: 12 weeks retention
  • • Monthly snapshots: 12 months retention
  • • Yearly snapshots: 7 years retention

Cleanup Monitoring

  • • Automated old snapshot deletion
  • • Storage cost optimization
  • • Compliance with data retention laws
  • • Failed snapshot cleanup

Common Snapshot Issues

🚨 Critical: No Snapshot Repository

No snapshot repositories are configured, meaning no backups are possible. This is a critical data protection gap.

Immediate Actions:

  1. 1. Set up a snapshot repository immediately
  2. 2. Configure automated daily snapshots
  3. 3. Test restore procedures
  4. 4. Document backup and recovery processes
  5. 5. Set up monitoring and alerting

⚠️ Warning: Stale Snapshots

Latest snapshot is older than expected, indicating potential backup failures or scheduling issues.

Investigation Steps:

  • • Check snapshot job logs and scheduling
  • • Verify repository accessibility and permissions
  • • Review cluster resources during backup windows
  • • Test manual snapshot creation
  • • Update monitoring and alerting thresholds

ℹ️ Info: Snapshot Performance

Snapshots are taking longer than expected or consuming significant cluster resources during creation.

Optimization Options:

  • • Schedule snapshots during low-traffic periods
  • • Enable compression in repository settings
  • • Use incremental snapshots for faster backups
  • • Consider multiple smaller snapshots vs. one large snapshot
  • • Monitor and tune network and storage performance

Snapshot Strategy Best Practices

✅ Essential Practices

  • • Automate daily snapshots for all critical data
  • • Use multiple repositories for redundancy
  • • Enable compression and encryption
  • • Implement proper retention policies
  • • Test restore procedures regularly
  • • Monitor snapshot success and failures

💡 Advanced Strategies

  • • Cross-region backup replication
  • • Snapshot lifecycle management (SLM)
  • • Selective index backup strategies
  • • Hot-warm-cold backup tiers
  • • Disaster recovery runbooks

❌ Common Pitfalls

  • • Assuming snapshots without testing restores
  • • Using only local storage for backups
  • • Ignoring snapshot failures and alerts
  • • Not documenting recovery procedures
  • • Inadequate retention policies
  • • Missing encryption for sensitive data

⚠️ Monitoring Points

  • • Snapshot completion time and success rate
  • • Repository storage usage and costs
  • • Backup window performance impact
  • • Failed snapshot cleanup and alerts
  • • Recovery time objective (RTO) testing

Snapshot Configuration Examples

S3 Repository Setup

# Create S3 repository
PUT /_snapshot/my-s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups",
    "region": "us-west-2",
    "base_path": "prod-cluster/snapshots",
    "compress": true,
    "server_side_encryption": true,
    "storage_class": "standard_ia"
  }
}

# Verify repository
POST /_snapshot/my-s3-backup/_verify

Automated Snapshot Policy

# Create snapshot lifecycle policy
PUT /_slm/policy/daily-snapshots
{
  "schedule": "0 2 * * *",
  "name": "<daily-snap-{now/d}>",
  "repository": "my-s3-backup",
  "config": {
    "indices": ["logs-*", "metrics-*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

# Start the policy
POST /_slm/policy/daily-snapshots/_execute

Restore Operations

# List available snapshots
GET /_snapshot/my-s3-backup/_all

# Restore specific indices
POST /_snapshot/my-s3-backup/daily-snap-2024-12-15/_restore
{
  "indices": "logs-2024.12.15",
  "ignore_unavailable": true,
  "include_global_state": false,
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

# Monitor restore progress
GET /_recovery/restored-logs-2024.12.15

Disaster Recovery Planning

🚨 Recovery Scenarios

Total Cluster Loss

  • • Hardware failure or data center outage
  • • Requires complete cluster rebuild
  • • RTO: 2-4 hours depending on data size
  • • RPO: Last successful snapshot

Partial Data Loss

  • • Corrupted or deleted indices
  • • Selective restore operations
  • • RTO: 30 minutes to 2 hours
  • • RPO: Last snapshot of affected indices

📋 Recovery Checklist

Pre-Disaster Preparation
  • • ✅ Automated snapshot schedules
  • • ✅ Multiple repository locations
  • • ✅ Documented recovery procedures
  • • ✅ Regular restore testing
  • • ✅ Contact information and escalation
During Disaster
  • • ✅ Assess scope of data loss
  • • ✅ Identify last good snapshot
  • • ✅ Rebuild cluster infrastructure
  • • ✅ Restore from snapshots
  • • ✅ Validate data integrity

Protecting Your Elasticsearch Data

Critical Points

  • • Snapshots are your primary defense against data loss
  • • Automated, tested backup strategies are essential
  • • Multiple repositories provide redundancy
  • • Recovery procedures must be documented and tested

Action Items

  • • Set up automated snapshot policies immediately
  • • Test restore procedures monthly
  • • Document disaster recovery runbooks
  • • Monitor snapshot health and success rates