ElasticDoctor - Elasticsearch Health Diagnostics

Data Protection is Not Optional

Elasticsearch snapshots are your safety net against data loss, corruption, and disasters. A well-designed snapshot strategy can mean the difference between a quick recovery and permanent data loss. This check validates your backup policies and identifies gaps in your disaster recovery preparation.

The snapshots check evaluates your backup strategy comprehensively - from repository configuration to snapshot frequency, retention policies, and recovery procedures. It ensures you're prepared for both planned maintenance and unexpected disasters.

Snapshot APIs and Monitoring

Multiple EndpointsES 5.x - 9.x

GET /_snapshot - List repositories

GET /_snapshot/_all - Repository details

GET /_snapshot/repo/_all - All snapshots

GET /_snapshot/_status - Active snapshots

✅ What This Check Validates

• Repository configuration and health
• Snapshot frequency and consistency
• Retention policies and cleanup
• Recovery readiness and testing
• Cross-region backup strategies
• Automation and monitoring setup

🔧 Repository Types

• S3: AWS S3 bucket storage
• GCS: Google Cloud Storage
• Azure: Azure Blob Storage
• HDFS: Hadoop Distributed File System
• Shared file system: NFS/SMB mounts

Critical Snapshot Validations

1. Repository Configuration

{
  "my-s3-repo": {
    "type": "s3",
    "settings": {
      "bucket": "my-es-backups",
      "region": "us-west-2",
      "base_path": "elasticsearch/snapshots",
      "compress": true,
      "server_side_encryption": true
    }
  }
}

Health Checks

• Repository accessibility and permissions
• Storage bucket/path availability
• Compression and encryption settings
• Cross-region replication setup

ElasticDoctor Thresholds

• Critical: No repositories configured
• Critical: Repository inaccessible
• Warning: Single repository only
• Warning: No encryption enabled

2. Backup Frequency and Consistency

{
  "snapshot": "daily-2024-12-15",
  "uuid": "abc123...",
  "state": "SUCCESS",
  "start_time": "2024-12-15T02:00:00.000Z",
  "end_time": "2024-12-15T02:45:00.000Z",
  "duration_in_millis": 2700000,
  "indices": ["logs-2024.12.15", "metrics-2024.12.15"],
  "shards": {
    "total": 15,
    "successful": 15,
    "failed": 0
  }
}

Daily Backups

Automated daily snapshots for production data with retention policies.

Weekly Archives

Longer-term weekly snapshots for compliance and historical data.

Pre-Change Backups

Manual snapshots before major changes or upgrades.

3. Retention Policies

# Snapshot Lifecycle Management Policy
{
  "policy": {
    "name": "daily-snapshots",
    "schedule": "0 2 * * *",
    "repository": "my-s3-repo",
    "config": {
      "indices": ["logs-*", "metrics-*"],
      "ignore_unavailable": true,
      "include_global_state": false
    },
    "retention": {
      "expire_after": "30d",
      "min_count": 5,
      "max_count": 50
    }
  }
}

Retention Strategy

• Daily snapshots: 30 days retention
• Weekly snapshots: 12 weeks retention
• Monthly snapshots: 12 months retention
• Yearly snapshots: 7 years retention

Cleanup Monitoring

• Automated old snapshot deletion
• Storage cost optimization
• Compliance with data retention laws
• Failed snapshot cleanup

Common Snapshot Issues

🚨 Critical: No Snapshot Repository

No snapshot repositories are configured, meaning no backups are possible. This is a critical data protection gap.

Immediate Actions:

1. Set up a snapshot repository immediately
2. Configure automated daily snapshots
3. Test restore procedures
4. Document backup and recovery processes
5. Set up monitoring and alerting

⚠️ Warning: Stale Snapshots

Latest snapshot is older than expected, indicating potential backup failures or scheduling issues.

Investigation Steps:

• Check snapshot job logs and scheduling
• Verify repository accessibility and permissions
• Review cluster resources during backup windows
• Test manual snapshot creation
• Update monitoring and alerting thresholds

ℹ️ Info: Snapshot Performance

Snapshots are taking longer than expected or consuming significant cluster resources during creation.

Optimization Options:

• Schedule snapshots during low-traffic periods
• Enable compression in repository settings
• Use incremental snapshots for faster backups
• Consider multiple smaller snapshots vs. one large snapshot
• Monitor and tune network and storage performance

Snapshot Strategy Best Practices

✅ Essential Practices

• Automate daily snapshots for all critical data
• Use multiple repositories for redundancy
• Enable compression and encryption
• Implement proper retention policies
• Test restore procedures regularly
• Monitor snapshot success and failures

💡 Advanced Strategies

• Cross-region backup replication
• Snapshot lifecycle management (SLM)
• Selective index backup strategies
• Hot-warm-cold backup tiers
• Disaster recovery runbooks

❌ Common Pitfalls

• Assuming snapshots without testing restores
• Using only local storage for backups
• Ignoring snapshot failures and alerts
• Not documenting recovery procedures
• Inadequate retention policies
• Missing encryption for sensitive data

⚠️ Monitoring Points

• Snapshot completion time and success rate
• Repository storage usage and costs
• Backup window performance impact
• Failed snapshot cleanup and alerts
• Recovery time objective (RTO) testing

Snapshot Configuration Examples

S3 Repository Setup

# Create S3 repository
PUT /_snapshot/my-s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups",
    "region": "us-west-2",
    "base_path": "prod-cluster/snapshots",
    "compress": true,
    "server_side_encryption": true,
    "storage_class": "standard_ia"
  }
}

# Verify repository
POST /_snapshot/my-s3-backup/_verify

Automated Snapshot Policy

# Create snapshot lifecycle policy
PUT /_slm/policy/daily-snapshots
{
  "schedule": "0 2 * * *",
  "name": "<daily-snap-{now/d}>",
  "repository": "my-s3-backup",
  "config": {
    "indices": ["logs-*", "metrics-*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

# Start the policy
POST /_slm/policy/daily-snapshots/_execute

Restore Operations

# List available snapshots
GET /_snapshot/my-s3-backup/_all

# Restore specific indices
POST /_snapshot/my-s3-backup/daily-snap-2024-12-15/_restore
{
  "indices": "logs-2024.12.15",
  "ignore_unavailable": true,
  "include_global_state": false,
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

# Monitor restore progress
GET /_recovery/restored-logs-2024.12.15

Disaster Recovery Planning

🚨 Recovery Scenarios

Total Cluster Loss

• Hardware failure or data center outage
• Requires complete cluster rebuild
• RTO: 2-4 hours depending on data size
• RPO: Last successful snapshot

Partial Data Loss

• Corrupted or deleted indices
• Selective restore operations
• RTO: 30 minutes to 2 hours
• RPO: Last snapshot of affected indices

📋 Recovery Checklist

Pre-Disaster Preparation

• ✅ Automated snapshot schedules
• ✅ Multiple repository locations
• ✅ Documented recovery procedures
• ✅ Regular restore testing
• ✅ Contact information and escalation

During Disaster

• ✅ Assess scope of data loss
• ✅ Identify last good snapshot
• ✅ Rebuild cluster infrastructure
• ✅ Restore from snapshots
• ✅ Validate data integrity

Protecting Your Elasticsearch Data

Critical Points

• Snapshots are your primary defense against data loss
• Automated, tested backup strategies are essential
• Multiple repositories provide redundancy
• Recovery procedures must be documented and tested

Action Items

• Set up automated snapshot policies immediately
• Test restore procedures monthly
• Document disaster recovery runbooks
• Monitor snapshot health and success rates

Previous: Ingest Pipelines Check Next: Deprecations Check

Snapshots Check: Backup Strategy and Disaster Recovery