Backup & Recovery - Mid-Depth
You’ve set up basic backups. They run nightly. You trust they work. Then your database crashes at 3 PM, and you discover the backups are corrupt. Or they’re there, but recovery takes 12 hours when you needed 2. Or ransomware encrypted everything, including your backups.
This mid-depth layer solves the problems that surface when backups meet reality:
- Recovery takes too long. Your RTO says 2 hours but reality is 8+ hours. Nobody tested the full restore process.
- You’re losing too much data. Daily backups mean 24 hours of transactions disappear. Your RPO doesn’t match business needs.
- Ransomware got your backups. All your backup copies were network-accessible. Attackers encrypted them before you noticed.
- Choosing backup types is confusing. Full? Incremental? Differential? Each has trade-offs nobody explained.
- Cloud vs on-premises decision paralysis. You need off-site storage but don’t know whether to build it or buy it.
We’ll fix these with patterns from backup veterans like W. Curtis Preston (30+ years experience) and enterprise frameworks from Veeam and AWS.
When Surface Level Isn’t Enough
You’ve shipped basic backups. Now you’re hitting real problems:
- Your team can’t articulate what “4 hour RTO” actually means or whether it’s achievable
- You have backups but no verification they’re recoverable
- Recovery procedures exist only in one person’s head
- Backup storage grows uncontrollably with no lifecycle policy
- Nobody knows which systems can tolerate downtime and which can’t
This guide covers practical patterns that matter when backups become a business-critical system.
Understanding RTO and RPO as Business Decisions
RTO (Recovery Time Objective) and RPO (Recovery Point Objective) sound technical. They’re not. They’re business decisions with dollar signs attached.
What RTO Actually Means
RTO is how long you can be down before the pain becomes unacceptable.
Curtis Preston calls out the reality:
“Want zero downtime and zero data loss? Sure, we can do that—for about a billion dollars. Once costs attach to requirements, objectives become significantly more realistic.”
Breaking down a “4-hour RTO”:
Most organizations set 4-hour RTOs without understanding what “recovery” includes:
- Detection time (10 mins - 2 hours): When did the failure happen? When did someone notice?
- Decision time (5 mins - 1 hour): Is this worth triggering DR? Who approves?
- Retrieval time (10 mins - 8 hours): Get backup media from storage (instant for cloud, hours for tape vaults)
- Restore time (30 mins - 6 hours): Copy data from backup to production (depends on data size and network)
- Verification time (15 mins - 2 hours): Does the restored data work? Can users log in?
- Cutover time (10 mins - 1 hour): Switch traffic to restored system
Your “4-hour RTO” might actually require 12+ hours when you account for everything.
Calculating achievable RTO:
True RTO = Detection + Decision + Retrieval + Restore + Verification + Cutover
Example:
Detection: 30 mins (monitoring alerts)
Decision: 15 mins (on-call approves)
Retrieval: 10 mins (S3 bucket download)
Restore: 2 hours (500 GB database restore)
Verification: 30 mins (smoke tests)
Cutover: 15 mins (DNS change)
Total: 3 hours 40 minutes
This is your realistic RTO, not "4 hours."
Talking to business stakeholders:
Don’t ask: “What’s your RTO requirement?”
Ask instead:
- “How much revenue do we lose per hour of downtime?”
- “At what point do customers start leaving for competitors?”
- “What’s the cost of achieving 1-hour recovery vs 4-hour recovery?”
- “Which systems must be up first, and which can wait?”
Attach costs to RTOs. A 1-hour RTO might cost $200K/year in infrastructure. A 4-hour RTO might cost $20K/year. That’s a business decision, not a technical one.
What RPO Actually Means
RPO is how much data you can afford to lose, measured in time.
Daily backups at midnight mean your RPO is 24 hours. If the database crashes at 11:59 PM, you lose 23 hours and 59 minutes of transactions.
Calculating data loss cost:
E-commerce platform:
Average transactions per hour: 500
Average order value: $85
RPO of 24 hours = potential loss of 12,000 orders = $1,020,000
RPO of 1 hour = potential loss of 500 orders = $42,500
Difference: $977,500 in worst-case data loss
Is reducing RPO from 24 hours to 1 hour worth the infrastructure cost? That depends on how often failures happen and what you spend on prevention.
Realistic RPO factors:
Your RPO isn’t just backup frequency. It’s:
True RPO = Backup Frequency + Detection Time + Data Reconstruction
Example:
Backup frequency: Every 4 hours
Detection time: 30 minutes (before triggering recovery)
Data reconstruction: 2 hours (re-entering missing transactions)
True RPO: Up to 6.5 hours of data loss
The Dangerous Gap: RTO vs RTA
Preston emphasizes distinguishing between:
- RTO (Recovery Time Objective): What you plan to achieve
- RTA (Recovery Time Actual): What you actually achieve
Organizations set 4-hour RTOs but never test them. When disaster strikes, RTA is 12+ hours. This gap exists because:
- Nobody timed the full recovery process
- Backup compression slows restores (untested)
- Network bandwidth insufficient for large restores
- Missing prerequisites (hardware, credentials, DNS configuration)
- Key person unavailable (knowledge in one person’s head)
Close the gap with testing:
The only way to know your RTA is to measure it. Test recovery regularly and document the time.
Backup Strategy Selection: Full vs Incremental vs Differential
You need to choose a backup type. Each has trade-offs between storage cost and recovery speed.
The Three Backup Types
Assume you have a 100 GB database with 10% daily change (10 GB of modifications per day).
Full Backup
Every backup copies all 100 GB.
Sunday: 100 GB full backup
Monday: 100 GB full backup
Tuesday: 100 GB full backup
Weekly storage: 300 GB (3 × 100 GB)
Recovery process (Tuesday crash):
- Restore Tuesday’s full backup (100 GB)
- Done
Recovery time: Fastest (single restore operation)
Incremental Backup
Only backs up changes since the last backup (any type).
Sunday: 100 GB full backup
Monday: 10 GB incremental (changes since Sunday)
Tuesday: 10 GB incremental (changes since Monday)
Weekly storage: 120 GB (100 GB + 10 GB + 10 GB)
Recovery process (Tuesday crash):
- Restore Sunday’s full backup (100 GB)
- Apply Monday’s incremental (10 GB)
- Apply Tuesday’s incremental (10 GB)
- Done
Recovery time: Slowest (requires sequential chain)
Differential Backup
Backs up all changes since the last full backup.
Sunday: 100 GB full backup
Monday: 10 GB differential (changes since Sunday)
Tuesday: 20 GB differential (all changes since Sunday)
Weekly storage: 130 GB (100 GB + 10 GB + 20 GB)
Recovery process (Tuesday crash):
- Restore Sunday’s full backup (100 GB)
- Apply Tuesday’s differential (20 GB)
- Done
Recovery time: Fast (only two restore operations)
Comparison Table
| Aspect | Full Backup | Incremental | Differential |
|---|---|---|---|
| Backup speed | Slowest (copies everything) | Fastest (minimal data) | Medium (growing data) |
| Backup storage | Most expensive | Least expensive | Medium cost |
| Recovery speed | Fastest (1 step) | Slowest (chain required) | Fast (2 steps) |
| Recovery complexity | Simple | Complex (chain breaks if any part corrupted) | Moderate |
| Best for | Critical systems, fast RTO | Cost-conscious, slower RTO acceptable | Balanced approach |
Decision Framework
Use full backups when:
- Recovery speed matters more than cost
- RTO is aggressive (1-2 hours)
- Data size is manageable (under 500 GB)
- Weekly full backups for compliance/audit
Use incremental backups when:
- Storage cost is a major concern
- Data volume is massive (multi-terabyte)
- Slower recovery is acceptable (4+ hour RTO)
- You can tolerate sequential restore complexity
Use differential backups when:
- You need balanced cost and speed
- RTO is moderate (2-4 hours)
- Daily change rate is predictable
- Most common choice for databases
Common Pattern: Full + Differential
Many organizations combine strategies:
Weekly schedule:
Sunday: Full backup (100 GB)
Monday: Differential (10 GB since Sunday)
Tuesday: Differential (20 GB since Sunday)
Wednesday: Differential (30 GB since Sunday)
Thursday: Differential (40 GB since Sunday)
Friday: Differential (50 GB since Sunday)
Saturday: Differential (60 GB since Sunday)
Next Sunday: Full backup (reset cycle)
Recovery at any point:
- Restore most recent full backup (Sunday’s 100 GB)
- Apply most recent differential (e.g., Thursday’s 40 GB)
- Two steps maximum
Storage growth:
- Week 1: 100 + 10 + 20 + 30 + 40 + 50 + 60 = 310 GB
- Manageable with retention policies (keep 4 weeks = ~1.2 TB)
Storage Architecture: On-Premises vs Cloud vs Hybrid
Where you store backups affects cost, recovery speed, and resilience.
On-Premises Backup Storage
Architecture:
- Local NAS or SAN storage
- Tape libraries for long-term retention
- Physical transportation for off-site copies
Advantages:
- Fast local recovery (LAN speeds: 1 GB/sec+)
- No ongoing cloud costs
- Full control over infrastructure
- Good for multi-terabyte datasets
Disadvantages:
- Capital expenditure (hardware purchase)
- Physical off-site requires tape transport
- Geographic disaster risk (fire, flood destroys both primary and backup)
- Manual operational overhead
Cost model:
Initial investment:
NAS hardware (10 TB): $5,000
Tape library: $15,000
Total CAPEX: $20,000
Annual costs:
Power and cooling: $1,200/year
Tape media replacement: $1,000/year
Off-site tape transport: $2,400/year
Maintenance: $1,000/year
Total OPEX: $5,600/year
5-year TCO: $48,000
When to use:
- Large on-premises datasets (5+ TB)
- Fast recovery required (RTO < 2 hours)
- Existing data center infrastructure
- Regulatory data residency requirements
Cloud Backup Storage
Architecture:
- AWS S3, Azure Blob Storage, Google Cloud Storage
- Automatic geographic replication
- Lifecycle policies for tiering (Standard → Glacier)
Advantages:
- No capital expenditure (pay-as-you-go)
- Automatic off-site redundancy
- AWS S3: 99.999999999% (11 nines) durability
- Scales automatically with data growth
- Built-in immutability (S3 Object Lock)
Disadvantages:
- Network bandwidth limits recovery speed
- Egress costs for large restores
- Ongoing monthly costs scale with data
- Internet dependency
Cost model:
Cloud storage (AWS S3):
10 TB S3 Standard: $230/month
10 TB S3 Glacier (archived): $40/month
Retrieval cost (full restore):
10 TB egress: $920 (one-time, disaster scenario)
Annual cost (Standard): $2,760/year
Annual cost (Glacier): $480/year + retrieval when needed
5-year TCO (Standard): $13,800
5-year TCO (Glacier): $2,400 + occasional retrieval costs
When to use:
- Variable or growing backup volumes
- Geographic redundancy required
- Small to medium datasets (under 5 TB)
- Organizations without on-premises infrastructure
- Startups and cloud-native applications
Hybrid Approach (Recommended)
Most resilient: combine local and cloud storage.
Architecture:
Primary backup: Local NAS (fast recovery)
↓
Replication: AWS S3 (off-site protection)
↓
Archive: S3 Glacier (long-term retention)
↓
Air-gap: Offline tape vault (ransomware defense)
Example implementation:
Daily workflow:
1. Differential backup to local NAS (100 GB, 15 mins)
2. Automatic replication to S3 (background, 2 hours)
3. Weekly full backup (500 GB) to NAS
4. Monthly full backup to tape, stored off-site
Recovery scenarios:
- Quick recovery (database restore): Use local NAS (30 mins RTO)
- Disaster (data center fire): Restore from S3 (4 hours RTO)
- Ransomware: Restore from air-gapped tape (24 hours RTO)
Cost model:
Hybrid infrastructure:
Local NAS (10 TB): $5,000 one-time
AWS S3 (10 TB Standard): $230/month
Tape library (long-term): $5,000 one-time
Annual costs:
Local power/maintenance: $2,000/year
AWS S3: $2,760/year
Tape transport: $2,400/year
Total: $7,160/year + $10K CAPEX
5-year TCO: $45,800
Benefits:
- Fast operational recovery (local NAS)
- Geographic disaster protection (S3)
- Ransomware defense (offline tape)
- Flexible cost optimization
Immutable Backups vs Air-Gapped Backups
Modern ransomware targets backups before encrypting production data. You need backups attackers can’t delete.
Immutable Backups
How it works: Backup objects protected by retention locks. Even administrators can’t modify or delete until retention expires.
AWS S3 Object Lock example:
{
"ObjectLockConfiguration": {
"ObjectLockEnabled": "Enabled",
"Rule": {
"DefaultRetention": {
"Mode": "COMPLIANCE",
"Days": 30
}
}
}
}
No user, administrator, or automated process can delete objects for 30 days. Even AWS root account can’t override compliance mode.
Advantages:
- Always online (fast recovery)
- Prevents accidental or malicious deletion
- Works with automated backups
- Scales in cloud environments
Disadvantages:
- Still accessible via network (vulnerable if credentials compromised)
- Retention period must be set carefully
- Storage costs for full retention period
When to use:
- Primary ransomware defense
- Fast recovery requirements (hours)
- Cloud-native architectures
- Automated continuous backups
Air-Gapped Backups
How it works: Physical disconnection from network. Requires human access to retrieve.
Implementation:
- Tape library with offline storage
- USB drives in locked safe
- Dedicated backup repository with no network connection
Advantages:
- Maximum isolation from cyberattacks
- No network credential compromise can reach it
- Excellent for compliance and long-term retention
- Ultimate insurance policy
Disadvantages:
- Slow recovery (hours to days)
- Manual operations required
- Higher operational overhead
- Less suitable for frequent backups
When to use:
- Final backup layer (last resort)
- High ransomware risk
- Compliance long-term retention
- Monthly/quarterly full backups
Combined Strategy (Veeam Recommendation)
Use both for maximum protection:
Backup tiers:
1. Local NAS (fast recovery): RTO 30 mins
2. Immutable S3 (ransomware defense): RTO 2 hours
3. Air-gapped tape (ultimate protection): RTO 24 hours
Recovery decision tree:
Database crashed?
├─ No ransomware suspected → Restore from local NAS (30 mins)
└─ Ransomware suspected → Assess damage:
├─ NAS intact → Restore from NAS, verify integrity
├─ NAS compromised → Restore from immutable S3 (2 hours)
└─ S3 credentials compromised → Restore from air-gapped tape (24 hours)
This layered approach means you can recover from:
- Accidental deletion (local NAS)
- Ransomware (immutable S3)
- Sophisticated attack (air-gapped tape)
Point-in-Time Recovery (PITR) for Databases
Standard backups capture snapshots at specific times. PITR lets you recover to any moment.
How PITR Works
The mechanism:
- Take a base backup (full copy of database)
- Continuously archive transaction logs (every change recorded)
- To recover: Restore base backup, then replay transactions to desired point
PostgreSQL example:
Transaction timeline:
3:00 AM Sunday: Base backup taken
10:30 AM Monday: Customer deposits $5,000 (transaction logged)
2:45 PM Tuesday: Ransomware corrupts database
Recovery: Restore to 10:31 AM Monday (after deposit, before ransomware)
PostgreSQL PITR Configuration
Enable WAL archiving:
-- postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'
Create base backup:
pg_basebackup -D /backup/base -Ft -z -P
This creates a compressed base backup.
Archive location:
/backup/
├── base/ (full database snapshot)
└── wal_archive/ (continuous transaction logs)
├── 000000010000000000000001
├── 000000010000000000000002
└── ...
Recovery process:
# 1. Stop PostgreSQL
systemctl stop postgresql
# 2. Clear data directory
rm -rf /var/lib/postgresql/14/main/*
# 3. Restore base backup
tar -xzf /backup/base/base.tar.gz -C /var/lib/postgresql/14/main/
# 4. Create recovery configuration
cat > /var/lib/postgresql/14/main/recovery.signal << EOF
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2024-11-16 10:31:00'
EOF
# 5. Start PostgreSQL (automatic recovery)
systemctl start postgresql
PostgreSQL automatically:
- Restores base backup
- Replays WAL files up to 10:31 AM Monday
- Stops at target time
- Makes database available
Result: Database at exact state from 10:31 AM Monday.
PITR Benefits and Trade-offs
Benefits:
- RPO near-zero (recover to any second)
- Surgical recovery (restore to exact point before corruption)
- No need for frequent full backups
Trade-offs:
- More complex setup
- Requires continuous WAL archiving
- More storage (WAL files grow over time)
- Slower recovery than recent full backup
When to use PITR:
- Financial systems (transaction precision required)
- Compliance requirements (audit trail)
- Low RPO requirements (< 1 hour)
- Databases where data loss is expensive
When full backups are sufficient:
- Internal tools (data loss acceptable)
- Infrequently changing data
- Non-critical systems
Backup Testing and Validation
Untested backups are worse than no backups. They create false confidence.
Preston’s observation:
“Fewer than 50% of companies with disaster recovery plans test them at all, which is described as a huge mistake.”
Why Backups Fail Untested
Real failure scenarios discovered during actual disasters:
- Backup compression slowed restores, making RTO unachievable (Preston’s personal experience)
- Missing dependencies (database version incompatible with backup)
- Credentials expired or incorrect
- Backup files corrupt but verification never caught it
- Network bandwidth insufficient for large restores
- Person who knew recovery procedure left company
Progressive Testing Strategy
Preston’s recommendation: Start small, build up.
Don’t try full-scale disaster recovery first. Build confidence incrementally.
Level 1: Basic File Restore (Monthly)
# Pick a random file from backup
# Restore it to test environment
# Verify contents match original
# Example
restore_file --backup-id=daily-2024-11-16 \
--file=/var/app/config.json \
--destination=/tmp/test-restore/
diff /var/app/config.json /tmp/test-restore/config.json
# No differences = success
Time: 5 minutes Risk: Zero (not touching production) Value: Confirms basic restore mechanics work
Level 2: Database Restore (Monthly)
# Restore database to separate test instance
# Verify data integrity
# Document time taken
# Example (PostgreSQL)
pg_restore -d test_recovery /backup/latest.dump
# Verify row counts match
psql test_recovery -c "SELECT COUNT(*) FROM users;"
# Compare to production count
# Time the restore
# Document: "500 GB restore took 2 hours 15 minutes"
Time: 30-60 minutes Risk: Low (separate test database) Value: Validates database backups, measures restore time
Level 3: Application Recovery (Quarterly)
Full application stack restore:
1. Provision clean infrastructure
2. Restore database from backup
3. Restore application files
4. Start services
5. Run smoke tests
6. Time entire process
Example checklist:
□ Spin up test EC2 instance
□ Restore database (time: ___ hours)
□ Restore application code from Git tag
□ Restore configuration files from backup
□ Start application services
□ Smoke tests:
□ Can users log in?
□ Can users view their data?
□ Can users create new records?
□ Total time: ___ hours ___ minutes
Time: 2-4 hours Risk: Medium (requires coordination) Value: Validates full recovery process, uncovers missing steps
Level 4: Disaster Recovery Exercise (Annually)
Simulate actual disaster:
1. Declare test disaster (communication drills)
2. Full recovery to production-like environment
3. Different team member leads recovery
4. Stakeholder involvement
5. Post-mortem: what went wrong, what to improve
Requirements:
- Recovery runbooks (documented procedures)
- Multiple people capable of executing recovery
- Communication plan (who to notify, escalation paths)
- Success criteria defined in advance
Time: 1 full day Risk: High coordination effort Value: Validates RTO/RPO, uncovers gaps in planning
Automated Verification
Manual testing is good. Automated testing is better.
Daily automated checks:
#!/bin/bash
# backup-verify.sh
# Check backup completed successfully
if [ ! -f /backup/latest.flag ]; then
alert "Backup did not complete"
exit 1
fi
# Check backup file exists and has content
BACKUP_FILE="/backup/db-$(date +%Y%m%d).sql.gz"
if [ ! -s "$BACKUP_FILE" ]; then
alert "Backup file missing or empty"
exit 1
fi
# Check backup file size (should be > 100MB)
SIZE=$(stat -f%z "$BACKUP_FILE")
if [ "$SIZE" -lt 104857600 ]; then
alert "Backup file suspiciously small: ${SIZE} bytes"
exit 1
fi
# Test restore to temp database (sample check)
gunzip -c "$BACKUP_FILE" | psql -d temp_restore_check
# Verify row count
COUNT=$(psql temp_restore_check -t -c "SELECT COUNT(*) FROM users;")
if [ "$COUNT" -lt 1000 ]; then
alert "Restored database has too few rows: ${COUNT}"
exit 1
fi
echo "Backup verified successfully"
Run this daily. If it fails, you know immediately.
Disaster Recovery Strategies by RTO/RPO
Different systems need different recovery strategies.
Cold Standby (RTO: Days, RPO: 24+ hours)
Architecture:
- Backup resources exist but are offline
- Manual activation required
- Minimal ongoing cost
Example:
Production database crashes
↓
1. Provision new hardware (1-2 days if ordering required)
2. Install and configure database software (4 hours)
3. Restore from latest backup (6 hours)
4. Reconfigure application to new database (2 hours)
5. Validate and switch traffic (1 hour)
Total RTO: 2-3 days
When to use:
- Non-critical internal tools
- Systems with minimal business impact
- Budget-constrained scenarios
Cost: Low (backups only, no standby infrastructure)
Warm Standby (RTO: Hours, RPO: 1-4 hours)
Architecture:
- Secondary infrastructure provisioned but not fully active
- Database replication with lag
- Manual or automated activation
Example:
Production region fails
↓
1. Detect failure (10 mins)
2. Approve failover (15 mins)
3. Activate standby region:
- Promote read replica to primary (5 mins)
- Start application servers (10 mins)
- Update DNS to standby region (15 mins propagation)
4. Validate services (30 mins)
Total RTO: ~90 minutes
When to use:
- Business-critical systems
- Acceptable brief outages
- Moderate budget
Cost: Medium (standby infrastructure + replication)
Hot Standby (RTO: Minutes, RPO: Near-zero)
Architecture:
- Active-passive: standby actively replicating, instant failover
- Real-time data replication
Example:
Production database fails
↓
1. Automatic detection (30 seconds)
2. Automatic failover to standby (2 mins)
3. Validation (5 mins)
Total RTO: ~8 minutes
Total RPO: < 1 minute (replication lag)
When to use:
- Mission-critical systems (financial, healthcare)
- Zero-tolerance for data loss
- High budget
Cost: High (duplicate infrastructure running continuously)
Multi-Region Active-Active (RTO: Seconds, RPO: Near-zero)
Architecture:
- Multiple regions serving traffic simultaneously
- Automatic failover, users don’t notice
Example:
AWS us-east-1 region fails
↓
1. Load balancer detects failure (5 seconds)
2. Routes traffic to us-west-2 (automatic)
3. Users experience no downtime
Total RTO: < 10 seconds
Total RPO: Near-zero (real-time replication)
When to use:
- Global services requiring high availability
- E-commerce platforms
- SaaS applications
Cost: Very high (full infrastructure in multiple regions)
Decision Framework
| RTO Requirement | RPO Requirement | Strategy | Example Use Case |
|---|---|---|---|
| Days | 24+ hours | Cold standby | Internal wikis, dev environments |
| 4-8 hours | 4-12 hours | Warm standby | CRM, reporting tools |
| 1-2 hours | 1-4 hours | Hot standby | Customer-facing apps |
| Minutes | Near-zero | Active-active | E-commerce, financial systems |
Cost increases dramatically with tighter RTO/RPO:
- Cold: ~$500/month (backups only)
- Warm: ~$5,000/month (standby + replication)
- Hot: ~$15,000/month (full duplicate infrastructure)
- Active-active: ~$30,000/month (multi-region full deployment)
Cost-Benefit Analysis
Backup infrastructure costs money. Make informed decisions.
Time Investment
Initial setup:
- Basic automated backups: 4-8 hours
- Hybrid backup strategy: 16-24 hours
- PITR configuration: 8-16 hours
- Disaster recovery testing: 8 hours (first time)
Ongoing maintenance:
- Monitoring backup health: 2 hours/month
- Testing restores: 4 hours/month
- Updating runbooks: 2 hours/month
- Total: ~8 hours/month
Return on Investment
Immediate benefits:
- Sleep better (backups exist and are tested)
- Faster recovery from accidental deletion
- Compliance checkbox (backups required for most standards)
Medium-term (3-6 months):
- Confidence in recovery procedures
- Reduced recovery time (tested and optimized)
- Team knowledge distributed (not single person)
Long-term (1+ year):
- Prevented data loss incidents
- Successful disaster recovery (when needed)
- Reduced insurance costs (better risk profile)
When to Skip Advanced Backup Strategies
Not every system needs enterprise-grade backups.
Skip complex backup infrastructure if:
- Data is easily recreated (development environments)
- System is non-critical (internal demo apps)
- Budget is very limited
- Data volume is tiny (< 10 GB)
Simple approach for these cases:
# Daily cron job
0 2 * * * pg_dump mydb | gzip > /backup/mydb-$(date +\%Y\%m\%d).sql.gz
# Keep 7 days
find /backup -name "mydb-*.sql.gz" -mtime +7 -delete
# Weekly copy to S3
0 3 * * 0 aws s3 cp /backup/ s3://my-backups/ --recursive
Good enough for non-critical systems.
Progressive Enhancement Path
Build backup infrastructure incrementally.
Month 1-2: Foundation
Week 1-2:
- Identify critical systems requiring backup
- Define RTO/RPO requirements (business conversation)
- Set up automated daily backups (database, files)
- Configure backup monitoring/alerting
Week 3-4:
- Document recovery procedures (runbook)
- Perform first test restore (Level 1: file restore)
- Calculate actual storage requirements
- Set up backup retention policy
Outcome: Basic backups running and verified
Month 3-4: Optimization
Week 5-6:
- Implement off-site backup replication (cloud)
- Test database restore (Level 2)
- Measure actual restore time (compare to RTO)
- Set up immutable backups (ransomware defense)
Week 7-8:
- Optimize backup schedule (full + differential)
- Implement automated verification
- Train additional team members on recovery
- Conduct tabletop exercise (discuss disaster scenarios)
Outcome: Tested, optimized backup strategy with off-site protection
Month 5-6: Advanced
Week 9-10:
- Implement PITR for critical databases (if needed)
- Set up air-gapped backup tier (tape or offline)
- Full disaster recovery exercise (Level 4)
- Document gaps and improvements
Week 11-12:
- Automate recovery procedures (infrastructure-as-code)
- Implement monitoring dashboard (backup health)
- Review and update RTO/RPO based on tests
- Create quarterly testing schedule
Outcome: Enterprise-grade backup and disaster recovery capability
Summary
Key takeaways:
-
RTO and RPO are business decisions, not technical ones. Attach costs to recovery objectives. Four-hour RTO might actually require 12+ hours when you account for all steps.
-
Test your backups or they don’t exist. Start with simple file restores, build up to full disaster recovery exercises. Untested backups create false confidence.
-
Choose backup types based on trade-offs. Full backups are fast to restore but expensive. Incremental backups are cheap but slow to restore. Differential balances both.
-
Hybrid storage combines speed and resilience. Local NAS for fast recovery, cloud for off-site protection, air-gapped tape for ransomware defense.
-
Immutable backups defend against ransomware. Combined with air-gapped backups, you can survive sophisticated attacks.
-
PITR reduces RPO to near-zero. For critical databases, point-in-time recovery enables surgical restoration to exact moments.
Start here:
- Define RTO/RPO with business stakeholders (attach dollar values)
- Set up automated backups with monitoring
- Perform your first test restore this week
For deeper understanding:
- Deep Water covers advanced PITR implementation, chaos engineering for backup validation, and multi-region disaster recovery architectures