Backup & Recovery - Mid-Depth

You’ve set up basic backups. They run nightly. You trust they work. Then your database crashes at 3 PM, and you discover the backups are corrupt. Or they’re there, but recovery takes 12 hours when you needed 2. Or ransomware encrypted everything, including your backups.

This mid-depth layer solves the problems that surface when backups meet reality:

Recovery takes too long. Your RTO says 2 hours but reality is 8+ hours. Nobody tested the full restore process.
You’re losing too much data. Daily backups mean 24 hours of transactions disappear. Your RPO doesn’t match business needs.
Ransomware got your backups. All your backup copies were network-accessible. Attackers encrypted them before you noticed.
Choosing backup types is confusing. Full? Incremental? Differential? Each has trade-offs nobody explained.
Cloud vs on-premises decision paralysis. You need off-site storage but don’t know whether to build it or buy it.

We’ll fix these with patterns from backup veterans like W. Curtis Preston (30+ years experience) and enterprise frameworks from Veeam and AWS.

When Surface Level Isn’t Enough

You’ve shipped basic backups. Now you’re hitting real problems:

Your team can’t articulate what “4 hour RTO” actually means or whether it’s achievable
You have backups but no verification they’re recoverable
Recovery procedures exist only in one person’s head
Backup storage grows uncontrollably with no lifecycle policy
Nobody knows which systems can tolerate downtime and which can’t

This guide covers practical patterns that matter when backups become a business-critical system.

Understanding RTO and RPO as Business Decisions

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) sound technical. They’re not. They’re business decisions with dollar signs attached.

What RTO Actually Means

RTO is how long you can be down before the pain becomes unacceptable.

Curtis Preston calls out the reality:

“Want zero downtime and zero data loss? Sure, we can do that—for about a billion dollars. Once costs attach to requirements, objectives become significantly more realistic.”

Breaking down a “4-hour RTO”:

Most organizations set 4-hour RTOs without understanding what “recovery” includes:

Detection time (10 mins - 2 hours): When did the failure happen? When did someone notice?
Decision time (5 mins - 1 hour): Is this worth triggering DR? Who approves?
Retrieval time (10 mins - 8 hours): Get backup media from storage (instant for cloud, hours for tape vaults)
Restore time (30 mins - 6 hours): Copy data from backup to production (depends on data size and network)
Verification time (15 mins - 2 hours): Does the restored data work? Can users log in?
Cutover time (10 mins - 1 hour): Switch traffic to restored system

Your “4-hour RTO” might actually require 12+ hours when you account for everything.

Calculating achievable RTO:

True RTO = Detection + Decision + Retrieval + Restore + Verification + Cutover

Example:
  Detection: 30 mins (monitoring alerts)
  Decision: 15 mins (on-call approves)
  Retrieval: 10 mins (S3 bucket download)
  Restore: 2 hours (500 GB database restore)
  Verification: 30 mins (smoke tests)
  Cutover: 15 mins (DNS change)

  Total: 3 hours 40 minutes

This is your realistic RTO, not "4 hours."

Talking to business stakeholders:

Don’t ask: “What’s your RTO requirement?”

Ask instead:

“How much revenue do we lose per hour of downtime?”
“At what point do customers start leaving for competitors?”
“What’s the cost of achieving 1-hour recovery vs 4-hour recovery?”
“Which systems must be up first, and which can wait?”

Attach costs to RTOs. A 1-hour RTO might cost $200K/year in infrastructure. A 4-hour RTO might cost $20K/year. That’s a business decision, not a technical one.

What RPO Actually Means

RPO is how much data you can afford to lose, measured in time.

Daily backups at midnight mean your RPO is 24 hours. If the database crashes at 11:59 PM, you lose 23 hours and 59 minutes of transactions.

Calculating data loss cost:

E-commerce platform:
  Average transactions per hour: 500
  Average order value: $85

  RPO of 24 hours = potential loss of 12,000 orders = $1,020,000
  RPO of 1 hour = potential loss of 500 orders = $42,500

  Difference: $977,500 in worst-case data loss

Is reducing RPO from 24 hours to 1 hour worth the infrastructure cost? That depends on how often failures happen and what you spend on prevention.

Realistic RPO factors:

Your RPO isn’t just backup frequency. It’s:

True RPO = Backup Frequency + Detection Time + Data Reconstruction

Example:
  Backup frequency: Every 4 hours
  Detection time: 30 minutes (before triggering recovery)
  Data reconstruction: 2 hours (re-entering missing transactions)

  True RPO: Up to 6.5 hours of data loss

The Dangerous Gap: RTO vs RTA

Preston emphasizes distinguishing between:

RTO (Recovery Time Objective): What you plan to achieve
RTA (Recovery Time Actual): What you actually achieve

Organizations set 4-hour RTOs but never test them. When disaster strikes, RTA is 12+ hours. This gap exists because:

Nobody timed the full recovery process
Backup compression slows restores (untested)
Network bandwidth insufficient for large restores
Missing prerequisites (hardware, credentials, DNS configuration)
Key person unavailable (knowledge in one person’s head)

Close the gap with testing:

The only way to know your RTA is to measure it. Test recovery regularly and document the time.

Backup Strategy Selection: Full vs Incremental vs Differential

You need to choose a backup type. Each has trade-offs between storage cost and recovery speed.

The Three Backup Types

Assume you have a 100 GB database with 10% daily change (10 GB of modifications per day).

Full Backup

Every backup copies all 100 GB.

Sunday: 100 GB full backup
Monday: 100 GB full backup
Tuesday: 100 GB full backup

Weekly storage: 300 GB (3 × 100 GB)

Recovery process (Tuesday crash):

Restore Tuesday’s full backup (100 GB)
Done

Recovery time: Fastest (single restore operation)

Incremental Backup

Only backs up changes since the last backup (any type).

Sunday: 100 GB full backup
Monday: 10 GB incremental (changes since Sunday)
Tuesday: 10 GB incremental (changes since Monday)

Weekly storage: 120 GB (100 GB + 10 GB + 10 GB)

Recovery process (Tuesday crash):

Restore Sunday’s full backup (100 GB)
Apply Monday’s incremental (10 GB)
Apply Tuesday’s incremental (10 GB)
Done

Recovery time: Slowest (requires sequential chain)

Differential Backup

Backs up all changes since the last full backup.

Sunday: 100 GB full backup
Monday: 10 GB differential (changes since Sunday)
Tuesday: 20 GB differential (all changes since Sunday)

Weekly storage: 130 GB (100 GB + 10 GB + 20 GB)

Recovery process (Tuesday crash):

Restore Sunday’s full backup (100 GB)
Apply Tuesday’s differential (20 GB)
Done

Recovery time: Fast (only two restore operations)

Comparison Table

Aspect	Full Backup	Incremental	Differential
Backup speed	Slowest (copies everything)	Fastest (minimal data)	Medium (growing data)
Backup storage	Most expensive	Least expensive	Medium cost
Recovery speed	Fastest (1 step)	Slowest (chain required)	Fast (2 steps)
Recovery complexity	Simple	Complex (chain breaks if any part corrupted)	Moderate
Best for	Critical systems, fast RTO	Cost-conscious, slower RTO acceptable	Balanced approach

Decision Framework

Use full backups when:

Recovery speed matters more than cost
RTO is aggressive (1-2 hours)
Data size is manageable (under 500 GB)
Weekly full backups for compliance/audit

Use incremental backups when:

Storage cost is a major concern
Data volume is massive (multi-terabyte)
Slower recovery is acceptable (4+ hour RTO)
You can tolerate sequential restore complexity

Use differential backups when:

You need balanced cost and speed
RTO is moderate (2-4 hours)
Daily change rate is predictable
Most common choice for databases

Common Pattern: Full + Differential

Many organizations combine strategies:

Weekly schedule:
  Sunday: Full backup (100 GB)
  Monday: Differential (10 GB since Sunday)
  Tuesday: Differential (20 GB since Sunday)
  Wednesday: Differential (30 GB since Sunday)
  Thursday: Differential (40 GB since Sunday)
  Friday: Differential (50 GB since Sunday)
  Saturday: Differential (60 GB since Sunday)
  Next Sunday: Full backup (reset cycle)

Recovery at any point:

Restore most recent full backup (Sunday’s 100 GB)
Apply most recent differential (e.g., Thursday’s 40 GB)
Two steps maximum

Storage growth:

Week 1: 100 + 10 + 20 + 30 + 40 + 50 + 60 = 310 GB
Manageable with retention policies (keep 4 weeks = ~1.2 TB)

Storage Architecture: On-Premises vs Cloud vs Hybrid

Where you store backups affects cost, recovery speed, and resilience.

On-Premises Backup Storage

Architecture:

Local NAS or SAN storage
Tape libraries for long-term retention
Physical transportation for off-site copies

Advantages:

Fast local recovery (LAN speeds: 1 GB/sec+)
No ongoing cloud costs
Full control over infrastructure
Good for multi-terabyte datasets

Disadvantages:

Capital expenditure (hardware purchase)
Physical off-site requires tape transport
Geographic disaster risk (fire, flood destroys both primary and backup)
Manual operational overhead

Cost model:

Initial investment:
  NAS hardware (10 TB): $5,000
  Tape library: $15,000
  Total CAPEX: $20,000

Annual costs:
  Power and cooling: $1,200/year
  Tape media replacement: $1,000/year
  Off-site tape transport: $2,400/year
  Maintenance: $1,000/year
  Total OPEX: $5,600/year

5-year TCO: $48,000

When to use:

Large on-premises datasets (5+ TB)
Fast recovery required (RTO < 2 hours)
Existing data center infrastructure
Regulatory data residency requirements

Cloud Backup Storage

Architecture:

AWS S3, Azure Blob Storage, Google Cloud Storage
Automatic geographic replication
Lifecycle policies for tiering (Standard → Glacier)

Advantages:

No capital expenditure (pay-as-you-go)
Automatic off-site redundancy
AWS S3: 99.999999999% (11 nines) durability
Scales automatically with data growth
Built-in immutability (S3 Object Lock)

Disadvantages:

Network bandwidth limits recovery speed
Egress costs for large restores
Ongoing monthly costs scale with data
Internet dependency

Cost model:

Cloud storage (AWS S3):
  10 TB S3 Standard: $230/month
  10 TB S3 Glacier (archived): $40/month

  Retrieval cost (full restore):
    10 TB egress: $920 (one-time, disaster scenario)

  Annual cost (Standard): $2,760/year
  Annual cost (Glacier): $480/year + retrieval when needed

5-year TCO (Standard): $13,800
5-year TCO (Glacier): $2,400 + occasional retrieval costs

When to use:

Variable or growing backup volumes
Geographic redundancy required
Small to medium datasets (under 5 TB)
Organizations without on-premises infrastructure
Startups and cloud-native applications

Hybrid Approach (Recommended)

Most resilient: combine local and cloud storage.

Architecture:

Primary backup: Local NAS (fast recovery)
     ↓
Replication: AWS S3 (off-site protection)
     ↓
Archive: S3 Glacier (long-term retention)
     ↓
Air-gap: Offline tape vault (ransomware defense)

Example implementation:

Daily workflow:
1. Differential backup to local NAS (100 GB, 15 mins)
2. Automatic replication to S3 (background, 2 hours)
3. Weekly full backup (500 GB) to NAS
4. Monthly full backup to tape, stored off-site

Recovery scenarios:
  - Quick recovery (database restore): Use local NAS (30 mins RTO)
  - Disaster (data center fire): Restore from S3 (4 hours RTO)
  - Ransomware: Restore from air-gapped tape (24 hours RTO)

Cost model:

Hybrid infrastructure:
  Local NAS (10 TB): $5,000 one-time
  AWS S3 (10 TB Standard): $230/month
  Tape library (long-term): $5,000 one-time

  Annual costs:
    Local power/maintenance: $2,000/year
    AWS S3: $2,760/year
    Tape transport: $2,400/year
    Total: $7,160/year + $10K CAPEX

5-year TCO: $45,800

Benefits:

Fast operational recovery (local NAS)
Geographic disaster protection (S3)
Ransomware defense (offline tape)
Flexible cost optimization

Immutable Backups vs Air-Gapped Backups

Modern ransomware targets backups before encrypting production data. You need backups attackers can’t delete.

Immutable Backups

How it works: Backup objects protected by retention locks. Even administrators can’t modify or delete until retention expires.

AWS S3 Object Lock example:

{
  "ObjectLockConfiguration": {
    "ObjectLockEnabled": "Enabled",
    "Rule": {
      "DefaultRetention": {
        "Mode": "COMPLIANCE",
        "Days": 30
      }
    }
  }
}

No user, administrator, or automated process can delete objects for 30 days. Even AWS root account can’t override compliance mode.

Advantages:

Always online (fast recovery)
Prevents accidental or malicious deletion
Works with automated backups
Scales in cloud environments

Disadvantages:

Still accessible via network (vulnerable if credentials compromised)
Retention period must be set carefully
Storage costs for full retention period

When to use:

Primary ransomware defense
Fast recovery requirements (hours)
Cloud-native architectures
Automated continuous backups

Air-Gapped Backups

How it works: Physical disconnection from network. Requires human access to retrieve.

Implementation:

Tape library with offline storage
USB drives in locked safe
Dedicated backup repository with no network connection

Advantages:

Maximum isolation from cyberattacks
No network credential compromise can reach it
Excellent for compliance and long-term retention
Ultimate insurance policy

Disadvantages:

Slow recovery (hours to days)
Manual operations required
Higher operational overhead
Less suitable for frequent backups

When to use:

Final backup layer (last resort)
High ransomware risk
Compliance long-term retention
Monthly/quarterly full backups

Combined Strategy (Veeam Recommendation)

Use both for maximum protection:

Backup tiers:
1. Local NAS (fast recovery): RTO 30 mins
2. Immutable S3 (ransomware defense): RTO 2 hours
3. Air-gapped tape (ultimate protection): RTO 24 hours

Recovery decision tree:

Database crashed?
  ├─ No ransomware suspected → Restore from local NAS (30 mins)
  └─ Ransomware suspected → Assess damage:
      ├─ NAS intact → Restore from NAS, verify integrity
      ├─ NAS compromised → Restore from immutable S3 (2 hours)
      └─ S3 credentials compromised → Restore from air-gapped tape (24 hours)

This layered approach means you can recover from:

Accidental deletion (local NAS)
Ransomware (immutable S3)
Sophisticated attack (air-gapped tape)

Point-in-Time Recovery (PITR) for Databases

Standard backups capture snapshots at specific times. PITR lets you recover to any moment.

How PITR Works

The mechanism:

Take a base backup (full copy of database)
Continuously archive transaction logs (every change recorded)
To recover: Restore base backup, then replay transactions to desired point

PostgreSQL example:

Transaction timeline:

3:00 AM Sunday: Base backup taken
10:30 AM Monday: Customer deposits $5,000 (transaction logged)
2:45 PM Tuesday: Ransomware corrupts database

Recovery: Restore to 10:31 AM Monday (after deposit, before ransomware)

PostgreSQL PITR Configuration

Enable WAL archiving:

-- postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/wal_archive/%f'

Create base backup:

pg_basebackup -D /backup/base -Ft -z -P

This creates a compressed base backup.

Archive location:

/backup/
  ├── base/               (full database snapshot)
  └── wal_archive/        (continuous transaction logs)
      ├── 000000010000000000000001
      ├── 000000010000000000000002
      └── ...

Recovery process:

# 1. Stop PostgreSQL
systemctl stop postgresql

# 2. Clear data directory
rm -rf /var/lib/postgresql/14/main/*

# 3. Restore base backup
tar -xzf /backup/base/base.tar.gz -C /var/lib/postgresql/14/main/

# 4. Create recovery configuration
cat > /var/lib/postgresql/14/main/recovery.signal << EOF
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2024-11-16 10:31:00'
EOF

# 5. Start PostgreSQL (automatic recovery)
systemctl start postgresql

PostgreSQL automatically:

Restores base backup
Replays WAL files up to 10:31 AM Monday
Stops at target time
Makes database available

Result: Database at exact state from 10:31 AM Monday.

PITR Benefits and Trade-offs

Benefits:

RPO near-zero (recover to any second)
Surgical recovery (restore to exact point before corruption)
No need for frequent full backups

Trade-offs:

More complex setup
Requires continuous WAL archiving
More storage (WAL files grow over time)
Slower recovery than recent full backup

When to use PITR:

Financial systems (transaction precision required)
Compliance requirements (audit trail)
Low RPO requirements (< 1 hour)
Databases where data loss is expensive

When full backups are sufficient:

Internal tools (data loss acceptable)
Infrequently changing data
Non-critical systems

Backup Testing and Validation

Untested backups are worse than no backups. They create false confidence.

Preston’s observation:

“Fewer than 50% of companies with disaster recovery plans test them at all, which is described as a huge mistake.”

Why Backups Fail Untested

Real failure scenarios discovered during actual disasters:

Backup compression slowed restores, making RTO unachievable (Preston’s personal experience)
Missing dependencies (database version incompatible with backup)
Credentials expired or incorrect
Backup files corrupt but verification never caught it
Network bandwidth insufficient for large restores
Person who knew recovery procedure left company

Progressive Testing Strategy

Preston’s recommendation: Start small, build up.

Don’t try full-scale disaster recovery first. Build confidence incrementally.

Level 1: Basic File Restore (Monthly)

# Pick a random file from backup
# Restore it to test environment
# Verify contents match original

# Example
restore_file --backup-id=daily-2024-11-16 \
  --file=/var/app/config.json \
  --destination=/tmp/test-restore/

diff /var/app/config.json /tmp/test-restore/config.json
# No differences = success

Time: 5 minutes Risk: Zero (not touching production) Value: Confirms basic restore mechanics work

Level 2: Database Restore (Monthly)

# Restore database to separate test instance
# Verify data integrity
# Document time taken

# Example (PostgreSQL)
pg_restore -d test_recovery /backup/latest.dump

# Verify row counts match
psql test_recovery -c "SELECT COUNT(*) FROM users;"
# Compare to production count

# Time the restore
# Document: "500 GB restore took 2 hours 15 minutes"

Time: 30-60 minutes Risk: Low (separate test database) Value: Validates database backups, measures restore time

Level 3: Application Recovery (Quarterly)

Full application stack restore:
1. Provision clean infrastructure
2. Restore database from backup
3. Restore application files
4. Start services
5. Run smoke tests
6. Time entire process

Example checklist:

□ Spin up test EC2 instance
□ Restore database (time: ___ hours)
□ Restore application code from Git tag
□ Restore configuration files from backup
□ Start application services
□ Smoke tests:
  □ Can users log in?
  □ Can users view their data?
  □ Can users create new records?
□ Total time: ___ hours ___ minutes

Time: 2-4 hours Risk: Medium (requires coordination) Value: Validates full recovery process, uncovers missing steps

Level 4: Disaster Recovery Exercise (Annually)

Simulate actual disaster:
1. Declare test disaster (communication drills)
2. Full recovery to production-like environment
3. Different team member leads recovery
4. Stakeholder involvement
5. Post-mortem: what went wrong, what to improve

Requirements:

Recovery runbooks (documented procedures)
Multiple people capable of executing recovery
Communication plan (who to notify, escalation paths)
Success criteria defined in advance

Time: 1 full day Risk: High coordination effort Value: Validates RTO/RPO, uncovers gaps in planning

Automated Verification

Manual testing is good. Automated testing is better.

Daily automated checks:

#!/bin/bash
# backup-verify.sh

# Check backup completed successfully
if [ ! -f /backup/latest.flag ]; then
  alert "Backup did not complete"
  exit 1
fi

# Check backup file exists and has content
BACKUP_FILE="/backup/db-$(date +%Y%m%d).sql.gz"
if [ ! -s "$BACKUP_FILE" ]; then
  alert "Backup file missing or empty"
  exit 1
fi

# Check backup file size (should be > 100MB)
SIZE=$(stat -f%z "$BACKUP_FILE")
if [ "$SIZE" -lt 104857600 ]; then
  alert "Backup file suspiciously small: ${SIZE} bytes"
  exit 1
fi

# Test restore to temp database (sample check)
gunzip -c "$BACKUP_FILE" | psql -d temp_restore_check

# Verify row count
COUNT=$(psql temp_restore_check -t -c "SELECT COUNT(*) FROM users;")
if [ "$COUNT" -lt 1000 ]; then
  alert "Restored database has too few rows: ${COUNT}"
  exit 1
fi

echo "Backup verified successfully"

Run this daily. If it fails, you know immediately.

Disaster Recovery Strategies by RTO/RPO

Different systems need different recovery strategies.

Cold Standby (RTO: Days, RPO: 24+ hours)

Architecture:

Backup resources exist but are offline
Manual activation required
Minimal ongoing cost

Example:

Production database crashes
↓
1. Provision new hardware (1-2 days if ordering required)
2. Install and configure database software (4 hours)
3. Restore from latest backup (6 hours)
4. Reconfigure application to new database (2 hours)
5. Validate and switch traffic (1 hour)

Total RTO: 2-3 days

When to use:

Non-critical internal tools
Systems with minimal business impact
Budget-constrained scenarios

Cost: Low (backups only, no standby infrastructure)

Warm Standby (RTO: Hours, RPO: 1-4 hours)

Architecture:

Secondary infrastructure provisioned but not fully active
Database replication with lag
Manual or automated activation

Example:

Production region fails
↓
1. Detect failure (10 mins)
2. Approve failover (15 mins)
3. Activate standby region:
   - Promote read replica to primary (5 mins)
   - Start application servers (10 mins)
   - Update DNS to standby region (15 mins propagation)
4. Validate services (30 mins)

Total RTO: ~90 minutes

When to use:

Business-critical systems
Acceptable brief outages
Moderate budget

Cost: Medium (standby infrastructure + replication)

Hot Standby (RTO: Minutes, RPO: Near-zero)

Architecture:

Active-passive: standby actively replicating, instant failover
Real-time data replication

Example:

Production database fails
↓
1. Automatic detection (30 seconds)
2. Automatic failover to standby (2 mins)
3. Validation (5 mins)

Total RTO: ~8 minutes
Total RPO: < 1 minute (replication lag)

When to use:

Mission-critical systems (financial, healthcare)
Zero-tolerance for data loss
High budget

Cost: High (duplicate infrastructure running continuously)

Multi-Region Active-Active (RTO: Seconds, RPO: Near-zero)

Architecture:

Multiple regions serving traffic simultaneously
Automatic failover, users don’t notice

Example:

AWS us-east-1 region fails
↓
1. Load balancer detects failure (5 seconds)
2. Routes traffic to us-west-2 (automatic)
3. Users experience no downtime

Total RTO: < 10 seconds
Total RPO: Near-zero (real-time replication)

When to use:

Global services requiring high availability
E-commerce platforms
SaaS applications

Cost: Very high (full infrastructure in multiple regions)

Decision Framework

RTO Requirement	RPO Requirement	Strategy	Example Use Case
Days	24+ hours	Cold standby	Internal wikis, dev environments
4-8 hours	4-12 hours	Warm standby	CRM, reporting tools
1-2 hours	1-4 hours	Hot standby	Customer-facing apps
Minutes	Near-zero	Active-active	E-commerce, financial systems

Cost increases dramatically with tighter RTO/RPO:

Cold: ~$500/month (backups only)
Warm: ~$5,000/month (standby + replication)
Hot: ~$15,000/month (full duplicate infrastructure)
Active-active: ~$30,000/month (multi-region full deployment)

Cost-Benefit Analysis

Backup infrastructure costs money. Make informed decisions.

Time Investment

Initial setup:

Basic automated backups: 4-8 hours
Hybrid backup strategy: 16-24 hours
PITR configuration: 8-16 hours
Disaster recovery testing: 8 hours (first time)

Ongoing maintenance:

Monitoring backup health: 2 hours/month
Testing restores: 4 hours/month
Updating runbooks: 2 hours/month
Total: ~8 hours/month

Return on Investment

Immediate benefits:

Sleep better (backups exist and are tested)
Faster recovery from accidental deletion
Compliance checkbox (backups required for most standards)

Medium-term (3-6 months):

Confidence in recovery procedures
Reduced recovery time (tested and optimized)
Team knowledge distributed (not single person)

Long-term (1+ year):

Prevented data loss incidents
Successful disaster recovery (when needed)
Reduced insurance costs (better risk profile)

When to Skip Advanced Backup Strategies

Not every system needs enterprise-grade backups.

Skip complex backup infrastructure if:

Data is easily recreated (development environments)
System is non-critical (internal demo apps)
Budget is very limited
Data volume is tiny (< 10 GB)

Simple approach for these cases:

# Daily cron job
0 2 * * * pg_dump mydb | gzip > /backup/mydb-$(date +\%Y\%m\%d).sql.gz

# Keep 7 days
find /backup -name "mydb-*.sql.gz" -mtime +7 -delete

# Weekly copy to S3
0 3 * * 0 aws s3 cp /backup/ s3://my-backups/ --recursive

Good enough for non-critical systems.

Progressive Enhancement Path

Build backup infrastructure incrementally.

Month 1-2: Foundation

Week 1-2:

Identify critical systems requiring backup
Define RTO/RPO requirements (business conversation)
Set up automated daily backups (database, files)
Configure backup monitoring/alerting

Week 3-4:

Document recovery procedures (runbook)
Perform first test restore (Level 1: file restore)
Calculate actual storage requirements
Set up backup retention policy

Outcome: Basic backups running and verified

Month 3-4: Optimization

Week 5-6:

Implement off-site backup replication (cloud)
Test database restore (Level 2)
Measure actual restore time (compare to RTO)
Set up immutable backups (ransomware defense)

Week 7-8:

Optimize backup schedule (full + differential)
Implement automated verification
Train additional team members on recovery
Conduct tabletop exercise (discuss disaster scenarios)

Outcome: Tested, optimized backup strategy with off-site protection

Month 5-6: Advanced

Week 9-10:

Implement PITR for critical databases (if needed)
Set up air-gapped backup tier (tape or offline)
Full disaster recovery exercise (Level 4)
Document gaps and improvements

Week 11-12:

Automate recovery procedures (infrastructure-as-code)
Implement monitoring dashboard (backup health)
Review and update RTO/RPO based on tests
Create quarterly testing schedule

Outcome: Enterprise-grade backup and disaster recovery capability

Summary

Key takeaways:

RTO and RPO are business decisions, not technical ones. Attach costs to recovery objectives. Four-hour RTO might actually require 12+ hours when you account for all steps.
Test your backups or they don’t exist. Start with simple file restores, build up to full disaster recovery exercises. Untested backups create false confidence.
Choose backup types based on trade-offs. Full backups are fast to restore but expensive. Incremental backups are cheap but slow to restore. Differential balances both.
Hybrid storage combines speed and resilience. Local NAS for fast recovery, cloud for off-site protection, air-gapped tape for ransomware defense.
Immutable backups defend against ransomware. Combined with air-gapped backups, you can survive sophisticated attacks.
PITR reduces RPO to near-zero. For critical databases, point-in-time recovery enables surgical restoration to exact moments.

Start here:

Define RTO/RPO with business stakeholders (attach dollar values)
Set up automated backups with monitoring
Perform your first test restore this week

For deeper understanding:

Deep Water covers advanced PITR implementation, chaos engineering for backup validation, and multi-region disaster recovery architectures

Backup & Recovery - Mid-Depth

When Surface Level Isn’t Enough

Understanding RTO and RPO as Business Decisions

What RTO Actually Means

What RPO Actually Means

The Dangerous Gap: RTO vs RTA

Backup Strategy Selection: Full vs Incremental vs Differential

The Three Backup Types

Comparison Table

Decision Framework

Common Pattern: Full + Differential

Storage Architecture: On-Premises vs Cloud vs Hybrid

On-Premises Backup Storage

Cloud Backup Storage

Hybrid Approach (Recommended)

Immutable Backups vs Air-Gapped Backups

Immutable Backups

Air-Gapped Backups

Combined Strategy (Veeam Recommendation)

Point-in-Time Recovery (PITR) for Databases

How PITR Works

PostgreSQL PITR Configuration

PITR Benefits and Trade-offs

Backup Testing and Validation

Why Backups Fail Untested

Progressive Testing Strategy

Automated Verification

Disaster Recovery Strategies by RTO/RPO

Cold Standby (RTO: Days, RPO: 24+ hours)

Warm Standby (RTO: Hours, RPO: 1-4 hours)

Hot Standby (RTO: Minutes, RPO: Near-zero)

Multi-Region Active-Active (RTO: Seconds, RPO: Near-zero)

Decision Framework

Cost-Benefit Analysis

Time Investment

Return on Investment

When to Skip Advanced Backup Strategies

Progressive Enhancement Path

Month 1-2: Foundation

Month 3-4: Optimization

Month 5-6: Advanced

Summary

Want to Go Deeper?

Related Topics