Deployment Strategy
Deployment strategy determines how you move code from development to production. The strategy you pick affects downtime risk, rollback speed, infrastructure cost, and how many users get hit if something breaks.
According to research from Waze, canary releases prevent about 25% of all incidents on their services. Your choice of deployment strategy has measurable business impact.
The Core Question
When you deploy, you’re deciding: should everyone get the new version at once, or should some smaller group test it first?
There are seven main approaches. Each trades off cost, speed, and safety differently.
Seven Deployment Strategies
1. Blue-Green Deployment
How it works: Run two identical production environments. Blue is live. Deploy to Green, test it, then switch all traffic at once.
When to use: You can afford 2× infrastructure and want instant rollback.
- Downtime: None
- Rollback: Flip traffic back (instant)
- Cost: High (doubles infrastructure)
Real scenario: Payment processing system that can’t have any downtime during deployment. Deploy to Green, run final checks, switch traffic. If Green breaks, switch back to Blue in seconds.
2. Rolling Deployment
How it works: Replace old instances with new ones gradually, one or a few at a time.
When to use: Limited infrastructure budget but still want zero downtime.
- Downtime: None
- Rollback: Reverse the rolling update (5-15 minutes)
- Cost: Low (same infrastructure)
Real scenario: API service with 10 instances. Update 2 instances, watch them, update 2 more, repeat. Old and new versions run side-by-side for a while.
Catch: Old and new versions run simultaneously. If v2 frontend can’t talk to v1 backend, users see errors.
3. Canary Deployment
How it works: Send 1-5% of traffic to new version. Monitor. If metrics look good, gradually increase to 100%.
When to use: Catching problems early with minimal user impact.
- Downtime: None
- Rollback: Stop sending traffic to canary (instant)
- Cost: Medium (1.05× infrastructure)
Real scenario: Social media app with 1 million users. Deploy to 5% (50,000 users). If error rate spikes, only 50,000 people saw the problem. Rollback before it hits everyone.
What to monitor: Error rate, latency, failed health checks. Compare canary metrics to stable version metrics.
4. Recreate Deployment
How it works: Stop all old instances, deploy new ones.
When to use: Batch jobs or off-hours deployments where downtime is acceptable.
- Downtime: Yes (during deployment)
- Rollback: Restart old version (5-30 minutes)
- Cost: Lowest (no extra infrastructure)
Real scenario: Nightly data processing job. Stop it at 2am, deploy new version, restart. Users aren’t using it anyway.
Real-World Example: Single-Server Docker Compose Pattern
Not every application needs Kubernetes on day one. A single-server Docker Compose deployment can handle 100+ concurrent users and is appropriate when validating product-market fit.
A dispatch management application used this pattern at launch:
Deployment Architecture:
- Single EC2 t3.medium (2 vCPU, 4GB RAM)
- Docker Compose with 4 containers
- Cost: ~$50/month
docker-compose.yml:
version: '3.8'
services:
nginx:
image: nginx:alpine
ports:
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
- ./static:/usr/share/nginx/html:ro
depends_on:
- flask-app
flask-app:
build: ./backend
environment:
- DATABASE_URL=postgresql://dispatch:${DB_PASSWORD}@postgres:5432/dispatch
- KEYCLOAK_URL=https://keycloak:8443
secrets:
- db_password
- keycloak_client_secret
depends_on:
- postgres
- keycloak
keycloak:
image: quay.io/keycloak/keycloak:latest
environment:
- KC_DB=postgres
- KC_DB_URL=jdbc:postgresql://postgres:5432/keycloak
- KC_DB_USERNAME=keycloak
- KC_DB_PASSWORD=${KEYCLOAK_DB_PASSWORD}
secrets:
- keycloak_db_password
depends_on:
- postgres
postgres:
image: postgres:15-alpine
environment:
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
secrets:
- postgres_password
volumes:
postgres_data:
secrets:
db_password:
file: ./secrets/db_password.txt
keycloak_client_secret:
file: ./secrets/keycloak_client_secret.txt
keycloak_db_password:
file: ./secrets/keycloak_db_password.txt
postgres_password:
file: ./secrets/postgres_password.txt
Deployment Process:
# Deploy new version
git pull origin main
docker-compose build
docker-compose up -d
# Verify deployment
docker-compose ps
curl -k https://localhost/health
# Rollback if needed
git checkout previous-tag
docker-compose build
docker-compose up -d
When This Pattern Is Appropriate:
- Product-market fit validation (0-100 users)
- Known user base who understands it’s not enterprise-grade
- Downtime of 2-5 minutes during deployment is acceptable
- Recovery time of 4 hours from backup is acceptable
- Cost optimization is important (saves $400-900/month vs Kubernetes)
Evolution Triggers (when to move beyond this):
- Customer contracts require 99%+ uptime SLAs
- Traffic consistently exceeds single-server capacity (response times >5 seconds)
- Deployments become daily (current approach causes too much downtime)
- Need geographic distribution for global users
- Team size grows beyond 5 engineers
Key Insight: This isn’t a stopgap solution. It’s the right architecture for the stage. The team ran this for 6 months, acquired 50+ customers, and validated product-market fit before adding complexity. The $50/month savings went toward customer development instead of infrastructure.
📌 See Complete Architecture: Dispatch Management - Surface Level
5. Shadow Deployment (Dark Launch)
How it works: Deploy new version, send it a copy of production traffic, but don’t return its responses to users. Stable version continues serving real responses.
When to use: Testing under real load without any user-facing risk.
- Downtime: None
- Rollback: Stop shadow traffic (immediate)
- Cost: High (2× infrastructure)
Real scenario: New search algorithm. Route copy of all search queries to new version, compare response times and results, but keep showing users the old search results. When confident the new algorithm is faster and better, switch.
Limitation: Can’t test writes/mutations (billing, payments). Only works for read-heavy or idempotent operations.
6. A/B Testing Deployment
How it works: Route different users to different versions based on user ID, geography, or other criteria. Compare behavior.
When to use: Measuring whether new feature actually improves user behavior.
- Downtime: None
- Rollback: Remove traffic split (instant)
- Cost: Medium
- Duration: Days or weeks (need statistical significance)
Real scenario: E-commerce site testing two checkout flows. 50% of users see version A, 50% see version B. After 2 weeks, measure which version has higher conversion rate.
7. Ring Deployment
How it works: Roll out to concentric rings: Ring 0 (employees), Ring 1 (early adopters), Ring 2 (all users).
When to use: Large organizations where you can afford staged rollout over days.
- Downtime: None
- Rollback: Stop progression to next ring
- Cost: Medium-High
- Duration: 1-2 weeks
Real scenario: Enterprise software with millions of users. Deploy to 100 employees first. If stable after 24 hours, deploy to 10,000 early adopters. If stable after 3 days, deploy to all users.
Quick Decision Framework
| Your Situation | Use This Strategy |
|---|---|
| Absolutely can’t have downtime | Blue-green or canary |
| Limited infrastructure budget | Rolling deployment |
| Afraid to deploy | Feature flags + canary |
| Database schema changes | Blue-green (with expand-contract pattern) |
| Testing with real production load | Shadow deployment |
| Measuring user behavior impact | A/B testing |
| Large organization, many teams | Ring deployment + feature flags |
| Batch job, downtime OK | Recreate |
Deployment vs. Release: Critical Difference
Deployment = Technical: Getting code into production. Release = Business: Making features visible to users.
You can separate these with feature toggles (feature flags):
- Deploy code with feature hidden (toggle = off)
- Test in production while users don’t see it
- Turn toggle on for beta users
- Turn toggle on for everyone
- If problems: turn toggle off (instant rollback, no deployment)
This is how companies deploy multiple times per day while releasing features on a business schedule.
What to Watch During Deployment
Monitor these metrics continuously:
Critical metrics:
- Error rate - Should stay constant or decrease
- Request latency (p99) - Shouldn’t spike above 50% of baseline
- Failed health checks - Application-level checks
- Resource usage - CPU, memory shouldn’t degrade
When to rollback:
- Error rate increases >2-3% above baseline
- Latency p99 increases >50% above baseline
- Failed health checks exceed threshold
- New error messages in logs
Wait 5-10 minutes after deployment before analyzing. Systems often have startup spike.
Common Deployment Disasters
| Disaster | Cause | Prevention |
|---|---|---|
| Big-bang failure | All changes at once; can’t isolate issues | Deploy small changes frequently |
| Friday outage | Deploy when team unavailable | Deploy during business hours |
| Database incompatibility | Old code can’t work with new schema | Use expand-contract pattern |
| Session loss | Users logged out during traffic shift | Use distributed session store (Redis) |
| No rollback | When deployment fails, stuck | Always keep previous version running |
Rollback Decision: 5-Minute Framework
Deploy
↓
Wait 5-10 minutes (avoid false positives)
↓
Check metrics
↓
Metrics degraded >2-3%?
├─ YES → Affecting >10% users? → ROLLBACK
│ Getting worse? → ROLLBACK
│ Stable? → Monitor longer
└─ NO → Continue deployment
Rollback speed by strategy:
- Blue-green: <1 minute (flip traffic)
- Canary: <1 minute (stop canary traffic)
- Rolling: 5-15 minutes (redeploy old version)
- Feature flag: Milliseconds (disable flag)
Database Schema Changes: The Problem
You can’t just add a required column to your database. If you do, old code breaks because it doesn’t know about the new column. But you can’t switch all code instantaneously.
Solution: Expand-Contract Pattern
Three deployments:
Deployment 1 - Expand:
- Add new column (old code ignores it)
Deployment 2 - Migrate:
- Deploy code that writes to both old and new columns
- Backfill existing data
- Deploy code that reads from new column
Deployment 3 - Contract:
- Remove old column
Each deployment can be rolled back independently. No downtime required.
Starting Point: Where You Are Now
Most teams start simple and evolve:
Stage 1: Manual deployments (high risk) → Automate with CI/CD pipeline
Stage 2: Rolling deployments (low cost) → Add monitoring and rollback procedures
Stage 3: Blue-green + feature toggles (safer) → Implement SLO monitoring and automated rollback
Stage 4: Canary + chaos engineering (mature) → Progressive delivery platforms, service mesh
You don’t need Stage 4 on day one. Start where your team can handle the complexity.
Key Takeaway
The deployment strategy you choose determines what happens when something breaks:
- Recreate: Everyone sees the problem, significant downtime
- Rolling: Problems spread gradually, 5-15 minute rollback
- Canary: 5% of users see problems, instant rollback
- Blue-green: Problems caught before anyone sees them, instant rollback
Pick the strategy that matches your tolerance for things going wrong and your infrastructure budget.
Next Steps
- Assess current state: What strategy do you use now? (Most start with rolling or recreate)
- Identify constraints: Infrastructure budget? Downtime tolerance? Team maturity?
- Pick one improvement: If using recreate, move to rolling. If using rolling, add basic monitoring.
- Practice rollback: Test your rollback procedure in staging before you need it in production.
The goal isn’t perfection. The goal is knowing what will happen when deployment fails, and having a plan to recover quickly.
Real Life Case Studies
Dispatch Management: Progressive Architecture
A B2B SaaS application showing deployment strategy evolution: Single-server Docker Compose (Surface) → Multi-instance Kubernetes (Mid-Depth) → Multi-region with geographic distribution (Deep-Water). Demonstrates when simple is sufficient and when to add complexity.
Topics covered: Docker Compose single-server pattern, Recreate deployment for PMF validation, Evolution triggers for Kubernetes adoption, Cost analysis at each deployment stage, Acceptable downtime at different scales
Deployment Focus: See Surface Level architecture for complete Docker Compose deployment pattern handling 100+ users for ~$50/month.