Mid-Depth Level: Production SaaS
Overview
Goal: Deliver reliable, scalable multi-tenant SaaS with enterprise features and compliance readiness
Target Scale:
- 1,000 concurrent users across multiple organizations
- 200+ concurrent dispatchers
- 10,000+ dispatch operations per day
- 50+ paying customers
Timeline: 6-12 months after Surface Level launch
Investment: 3-5 developers, $1,000-3,000/month infrastructure
Success Criteria
Quantitative Metrics
- 99% uptime over 90-day rolling period
- <2 second average API response time (p95)
- <1% error rate
- Support 50+ active tenants
- Zero data loss incidents
- <1 hour RTO (Recovery Time Objective)
- <5 minute RPO (Recovery Point Objective)
Business Metrics
- 50+ paying organizations
- <5% monthly churn rate
- Net Promoter Score (NPS) >30
- 90%+ of feature requests tracked and prioritized
- Enterprise customers demanding compliance (SOC 2, ISO 27001)
Operational Metrics
- Deployment frequency: Multiple per week (without downtime)
- Incident response time: <15 minutes
- Security patch cycle: <48 hours for critical CVEs
- Customer support response time: <4 hours
Architecture Evolution
From Monolith to Enhanced Monolith
Still a Monolith, But with:
- Multi-tenant data isolation
- Horizontal scaling (multiple backend instances)
- External services integration (Redis, S3, managed Keycloak)
- High availability for critical components
- Professional monitoring and observability
Why Not Microservices Yet?
- Monolith can handle 1,000 concurrent users easily
- Microservices add operational complexity without clear benefit at this scale
- Team size doesn’t justify distributed system overhead
- Transaction boundaries still align with module boundaries
What Changes:
- Database: Single instance → Primary + replicas
- Queue: In-memory → Redis-backed
- Storage: Filesystem → S3
- Certificates: Self-signed → Let’s Encrypt
- Deployment: Single server → Kubernetes or multi-container
- Auth: Self-hosted Keycloak → Support external Keycloak integration
Evolution from Surface Level
This table maps Surface Level limitations to Mid-Depth solutions, with the business driver for each change.
| Surface Level Limitation | Mid-Depth Solution | Business Driver |
|---|---|---|
| Single-tenant database | Schema-per-tenant isolation | Multiple customers require data separation for security and compliance |
| 30-second polling | WebSocket real-time updates | Dispatchers need instant visibility for time-critical decisions |
| 5-10 users max | 1,000+ concurrent users | Growth from PMF validation to production SaaS |
| Self-signed certificates | Let’s Encrypt auto-renewal | Enterprise customers won’t trust self-signed certs; compliance requires trusted CAs |
| In-memory queue (lost on restart) | Redis-backed persistent queue | Queue data too valuable to lose; enables horizontal scaling |
| Local filesystem storage | S3 object storage | Scalability, redundancy, 7-year compliance retention |
| Basic logging (local files) | ELK stack centralized logging | Troubleshooting across multiple instances; audit trail requirements |
| No MFA | Keycloak MFA enforcement | Security compliance (ISO 27001) requires MFA for privileged users |
| Single DB instance (no HA) | RDS Multi-AZ with read replicas | 99% uptime SLA; heavy reporting queries impact live operations |
| Manual deployments | CI/CD pipeline with rollback | Multiple weekly deployments require automation; reduce human error |
| Best-effort support | Defined incident response SLAs | Enterprise customers expect guaranteed response times |
| No rate limiting | Per-user and per-tenant limits | Prevent abuse, ensure fair usage across tenants |
| Basic RBAC | ABAC (attribute-based) | Regional restrictions, department-level access control |
Key Persona Impact
| Persona | What Changes for Them |
|---|---|
| Dispatcher | Real-time updates, faster UI, filtering for large datasets |
| Equipment Support | Utilization reports, maintenance scheduling, certification tracking |
| Dev/Ops Team | CI/CD automation, centralized logs, defined on-call rotation |
Infrastructure Architecture
Deployment Options
Option A: Kubernetes (Recommended for Growth)
┌─────────────────────────────────────────────┐
│ AWS ALB / GKE Load Balancer │
│ (SSL Termination - Let's Encrypt) │
└──────────────┬──────────────────────────────┘
│
┌──────────┴──────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Frontend │ │ Backend │
│ Pod │ │ Pods │
│ (nginx) │ │ (3-5x) │
└──────────┘ └─────┬────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Redis │ │ RDS │ │ S3 │
│ Cluster │ │ Postgres│ │ Bucket │
│ (Queue) │ │ Primary │ │ │
└─────────┘ │ +Replica│ └─────────┘
└─────────┘
┌──────────────────┐
│ Keycloak Cluster │
│ (2+ pods) │
└──────────────────┘
Kubernetes Resources:
- Namespace:
dispatch-production - Backend Deployment: 3-5 replicas (HPA enabled, scale 3-10)
- Frontend Deployment: 2 replicas (static files, minimal resources)
- Redis: Single pod initially, cluster later
- PostgreSQL: External managed service (AWS RDS, Google Cloud SQL)
- S3: Managed object storage
- Keycloak: StatefulSet with 2 replicas
Resource Allocation:
Backend Pod:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
Frontend Pod:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Monthly Cost Estimate (AWS):
- EKS cluster: $75
- EC2 nodes (3x t3.large): $180
- RDS PostgreSQL (db.t3.large): $120
- Redis (cache.t3.medium): $50
- S3 storage (500 GB): $12
- Load Balancer: $20
- Data transfer: $50
- Total: ~$500-600/month
Option B: Managed Container Service (Simpler)
AWS ECS / Google Cloud Run / Azure Container Apps
┌─────────────────────────────────────────────┐
│ Application Load Balancer │
└──────────────┬──────────────────────────────┘
│
┌──────────┴──────────┐
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Frontend │ │ Backend │
│ Service │ │ Service │
│ (2 tasks)│ │ (3 tasks)│
└──────────┘ └─────┬────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ElastiCache│ │ RDS │ │ S3 │
│ Redis │ │ Postgres│ │ │
└─────────┘ └─────────┘ └─────────┘
Pros:
- Simpler than Kubernetes
- Less operational overhead
- Easier to manage for small teams
Cons:
- Less portable (vendor lock-in)
- Fewer ecosystem tools
- May need migration to K8s later
Monthly Cost Estimate: Similar to Kubernetes (~$500-600)
Database Architecture
PostgreSQL Configuration:
Primary Instance:
- Size: db.t3.large (2 vCPU, 8 GB RAM) initially
- Storage: 500 GB SSD (auto-scaling to 2 TB)
- IOPS: Provisioned IOPS SSD for consistent performance
- Multi-AZ: Enabled (automatic failover to standby)
Read Replicas:
- Count: 1-2 replicas
- Purpose:
- Reports and analytics (offload primary)
- Dashboard queries (stale data acceptable)
- Lag: Typically <1 second
- Cross-region: Optional (for disaster recovery)
Connection Management:
pgBouncer (connection pooler):
Backend Pods (50 connections each)
↓
pgBouncer (transaction mode)
↓
PostgreSQL (max 100 connections)
Benefits:
- Reuse connections efficiently
- Handle connection spikes
- Route reads to replicas automatically
Configuration:
[databases]
dispatch = host=postgres-primary.rds.amazonaws.com port=5432 dbname=dispatch
dispatch_ro = host=postgres-replica.rds.amazonaws.com port=5432 dbname=dispatch
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
Multi-Tenancy Implementation:
Schema-Per-Tenant:
-- Tenant management schema
CREATE SCHEMA tenant_management;
CREATE TABLE tenant_management.tenants (
id UUID PRIMARY KEY,
name VARCHAR(255) NOT NULL,
subdomain VARCHAR(63) UNIQUE NOT NULL,
schema_name VARCHAR(63) UNIQUE NOT NULL,
keycloak_realm VARCHAR(255), -- External Keycloak support
status VARCHAR(50) DEFAULT 'active', -- active, suspended, trial
tier VARCHAR(50) DEFAULT 'standard', -- standard, premium, enterprise
created_at TIMESTAMP DEFAULT NOW(),
settings JSONB -- Tenant-specific configuration
);
-- Each tenant gets identical schema structure
CREATE SCHEMA tenant_acme;
CREATE SCHEMA tenant_contoso;
-- Automated schema creation function
CREATE OR REPLACE FUNCTION create_tenant_schema(tenant_schema VARCHAR)
RETURNS VOID AS $$
BEGIN
EXECUTE format('CREATE SCHEMA %I', tenant_schema);
-- Create all tables in new schema
EXECUTE format('CREATE TABLE %I.work_orders (...)', tenant_schema);
EXECUTE format('CREATE TABLE %I.equipment (...)', tenant_schema);
-- ... all other tables
END;
$$ LANGUAGE plpgsql;
Tenant Resolution:
# Middleware determines tenant from subdomain or JWT
def get_tenant():
# Option 1: Subdomain
subdomain = request.host.split('.')[0]
tenant = Tenant.query.filter_by(subdomain=subdomain).first()
# Option 2: JWT claim
if not tenant:
token = validate_token(request)
tenant_id = token.get('tenant_id')
tenant = Tenant.query.get(tenant_id)
return tenant
# Set search_path for all queries in this request
@app.before_request
def set_tenant_context():
tenant = get_tenant()
db.session.execute(f"SET search_path TO {tenant.schema_name}")
Row-Level Security (Backup):
-- Even if search_path fails, RLS prevents cross-tenant access
ALTER TABLE work_orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON work_orders
USING (tenant_id = current_setting('app.current_tenant')::UUID);
Redis Architecture
Purpose:
- Dispatch queue (durable, survives restarts)
- Session cache (optional, reduce Keycloak calls)
- Dashboard data cache (reduce database load)
- Rate limiting counters
Configuration:
Single Instance (Sufficient Initially):
- Size: cache.t3.medium (2 vCPU, 3.09 GB RAM)
- Persistence: AOF (Append-Only File) enabled
- Backup: Daily snapshots
Redis Cluster (When Scaling):
- 3 primary nodes + 3 replicas
- Automatic sharding
- High availability
Queue Implementation:
import redis
import json
class RedisQueueManager:
def __init__(self, redis_client):
self.redis = redis_client
def add_to_queue(self, tenant_id, queue_entry):
# Priority queue: urgent items in separate list
key = f"dispatch_queue:{tenant_id}:{queue_entry['priority']}"
value = json.dumps(queue_entry)
self.redis.rpush(key, value) # Add to end (FIFO)
def pop_from_queue(self, tenant_id, equipment_type_id):
# Check urgent first, then standard
for priority in ['urgent', 'standard']:
key = f"dispatch_queue:{tenant_id}:{priority}"
# Atomic pop from left (oldest item)
value = self.redis.lpop(key)
if value:
entry = json.loads(value)
if entry['equipment_type_id'] == equipment_type_id:
return entry
else:
# Not matching, put back
self.redis.rpush(key, value)
return None
def get_queue_depth(self, tenant_id):
urgent = self.redis.llen(f"dispatch_queue:{tenant_id}:urgent")
standard = self.redis.llen(f"dispatch_queue:{tenant_id}:standard")
return {'urgent': urgent, 'standard': standard}
Cache Implementation:
def get_dashboard_data(tenant_id):
cache_key = f"dashboard:{tenant_id}"
# Check cache first
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss, query database
data = {
'active_dispatches': get_active_dispatches(),
'available_equipment': get_available_equipment(),
'queue': get_queue_entries()
}
# Cache for 30 seconds
redis.setex(cache_key, 30, json.dumps(data))
return data
Object Storage (S3)
Bucket Structure:
s3://dispatch-app-production/
/tenants
/tenant-{id}
/config-files
/{work_order_id}
/config_v1.json
/config_v2.pdf
/reports
/{dispatch_assignment_id}
/status_report.pdf
/backups
/database
/tenant-{id}
/dispatch_20250115.backup
/static
/frontend-build
/assets
/index.js
/styles.css
Lifecycle Policies:
{
"Rules": [
{
"Id": "TransitionOldReports",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "STANDARD_IA"
},
{
"Days": 365,
"StorageClass": "GLACIER"
}
]
},
{
"Id": "DeleteOldBackups",
"Status": "Enabled",
"Expiration": {
"Days": 90
},
"Filter": {
"Prefix": "backups/"
}
}
]
}
Direct Upload Flow (Efficient):
# Backend generates pre-signed POST URL
import boto3
s3_client = boto3.client('s3')
def generate_upload_url(tenant_id, work_order_id, filename):
key = f"tenants/{tenant_id}/config-files/{work_order_id}/{filename}"
presigned_post = s3_client.generate_presigned_post(
Bucket='dispatch-app-production',
Key=key,
Fields={'acl': 'private'},
Conditions=[
{'acl': 'private'},
['content-length-range', 1, 10485760] # 1 byte to 10 MB
],
ExpiresIn=3600 # 1 hour
)
return presigned_post
# Frontend uploads directly to S3
# After upload, frontend notifies backend to create database record
Security:
- Bucket not publicly accessible
- All access via pre-signed URLs (time-limited)
- Server-side encryption: SSE-S3 or SSE-KMS
- Versioning enabled (can recover deleted files)
- Cross-region replication (optional, for DR)
Authentication & Authorization Enhancements
Scenario A: Self-Hosted Keycloak Cluster
High Availability Setup:
┌──────────────────┐
│ Load Balancer │
└────────┬─────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌────────┐
│Keycloak│ │Keycloak│
│Node 1 │ │Node 2 │
└───┬────┘ └───┬────┘
│ │
└────┬─────┘
▼
┌──────────┐
│PostgreSQL│
│(Shared) │
└──────────┘
Configuration:
- 2+ Keycloak instances behind load balancer
- Shared PostgreSQL database (Keycloak backend)
- Session replication via JGroups or external cache
- Let’s Encrypt certificates (automated renewal)
MFA Enforcement:
// Keycloak realm configuration
{
"realm": "dispatch-app",
"requiredActions": ["CONFIGURE_TOTP"], // Force MFA setup on first login
"otpPolicyType": "totp",
"otpPolicyAlgorithm": "HmacSHA1",
"otpPolicyDigits": 6,
"otpPolicyPeriod": 30,
"browserFlow": "browser-with-mfa", // Custom flow requiring MFA for admin/dispatcher
"roles": {
"realm": [
{
"name": "admin",
"requiredOtp": true // Always require MFA
},
{
"name": "dispatcher",
"requiredOtp": true
}
]
}
}
Scenario B: External Keycloak Integration
Customer Provides Their Keycloak:
Configuration Handoff:
- Customer creates realm:
dispatch-{customer-name} - Customer creates two clients:
dispatch-web(public, OIDC, for frontend)dispatch-api(confidential, for backend token validation)
- Customer defines roles:
dispatch-admin,dispatch-dispatcher,dispatch-driver,dispatch-viewer - Customer provides:
- Keycloak URL
- Realm name
- Client IDs and secrets
- JWKS endpoint URL
Application Configuration:
# Per-tenant Keycloak settings stored in database
tenants:
- id: acme-corp
keycloak_url: https://sso.acme.com
keycloak_realm: dispatch-acme
keycloak_client_id: dispatch-api
keycloak_client_secret: <secret>
keycloak_jwks_url: https://sso.acme.com/realms/dispatch-acme/protocol/openid-connect/certs
- id: contoso
keycloak_url: https://keycloak.contoso.com
keycloak_realm: dispatch-production
keycloak_client_id: dispatch-backend
keycloak_client_secret: <secret>
keycloak_jwks_url: https://keycloak.contoso.com/realms/dispatch-production/protocol/openid-connect/certs
Token Validation (Cached JWKS):
import requests
import jwt
from functools import lru_cache
from datetime import datetime, timedelta
class MultiTenantAuth:
def __init__(self):
self.jwks_cache = {} # tenant_id -> (jwks, expiry)
def get_jwks(self, tenant):
# Check cache
if tenant.id in self.jwks_cache:
jwks, expiry = self.jwks_cache[tenant.id]
if datetime.utcnow() < expiry:
return jwks
# Fetch JWKS
response = requests.get(tenant.keycloak_jwks_url)
jwks = response.json()
# Cache for 1 hour
self.jwks_cache[tenant.id] = (jwks, datetime.utcnow() + timedelta(hours=1))
return jwks
def validate_token(self, token, tenant):
try:
jwks = self.get_jwks(tenant)
# Decode and validate
decoded = jwt.decode(
token,
jwks,
algorithms=['RS256'],
audience=tenant.keycloak_client_id,
options={'verify_exp': True}
)
return decoded
except jwt.ExpiredSignatureError:
raise Unauthorized("Token expired")
except jwt.InvalidTokenError:
raise Unauthorized("Invalid token")
Fallback Mechanism (Graceful Degradation):
def validate_token_with_fallback(token, tenant):
try:
# Try to validate with live JWKS
return validate_token(token, tenant)
except requests.ConnectionError:
# Customer's Keycloak unreachable
# Use cached JWKS (even if expired) with warning
logger.warning(f"Keycloak unreachable for tenant {tenant.id}, using cached JWKS")
if tenant.id in jwks_cache:
jwks, _ = jwks_cache[tenant.id] # Ignore expiry
# Validate with cached JWKS (accept 15-minute grace period)
return jwt.decode(token, jwks, algorithms=['RS256'], options={'verify_exp': False})
else:
raise ServiceUnavailable("Authentication service unavailable")
Advanced Authorization (ABAC)
Attribute-Based Access Control:
Example Use Case: Dispatcher can only dispatch equipment in their region
Token Claims:
{
"sub": "user-123",
"preferred_username": "john.dispatcher",
"realm_access": {
"roles": ["dispatcher"]
},
"custom_attributes": {
"region": "west-coast",
"department": "logistics"
}
}
Policy Enforcement:
def authorize_dispatch(user, equipment):
# Check role
if 'dispatcher' not in user.roles and 'admin' not in user.roles:
raise Forbidden("Not authorized to dispatch")
# Check attribute (region)
if 'admin' not in user.roles: # Admins bypass region restriction
user_region = user.custom_attributes.get('region')
equipment_region = equipment.region
if user_region != equipment_region:
raise Forbidden(f"Cannot dispatch equipment outside your region ({user_region})")
return True
Real-Time Updates (WebSockets)
Upgrade from Polling to WebSockets:
Architecture:
Frontend (Socket.io client)
↓ WSS
Backend (Flask-SocketIO)
↓ Pub/Sub
Redis (Message Broker)
Benefits:
- Real-time dashboard updates (<1 second latency)
- Reduced server load (no constant polling)
- Lower bandwidth usage
- Better user experience
Implementation:
Backend (Flask-SocketIO):
from flask_socketio import SocketIO, emit, join_room
socketio = SocketIO(app, cors_allowed_origins="*", message_queue='redis://redis:6379')
@socketio.on('connect')
def handle_connect():
# Authenticate user
token = request.args.get('token')
user = validate_token(token)
# Join tenant-specific room
tenant_id = get_tenant_from_user(user)
join_room(f"tenant_{tenant_id}")
emit('connected', {'message': 'Welcome to dispatch updates'})
@socketio.on('subscribe_dispatches')
def handle_subscribe():
# User subscribes to active dispatch updates
pass
# When dispatch status changes
def notify_dispatch_update(tenant_id, dispatch_data):
socketio.emit('dispatch_updated', dispatch_data, room=f"tenant_{tenant_id}")
# When queue changes
def notify_queue_update(tenant_id, queue_data):
socketio.emit('queue_updated', queue_data, room=f"tenant_{tenant_id}")
Frontend (React):
import { io } from 'socket.io-client';
import { useAuth } from '@/features/auth/useAuth';
export function useDispatchUpdates() {
const { token } = useAuth();
const [dispatches, setDispatches] = useState([]);
useEffect(() => {
const socket = io('wss://api.dispatchapp.com', {
auth: { token }
});
socket.on('connect', () => {
console.log('Connected to dispatch updates');
socket.emit('subscribe_dispatches');
});
socket.on('dispatch_updated', (data) => {
// Update local state
setDispatches(prev => {
const index = prev.findIndex(d => d.id === data.id);
if (index >= 0) {
const updated = [...prev];
updated[index] = data;
return updated;
} else {
return [...prev, data];
}
});
});
socket.on('queue_updated', (data) => {
// Update queue display
});
return () => socket.disconnect();
}, [token]);
return { dispatches };
}
Scaling WebSockets:
- Redis pub/sub allows multiple backend instances to share WebSocket state
- Each backend instance handles subset of connections
- Load balancer sticky sessions ensure client connects to same instance
Performance Optimizations
Database Optimization
Query Optimization:
-- Slow query (before)
SELECT d.*, wo.order_number, e.identifier, u.full_name
FROM dispatch_assignments d
JOIN work_orders wo ON d.work_order_id = wo.id
JOIN equipment e ON d.equipment_id = e.id
JOIN users u ON d.driver_id = u.id
WHERE d.status IN ('dispatched', 'en_route', 'on_site')
AND d.dispatched_at > NOW() - INTERVAL '7 days';
-- Optimized (after)
-- 1. Add covering index
CREATE INDEX idx_dispatch_active_covering ON dispatch_assignments(status, dispatched_at)
INCLUDE (work_order_id, equipment_id, driver_id);
-- 2. Denormalize frequently accessed data
ALTER TABLE dispatch_assignments
ADD COLUMN work_order_number VARCHAR,
ADD COLUMN equipment_identifier VARCHAR,
ADD COLUMN driver_name VARCHAR;
-- Updated on create/update via trigger or application logic
-- Now query is much faster (index-only scan possible)
SELECT * FROM dispatch_assignments
WHERE status IN ('dispatched', 'en_route', 'on_site')
AND dispatched_at > NOW() - INTERVAL '7 days';
Partitioning (for large tables):
-- Partition dispatch_assignments by month
CREATE TABLE dispatch_assignments (
id UUID NOT NULL,
dispatched_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
-- ... other columns
) PARTITION BY RANGE (dispatched_at);
-- Create partitions
CREATE TABLE dispatch_assignments_2025_01 PARTITION OF dispatch_assignments
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE dispatch_assignments_2025_02 PARTITION OF dispatch_assignments
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Automatic partition creation via cron job or pg_partman
Materialized Views (for dashboard metrics):
-- Expensive query: Equipment utilization rate
CREATE MATERIALIZED VIEW equipment_utilization AS
SELECT
e.id,
e.identifier,
COUNT(CASE WHEN d.status IN ('dispatched', 'en_route', 'on_site') THEN 1 END) as active_count,
COUNT(d.id) as total_dispatches,
AVG(EXTRACT(EPOCH FROM (d.completed_at - d.dispatched_at))) as avg_duration_seconds
FROM equipment e
LEFT JOIN dispatch_assignments d ON e.id = d.equipment_id
WHERE d.dispatched_at > NOW() - INTERVAL '30 days'
GROUP BY e.id, e.identifier;
-- Refresh every 5 minutes via cron
-- SELECT refresh_materialized_view('equipment_utilization');
Application-Level Caching
Cache Strategy:
What to Cache:
- Dashboard data (30 seconds TTL)
- Available equipment list (30 seconds TTL)
- User profiles (5 minutes TTL)
- Tenant settings (1 hour TTL)
What NOT to Cache:
- Active dispatch status (real-time via WebSocket)
- Queue depth (real-time via Redis)
- Transaction-critical data
Cache Implementation:
from flask_caching import Cache
cache = Cache(app, config={
'CACHE_TYPE': 'redis',
'CACHE_REDIS_HOST': 'redis',
'CACHE_REDIS_PORT': 6379,
'CACHE_DEFAULT_TIMEOUT': 300
})
@app.route('/api/equipment/available')
@cache.cached(timeout=30, key_prefix='available_equipment')
def get_available_equipment():
tenant = get_tenant()
equipment = Equipment.query.filter_by(
schema_name=tenant.schema_name,
status='available'
).all()
return jsonify(EquipmentSchema(many=True).dump(equipment))
# Cache invalidation on equipment status change
def update_equipment_status(equipment_id, new_status):
equipment = Equipment.query.get(equipment_id)
equipment.status = new_status
db.session.commit()
# Invalidate cache
cache.delete('available_equipment')
API Rate Limiting
Per-User Rate Limits:
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
limiter = Limiter(
app,
key_func=lambda: g.user['sub'], # Rate limit by user ID
storage_uri='redis://redis:6379',
default_limits=["1000 per hour", "100 per minute"]
)
# Stricter limits on expensive endpoints
@app.route('/api/reports/generate', methods=['POST'])
@limiter.limit("10 per minute")
def generate_report():
# Report generation is expensive
pass
# Exempt admins from rate limiting
@app.route('/api/admin/bulk-import', methods=['POST'])
@limiter.exempt
def bulk_import():
if 'admin' not in g.user['roles']:
abort(403)
# Admin bulk operations
pass
Per-Tenant Rate Limits:
# Different limits based on subscription tier
def get_tenant_limit():
tenant = get_tenant()
limits = {
'trial': "100 per hour",
'standard': "1000 per hour",
'premium': "5000 per hour",
'enterprise': "20000 per hour"
}
return limits.get(tenant.tier, "1000 per hour")
@app.route('/api/dispatch/request', methods=['POST'])
@limiter.limit(get_tenant_limit)
def request_dispatch():
# Dispatch request
pass
Monitoring & Observability
Logging Infrastructure
Centralized Logging (ELK Stack):
Application Logs (JSON)
↓
Filebeat (log shipper)
↓
Logstash (parsing, enrichment)
↓
Elasticsearch (storage, indexing)
↓
Kibana (visualization, search)
Log Format:
{
"timestamp": "2025-01-15T10:30:45.123Z",
"level": "INFO",
"service": "dispatch-backend",
"module": "dispatch_module.service",
"tenant_id": "acme-corp",
"user_id": "user-123",
"action": "dispatch_requested",
"work_order_id": "WO-54321",
"equipment_id": "eq-789",
"driver_id": "driver-456",
"result": "success",
"duration_ms": 45,
"request_id": "req-abc-123" # Trace through all services
}
Log Retention:
- Hot tier (Elasticsearch): 30 days
- Warm tier (S3): 90 days
- Cold tier (S3 Glacier): 1 year
Structured Logging:
import structlog
logger = structlog.get_logger()
def request_dispatch(work_order_id, equipment_type_id, dispatcher_id):
logger.info(
"dispatch_requested",
work_order_id=work_order_id,
equipment_type_id=equipment_type_id,
dispatcher_id=dispatcher_id
)
try:
# Dispatch logic
result = create_dispatch_assignment(...)
logger.info(
"dispatch_completed",
assignment_id=result.id,
duration_ms=(datetime.utcnow() - start_time).total_seconds() * 1000
)
return result
except Exception as e:
logger.error(
"dispatch_failed",
error=str(e),
error_type=type(e).__name__,
exc_info=True
)
raise
Metrics & Alerting (Prometheus + Grafana)
Prometheus Metrics:
Application Metrics:
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
# Business metrics
dispatch_requests_total = Counter('dispatch_requests_total', 'Total dispatch requests', ['tenant_id', 'result'])
dispatch_queue_depth = Gauge('dispatch_queue_depth', 'Current queue depth', ['tenant_id', 'priority'])
active_dispatches = Gauge('active_dispatches', 'Number of active dispatches', ['tenant_id'])
# Database metrics
db_query_duration = Histogram('db_query_duration_seconds', 'Database query duration', ['query_type'])
db_connection_pool_size = Gauge('db_connection_pool_size', 'Connection pool size')
Grafana Dashboards:
Dashboard 1: System Health
- Request rate (requests/second)
- Error rate (%)
- Response time (p50, p95, p99)
- CPU and memory usage per pod
- Database connection pool utilization
Dashboard 2: Business Metrics
- Dispatches per hour
- Queue depth over time
- Average dispatch duration
- Equipment utilization rate
- Top tenants by activity
Dashboard 3: Tenant-Specific
- Per-tenant request rate
- Per-tenant error rate
- Per-tenant active dispatches
- Per-tenant queue depth
Alerting Rules
Critical Alerts (PagerDuty, immediate response):
groups:
- name: critical
interval: 1m
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} (>5%) for 2 minutes"
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 1m
annotations:
summary: "PostgreSQL is down"
- alert: QueueDepthCritical
expr: dispatch_queue_depth{priority="urgent"} > 50
for: 5m
annotations:
summary: "Urgent queue depth exceeds 50"
Warning Alerts (Slack, investigate during business hours):
- name: warnings
interval: 5m
rules:
- alert: SlowResponseTime
expr: histogram_quantile(0.95, http_request_duration_seconds) > 2.0
for: 10m
annotations:
summary: "95th percentile response time > 2 seconds"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
for: 10m
annotations:
summary: "Memory usage > 85%"
- alert: CertificateExpiringSoon
expr: ssl_certificate_expiry_days < 30
annotations:
summary: "SSL certificate expires in {{ $value }} days"
Security Enhancements
Certificate Management
Let’s Encrypt (Automated):
cert-manager (Kubernetes):
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@dispatchapp.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: dispatch-tls
spec:
secretName: dispatch-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- dispatchapp.com
- "*.dispatchapp.com"
renewBefore: 720h # Renew 30 days before expiry
Automated Renewal:
- cert-manager checks expiry daily
- Renews when <30 days remaining
- Zero-downtime certificate rotation
- Monitoring alert if renewal fails
Internal Services (Vault PKI):
HashiCorp Vault as Internal CA:
# Enable PKI secrets engine
vault secrets enable pki
vault secrets tune -max-lease-ttl=87600h pki # 10 years
# Generate root CA
vault write pki/root/generate/internal \
common_name="Dispatch App Internal CA" \
ttl=87600h
# Create role for service certificates
vault write pki/roles/internal-services \
allowed_domains="*.dispatch.internal,*.dispatch.svc.cluster.local" \
allow_subdomains=true \
max_ttl=2160h # 90 days
# Backend requests certificate on startup
vault write pki/issue/internal-services \
common_name="backend.dispatch.svc.cluster.local" \
ttl=2160h
Certificate Rotation:
# Application checks certificate expiry on startup
def ensure_valid_certificate():
cert_path = '/var/run/secrets/tls/tls.crt'
cert = load_certificate(cert_path)
days_until_expiry = (cert.not_valid_after - datetime.utcnow()).days
if days_until_expiry < 30:
logger.warning(f"Certificate expires in {days_until_expiry} days, requesting new certificate")
new_cert = request_certificate_from_vault()
replace_certificate(new_cert)
reload_app() # Graceful reload
Secrets Management (Vault)
Dynamic Database Credentials:
# Configure database secrets engine
vault secrets enable database
vault write database/config/dispatch-postgres \
plugin_name=postgresql-database-plugin \
allowed_roles="dispatch-backend" \
connection_url="postgresql://{{username}}:{{password}}@postgres:5432/dispatch" \
username="vault" \
password="vault_password"
# Create role that generates 1-hour credentials
vault write database/roles/dispatch-backend \
db_name=dispatch-postgres \
creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE dispatch_app;" \
default_ttl="1h" \
max_ttl="24h"
# Backend requests credentials on startup
vault read database/creds/dispatch-backend
# Returns: { "username": "v-token-12345", "password": "A1B2C3..." }
Application Integration:
import hvac
vault_client = hvac.Client(url='https://vault.internal:8200')
# Authenticate (Kubernetes service account, AppRole, etc.)
vault_client.auth.kubernetes.login(
role='dispatch-backend',
jwt=open('/var/run/secrets/kubernetes.io/serviceaccount/token').read()
)
# Get database credentials
db_creds = vault_client.secrets.database.generate_credentials(name='dispatch-backend')
db_user = db_creds['data']['username']
db_pass = db_creds['data']['password']
# Connect to database
engine = create_engine(f'postgresql://{db_user}:{db_pass}@postgres:5432/dispatch')
# Credentials automatically revoked after 1 hour
# Application renews before expiry or reconnects with new credentials
Compliance Preparation (ISO 27001)
ISO 27001 Alignment:
A.9 Access Control:
- ✅ Keycloak provides centralized identity management
- ✅ RBAC enforced at API level
- ✅ MFA for privileged accounts (admin, dispatcher)
- ✅ Regular access reviews (quarterly audit of user roles)
A.10 Cryptography:
- ✅ TLS 1.3 for all data in transit
- ✅ Encryption at rest for S3 (SSE-KMS)
- ✅ Database encryption (TDE) - enable at Deep-Water
- ✅ Key management via Vault
A.12 Operations Security:
- ✅ Automated security patching (Dependabot + weekly review)
- ✅ Vulnerability scanning (Snyk integrated in CI/CD)
- ✅ Change management process (Git, PR reviews, staging environment)
- ✅ Capacity monitoring (Prometheus alerts)
A.14 System Acquisition, Development, Maintenance:
- ✅ Security requirements in design phase (documented)
- ✅ Code review process (all PRs reviewed)
- ✅ Penetration testing (quarterly, external firm)
- ✅ Secure coding guidelines documented
A.16 Incident Management:
- ✅ Incident response plan documented
- ✅ Security event logging (ELK stack)
- ✅ 24/7 on-call rotation (PagerDuty)
- ✅ Post-incident reviews (within 48 hours)
A.18 Compliance:
- ✅ Regular internal audits (monthly)
- ✅ External audit (annual, preparation for SOC 2 Type I)
- ✅ Documentation of all controls
- ✅ Gap analysis against ISO 27001 requirements
Documentation Required:
- Information Security Policy
- Risk Assessment Register
- Asset Inventory
- Incident Response Procedures
- Business Continuity Plan
- Disaster Recovery Plan
- Access Control Policy
- Change Management Process
Deployment & DevOps
CI/CD Pipeline
Tools: GitHub Actions, GitLab CI, or Jenkins
Pipeline Stages:
1. Build:
# .github/workflows/build.yml
name: Build and Test
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.11
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Lint
run: |
flake8 app/
black --check app/
- name: Unit tests
run: pytest tests/unit --cov=app
- name: Integration tests
run: pytest tests/integration
- name: Security scan
run: |
bandit -r app/
safety check
build-image:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build Docker image
run: docker build -t dispatch-backend:${{ github.sha }} .
- name: Scan image
run: trivy image dispatch-backend:${{ github.sha }}
- name: Push to registry
run: |
echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
docker push dispatch-backend:${{ github.sha }}
2. Deploy to Staging:
deploy-staging:
needs: build-image
runs-on: ubuntu-latest
steps:
- name: Update Kubernetes manifests
run: |
kubectl set image deployment/dispatch-backend \
backend=dispatch-backend:${{ github.sha }} \
-n dispatch-staging
- name: Wait for rollout
run: kubectl rollout status deployment/dispatch-backend -n dispatch-staging
- name: Smoke tests
run: |
curl -f https://staging.dispatchapp.com/api/health
pytest tests/smoke --base-url=https://staging.dispatchapp.com
3. Deploy to Production (manual approval):
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment:
name: production
url: https://dispatchapp.com
steps:
- name: Update Kubernetes manifests
run: |
kubectl set image deployment/dispatch-backend \
backend=dispatch-backend:${{ github.sha }} \
-n dispatch-production
- name: Wait for rollout
run: kubectl rollout status deployment/dispatch-backend -n dispatch-production
- name: Notify Slack
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK_URL }} \
-H 'Content-Type: application/json' \
-d '{"text":"Production deployment successful: ${{ github.sha }}"}'
Blue-Green Deployments
Zero-Downtime Deployments:
Kubernetes Rolling Update:
apiVersion: apps/v1
kind: Deployment
metadata:
name: dispatch-backend
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # Only 1 pod down at a time
maxSurge: 2 # Up to 2 extra pods during update
template:
spec:
containers:
- name: backend
image: dispatch-backend:v1.2.3
readinessProbe:
httpGet:
path: /api/health
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /api/health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
Process:
- New pods spin up (v1.2.3)
- Readiness probe passes → added to load balancer
- Old pod (v1.2.2) removed from load balancer
- Old pod allowed to finish in-flight requests (30s grace period)
- Old pod terminated
- Repeat for all pods
Rollback (if deployment fails):
# Automatic rollback on failed readiness probes
# Or manual rollback
kubectl rollout undo deployment/dispatch-backend -n dispatch-production
Database Migrations
Zero-Downtime Migration Strategy:
Phase 1: Additive Changes Only
# Migration: Add new column
def upgrade():
op.add_column('dispatch_assignments',
sa.Column('new_field', sa.String(), nullable=True))
# Deploy code that writes to both old and new fields
# Old code still works (new_field is nullable)
Phase 2: Backfill Data
# Background job fills new_field for existing rows
def backfill_new_field():
assignments = DispatchAssignment.query.filter_by(new_field=None).all()
for assignment in assignments:
assignment.new_field = calculate_new_field(assignment)
db.session.commit()
Phase 3: Make Required
# After all rows backfilled
def upgrade():
op.alter_column('dispatch_assignments', 'new_field', nullable=False)
# Deploy code that relies on new_field
Phase 4: Remove Old Field (if applicable)
# After old code no longer deployed
def upgrade():
op.drop_column('dispatch_assignments', 'old_field')
Backup & Disaster Recovery
Backup Strategy
Database Backups:
- Frequency: Continuous (WAL archiving) + daily snapshots
- Retention: 35 days (RDS default)
- Location: Same region + cross-region replication
- Encryption: AES-256 (at rest)
Automated Snapshots (RDS):
# Automated daily snapshots at 3 AM
# Manual snapshot before major changes
aws rds create-db-snapshot \
--db-instance-identifier dispatch-production \
--db-snapshot-identifier pre-migration-2025-01-15
Point-in-Time Recovery (PITR):
# Restore to any second in last 35 days
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier dispatch-production \
--target-db-instance-identifier dispatch-restored \
--restore-time 2025-01-15T10:30:00Z
Object Storage Backups:
- S3 Versioning: Enabled (recover deleted/overwritten files)
- Cross-Region Replication: Enabled (DR)
- Lifecycle Policy: Reports archived to Glacier after 1 year
Disaster Recovery Plan
Scenarios & Procedures:
1. Database Failure:
- Detection: Automatic (RDS Multi-AZ failover)
- RTO: <2 minutes (automatic failover to standby)
- RPO: 0 (synchronous replication)
- Procedure: No action required (automatic)
2. Regional Outage (AWS region down):
- Detection: Manual (monitoring shows all services down)
- RTO: 1 hour (manual failover to backup region)
- RPO: 5 minutes (asynchronous replication lag)
- Procedure:
- Promote read replica in backup region to primary
- Update DNS to point to backup region
- Restart backend pods in backup region
- Communicate status to customers
3. Data Corruption (accidental deletion):
- Detection: Customer report or internal audit
- RTO: 2 hours (restore from backup)
- RPO: Up to 24 hours (daily snapshots)
- Procedure:
- Identify corruption extent (which tenants affected)
- Restore affected tenant schemas from backup
- Validate data integrity
- Notify affected customers
4. Security Breach (unauthorized access):
- Detection: Anomaly detection, SIEM alerts
- RTO: N/A (incident response, not recovery)
- Procedure:
- Isolate affected systems
- Revoke compromised credentials
- Investigate attack vector
- Patch vulnerability
- Notify affected customers (within 72 hours, GDPR)
- Conduct post-mortem
DR Testing
Quarterly DR Drills:
-
Planned Failover Test (Q1, Q3):
- Schedule maintenance window
- Perform manual failover to backup region
- Verify all services operational
- Failback to primary region
- Document any issues
-
Backup Restore Test (Q2, Q4):
- Create test environment
- Restore latest backup
- Run smoke tests
- Verify data integrity
- Document restore time
Cost Optimization
Cost Breakdown (Estimated Monthly)
| Category | Cost |
|---|---|
| Compute | |
| - Kubernetes nodes (3x t3.large) | $180 |
| - EKS control plane | $75 |
| Database | |
| - RDS PostgreSQL (primary) | $120 |
| - RDS read replica | $120 |
| Cache & Queue | |
| - ElastiCache Redis | $50 |
| Storage | |
| - S3 (1 TB) | $23 |
| - EBS volumes | $30 |
| Networking | |
| - Load Balancer | $20 |
| - Data transfer | $100 |
| Monitoring & Logging | |
| - CloudWatch | $30 |
| - ELK stack (self-hosted) | Included in compute |
| Security | |
| - Vault (self-hosted) | Included in compute |
| Backups | $25 |
| Total | ~$773/month |
Cost Per Customer (at 50 customers): ~$15/month
Suggested Pricing: $50-100/month per customer (3-6x cost)
Cost Optimization Strategies
1. Right-Size Resources:
- Monitor actual usage vs. provisioned capacity
- Downsize underutilized nodes
- Use auto-scaling for variable workloads
2. Reserved Instances (for predictable workload):
- 1-year reserved instances: 30-40% discount
- 3-year reserved instances: 50-60% discount
- Apply to: Database, cache, baseline compute
3. Spot Instances (for non-critical workloads):
- Use for batch jobs (report generation)
- Dev/staging environments
- 70-90% discount vs. on-demand
4. Storage Lifecycle Policies:
- Move old reports to Glacier (10x cheaper)
- Delete old logs after retention period
- Compress backups
5. Data Transfer Optimization:
- Use CloudFront CDN for static assets
- Keep traffic within same region when possible
- Compress API responses
Transition to Deep-Water
Triggers to Advance
You’re ready for Deep-Water when:
- 500+ concurrent users (mid-depth infrastructure stressed)
- Geographic expansion required (customers in multiple regions)
- Enterprise customers demanding 99.9% SLA
- Compliance certifications achieved (ISO 27001, SOC 2)
- Microservices extraction justified (specific bottlenecks identified)
- Team grown to 10+ engineers (can manage distributed systems)
- Revenue justifies investment ($500k+ ARR)
Timeline: 12-18 months after Mid-Depth
Conclusion
Mid-Depth is about professionalization and reliability. Goals:
- Multi-tenant at scale: Support 50+ customers with data isolation
- Enterprise-ready: Compliance preparation, professional security
- Reliable: 99% uptime, fast incident response
- Observable: Comprehensive monitoring and logging
- Performant: <2 second response times under load
What Success Looks Like:
- Customers trust you with their operations
- Minimal support burden (system “just works”)
- Predictable costs and performance
- Clear path to enterprise scale (Deep-Water)
Key Mindset: Build for reliability and maintainability, not premature optimization. The modular monolith can take you far—don’t rush to microservices.
When enterprise customers demand geographic distribution and 99.9% SLAs, you’re ready for Deep-Water.