Mid-Depth Level: Production SaaS

Overview

Goal: Deliver reliable, scalable multi-tenant SaaS with enterprise features and compliance readiness

Target Scale:

1,000 concurrent users across multiple organizations
200+ concurrent dispatchers
10,000+ dispatch operations per day
50+ paying customers

Timeline: 6-12 months after Surface Level launch

Investment: 3-5 developers, $1,000-3,000/month infrastructure

Success Criteria

Quantitative Metrics

99% uptime over 90-day rolling period
<2 second average API response time (p95)
<1% error rate
Support 50+ active tenants
Zero data loss incidents
<1 hour RTO (Recovery Time Objective)
<5 minute RPO (Recovery Point Objective)

Business Metrics

50+ paying organizations
<5% monthly churn rate
Net Promoter Score (NPS) >30
90%+ of feature requests tracked and prioritized
Enterprise customers demanding compliance (SOC 2, ISO 27001)

Operational Metrics

Deployment frequency: Multiple per week (without downtime)
Incident response time: <15 minutes
Security patch cycle: <48 hours for critical CVEs
Customer support response time: <4 hours

Architecture Evolution

From Monolith to Enhanced Monolith

Still a Monolith, But with:

Multi-tenant data isolation
Horizontal scaling (multiple backend instances)
External services integration (Redis, S3, managed Keycloak)
High availability for critical components
Professional monitoring and observability

Why Not Microservices Yet?

Monolith can handle 1,000 concurrent users easily
Microservices add operational complexity without clear benefit at this scale
Team size doesn’t justify distributed system overhead
Transaction boundaries still align with module boundaries

What Changes:

Database: Single instance → Primary + replicas
Queue: In-memory → Redis-backed
Storage: Filesystem → S3
Certificates: Self-signed → Let’s Encrypt
Deployment: Single server → Kubernetes or multi-container
Auth: Self-hosted Keycloak → Support external Keycloak integration

Evolution from Surface Level

This table maps Surface Level limitations to Mid-Depth solutions, with the business driver for each change.

Surface Level Limitation	Mid-Depth Solution	Business Driver
Single-tenant database	Schema-per-tenant isolation	Multiple customers require data separation for security and compliance
30-second polling	WebSocket real-time updates	Dispatchers need instant visibility for time-critical decisions
5-10 users max	1,000+ concurrent users	Growth from PMF validation to production SaaS
Self-signed certificates	Let’s Encrypt auto-renewal	Enterprise customers won’t trust self-signed certs; compliance requires trusted CAs
In-memory queue (lost on restart)	Redis-backed persistent queue	Queue data too valuable to lose; enables horizontal scaling
Local filesystem storage	S3 object storage	Scalability, redundancy, 7-year compliance retention
Basic logging (local files)	ELK stack centralized logging	Troubleshooting across multiple instances; audit trail requirements
No MFA	Keycloak MFA enforcement	Security compliance (ISO 27001) requires MFA for privileged users
Single DB instance (no HA)	RDS Multi-AZ with read replicas	99% uptime SLA; heavy reporting queries impact live operations
Manual deployments	CI/CD pipeline with rollback	Multiple weekly deployments require automation; reduce human error
Best-effort support	Defined incident response SLAs	Enterprise customers expect guaranteed response times
No rate limiting	Per-user and per-tenant limits	Prevent abuse, ensure fair usage across tenants
Basic RBAC	ABAC (attribute-based)	Regional restrictions, department-level access control

Key Persona Impact

Persona	What Changes for Them
Dispatcher	Real-time updates, faster UI, filtering for large datasets
Equipment Support	Utilization reports, maintenance scheduling, certification tracking
Dev/Ops Team	CI/CD automation, centralized logs, defined on-call rotation

Infrastructure Architecture

Deployment Options

Option A: Kubernetes (Recommended for Growth)

┌─────────────────────────────────────────────┐
│ AWS ALB / GKE Load Balancer                 │
│ (SSL Termination - Let's Encrypt)           │
└──────────────┬──────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
    ▼                     ▼
┌──────────┐         ┌──────────┐
│ Frontend │         │ Backend  │
│ Pod      │         │ Pods     │
│ (nginx)  │         │ (3-5x)   │
└──────────┘         └─────┬────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
         ▼                 ▼                 ▼
    ┌─────────┐      ┌─────────┐      ┌─────────┐
    │ Redis   │      │ RDS     │      │ S3      │
    │ Cluster │      │ Postgres│      │ Bucket  │
    │ (Queue) │      │ Primary │      │         │
    └─────────┘      │ +Replica│      └─────────┘
                     └─────────┘
    ┌──────────────────┐
    │ Keycloak Cluster │
    │ (2+ pods)        │
    └──────────────────┘

Kubernetes Resources:

Namespace: dispatch-production
Backend Deployment: 3-5 replicas (HPA enabled, scale 3-10)
Frontend Deployment: 2 replicas (static files, minimal resources)
Redis: Single pod initially, cluster later
PostgreSQL: External managed service (AWS RDS, Google Cloud SQL)
S3: Managed object storage
Keycloak: StatefulSet with 2 replicas

Resource Allocation:

Backend Pod:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 2Gi

Frontend Pod:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Monthly Cost Estimate (AWS):

EKS cluster: $75
EC2 nodes (3x t3.large): $180
RDS PostgreSQL (db.t3.large): $120
Redis (cache.t3.medium): $50
S3 storage (500 GB): $12
Load Balancer: $20
Data transfer: $50
Total: ~$500-600/month

Option B: Managed Container Service (Simpler)

AWS ECS / Google Cloud Run / Azure Container Apps

┌─────────────────────────────────────────────┐
│ Application Load Balancer                   │
└──────────────┬──────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
    ▼                     ▼
┌──────────┐         ┌──────────┐
│ Frontend │         │ Backend  │
│ Service  │         │ Service  │
│ (2 tasks)│         │ (3 tasks)│
└──────────┘         └─────┬────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
         ▼                 ▼                 ▼
    ┌─────────┐      ┌─────────┐      ┌─────────┐
    │ElastiCache│     │ RDS     │      │ S3      │
    │ Redis    │     │ Postgres│      │         │
    └─────────┘      └─────────┘      └─────────┘

Pros:

Simpler than Kubernetes
Less operational overhead
Easier to manage for small teams

Cons:

Less portable (vendor lock-in)
Fewer ecosystem tools
May need migration to K8s later

Monthly Cost Estimate: Similar to Kubernetes (~$500-600)

Database Architecture

PostgreSQL Configuration:

Primary Instance:

Size: db.t3.large (2 vCPU, 8 GB RAM) initially
Storage: 500 GB SSD (auto-scaling to 2 TB)
IOPS: Provisioned IOPS SSD for consistent performance
Multi-AZ: Enabled (automatic failover to standby)

Read Replicas:

Count: 1-2 replicas
Purpose:
- Reports and analytics (offload primary)
- Dashboard queries (stale data acceptable)
Lag: Typically <1 second
Cross-region: Optional (for disaster recovery)

Connection Management:

pgBouncer (connection pooler):

Backend Pods (50 connections each)
        ↓
    pgBouncer (transaction mode)
        ↓
PostgreSQL (max 100 connections)

Benefits:

Reuse connections efficiently
Handle connection spikes
Route reads to replicas automatically

Configuration:

[databases]
dispatch = host=postgres-primary.rds.amazonaws.com port=5432 dbname=dispatch
dispatch_ro = host=postgres-replica.rds.amazonaws.com port=5432 dbname=dispatch

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

Multi-Tenancy Implementation:

Schema-Per-Tenant:

-- Tenant management schema
CREATE SCHEMA tenant_management;

CREATE TABLE tenant_management.tenants (
    id UUID PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    subdomain VARCHAR(63) UNIQUE NOT NULL,
    schema_name VARCHAR(63) UNIQUE NOT NULL,
    keycloak_realm VARCHAR(255),  -- External Keycloak support
    status VARCHAR(50) DEFAULT 'active',  -- active, suspended, trial
    tier VARCHAR(50) DEFAULT 'standard',  -- standard, premium, enterprise
    created_at TIMESTAMP DEFAULT NOW(),
    settings JSONB  -- Tenant-specific configuration
);

-- Each tenant gets identical schema structure
CREATE SCHEMA tenant_acme;
CREATE SCHEMA tenant_contoso;

-- Automated schema creation function
CREATE OR REPLACE FUNCTION create_tenant_schema(tenant_schema VARCHAR)
RETURNS VOID AS $$
BEGIN
    EXECUTE format('CREATE SCHEMA %I', tenant_schema);
    -- Create all tables in new schema
    EXECUTE format('CREATE TABLE %I.work_orders (...)', tenant_schema);
    EXECUTE format('CREATE TABLE %I.equipment (...)', tenant_schema);
    -- ... all other tables
END;
$$ LANGUAGE plpgsql;

Tenant Resolution:

# Middleware determines tenant from subdomain or JWT
def get_tenant():
    # Option 1: Subdomain
    subdomain = request.host.split('.')[0]
    tenant = Tenant.query.filter_by(subdomain=subdomain).first()
    
    # Option 2: JWT claim
    if not tenant:
        token = validate_token(request)
        tenant_id = token.get('tenant_id')
        tenant = Tenant.query.get(tenant_id)
    
    return tenant

# Set search_path for all queries in this request
@app.before_request
def set_tenant_context():
    tenant = get_tenant()
    db.session.execute(f"SET search_path TO {tenant.schema_name}")

Row-Level Security (Backup):

-- Even if search_path fails, RLS prevents cross-tenant access
ALTER TABLE work_orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON work_orders
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

Redis Architecture

Purpose:

Dispatch queue (durable, survives restarts)
Session cache (optional, reduce Keycloak calls)
Dashboard data cache (reduce database load)
Rate limiting counters

Configuration:

Single Instance (Sufficient Initially):

Size: cache.t3.medium (2 vCPU, 3.09 GB RAM)
Persistence: AOF (Append-Only File) enabled
Backup: Daily snapshots

Redis Cluster (When Scaling):

3 primary nodes + 3 replicas
Automatic sharding
High availability

Queue Implementation:

import redis
import json

class RedisQueueManager:
    def __init__(self, redis_client):
        self.redis = redis_client
    
    def add_to_queue(self, tenant_id, queue_entry):
        # Priority queue: urgent items in separate list
        key = f"dispatch_queue:{tenant_id}:{queue_entry['priority']}"
        value = json.dumps(queue_entry)
        self.redis.rpush(key, value)  # Add to end (FIFO)
    
    def pop_from_queue(self, tenant_id, equipment_type_id):
        # Check urgent first, then standard
        for priority in ['urgent', 'standard']:
            key = f"dispatch_queue:{tenant_id}:{priority}"
            
            # Atomic pop from left (oldest item)
            value = self.redis.lpop(key)
            if value:
                entry = json.loads(value)
                if entry['equipment_type_id'] == equipment_type_id:
                    return entry
                else:
                    # Not matching, put back
                    self.redis.rpush(key, value)
        
        return None
    
    def get_queue_depth(self, tenant_id):
        urgent = self.redis.llen(f"dispatch_queue:{tenant_id}:urgent")
        standard = self.redis.llen(f"dispatch_queue:{tenant_id}:standard")
        return {'urgent': urgent, 'standard': standard}

Cache Implementation:

def get_dashboard_data(tenant_id):
    cache_key = f"dashboard:{tenant_id}"
    
    # Check cache first
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Cache miss, query database
    data = {
        'active_dispatches': get_active_dispatches(),
        'available_equipment': get_available_equipment(),
        'queue': get_queue_entries()
    }
    
    # Cache for 30 seconds
    redis.setex(cache_key, 30, json.dumps(data))
    return data

Object Storage (S3)

Bucket Structure:

s3://dispatch-app-production/
  /tenants
    /tenant-{id}
      /config-files
        /{work_order_id}
          /config_v1.json
          /config_v2.pdf
      /reports
        /{dispatch_assignment_id}
          /status_report.pdf
  /backups
    /database
      /tenant-{id}
        /dispatch_20250115.backup
  /static
    /frontend-build
      /assets
        /index.js
        /styles.css

Lifecycle Policies:

{
  "Rules": [
    {
      "Id": "TransitionOldReports",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "Id": "DeleteOldBackups",
      "Status": "Enabled",
      "Expiration": {
        "Days": 90
      },
      "Filter": {
        "Prefix": "backups/"
      }
    }
  ]
}

Direct Upload Flow (Efficient):

# Backend generates pre-signed POST URL
import boto3

s3_client = boto3.client('s3')

def generate_upload_url(tenant_id, work_order_id, filename):
    key = f"tenants/{tenant_id}/config-files/{work_order_id}/{filename}"
    
    presigned_post = s3_client.generate_presigned_post(
        Bucket='dispatch-app-production',
        Key=key,
        Fields={'acl': 'private'},
        Conditions=[
            {'acl': 'private'},
            ['content-length-range', 1, 10485760]  # 1 byte to 10 MB
        ],
        ExpiresIn=3600  # 1 hour
    )
    
    return presigned_post

# Frontend uploads directly to S3
# After upload, frontend notifies backend to create database record

Security:

Bucket not publicly accessible
All access via pre-signed URLs (time-limited)
Server-side encryption: SSE-S3 or SSE-KMS
Versioning enabled (can recover deleted files)
Cross-region replication (optional, for DR)

Authentication & Authorization Enhancements

Scenario A: Self-Hosted Keycloak Cluster

High Availability Setup:

┌──────────────────┐
│ Load Balancer    │
└────────┬─────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌────────┐
│Keycloak│ │Keycloak│
│Node 1  │ │Node 2  │
└───┬────┘ └───┬────┘
    │          │
    └────┬─────┘
         ▼
   ┌──────────┐
   │PostgreSQL│
   │(Shared)  │
   └──────────┘

Configuration:

2+ Keycloak instances behind load balancer
Shared PostgreSQL database (Keycloak backend)
Session replication via JGroups or external cache
Let’s Encrypt certificates (automated renewal)

MFA Enforcement:

// Keycloak realm configuration
{
  "realm": "dispatch-app",
  "requiredActions": ["CONFIGURE_TOTP"],  // Force MFA setup on first login
  "otpPolicyType": "totp",
  "otpPolicyAlgorithm": "HmacSHA1",
  "otpPolicyDigits": 6,
  "otpPolicyPeriod": 30,
  "browserFlow": "browser-with-mfa",  // Custom flow requiring MFA for admin/dispatcher
  "roles": {
    "realm": [
      {
        "name": "admin",
        "requiredOtp": true  // Always require MFA
      },
      {
        "name": "dispatcher",
        "requiredOtp": true
      }
    ]
  }
}

Scenario B: External Keycloak Integration

Customer Provides Their Keycloak:

Configuration Handoff:

Customer creates realm: dispatch-{customer-name}
Customer creates two clients:
- dispatch-web (public, OIDC, for frontend)
- dispatch-api (confidential, for backend token validation)
Customer defines roles: dispatch-admin, dispatch-dispatcher, dispatch-driver, dispatch-viewer
Customer provides:
- Keycloak URL
- Realm name
- Client IDs and secrets
- JWKS endpoint URL

Application Configuration:

# Per-tenant Keycloak settings stored in database
tenants:
  - id: acme-corp
    keycloak_url: https://sso.acme.com
    keycloak_realm: dispatch-acme
    keycloak_client_id: dispatch-api
    keycloak_client_secret: <secret>
    keycloak_jwks_url: https://sso.acme.com/realms/dispatch-acme/protocol/openid-connect/certs

  - id: contoso
    keycloak_url: https://keycloak.contoso.com
    keycloak_realm: dispatch-production
    keycloak_client_id: dispatch-backend
    keycloak_client_secret: <secret>
    keycloak_jwks_url: https://keycloak.contoso.com/realms/dispatch-production/protocol/openid-connect/certs

Token Validation (Cached JWKS):

import requests
import jwt
from functools import lru_cache
from datetime import datetime, timedelta

class MultiTenantAuth:
    def __init__(self):
        self.jwks_cache = {}  # tenant_id -> (jwks, expiry)
    
    def get_jwks(self, tenant):
        # Check cache
        if tenant.id in self.jwks_cache:
            jwks, expiry = self.jwks_cache[tenant.id]
            if datetime.utcnow() < expiry:
                return jwks
        
        # Fetch JWKS
        response = requests.get(tenant.keycloak_jwks_url)
        jwks = response.json()
        
        # Cache for 1 hour
        self.jwks_cache[tenant.id] = (jwks, datetime.utcnow() + timedelta(hours=1))
        return jwks
    
    def validate_token(self, token, tenant):
        try:
            jwks = self.get_jwks(tenant)
            
            # Decode and validate
            decoded = jwt.decode(
                token,
                jwks,
                algorithms=['RS256'],
                audience=tenant.keycloak_client_id,
                options={'verify_exp': True}
            )
            
            return decoded
        except jwt.ExpiredSignatureError:
            raise Unauthorized("Token expired")
        except jwt.InvalidTokenError:
            raise Unauthorized("Invalid token")

Fallback Mechanism (Graceful Degradation):

def validate_token_with_fallback(token, tenant):
    try:
        # Try to validate with live JWKS
        return validate_token(token, tenant)
    except requests.ConnectionError:
        # Customer's Keycloak unreachable
        # Use cached JWKS (even if expired) with warning
        logger.warning(f"Keycloak unreachable for tenant {tenant.id}, using cached JWKS")
        
        if tenant.id in jwks_cache:
            jwks, _ = jwks_cache[tenant.id]  # Ignore expiry
            # Validate with cached JWKS (accept 15-minute grace period)
            return jwt.decode(token, jwks, algorithms=['RS256'], options={'verify_exp': False})
        else:
            raise ServiceUnavailable("Authentication service unavailable")

Advanced Authorization (ABAC)

Attribute-Based Access Control:

Example Use Case: Dispatcher can only dispatch equipment in their region

Token Claims:

{
  "sub": "user-123",
  "preferred_username": "john.dispatcher",
  "realm_access": {
    "roles": ["dispatcher"]
  },
  "custom_attributes": {
    "region": "west-coast",
    "department": "logistics"
  }
}

Policy Enforcement:

def authorize_dispatch(user, equipment):
    # Check role
    if 'dispatcher' not in user.roles and 'admin' not in user.roles:
        raise Forbidden("Not authorized to dispatch")
    
    # Check attribute (region)
    if 'admin' not in user.roles:  # Admins bypass region restriction
        user_region = user.custom_attributes.get('region')
        equipment_region = equipment.region
        
        if user_region != equipment_region:
            raise Forbidden(f"Cannot dispatch equipment outside your region ({user_region})")
    
    return True

Real-Time Updates (WebSockets)

Upgrade from Polling to WebSockets:

Architecture:

Frontend (Socket.io client)
        ↓ WSS
Backend (Flask-SocketIO)
        ↓ Pub/Sub
Redis (Message Broker)

Benefits:

Real-time dashboard updates (<1 second latency)
Reduced server load (no constant polling)
Lower bandwidth usage
Better user experience

Implementation:

Backend (Flask-SocketIO):

from flask_socketio import SocketIO, emit, join_room

socketio = SocketIO(app, cors_allowed_origins="*", message_queue='redis://redis:6379')

@socketio.on('connect')
def handle_connect():
    # Authenticate user
    token = request.args.get('token')
    user = validate_token(token)
    
    # Join tenant-specific room
    tenant_id = get_tenant_from_user(user)
    join_room(f"tenant_{tenant_id}")
    
    emit('connected', {'message': 'Welcome to dispatch updates'})

@socketio.on('subscribe_dispatches')
def handle_subscribe():
    # User subscribes to active dispatch updates
    pass

# When dispatch status changes
def notify_dispatch_update(tenant_id, dispatch_data):
    socketio.emit('dispatch_updated', dispatch_data, room=f"tenant_{tenant_id}")

# When queue changes
def notify_queue_update(tenant_id, queue_data):
    socketio.emit('queue_updated', queue_data, room=f"tenant_{tenant_id}")

Frontend (React):

import { io } from 'socket.io-client';
import { useAuth } from '@/features/auth/useAuth';

export function useDispatchUpdates() {
  const { token } = useAuth();
  const [dispatches, setDispatches] = useState([]);
  
  useEffect(() => {
    const socket = io('wss://api.dispatchapp.com', {
      auth: { token }
    });
    
    socket.on('connect', () => {
      console.log('Connected to dispatch updates');
      socket.emit('subscribe_dispatches');
    });
    
    socket.on('dispatch_updated', (data) => {
      // Update local state
      setDispatches(prev => {
        const index = prev.findIndex(d => d.id === data.id);
        if (index >= 0) {
          const updated = [...prev];
          updated[index] = data;
          return updated;
        } else {
          return [...prev, data];
        }
      });
    });
    
    socket.on('queue_updated', (data) => {
      // Update queue display
    });
    
    return () => socket.disconnect();
  }, [token]);
  
  return { dispatches };
}

Scaling WebSockets:

Redis pub/sub allows multiple backend instances to share WebSocket state
Each backend instance handles subset of connections
Load balancer sticky sessions ensure client connects to same instance

Performance Optimizations

Database Optimization

Query Optimization:

-- Slow query (before)
SELECT d.*, wo.order_number, e.identifier, u.full_name
FROM dispatch_assignments d
JOIN work_orders wo ON d.work_order_id = wo.id
JOIN equipment e ON d.equipment_id = e.id
JOIN users u ON d.driver_id = u.id
WHERE d.status IN ('dispatched', 'en_route', 'on_site')
  AND d.dispatched_at > NOW() - INTERVAL '7 days';

-- Optimized (after)
-- 1. Add covering index
CREATE INDEX idx_dispatch_active_covering ON dispatch_assignments(status, dispatched_at) 
    INCLUDE (work_order_id, equipment_id, driver_id);

-- 2. Denormalize frequently accessed data
ALTER TABLE dispatch_assignments 
    ADD COLUMN work_order_number VARCHAR,
    ADD COLUMN equipment_identifier VARCHAR,
    ADD COLUMN driver_name VARCHAR;

-- Updated on create/update via trigger or application logic

-- Now query is much faster (index-only scan possible)
SELECT * FROM dispatch_assignments
WHERE status IN ('dispatched', 'en_route', 'on_site')
  AND dispatched_at > NOW() - INTERVAL '7 days';

Partitioning (for large tables):

-- Partition dispatch_assignments by month
CREATE TABLE dispatch_assignments (
    id UUID NOT NULL,
    dispatched_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP,
    -- ... other columns
) PARTITION BY RANGE (dispatched_at);

-- Create partitions
CREATE TABLE dispatch_assignments_2025_01 PARTITION OF dispatch_assignments
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE dispatch_assignments_2025_02 PARTITION OF dispatch_assignments
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Automatic partition creation via cron job or pg_partman

Materialized Views (for dashboard metrics):

-- Expensive query: Equipment utilization rate
CREATE MATERIALIZED VIEW equipment_utilization AS
SELECT 
    e.id,
    e.identifier,
    COUNT(CASE WHEN d.status IN ('dispatched', 'en_route', 'on_site') THEN 1 END) as active_count,
    COUNT(d.id) as total_dispatches,
    AVG(EXTRACT(EPOCH FROM (d.completed_at - d.dispatched_at))) as avg_duration_seconds
FROM equipment e
LEFT JOIN dispatch_assignments d ON e.id = d.equipment_id
WHERE d.dispatched_at > NOW() - INTERVAL '30 days'
GROUP BY e.id, e.identifier;

-- Refresh every 5 minutes via cron
-- SELECT refresh_materialized_view('equipment_utilization');

Application-Level Caching

Cache Strategy:

What to Cache:

Dashboard data (30 seconds TTL)
Available equipment list (30 seconds TTL)
User profiles (5 minutes TTL)
Tenant settings (1 hour TTL)

What NOT to Cache:

Active dispatch status (real-time via WebSocket)
Queue depth (real-time via Redis)
Transaction-critical data

Cache Implementation:

from flask_caching import Cache

cache = Cache(app, config={
    'CACHE_TYPE': 'redis',
    'CACHE_REDIS_HOST': 'redis',
    'CACHE_REDIS_PORT': 6379,
    'CACHE_DEFAULT_TIMEOUT': 300
})

@app.route('/api/equipment/available')
@cache.cached(timeout=30, key_prefix='available_equipment')
def get_available_equipment():
    tenant = get_tenant()
    equipment = Equipment.query.filter_by(
        schema_name=tenant.schema_name,
        status='available'
    ).all()
    return jsonify(EquipmentSchema(many=True).dump(equipment))

# Cache invalidation on equipment status change
def update_equipment_status(equipment_id, new_status):
    equipment = Equipment.query.get(equipment_id)
    equipment.status = new_status
    db.session.commit()
    
    # Invalidate cache
    cache.delete('available_equipment')

API Rate Limiting

Per-User Rate Limits:

from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

limiter = Limiter(
    app,
    key_func=lambda: g.user['sub'],  # Rate limit by user ID
    storage_uri='redis://redis:6379',
    default_limits=["1000 per hour", "100 per minute"]
)

# Stricter limits on expensive endpoints
@app.route('/api/reports/generate', methods=['POST'])
@limiter.limit("10 per minute")
def generate_report():
    # Report generation is expensive
    pass

# Exempt admins from rate limiting
@app.route('/api/admin/bulk-import', methods=['POST'])
@limiter.exempt
def bulk_import():
    if 'admin' not in g.user['roles']:
        abort(403)
    # Admin bulk operations
    pass

Per-Tenant Rate Limits:

# Different limits based on subscription tier
def get_tenant_limit():
    tenant = get_tenant()
    limits = {
        'trial': "100 per hour",
        'standard': "1000 per hour",
        'premium': "5000 per hour",
        'enterprise': "20000 per hour"
    }
    return limits.get(tenant.tier, "1000 per hour")

@app.route('/api/dispatch/request', methods=['POST'])
@limiter.limit(get_tenant_limit)
def request_dispatch():
    # Dispatch request
    pass

Monitoring & Observability

Logging Infrastructure

Centralized Logging (ELK Stack):

Application Logs (JSON)
        ↓
Filebeat (log shipper)
        ↓
Logstash (parsing, enrichment)
        ↓
Elasticsearch (storage, indexing)
        ↓
Kibana (visualization, search)

Log Format:

{
  "timestamp": "2025-01-15T10:30:45.123Z",
  "level": "INFO",
  "service": "dispatch-backend",
  "module": "dispatch_module.service",
  "tenant_id": "acme-corp",
  "user_id": "user-123",
  "action": "dispatch_requested",
  "work_order_id": "WO-54321",
  "equipment_id": "eq-789",
  "driver_id": "driver-456",
  "result": "success",
  "duration_ms": 45,
  "request_id": "req-abc-123"  # Trace through all services
}

Log Retention:

Hot tier (Elasticsearch): 30 days
Warm tier (S3): 90 days
Cold tier (S3 Glacier): 1 year

Structured Logging:

import structlog

logger = structlog.get_logger()

def request_dispatch(work_order_id, equipment_type_id, dispatcher_id):
    logger.info(
        "dispatch_requested",
        work_order_id=work_order_id,
        equipment_type_id=equipment_type_id,
        dispatcher_id=dispatcher_id
    )
    
    try:
        # Dispatch logic
        result = create_dispatch_assignment(...)
        
        logger.info(
            "dispatch_completed",
            assignment_id=result.id,
            duration_ms=(datetime.utcnow() - start_time).total_seconds() * 1000
        )
        
        return result
    except Exception as e:
        logger.error(
            "dispatch_failed",
            error=str(e),
            error_type=type(e).__name__,
            exc_info=True
        )
        raise

Metrics & Alerting (Prometheus + Grafana)

Prometheus Metrics:

Application Metrics:

from prometheus_client import Counter, Histogram, Gauge

# Request metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])

# Business metrics
dispatch_requests_total = Counter('dispatch_requests_total', 'Total dispatch requests', ['tenant_id', 'result'])
dispatch_queue_depth = Gauge('dispatch_queue_depth', 'Current queue depth', ['tenant_id', 'priority'])
active_dispatches = Gauge('active_dispatches', 'Number of active dispatches', ['tenant_id'])

# Database metrics
db_query_duration = Histogram('db_query_duration_seconds', 'Database query duration', ['query_type'])
db_connection_pool_size = Gauge('db_connection_pool_size', 'Connection pool size')

Grafana Dashboards:

Dashboard 1: System Health

Request rate (requests/second)
Error rate (%)
Response time (p50, p95, p99)
CPU and memory usage per pod
Database connection pool utilization

Dashboard 2: Business Metrics

Dispatches per hour
Queue depth over time
Average dispatch duration
Equipment utilization rate
Top tenants by activity

Dashboard 3: Tenant-Specific

Per-tenant request rate
Per-tenant error rate
Per-tenant active dispatches
Per-tenant queue depth

Alerting Rules

Critical Alerts (PagerDuty, immediate response):

groups:
  - name: critical
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} (>5%) for 2 minutes"
      
      - alert: DatabaseDown
        expr: up{job="postgres"} == 0
        for: 1m
        annotations:
          summary: "PostgreSQL is down"
      
      - alert: QueueDepthCritical
        expr: dispatch_queue_depth{priority="urgent"} > 50
        for: 5m
        annotations:
          summary: "Urgent queue depth exceeds 50"

Warning Alerts (Slack, investigate during business hours):

  - name: warnings
    interval: 5m
    rules:
      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, http_request_duration_seconds) > 2.0
        for: 10m
        annotations:
          summary: "95th percentile response time > 2 seconds"
      
      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
        for: 10m
        annotations:
          summary: "Memory usage > 85%"
      
      - alert: CertificateExpiringSoon
        expr: ssl_certificate_expiry_days < 30
        annotations:
          summary: "SSL certificate expires in {{ $value }} days"

Security Enhancements

Certificate Management

Let’s Encrypt (Automated):

cert-manager (Kubernetes):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@dispatchapp.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: dispatch-tls
spec:
  secretName: dispatch-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - dispatchapp.com
    - "*.dispatchapp.com"
  renewBefore: 720h  # Renew 30 days before expiry

Automated Renewal:

cert-manager checks expiry daily
Renews when <30 days remaining
Zero-downtime certificate rotation
Monitoring alert if renewal fails

Internal Services (Vault PKI):

HashiCorp Vault as Internal CA:

# Enable PKI secrets engine
vault secrets enable pki
vault secrets tune -max-lease-ttl=87600h pki  # 10 years

# Generate root CA
vault write pki/root/generate/internal \
    common_name="Dispatch App Internal CA" \
    ttl=87600h

# Create role for service certificates
vault write pki/roles/internal-services \
    allowed_domains="*.dispatch.internal,*.dispatch.svc.cluster.local" \
    allow_subdomains=true \
    max_ttl=2160h  # 90 days

# Backend requests certificate on startup
vault write pki/issue/internal-services \
    common_name="backend.dispatch.svc.cluster.local" \
    ttl=2160h

Certificate Rotation:

# Application checks certificate expiry on startup
def ensure_valid_certificate():
    cert_path = '/var/run/secrets/tls/tls.crt'
    cert = load_certificate(cert_path)
    
    days_until_expiry = (cert.not_valid_after - datetime.utcnow()).days
    
    if days_until_expiry < 30:
        logger.warning(f"Certificate expires in {days_until_expiry} days, requesting new certificate")
        new_cert = request_certificate_from_vault()
        replace_certificate(new_cert)
        reload_app()  # Graceful reload

Secrets Management (Vault)

Dynamic Database Credentials:

# Configure database secrets engine
vault secrets enable database

vault write database/config/dispatch-postgres \
    plugin_name=postgresql-database-plugin \
    allowed_roles="dispatch-backend" \
    connection_url="postgresql://{{username}}:{{password}}@postgres:5432/dispatch" \
    username="vault" \
    password="vault_password"

# Create role that generates 1-hour credentials
vault write database/roles/dispatch-backend \
    db_name=dispatch-postgres \
    creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE dispatch_app;" \
    default_ttl="1h" \
    max_ttl="24h"

# Backend requests credentials on startup
vault read database/creds/dispatch-backend
# Returns: { "username": "v-token-12345", "password": "A1B2C3..." }

Application Integration:

import hvac

vault_client = hvac.Client(url='https://vault.internal:8200')

# Authenticate (Kubernetes service account, AppRole, etc.)
vault_client.auth.kubernetes.login(
    role='dispatch-backend',
    jwt=open('/var/run/secrets/kubernetes.io/serviceaccount/token').read()
)

# Get database credentials
db_creds = vault_client.secrets.database.generate_credentials(name='dispatch-backend')
db_user = db_creds['data']['username']
db_pass = db_creds['data']['password']

# Connect to database
engine = create_engine(f'postgresql://{db_user}:{db_pass}@postgres:5432/dispatch')

# Credentials automatically revoked after 1 hour
# Application renews before expiry or reconnects with new credentials

Compliance Preparation (ISO 27001)

ISO 27001 Alignment:

A.9 Access Control:

✅ Keycloak provides centralized identity management
✅ RBAC enforced at API level
✅ MFA for privileged accounts (admin, dispatcher)
✅ Regular access reviews (quarterly audit of user roles)

A.10 Cryptography:

✅ TLS 1.3 for all data in transit
✅ Encryption at rest for S3 (SSE-KMS)
✅ Database encryption (TDE) - enable at Deep-Water
✅ Key management via Vault

A.12 Operations Security:

✅ Automated security patching (Dependabot + weekly review)
✅ Vulnerability scanning (Snyk integrated in CI/CD)
✅ Change management process (Git, PR reviews, staging environment)
✅ Capacity monitoring (Prometheus alerts)

A.14 System Acquisition, Development, Maintenance:

✅ Security requirements in design phase (documented)
✅ Code review process (all PRs reviewed)
✅ Penetration testing (quarterly, external firm)
✅ Secure coding guidelines documented

A.16 Incident Management:

✅ Incident response plan documented
✅ Security event logging (ELK stack)
✅ 24/7 on-call rotation (PagerDuty)
✅ Post-incident reviews (within 48 hours)

A.18 Compliance:

✅ Regular internal audits (monthly)
✅ External audit (annual, preparation for SOC 2 Type I)
✅ Documentation of all controls
✅ Gap analysis against ISO 27001 requirements

Documentation Required:

Information Security Policy
Risk Assessment Register
Asset Inventory
Incident Response Procedures
Business Continuity Plan
Disaster Recovery Plan
Access Control Policy
Change Management Process

Deployment & DevOps

CI/CD Pipeline

Tools: GitHub Actions, GitLab CI, or Jenkins

Pipeline Stages:

1. Build:

# .github/workflows/build.yml
name: Build and Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.11
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-dev.txt
      
      - name: Lint
        run: |
          flake8 app/
          black --check app/
      
      - name: Unit tests
        run: pytest tests/unit --cov=app
      
      - name: Integration tests
        run: pytest tests/integration
      
      - name: Security scan
        run: |
          bandit -r app/
          safety check
  
  build-image:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Build Docker image
        run: docker build -t dispatch-backend:${{ github.sha }} .
      
      - name: Scan image
        run: trivy image dispatch-backend:${{ github.sha }}
      
      - name: Push to registry
        run: |
          echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
          docker push dispatch-backend:${{ github.sha }}

2. Deploy to Staging:

  deploy-staging:
    needs: build-image
    runs-on: ubuntu-latest
    steps:
      - name: Update Kubernetes manifests
        run: |
          kubectl set image deployment/dispatch-backend \
            backend=dispatch-backend:${{ github.sha }} \
            -n dispatch-staging
      
      - name: Wait for rollout
        run: kubectl rollout status deployment/dispatch-backend -n dispatch-staging
      
      - name: Smoke tests
        run: |
          curl -f https://staging.dispatchapp.com/api/health
          pytest tests/smoke --base-url=https://staging.dispatchapp.com

3. Deploy to Production (manual approval):

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://dispatchapp.com
    steps:
      - name: Update Kubernetes manifests
        run: |
          kubectl set image deployment/dispatch-backend \
            backend=dispatch-backend:${{ github.sha }} \
            -n dispatch-production
      
      - name: Wait for rollout
        run: kubectl rollout status deployment/dispatch-backend -n dispatch-production
      
      - name: Notify Slack
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK_URL }} \
            -H 'Content-Type: application/json' \
            -d '{"text":"Production deployment successful: ${{ github.sha }}"}'

Blue-Green Deployments

Zero-Downtime Deployments:

Kubernetes Rolling Update:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dispatch-backend
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Only 1 pod down at a time
      maxSurge: 2        # Up to 2 extra pods during update
  template:
    spec:
      containers:
        - name: backend
          image: dispatch-backend:v1.2.3
          readinessProbe:
            httpGet:
              path: /api/health
              port: 5000
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /api/health
              port: 5000
            initialDelaySeconds: 30
            periodSeconds: 10

Process:

New pods spin up (v1.2.3)
Readiness probe passes → added to load balancer
Old pod (v1.2.2) removed from load balancer
Old pod allowed to finish in-flight requests (30s grace period)
Old pod terminated
Repeat for all pods

Rollback (if deployment fails):

# Automatic rollback on failed readiness probes
# Or manual rollback
kubectl rollout undo deployment/dispatch-backend -n dispatch-production

Database Migrations

Zero-Downtime Migration Strategy:

Phase 1: Additive Changes Only

# Migration: Add new column
def upgrade():
    op.add_column('dispatch_assignments', 
        sa.Column('new_field', sa.String(), nullable=True))

# Deploy code that writes to both old and new fields
# Old code still works (new_field is nullable)

Phase 2: Backfill Data

# Background job fills new_field for existing rows
def backfill_new_field():
    assignments = DispatchAssignment.query.filter_by(new_field=None).all()
    for assignment in assignments:
        assignment.new_field = calculate_new_field(assignment)
    db.session.commit()

Phase 3: Make Required

# After all rows backfilled
def upgrade():
    op.alter_column('dispatch_assignments', 'new_field', nullable=False)

# Deploy code that relies on new_field

Phase 4: Remove Old Field (if applicable)

# After old code no longer deployed
def upgrade():
    op.drop_column('dispatch_assignments', 'old_field')

Backup & Disaster Recovery

Backup Strategy

Database Backups:

Frequency: Continuous (WAL archiving) + daily snapshots
Retention: 35 days (RDS default)
Location: Same region + cross-region replication
Encryption: AES-256 (at rest)

Automated Snapshots (RDS):

# Automated daily snapshots at 3 AM
# Manual snapshot before major changes
aws rds create-db-snapshot \
    --db-instance-identifier dispatch-production \
    --db-snapshot-identifier pre-migration-2025-01-15

Point-in-Time Recovery (PITR):

# Restore to any second in last 35 days
aws rds restore-db-instance-to-point-in-time \
    --source-db-instance-identifier dispatch-production \
    --target-db-instance-identifier dispatch-restored \
    --restore-time 2025-01-15T10:30:00Z

Object Storage Backups:

S3 Versioning: Enabled (recover deleted/overwritten files)
Cross-Region Replication: Enabled (DR)
Lifecycle Policy: Reports archived to Glacier after 1 year

Disaster Recovery Plan

Scenarios & Procedures:

1. Database Failure:

Detection: Automatic (RDS Multi-AZ failover)
RTO: <2 minutes (automatic failover to standby)
RPO: 0 (synchronous replication)
Procedure: No action required (automatic)

2. Regional Outage (AWS region down):

Detection: Manual (monitoring shows all services down)
RTO: 1 hour (manual failover to backup region)
RPO: 5 minutes (asynchronous replication lag)
Procedure:
1. Promote read replica in backup region to primary
2. Update DNS to point to backup region
3. Restart backend pods in backup region
4. Communicate status to customers

3. Data Corruption (accidental deletion):

Detection: Customer report or internal audit
RTO: 2 hours (restore from backup)
RPO: Up to 24 hours (daily snapshots)
Procedure:
1. Identify corruption extent (which tenants affected)
2. Restore affected tenant schemas from backup
3. Validate data integrity
4. Notify affected customers

4. Security Breach (unauthorized access):

Detection: Anomaly detection, SIEM alerts
RTO: N/A (incident response, not recovery)
Procedure:
1. Isolate affected systems
2. Revoke compromised credentials
3. Investigate attack vector
4. Patch vulnerability
5. Notify affected customers (within 72 hours, GDPR)
6. Conduct post-mortem

DR Testing

Quarterly DR Drills:

Planned Failover Test (Q1, Q3):
- Schedule maintenance window
- Perform manual failover to backup region
- Verify all services operational
- Failback to primary region
- Document any issues
Backup Restore Test (Q2, Q4):
- Create test environment
- Restore latest backup
- Run smoke tests
- Verify data integrity
- Document restore time

Cost Optimization

Cost Breakdown (Estimated Monthly)

Category	Cost
Compute
- Kubernetes nodes (3x t3.large)	$180
- EKS control plane	$75
Database
- RDS PostgreSQL (primary)	$120
- RDS read replica	$120
Cache & Queue
- ElastiCache Redis	$50
Storage
- S3 (1 TB)	$23
- EBS volumes	$30
Networking
- Load Balancer	$20
- Data transfer	$100
Monitoring & Logging
- CloudWatch	$30
- ELK stack (self-hosted)	Included in compute
Security
- Vault (self-hosted)	Included in compute
Backups	$25
Total	~$773/month

Cost Per Customer (at 50 customers): ~$15/month
Suggested Pricing: $50-100/month per customer (3-6x cost)

Cost Optimization Strategies

1. Right-Size Resources:

Monitor actual usage vs. provisioned capacity
Downsize underutilized nodes
Use auto-scaling for variable workloads

2. Reserved Instances (for predictable workload):

1-year reserved instances: 30-40% discount
3-year reserved instances: 50-60% discount
Apply to: Database, cache, baseline compute

3. Spot Instances (for non-critical workloads):

Use for batch jobs (report generation)
Dev/staging environments
70-90% discount vs. on-demand

4. Storage Lifecycle Policies:

Move old reports to Glacier (10x cheaper)
Delete old logs after retention period
Compress backups

5. Data Transfer Optimization:

Use CloudFront CDN for static assets
Keep traffic within same region when possible
Compress API responses

Transition to Deep-Water

Triggers to Advance

You’re ready for Deep-Water when:

500+ concurrent users (mid-depth infrastructure stressed)
Geographic expansion required (customers in multiple regions)
Enterprise customers demanding 99.9% SLA
Compliance certifications achieved (ISO 27001, SOC 2)
Microservices extraction justified (specific bottlenecks identified)
Team grown to 10+ engineers (can manage distributed systems)
Revenue justifies investment ($500k+ ARR)

Timeline: 12-18 months after Mid-Depth

Conclusion

Mid-Depth is about professionalization and reliability. Goals:

Multi-tenant at scale: Support 50+ customers with data isolation
Enterprise-ready: Compliance preparation, professional security
Reliable: 99% uptime, fast incident response
Observable: Comprehensive monitoring and logging
Performant: <2 second response times under load

What Success Looks Like:

Customers trust you with their operations
Minimal support burden (system “just works”)
Predictable costs and performance
Clear path to enterprise scale (Deep-Water)

Key Mindset: Build for reliability and maintainability, not premature optimization. The modular monolith can take you far—don’t rush to microservices.

When enterprise customers demand geographic distribution and 99.9% SLAs, you’re ready for Deep-Water.

Mid-Depth Level: Production SaaS

Overview

Success Criteria

Quantitative Metrics

Business Metrics

Operational Metrics

Architecture Evolution

From Monolith to Enhanced Monolith

Evolution from Surface Level

Key Persona Impact

Infrastructure Architecture

Deployment Options

Option A: Kubernetes (Recommended for Growth)

Option B: Managed Container Service (Simpler)

Database Architecture

Redis Architecture

Object Storage (S3)

Authentication & Authorization Enhancements

Scenario A: Self-Hosted Keycloak Cluster

Scenario B: External Keycloak Integration

Advanced Authorization (ABAC)

Real-Time Updates (WebSockets)

Performance Optimizations

Database Optimization

Application-Level Caching

API Rate Limiting

Monitoring & Observability

Logging Infrastructure

Metrics & Alerting (Prometheus + Grafana)

Alerting Rules

Security Enhancements

Certificate Management

Secrets Management (Vault)

Compliance Preparation (ISO 27001)

Deployment & DevOps

CI/CD Pipeline

Blue-Green Deployments

Database Migrations

Backup & Disaster Recovery

Backup Strategy

Disaster Recovery Plan

DR Testing

Cost Optimization

Cost Breakdown (Estimated Monthly)

Cost Optimization Strategies

Transition to Deep-Water

Triggers to Advance

Conclusion

Want to Go Deeper?