Case Study logistics 40 min read

Mid-Depth Level: Production SaaS

Overview

Goal: Deliver reliable, scalable multi-tenant SaaS with enterprise features and compliance readiness

Target Scale:

  • 1,000 concurrent users across multiple organizations
  • 200+ concurrent dispatchers
  • 10,000+ dispatch operations per day
  • 50+ paying customers

Timeline: 6-12 months after Surface Level launch

Investment: 3-5 developers, $1,000-3,000/month infrastructure


Success Criteria

Quantitative Metrics

  • 99% uptime over 90-day rolling period
  • <2 second average API response time (p95)
  • <1% error rate
  • Support 50+ active tenants
  • Zero data loss incidents
  • <1 hour RTO (Recovery Time Objective)
  • <5 minute RPO (Recovery Point Objective)

Business Metrics

  • 50+ paying organizations
  • <5% monthly churn rate
  • Net Promoter Score (NPS) >30
  • 90%+ of feature requests tracked and prioritized
  • Enterprise customers demanding compliance (SOC 2, ISO 27001)

Operational Metrics

  • Deployment frequency: Multiple per week (without downtime)
  • Incident response time: <15 minutes
  • Security patch cycle: <48 hours for critical CVEs
  • Customer support response time: <4 hours

Architecture Evolution

From Monolith to Enhanced Monolith

Still a Monolith, But with:

  • Multi-tenant data isolation
  • Horizontal scaling (multiple backend instances)
  • External services integration (Redis, S3, managed Keycloak)
  • High availability for critical components
  • Professional monitoring and observability

Why Not Microservices Yet?

  • Monolith can handle 1,000 concurrent users easily
  • Microservices add operational complexity without clear benefit at this scale
  • Team size doesn’t justify distributed system overhead
  • Transaction boundaries still align with module boundaries

What Changes:

  • Database: Single instance → Primary + replicas
  • Queue: In-memory → Redis-backed
  • Storage: Filesystem → S3
  • Certificates: Self-signed → Let’s Encrypt
  • Deployment: Single server → Kubernetes or multi-container
  • Auth: Self-hosted Keycloak → Support external Keycloak integration

Evolution from Surface Level

This table maps Surface Level limitations to Mid-Depth solutions, with the business driver for each change.

Surface Level LimitationMid-Depth SolutionBusiness Driver
Single-tenant databaseSchema-per-tenant isolationMultiple customers require data separation for security and compliance
30-second pollingWebSocket real-time updatesDispatchers need instant visibility for time-critical decisions
5-10 users max1,000+ concurrent usersGrowth from PMF validation to production SaaS
Self-signed certificatesLet’s Encrypt auto-renewalEnterprise customers won’t trust self-signed certs; compliance requires trusted CAs
In-memory queue (lost on restart)Redis-backed persistent queueQueue data too valuable to lose; enables horizontal scaling
Local filesystem storageS3 object storageScalability, redundancy, 7-year compliance retention
Basic logging (local files)ELK stack centralized loggingTroubleshooting across multiple instances; audit trail requirements
No MFAKeycloak MFA enforcementSecurity compliance (ISO 27001) requires MFA for privileged users
Single DB instance (no HA)RDS Multi-AZ with read replicas99% uptime SLA; heavy reporting queries impact live operations
Manual deploymentsCI/CD pipeline with rollbackMultiple weekly deployments require automation; reduce human error
Best-effort supportDefined incident response SLAsEnterprise customers expect guaranteed response times
No rate limitingPer-user and per-tenant limitsPrevent abuse, ensure fair usage across tenants
Basic RBACABAC (attribute-based)Regional restrictions, department-level access control

Key Persona Impact

PersonaWhat Changes for Them
DispatcherReal-time updates, faster UI, filtering for large datasets
Equipment SupportUtilization reports, maintenance scheduling, certification tracking
Dev/Ops TeamCI/CD automation, centralized logs, defined on-call rotation

Infrastructure Architecture

Deployment Options

┌─────────────────────────────────────────────┐
│ AWS ALB / GKE Load Balancer                 │
│ (SSL Termination - Let's Encrypt)           │
└──────────────┬──────────────────────────────┘

    ┌──────────┴──────────┐
    │                     │
    ▼                     ▼
┌──────────┐         ┌──────────┐
│ Frontend │         │ Backend  │
│ Pod      │         │ Pods     │
│ (nginx)  │         │ (3-5x)   │
└──────────┘         └─────┬────┘

         ┌─────────────────┼─────────────────┐
         │                 │                 │
         ▼                 ▼                 ▼
    ┌─────────┐      ┌─────────┐      ┌─────────┐
    │ Redis   │      │ RDS     │      │ S3      │
    │ Cluster │      │ Postgres│      │ Bucket  │
    │ (Queue) │      │ Primary │      │         │
    └─────────┘      │ +Replica│      └─────────┘
                     └─────────┘
    ┌──────────────────┐
    │ Keycloak Cluster │
    │ (2+ pods)        │
    └──────────────────┘

Kubernetes Resources:

  • Namespace: dispatch-production
  • Backend Deployment: 3-5 replicas (HPA enabled, scale 3-10)
  • Frontend Deployment: 2 replicas (static files, minimal resources)
  • Redis: Single pod initially, cluster later
  • PostgreSQL: External managed service (AWS RDS, Google Cloud SQL)
  • S3: Managed object storage
  • Keycloak: StatefulSet with 2 replicas

Resource Allocation:

Backend Pod:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 2Gi

Frontend Pod:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Monthly Cost Estimate (AWS):

  • EKS cluster: $75
  • EC2 nodes (3x t3.large): $180
  • RDS PostgreSQL (db.t3.large): $120
  • Redis (cache.t3.medium): $50
  • S3 storage (500 GB): $12
  • Load Balancer: $20
  • Data transfer: $50
  • Total: ~$500-600/month

Option B: Managed Container Service (Simpler)

AWS ECS / Google Cloud Run / Azure Container Apps

┌─────────────────────────────────────────────┐
│ Application Load Balancer                   │
└──────────────┬──────────────────────────────┘

    ┌──────────┴──────────┐
    │                     │
    ▼                     ▼
┌──────────┐         ┌──────────┐
│ Frontend │         │ Backend  │
│ Service  │         │ Service  │
│ (2 tasks)│         │ (3 tasks)│
└──────────┘         └─────┬────┘

         ┌─────────────────┼─────────────────┐
         │                 │                 │
         ▼                 ▼                 ▼
    ┌─────────┐      ┌─────────┐      ┌─────────┐
    │ElastiCache│     │ RDS     │      │ S3      │
    │ Redis    │     │ Postgres│      │         │
    └─────────┘      └─────────┘      └─────────┘

Pros:

  • Simpler than Kubernetes
  • Less operational overhead
  • Easier to manage for small teams

Cons:

  • Less portable (vendor lock-in)
  • Fewer ecosystem tools
  • May need migration to K8s later

Monthly Cost Estimate: Similar to Kubernetes (~$500-600)


Database Architecture

PostgreSQL Configuration:

Primary Instance:

  • Size: db.t3.large (2 vCPU, 8 GB RAM) initially
  • Storage: 500 GB SSD (auto-scaling to 2 TB)
  • IOPS: Provisioned IOPS SSD for consistent performance
  • Multi-AZ: Enabled (automatic failover to standby)

Read Replicas:

  • Count: 1-2 replicas
  • Purpose:
    • Reports and analytics (offload primary)
    • Dashboard queries (stale data acceptable)
  • Lag: Typically <1 second
  • Cross-region: Optional (for disaster recovery)

Connection Management:

pgBouncer (connection pooler):

Backend Pods (50 connections each)

    pgBouncer (transaction mode)

PostgreSQL (max 100 connections)

Benefits:

  • Reuse connections efficiently
  • Handle connection spikes
  • Route reads to replicas automatically

Configuration:

[databases]
dispatch = host=postgres-primary.rds.amazonaws.com port=5432 dbname=dispatch
dispatch_ro = host=postgres-replica.rds.amazonaws.com port=5432 dbname=dispatch

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

Multi-Tenancy Implementation:

Schema-Per-Tenant:

-- Tenant management schema
CREATE SCHEMA tenant_management;

CREATE TABLE tenant_management.tenants (
    id UUID PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    subdomain VARCHAR(63) UNIQUE NOT NULL,
    schema_name VARCHAR(63) UNIQUE NOT NULL,
    keycloak_realm VARCHAR(255),  -- External Keycloak support
    status VARCHAR(50) DEFAULT 'active',  -- active, suspended, trial
    tier VARCHAR(50) DEFAULT 'standard',  -- standard, premium, enterprise
    created_at TIMESTAMP DEFAULT NOW(),
    settings JSONB  -- Tenant-specific configuration
);

-- Each tenant gets identical schema structure
CREATE SCHEMA tenant_acme;
CREATE SCHEMA tenant_contoso;

-- Automated schema creation function
CREATE OR REPLACE FUNCTION create_tenant_schema(tenant_schema VARCHAR)
RETURNS VOID AS $$
BEGIN
    EXECUTE format('CREATE SCHEMA %I', tenant_schema);
    -- Create all tables in new schema
    EXECUTE format('CREATE TABLE %I.work_orders (...)', tenant_schema);
    EXECUTE format('CREATE TABLE %I.equipment (...)', tenant_schema);
    -- ... all other tables
END;
$$ LANGUAGE plpgsql;

Tenant Resolution:

# Middleware determines tenant from subdomain or JWT
def get_tenant():
    # Option 1: Subdomain
    subdomain = request.host.split('.')[0]
    tenant = Tenant.query.filter_by(subdomain=subdomain).first()
    
    # Option 2: JWT claim
    if not tenant:
        token = validate_token(request)
        tenant_id = token.get('tenant_id')
        tenant = Tenant.query.get(tenant_id)
    
    return tenant

# Set search_path for all queries in this request
@app.before_request
def set_tenant_context():
    tenant = get_tenant()
    db.session.execute(f"SET search_path TO {tenant.schema_name}")

Row-Level Security (Backup):

-- Even if search_path fails, RLS prevents cross-tenant access
ALTER TABLE work_orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON work_orders
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

Redis Architecture

Purpose:

  • Dispatch queue (durable, survives restarts)
  • Session cache (optional, reduce Keycloak calls)
  • Dashboard data cache (reduce database load)
  • Rate limiting counters

Configuration:

Single Instance (Sufficient Initially):

  • Size: cache.t3.medium (2 vCPU, 3.09 GB RAM)
  • Persistence: AOF (Append-Only File) enabled
  • Backup: Daily snapshots

Redis Cluster (When Scaling):

  • 3 primary nodes + 3 replicas
  • Automatic sharding
  • High availability

Queue Implementation:

import redis
import json

class RedisQueueManager:
    def __init__(self, redis_client):
        self.redis = redis_client
    
    def add_to_queue(self, tenant_id, queue_entry):
        # Priority queue: urgent items in separate list
        key = f"dispatch_queue:{tenant_id}:{queue_entry['priority']}"
        value = json.dumps(queue_entry)
        self.redis.rpush(key, value)  # Add to end (FIFO)
    
    def pop_from_queue(self, tenant_id, equipment_type_id):
        # Check urgent first, then standard
        for priority in ['urgent', 'standard']:
            key = f"dispatch_queue:{tenant_id}:{priority}"
            
            # Atomic pop from left (oldest item)
            value = self.redis.lpop(key)
            if value:
                entry = json.loads(value)
                if entry['equipment_type_id'] == equipment_type_id:
                    return entry
                else:
                    # Not matching, put back
                    self.redis.rpush(key, value)
        
        return None
    
    def get_queue_depth(self, tenant_id):
        urgent = self.redis.llen(f"dispatch_queue:{tenant_id}:urgent")
        standard = self.redis.llen(f"dispatch_queue:{tenant_id}:standard")
        return {'urgent': urgent, 'standard': standard}

Cache Implementation:

def get_dashboard_data(tenant_id):
    cache_key = f"dashboard:{tenant_id}"
    
    # Check cache first
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Cache miss, query database
    data = {
        'active_dispatches': get_active_dispatches(),
        'available_equipment': get_available_equipment(),
        'queue': get_queue_entries()
    }
    
    # Cache for 30 seconds
    redis.setex(cache_key, 30, json.dumps(data))
    return data

Object Storage (S3)

Bucket Structure:

s3://dispatch-app-production/
  /tenants
    /tenant-{id}
      /config-files
        /{work_order_id}
          /config_v1.json
          /config_v2.pdf
      /reports
        /{dispatch_assignment_id}
          /status_report.pdf
  /backups
    /database
      /tenant-{id}
        /dispatch_20250115.backup
  /static
    /frontend-build
      /assets
        /index.js
        /styles.css

Lifecycle Policies:

{
  "Rules": [
    {
      "Id": "TransitionOldReports",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER"
        }
      ]
    },
    {
      "Id": "DeleteOldBackups",
      "Status": "Enabled",
      "Expiration": {
        "Days": 90
      },
      "Filter": {
        "Prefix": "backups/"
      }
    }
  ]
}

Direct Upload Flow (Efficient):

# Backend generates pre-signed POST URL
import boto3

s3_client = boto3.client('s3')

def generate_upload_url(tenant_id, work_order_id, filename):
    key = f"tenants/{tenant_id}/config-files/{work_order_id}/{filename}"
    
    presigned_post = s3_client.generate_presigned_post(
        Bucket='dispatch-app-production',
        Key=key,
        Fields={'acl': 'private'},
        Conditions=[
            {'acl': 'private'},
            ['content-length-range', 1, 10485760]  # 1 byte to 10 MB
        ],
        ExpiresIn=3600  # 1 hour
    )
    
    return presigned_post

# Frontend uploads directly to S3
# After upload, frontend notifies backend to create database record

Security:

  • Bucket not publicly accessible
  • All access via pre-signed URLs (time-limited)
  • Server-side encryption: SSE-S3 or SSE-KMS
  • Versioning enabled (can recover deleted files)
  • Cross-region replication (optional, for DR)

Authentication & Authorization Enhancements

Scenario A: Self-Hosted Keycloak Cluster

High Availability Setup:

┌──────────────────┐
│ Load Balancer    │
└────────┬─────────┘

    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌────────┐
│Keycloak│ │Keycloak│
│Node 1  │ │Node 2  │
└───┬────┘ └───┬────┘
    │          │
    └────┬─────┘

   ┌──────────┐
   │PostgreSQL│
   │(Shared)  │
   └──────────┘

Configuration:

  • 2+ Keycloak instances behind load balancer
  • Shared PostgreSQL database (Keycloak backend)
  • Session replication via JGroups or external cache
  • Let’s Encrypt certificates (automated renewal)

MFA Enforcement:

// Keycloak realm configuration
{
  "realm": "dispatch-app",
  "requiredActions": ["CONFIGURE_TOTP"],  // Force MFA setup on first login
  "otpPolicyType": "totp",
  "otpPolicyAlgorithm": "HmacSHA1",
  "otpPolicyDigits": 6,
  "otpPolicyPeriod": 30,
  "browserFlow": "browser-with-mfa",  // Custom flow requiring MFA for admin/dispatcher
  "roles": {
    "realm": [
      {
        "name": "admin",
        "requiredOtp": true  // Always require MFA
      },
      {
        "name": "dispatcher",
        "requiredOtp": true
      }
    ]
  }
}

Scenario B: External Keycloak Integration

Customer Provides Their Keycloak:

Configuration Handoff:

  1. Customer creates realm: dispatch-{customer-name}
  2. Customer creates two clients:
    • dispatch-web (public, OIDC, for frontend)
    • dispatch-api (confidential, for backend token validation)
  3. Customer defines roles: dispatch-admin, dispatch-dispatcher, dispatch-driver, dispatch-viewer
  4. Customer provides:
    • Keycloak URL
    • Realm name
    • Client IDs and secrets
    • JWKS endpoint URL

Application Configuration:

# Per-tenant Keycloak settings stored in database
tenants:
  - id: acme-corp
    keycloak_url: https://sso.acme.com
    keycloak_realm: dispatch-acme
    keycloak_client_id: dispatch-api
    keycloak_client_secret: <secret>
    keycloak_jwks_url: https://sso.acme.com/realms/dispatch-acme/protocol/openid-connect/certs

  - id: contoso
    keycloak_url: https://keycloak.contoso.com
    keycloak_realm: dispatch-production
    keycloak_client_id: dispatch-backend
    keycloak_client_secret: <secret>
    keycloak_jwks_url: https://keycloak.contoso.com/realms/dispatch-production/protocol/openid-connect/certs

Token Validation (Cached JWKS):

import requests
import jwt
from functools import lru_cache
from datetime import datetime, timedelta

class MultiTenantAuth:
    def __init__(self):
        self.jwks_cache = {}  # tenant_id -> (jwks, expiry)
    
    def get_jwks(self, tenant):
        # Check cache
        if tenant.id in self.jwks_cache:
            jwks, expiry = self.jwks_cache[tenant.id]
            if datetime.utcnow() < expiry:
                return jwks
        
        # Fetch JWKS
        response = requests.get(tenant.keycloak_jwks_url)
        jwks = response.json()
        
        # Cache for 1 hour
        self.jwks_cache[tenant.id] = (jwks, datetime.utcnow() + timedelta(hours=1))
        return jwks
    
    def validate_token(self, token, tenant):
        try:
            jwks = self.get_jwks(tenant)
            
            # Decode and validate
            decoded = jwt.decode(
                token,
                jwks,
                algorithms=['RS256'],
                audience=tenant.keycloak_client_id,
                options={'verify_exp': True}
            )
            
            return decoded
        except jwt.ExpiredSignatureError:
            raise Unauthorized("Token expired")
        except jwt.InvalidTokenError:
            raise Unauthorized("Invalid token")

Fallback Mechanism (Graceful Degradation):

def validate_token_with_fallback(token, tenant):
    try:
        # Try to validate with live JWKS
        return validate_token(token, tenant)
    except requests.ConnectionError:
        # Customer's Keycloak unreachable
        # Use cached JWKS (even if expired) with warning
        logger.warning(f"Keycloak unreachable for tenant {tenant.id}, using cached JWKS")
        
        if tenant.id in jwks_cache:
            jwks, _ = jwks_cache[tenant.id]  # Ignore expiry
            # Validate with cached JWKS (accept 15-minute grace period)
            return jwt.decode(token, jwks, algorithms=['RS256'], options={'verify_exp': False})
        else:
            raise ServiceUnavailable("Authentication service unavailable")

Advanced Authorization (ABAC)

Attribute-Based Access Control:

Example Use Case: Dispatcher can only dispatch equipment in their region

Token Claims:

{
  "sub": "user-123",
  "preferred_username": "john.dispatcher",
  "realm_access": {
    "roles": ["dispatcher"]
  },
  "custom_attributes": {
    "region": "west-coast",
    "department": "logistics"
  }
}

Policy Enforcement:

def authorize_dispatch(user, equipment):
    # Check role
    if 'dispatcher' not in user.roles and 'admin' not in user.roles:
        raise Forbidden("Not authorized to dispatch")
    
    # Check attribute (region)
    if 'admin' not in user.roles:  # Admins bypass region restriction
        user_region = user.custom_attributes.get('region')
        equipment_region = equipment.region
        
        if user_region != equipment_region:
            raise Forbidden(f"Cannot dispatch equipment outside your region ({user_region})")
    
    return True

Real-Time Updates (WebSockets)

Upgrade from Polling to WebSockets:

Architecture:

Frontend (Socket.io client)
        ↓ WSS
Backend (Flask-SocketIO)
        ↓ Pub/Sub
Redis (Message Broker)

Benefits:

  • Real-time dashboard updates (<1 second latency)
  • Reduced server load (no constant polling)
  • Lower bandwidth usage
  • Better user experience

Implementation:

Backend (Flask-SocketIO):

from flask_socketio import SocketIO, emit, join_room

socketio = SocketIO(app, cors_allowed_origins="*", message_queue='redis://redis:6379')

@socketio.on('connect')
def handle_connect():
    # Authenticate user
    token = request.args.get('token')
    user = validate_token(token)
    
    # Join tenant-specific room
    tenant_id = get_tenant_from_user(user)
    join_room(f"tenant_{tenant_id}")
    
    emit('connected', {'message': 'Welcome to dispatch updates'})

@socketio.on('subscribe_dispatches')
def handle_subscribe():
    # User subscribes to active dispatch updates
    pass

# When dispatch status changes
def notify_dispatch_update(tenant_id, dispatch_data):
    socketio.emit('dispatch_updated', dispatch_data, room=f"tenant_{tenant_id}")

# When queue changes
def notify_queue_update(tenant_id, queue_data):
    socketio.emit('queue_updated', queue_data, room=f"tenant_{tenant_id}")

Frontend (React):

import { io } from 'socket.io-client';
import { useAuth } from '@/features/auth/useAuth';

export function useDispatchUpdates() {
  const { token } = useAuth();
  const [dispatches, setDispatches] = useState([]);
  
  useEffect(() => {
    const socket = io('wss://api.dispatchapp.com', {
      auth: { token }
    });
    
    socket.on('connect', () => {
      console.log('Connected to dispatch updates');
      socket.emit('subscribe_dispatches');
    });
    
    socket.on('dispatch_updated', (data) => {
      // Update local state
      setDispatches(prev => {
        const index = prev.findIndex(d => d.id === data.id);
        if (index >= 0) {
          const updated = [...prev];
          updated[index] = data;
          return updated;
        } else {
          return [...prev, data];
        }
      });
    });
    
    socket.on('queue_updated', (data) => {
      // Update queue display
    });
    
    return () => socket.disconnect();
  }, [token]);
  
  return { dispatches };
}

Scaling WebSockets:

  • Redis pub/sub allows multiple backend instances to share WebSocket state
  • Each backend instance handles subset of connections
  • Load balancer sticky sessions ensure client connects to same instance

Performance Optimizations

Database Optimization

Query Optimization:

-- Slow query (before)
SELECT d.*, wo.order_number, e.identifier, u.full_name
FROM dispatch_assignments d
JOIN work_orders wo ON d.work_order_id = wo.id
JOIN equipment e ON d.equipment_id = e.id
JOIN users u ON d.driver_id = u.id
WHERE d.status IN ('dispatched', 'en_route', 'on_site')
  AND d.dispatched_at > NOW() - INTERVAL '7 days';

-- Optimized (after)
-- 1. Add covering index
CREATE INDEX idx_dispatch_active_covering ON dispatch_assignments(status, dispatched_at) 
    INCLUDE (work_order_id, equipment_id, driver_id);

-- 2. Denormalize frequently accessed data
ALTER TABLE dispatch_assignments 
    ADD COLUMN work_order_number VARCHAR,
    ADD COLUMN equipment_identifier VARCHAR,
    ADD COLUMN driver_name VARCHAR;

-- Updated on create/update via trigger or application logic

-- Now query is much faster (index-only scan possible)
SELECT * FROM dispatch_assignments
WHERE status IN ('dispatched', 'en_route', 'on_site')
  AND dispatched_at > NOW() - INTERVAL '7 days';

Partitioning (for large tables):

-- Partition dispatch_assignments by month
CREATE TABLE dispatch_assignments (
    id UUID NOT NULL,
    dispatched_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP,
    -- ... other columns
) PARTITION BY RANGE (dispatched_at);

-- Create partitions
CREATE TABLE dispatch_assignments_2025_01 PARTITION OF dispatch_assignments
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE dispatch_assignments_2025_02 PARTITION OF dispatch_assignments
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Automatic partition creation via cron job or pg_partman

Materialized Views (for dashboard metrics):

-- Expensive query: Equipment utilization rate
CREATE MATERIALIZED VIEW equipment_utilization AS
SELECT 
    e.id,
    e.identifier,
    COUNT(CASE WHEN d.status IN ('dispatched', 'en_route', 'on_site') THEN 1 END) as active_count,
    COUNT(d.id) as total_dispatches,
    AVG(EXTRACT(EPOCH FROM (d.completed_at - d.dispatched_at))) as avg_duration_seconds
FROM equipment e
LEFT JOIN dispatch_assignments d ON e.id = d.equipment_id
WHERE d.dispatched_at > NOW() - INTERVAL '30 days'
GROUP BY e.id, e.identifier;

-- Refresh every 5 minutes via cron
-- SELECT refresh_materialized_view('equipment_utilization');

Application-Level Caching

Cache Strategy:

What to Cache:

  • Dashboard data (30 seconds TTL)
  • Available equipment list (30 seconds TTL)
  • User profiles (5 minutes TTL)
  • Tenant settings (1 hour TTL)

What NOT to Cache:

  • Active dispatch status (real-time via WebSocket)
  • Queue depth (real-time via Redis)
  • Transaction-critical data

Cache Implementation:

from flask_caching import Cache

cache = Cache(app, config={
    'CACHE_TYPE': 'redis',
    'CACHE_REDIS_HOST': 'redis',
    'CACHE_REDIS_PORT': 6379,
    'CACHE_DEFAULT_TIMEOUT': 300
})

@app.route('/api/equipment/available')
@cache.cached(timeout=30, key_prefix='available_equipment')
def get_available_equipment():
    tenant = get_tenant()
    equipment = Equipment.query.filter_by(
        schema_name=tenant.schema_name,
        status='available'
    ).all()
    return jsonify(EquipmentSchema(many=True).dump(equipment))

# Cache invalidation on equipment status change
def update_equipment_status(equipment_id, new_status):
    equipment = Equipment.query.get(equipment_id)
    equipment.status = new_status
    db.session.commit()
    
    # Invalidate cache
    cache.delete('available_equipment')

API Rate Limiting

Per-User Rate Limits:

from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

limiter = Limiter(
    app,
    key_func=lambda: g.user['sub'],  # Rate limit by user ID
    storage_uri='redis://redis:6379',
    default_limits=["1000 per hour", "100 per minute"]
)

# Stricter limits on expensive endpoints
@app.route('/api/reports/generate', methods=['POST'])
@limiter.limit("10 per minute")
def generate_report():
    # Report generation is expensive
    pass

# Exempt admins from rate limiting
@app.route('/api/admin/bulk-import', methods=['POST'])
@limiter.exempt
def bulk_import():
    if 'admin' not in g.user['roles']:
        abort(403)
    # Admin bulk operations
    pass

Per-Tenant Rate Limits:

# Different limits based on subscription tier
def get_tenant_limit():
    tenant = get_tenant()
    limits = {
        'trial': "100 per hour",
        'standard': "1000 per hour",
        'premium': "5000 per hour",
        'enterprise': "20000 per hour"
    }
    return limits.get(tenant.tier, "1000 per hour")

@app.route('/api/dispatch/request', methods=['POST'])
@limiter.limit(get_tenant_limit)
def request_dispatch():
    # Dispatch request
    pass

Monitoring & Observability

Logging Infrastructure

Centralized Logging (ELK Stack):

Application Logs (JSON)

Filebeat (log shipper)

Logstash (parsing, enrichment)

Elasticsearch (storage, indexing)

Kibana (visualization, search)

Log Format:

{
  "timestamp": "2025-01-15T10:30:45.123Z",
  "level": "INFO",
  "service": "dispatch-backend",
  "module": "dispatch_module.service",
  "tenant_id": "acme-corp",
  "user_id": "user-123",
  "action": "dispatch_requested",
  "work_order_id": "WO-54321",
  "equipment_id": "eq-789",
  "driver_id": "driver-456",
  "result": "success",
  "duration_ms": 45,
  "request_id": "req-abc-123"  # Trace through all services
}

Log Retention:

  • Hot tier (Elasticsearch): 30 days
  • Warm tier (S3): 90 days
  • Cold tier (S3 Glacier): 1 year

Structured Logging:

import structlog

logger = structlog.get_logger()

def request_dispatch(work_order_id, equipment_type_id, dispatcher_id):
    logger.info(
        "dispatch_requested",
        work_order_id=work_order_id,
        equipment_type_id=equipment_type_id,
        dispatcher_id=dispatcher_id
    )
    
    try:
        # Dispatch logic
        result = create_dispatch_assignment(...)
        
        logger.info(
            "dispatch_completed",
            assignment_id=result.id,
            duration_ms=(datetime.utcnow() - start_time).total_seconds() * 1000
        )
        
        return result
    except Exception as e:
        logger.error(
            "dispatch_failed",
            error=str(e),
            error_type=type(e).__name__,
            exc_info=True
        )
        raise

Metrics & Alerting (Prometheus + Grafana)

Prometheus Metrics:

Application Metrics:

from prometheus_client import Counter, Histogram, Gauge

# Request metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])

# Business metrics
dispatch_requests_total = Counter('dispatch_requests_total', 'Total dispatch requests', ['tenant_id', 'result'])
dispatch_queue_depth = Gauge('dispatch_queue_depth', 'Current queue depth', ['tenant_id', 'priority'])
active_dispatches = Gauge('active_dispatches', 'Number of active dispatches', ['tenant_id'])

# Database metrics
db_query_duration = Histogram('db_query_duration_seconds', 'Database query duration', ['query_type'])
db_connection_pool_size = Gauge('db_connection_pool_size', 'Connection pool size')

Grafana Dashboards:

Dashboard 1: System Health

  • Request rate (requests/second)
  • Error rate (%)
  • Response time (p50, p95, p99)
  • CPU and memory usage per pod
  • Database connection pool utilization

Dashboard 2: Business Metrics

  • Dispatches per hour
  • Queue depth over time
  • Average dispatch duration
  • Equipment utilization rate
  • Top tenants by activity

Dashboard 3: Tenant-Specific

  • Per-tenant request rate
  • Per-tenant error rate
  • Per-tenant active dispatches
  • Per-tenant queue depth

Alerting Rules

Critical Alerts (PagerDuty, immediate response):

groups:
  - name: critical
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} (>5%) for 2 minutes"
      
      - alert: DatabaseDown
        expr: up{job="postgres"} == 0
        for: 1m
        annotations:
          summary: "PostgreSQL is down"
      
      - alert: QueueDepthCritical
        expr: dispatch_queue_depth{priority="urgent"} > 50
        for: 5m
        annotations:
          summary: "Urgent queue depth exceeds 50"

Warning Alerts (Slack, investigate during business hours):

  - name: warnings
    interval: 5m
    rules:
      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, http_request_duration_seconds) > 2.0
        for: 10m
        annotations:
          summary: "95th percentile response time > 2 seconds"
      
      - alert: HighMemoryUsage
        expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
        for: 10m
        annotations:
          summary: "Memory usage > 85%"
      
      - alert: CertificateExpiringSoon
        expr: ssl_certificate_expiry_days < 30
        annotations:
          summary: "SSL certificate expires in {{ $value }} days"

Security Enhancements

Certificate Management

Let’s Encrypt (Automated):

cert-manager (Kubernetes):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@dispatchapp.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - http01:
          ingress:
            class: nginx

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: dispatch-tls
spec:
  secretName: dispatch-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - dispatchapp.com
    - "*.dispatchapp.com"
  renewBefore: 720h  # Renew 30 days before expiry

Automated Renewal:

  • cert-manager checks expiry daily
  • Renews when <30 days remaining
  • Zero-downtime certificate rotation
  • Monitoring alert if renewal fails

Internal Services (Vault PKI):

HashiCorp Vault as Internal CA:

# Enable PKI secrets engine
vault secrets enable pki
vault secrets tune -max-lease-ttl=87600h pki  # 10 years

# Generate root CA
vault write pki/root/generate/internal \
    common_name="Dispatch App Internal CA" \
    ttl=87600h

# Create role for service certificates
vault write pki/roles/internal-services \
    allowed_domains="*.dispatch.internal,*.dispatch.svc.cluster.local" \
    allow_subdomains=true \
    max_ttl=2160h  # 90 days

# Backend requests certificate on startup
vault write pki/issue/internal-services \
    common_name="backend.dispatch.svc.cluster.local" \
    ttl=2160h

Certificate Rotation:

# Application checks certificate expiry on startup
def ensure_valid_certificate():
    cert_path = '/var/run/secrets/tls/tls.crt'
    cert = load_certificate(cert_path)
    
    days_until_expiry = (cert.not_valid_after - datetime.utcnow()).days
    
    if days_until_expiry < 30:
        logger.warning(f"Certificate expires in {days_until_expiry} days, requesting new certificate")
        new_cert = request_certificate_from_vault()
        replace_certificate(new_cert)
        reload_app()  # Graceful reload

Secrets Management (Vault)

Dynamic Database Credentials:

# Configure database secrets engine
vault secrets enable database

vault write database/config/dispatch-postgres \
    plugin_name=postgresql-database-plugin \
    allowed_roles="dispatch-backend" \
    connection_url="postgresql://{{username}}:{{password}}@postgres:5432/dispatch" \
    username="vault" \
    password="vault_password"

# Create role that generates 1-hour credentials
vault write database/roles/dispatch-backend \
    db_name=dispatch-postgres \
    creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE dispatch_app;" \
    default_ttl="1h" \
    max_ttl="24h"

# Backend requests credentials on startup
vault read database/creds/dispatch-backend
# Returns: { "username": "v-token-12345", "password": "A1B2C3..." }

Application Integration:

import hvac

vault_client = hvac.Client(url='https://vault.internal:8200')

# Authenticate (Kubernetes service account, AppRole, etc.)
vault_client.auth.kubernetes.login(
    role='dispatch-backend',
    jwt=open('/var/run/secrets/kubernetes.io/serviceaccount/token').read()
)

# Get database credentials
db_creds = vault_client.secrets.database.generate_credentials(name='dispatch-backend')
db_user = db_creds['data']['username']
db_pass = db_creds['data']['password']

# Connect to database
engine = create_engine(f'postgresql://{db_user}:{db_pass}@postgres:5432/dispatch')

# Credentials automatically revoked after 1 hour
# Application renews before expiry or reconnects with new credentials

Compliance Preparation (ISO 27001)

ISO 27001 Alignment:

A.9 Access Control:

  • ✅ Keycloak provides centralized identity management
  • ✅ RBAC enforced at API level
  • ✅ MFA for privileged accounts (admin, dispatcher)
  • ✅ Regular access reviews (quarterly audit of user roles)

A.10 Cryptography:

  • ✅ TLS 1.3 for all data in transit
  • ✅ Encryption at rest for S3 (SSE-KMS)
  • ✅ Database encryption (TDE) - enable at Deep-Water
  • ✅ Key management via Vault

A.12 Operations Security:

  • ✅ Automated security patching (Dependabot + weekly review)
  • ✅ Vulnerability scanning (Snyk integrated in CI/CD)
  • ✅ Change management process (Git, PR reviews, staging environment)
  • ✅ Capacity monitoring (Prometheus alerts)

A.14 System Acquisition, Development, Maintenance:

  • ✅ Security requirements in design phase (documented)
  • ✅ Code review process (all PRs reviewed)
  • ✅ Penetration testing (quarterly, external firm)
  • ✅ Secure coding guidelines documented

A.16 Incident Management:

  • ✅ Incident response plan documented
  • ✅ Security event logging (ELK stack)
  • ✅ 24/7 on-call rotation (PagerDuty)
  • ✅ Post-incident reviews (within 48 hours)

A.18 Compliance:

  • ✅ Regular internal audits (monthly)
  • ✅ External audit (annual, preparation for SOC 2 Type I)
  • ✅ Documentation of all controls
  • ✅ Gap analysis against ISO 27001 requirements

Documentation Required:

  • Information Security Policy
  • Risk Assessment Register
  • Asset Inventory
  • Incident Response Procedures
  • Business Continuity Plan
  • Disaster Recovery Plan
  • Access Control Policy
  • Change Management Process

Deployment & DevOps

CI/CD Pipeline

Tools: GitHub Actions, GitLab CI, or Jenkins

Pipeline Stages:

1. Build:

# .github/workflows/build.yml
name: Build and Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.11
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-dev.txt
      
      - name: Lint
        run: |
          flake8 app/
          black --check app/
      
      - name: Unit tests
        run: pytest tests/unit --cov=app
      
      - name: Integration tests
        run: pytest tests/integration
      
      - name: Security scan
        run: |
          bandit -r app/
          safety check
  
  build-image:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Build Docker image
        run: docker build -t dispatch-backend:${{ github.sha }} .
      
      - name: Scan image
        run: trivy image dispatch-backend:${{ github.sha }}
      
      - name: Push to registry
        run: |
          echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
          docker push dispatch-backend:${{ github.sha }}

2. Deploy to Staging:

  deploy-staging:
    needs: build-image
    runs-on: ubuntu-latest
    steps:
      - name: Update Kubernetes manifests
        run: |
          kubectl set image deployment/dispatch-backend \
            backend=dispatch-backend:${{ github.sha }} \
            -n dispatch-staging
      
      - name: Wait for rollout
        run: kubectl rollout status deployment/dispatch-backend -n dispatch-staging
      
      - name: Smoke tests
        run: |
          curl -f https://staging.dispatchapp.com/api/health
          pytest tests/smoke --base-url=https://staging.dispatchapp.com

3. Deploy to Production (manual approval):

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://dispatchapp.com
    steps:
      - name: Update Kubernetes manifests
        run: |
          kubectl set image deployment/dispatch-backend \
            backend=dispatch-backend:${{ github.sha }} \
            -n dispatch-production
      
      - name: Wait for rollout
        run: kubectl rollout status deployment/dispatch-backend -n dispatch-production
      
      - name: Notify Slack
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK_URL }} \
            -H 'Content-Type: application/json' \
            -d '{"text":"Production deployment successful: ${{ github.sha }}"}'

Blue-Green Deployments

Zero-Downtime Deployments:

Kubernetes Rolling Update:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dispatch-backend
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Only 1 pod down at a time
      maxSurge: 2        # Up to 2 extra pods during update
  template:
    spec:
      containers:
        - name: backend
          image: dispatch-backend:v1.2.3
          readinessProbe:
            httpGet:
              path: /api/health
              port: 5000
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /api/health
              port: 5000
            initialDelaySeconds: 30
            periodSeconds: 10

Process:

  1. New pods spin up (v1.2.3)
  2. Readiness probe passes → added to load balancer
  3. Old pod (v1.2.2) removed from load balancer
  4. Old pod allowed to finish in-flight requests (30s grace period)
  5. Old pod terminated
  6. Repeat for all pods

Rollback (if deployment fails):

# Automatic rollback on failed readiness probes
# Or manual rollback
kubectl rollout undo deployment/dispatch-backend -n dispatch-production

Database Migrations

Zero-Downtime Migration Strategy:

Phase 1: Additive Changes Only

# Migration: Add new column
def upgrade():
    op.add_column('dispatch_assignments', 
        sa.Column('new_field', sa.String(), nullable=True))

# Deploy code that writes to both old and new fields
# Old code still works (new_field is nullable)

Phase 2: Backfill Data

# Background job fills new_field for existing rows
def backfill_new_field():
    assignments = DispatchAssignment.query.filter_by(new_field=None).all()
    for assignment in assignments:
        assignment.new_field = calculate_new_field(assignment)
    db.session.commit()

Phase 3: Make Required

# After all rows backfilled
def upgrade():
    op.alter_column('dispatch_assignments', 'new_field', nullable=False)

# Deploy code that relies on new_field

Phase 4: Remove Old Field (if applicable)

# After old code no longer deployed
def upgrade():
    op.drop_column('dispatch_assignments', 'old_field')

Backup & Disaster Recovery

Backup Strategy

Database Backups:

  • Frequency: Continuous (WAL archiving) + daily snapshots
  • Retention: 35 days (RDS default)
  • Location: Same region + cross-region replication
  • Encryption: AES-256 (at rest)

Automated Snapshots (RDS):

# Automated daily snapshots at 3 AM
# Manual snapshot before major changes
aws rds create-db-snapshot \
    --db-instance-identifier dispatch-production \
    --db-snapshot-identifier pre-migration-2025-01-15

Point-in-Time Recovery (PITR):

# Restore to any second in last 35 days
aws rds restore-db-instance-to-point-in-time \
    --source-db-instance-identifier dispatch-production \
    --target-db-instance-identifier dispatch-restored \
    --restore-time 2025-01-15T10:30:00Z

Object Storage Backups:

  • S3 Versioning: Enabled (recover deleted/overwritten files)
  • Cross-Region Replication: Enabled (DR)
  • Lifecycle Policy: Reports archived to Glacier after 1 year

Disaster Recovery Plan

Scenarios & Procedures:

1. Database Failure:

  • Detection: Automatic (RDS Multi-AZ failover)
  • RTO: <2 minutes (automatic failover to standby)
  • RPO: 0 (synchronous replication)
  • Procedure: No action required (automatic)

2. Regional Outage (AWS region down):

  • Detection: Manual (monitoring shows all services down)
  • RTO: 1 hour (manual failover to backup region)
  • RPO: 5 minutes (asynchronous replication lag)
  • Procedure:
    1. Promote read replica in backup region to primary
    2. Update DNS to point to backup region
    3. Restart backend pods in backup region
    4. Communicate status to customers

3. Data Corruption (accidental deletion):

  • Detection: Customer report or internal audit
  • RTO: 2 hours (restore from backup)
  • RPO: Up to 24 hours (daily snapshots)
  • Procedure:
    1. Identify corruption extent (which tenants affected)
    2. Restore affected tenant schemas from backup
    3. Validate data integrity
    4. Notify affected customers

4. Security Breach (unauthorized access):

  • Detection: Anomaly detection, SIEM alerts
  • RTO: N/A (incident response, not recovery)
  • Procedure:
    1. Isolate affected systems
    2. Revoke compromised credentials
    3. Investigate attack vector
    4. Patch vulnerability
    5. Notify affected customers (within 72 hours, GDPR)
    6. Conduct post-mortem

DR Testing

Quarterly DR Drills:

  1. Planned Failover Test (Q1, Q3):

    • Schedule maintenance window
    • Perform manual failover to backup region
    • Verify all services operational
    • Failback to primary region
    • Document any issues
  2. Backup Restore Test (Q2, Q4):

    • Create test environment
    • Restore latest backup
    • Run smoke tests
    • Verify data integrity
    • Document restore time

Cost Optimization

Cost Breakdown (Estimated Monthly)

CategoryCost
Compute
- Kubernetes nodes (3x t3.large)$180
- EKS control plane$75
Database
- RDS PostgreSQL (primary)$120
- RDS read replica$120
Cache & Queue
- ElastiCache Redis$50
Storage
- S3 (1 TB)$23
- EBS volumes$30
Networking
- Load Balancer$20
- Data transfer$100
Monitoring & Logging
- CloudWatch$30
- ELK stack (self-hosted)Included in compute
Security
- Vault (self-hosted)Included in compute
Backups$25
Total~$773/month

Cost Per Customer (at 50 customers): ~$15/month
Suggested Pricing: $50-100/month per customer (3-6x cost)


Cost Optimization Strategies

1. Right-Size Resources:

  • Monitor actual usage vs. provisioned capacity
  • Downsize underutilized nodes
  • Use auto-scaling for variable workloads

2. Reserved Instances (for predictable workload):

  • 1-year reserved instances: 30-40% discount
  • 3-year reserved instances: 50-60% discount
  • Apply to: Database, cache, baseline compute

3. Spot Instances (for non-critical workloads):

  • Use for batch jobs (report generation)
  • Dev/staging environments
  • 70-90% discount vs. on-demand

4. Storage Lifecycle Policies:

  • Move old reports to Glacier (10x cheaper)
  • Delete old logs after retention period
  • Compress backups

5. Data Transfer Optimization:

  • Use CloudFront CDN for static assets
  • Keep traffic within same region when possible
  • Compress API responses

Transition to Deep-Water

Triggers to Advance

You’re ready for Deep-Water when:

  • 500+ concurrent users (mid-depth infrastructure stressed)
  • Geographic expansion required (customers in multiple regions)
  • Enterprise customers demanding 99.9% SLA
  • Compliance certifications achieved (ISO 27001, SOC 2)
  • Microservices extraction justified (specific bottlenecks identified)
  • Team grown to 10+ engineers (can manage distributed systems)
  • Revenue justifies investment ($500k+ ARR)

Timeline: 12-18 months after Mid-Depth


Conclusion

Mid-Depth is about professionalization and reliability. Goals:

  1. Multi-tenant at scale: Support 50+ customers with data isolation
  2. Enterprise-ready: Compliance preparation, professional security
  3. Reliable: 99% uptime, fast incident response
  4. Observable: Comprehensive monitoring and logging
  5. Performant: <2 second response times under load

What Success Looks Like:

  • Customers trust you with their operations
  • Minimal support burden (system “just works”)
  • Predictable costs and performance
  • Clear path to enterprise scale (Deep-Water)

Key Mindset: Build for reliability and maintainability, not premature optimization. The modular monolith can take you far—don’t rush to microservices.

When enterprise customers demand geographic distribution and 99.9% SLAs, you’re ready for Deep-Water.