Performance & Scalability Design - Deep Water

The surface content covered the fundamentals: cache what doesn’t change, index what you query, use a CDN. These practices get you to thousands of users without breaking a sweat.

This content covers what happens when those patterns aren’t enough. When you’re handling millions of requests per day. When cache invalidation becomes a distributed systems problem. When a single database can’t handle the load. When geographic distribution matters not just for static assets, but for dynamic content.

The systems described here are what companies like Discord, Stripe, Netflix, and Shopify built after their basic infrastructure stopped scaling. They’re complex. They’re expensive to operate. They solve real problems that only appear at scale.

You probably don’t need most of this right now. But when you do need it, you’ll know exactly which problem you’re solving.

Understanding the Scalability Cliff

Applications don’t slow down linearly. They hit cliffs.

With 1,000 users, response times average 200ms. With 10,000 users, still 200ms. With 50,000 users, 250ms. Then you hit 75,000 users and response times spike to 5 seconds. Your database connection pool is exhausted. Every request waits for a connection. The system is at a cliff.

These cliffs appear at different points depending on your architecture:

Connection pool exhaustion - You can only handle as many concurrent requests as you have database connections. Most apps start with 10-20. When concurrent requests exceed that, new requests queue. The queue builds up faster than it drains. Response times spiral.

Memory exhaustion - Application servers have finite RAM. Each request uses memory. When you run out, the OS starts swapping to disk. Swap is 1000x slower than RAM. The system grinds to a halt.

CPU saturation - When CPU usage hits 80-90%, response times increase exponentially. A server running at 70% CPU handles requests in 100ms. The same server at 95% CPU handles requests in 2 seconds.

Network bandwidth - A 1 Gbps connection seems huge until you’re serving video or high-resolution images to thousands of users simultaneously. You hit the limit and throughput plummets.

Lock contention - Databases use locks to maintain consistency. When many requests try to update the same rows, they wait for locks. One slow transaction holding a lock can block hundreds of requests.

The key insight: you need to identify which cliff you’re approaching before you hit it. That requires monitoring with specific metrics.

Global CDN Architecture Beyond Static Files

Surface-level CDN usage is straightforward: put images and JavaScript files on edge servers. At scale, CDNs become compute platforms that run application logic.

Edge Computing Patterns

Cloudflare Workers, AWS Lambda@Edge, and Vercel Edge Functions let you run code at CDN edge locations. This moves computation closer to users.

Authentication at the Edge Instead of sending every request to your origin server to check authentication, verify JWT tokens at the edge. A request from Tokyo hits the Tokyo edge server, verifies the token locally, and either serves cached content or proxies to origin.

// Runs at edge locations worldwide
async function handleRequest(request) {
  const token = request.headers.get('Authorization');

  // Verify JWT locally at edge - no origin request needed
  const user = await verifyToken(token);
  if (!user) {
    return new Response('Unauthorized', { status: 401 });
  }

  // Serve from edge cache if available
  const cache = caches.default;
  const cacheKey = new Request(request.url);
  let response = await cache.match(cacheKey);

  if (!response) {
    // Only hit origin if not cached
    response = await fetch(request);
    await cache.put(cacheKey, response.clone());
  }

  return response;
}

This pattern reduced Discord’s authentication overhead by 70% when they moved JWT verification to the edge. Origin servers no longer process millions of authentication checks.

Personalization at the Edge You can’t cache personalized content the traditional way - every user sees different data. But you can cache the template and inject personalization at the edge.

Stripe does this for their dashboard. The HTML structure is cached at the edge. User-specific data (account balance, recent transactions) comes from a fast API call. The edge server combines them before sending to the client.

async function personalizedResponse(request, userId) {
  // Cache the template - same for all users
  const template = await getFromCache('dashboard-template');

  // Fetch user-specific data - fast API call
  const userData = await fetch(`https://api.example.com/user/${userId}/summary`);

  // Combine at edge
  return renderTemplate(template, userData);
}

Response time from San Francisco: template from local edge (20ms) + API call to origin (80ms) = 100ms total. Without edge caching, the full HTML render at origin takes 300ms plus network latency.

Regional Caching Tiers

Large-scale systems use multiple caching layers with different eviction policies.

Edge Cache (Short TTL) Lives on CDN edge servers. Hundreds of locations. Very small capacity per location. Stores frequently accessed content with 30-60 second TTL.

Regional Cache (Medium TTL) Lives in major regions. Dozens of locations. Larger capacity. Stores popular content with 5-15 minute TTL.

Origin Cache (Long TTL) Lives near application servers. Few locations. Large capacity. Stores everything with hours to days TTL.

Shopify’s architecture looks like this:

User Request → Edge Cache (miss)
  → Regional Cache (miss)
    → Origin Cache (miss)
      → Application Server

Most requests hit at edge. Popular but not ultra-popular content hits at regional. Everything else hits origin cache. Only uncached content reaches application servers.

The benefit: origin servers handle 5% of requests instead of 100%. When you’re handling 100,000 requests per second, that’s the difference between 100 application servers and 5,000.

Cache Invalidation at Scale

“Clear the cache when data changes” sounds simple. At scale, it’s distributed systems complexity.

The Thundering Herd Problem Your cache has product pricing with 5-minute TTL. It expires. At exactly that moment, 10,000 requests arrive. All see cache miss. All query the database. The database falls over.

Solution 1: Locking When cache misses, acquire a lock before hitting the database. Other requests wait for the lock holder to populate the cache.

def get_product_price(product_id):
    cache_key = f"price:{product_id}"

    # Try cache first
    price = cache.get(cache_key)
    if price:
        return price

    # Cache miss - acquire lock
    lock_key = f"lock:{cache_key}"
    if cache.add(lock_key, "locked", ttl=10):
        # We got the lock - fetch from database
        price = db.query("SELECT price FROM products WHERE id = ?", product_id)
        cache.set(cache_key, price, ttl=300)
        cache.delete(lock_key)
        return price
    else:
        # Someone else has the lock - wait and retry
        time.sleep(0.1)
        return get_product_price(product_id)

This works but creates latency spikes. The first request takes 200ms (database query). Subsequent requests wait 100ms (sleep and retry). Not ideal.

Solution 2: Probabilistic Early Expiration Don’t wait for cache to expire. Refresh it probabilistically before expiration.

import random

def get_product_price(product_id):
    cache_key = f"price:{product_id}"

    # Cache stores (value, timestamp)
    cached = cache.get(cache_key)
    if cached:
        value, cached_at = cached
        age = time.time() - cached_at
        ttl = 300  # 5 minutes

        # Probabilistically refresh when getting old
        # When age = 0.8 * TTL, 20% chance to refresh
        # When age = 0.9 * TTL, 30% chance to refresh
        refresh_probability = max(0, (age / ttl) - 0.5) * 2

        if random.random() > refresh_probability:
            return value

        # Fall through to refresh

    # Refresh cache
    price = db.query("SELECT price FROM products WHERE id = ?", product_id)
    cache.set(cache_key, (price, time.time()), ttl=300)
    return price

Requests spread out cache refreshes. No thundering herd. No latency spikes. This is what Reddit uses for their comment caching.

Solution 3: Cache Warming Pre-populate cache before it’s needed. When you deploy new pricing, push it to the cache before announcing it.

def deploy_new_pricing(pricing_data):
    # Write to database
    db.batch_update("products", pricing_data)

    # Immediately warm cache
    for product in pricing_data:
        cache_key = f"price:{product['id']}"
        cache.set(cache_key, product['price'], ttl=300)

    # Now announce the change
    notify_users("Pricing updated")

No cache misses. No database load spike. Users see new pricing immediately.

Solution 4: Versioned Cache Keys Include a version in the cache key. Invalidation becomes “increment the version.”

# Global version stored in fast distributed store
PRICING_VERSION = redis.get("pricing_version")  # e.g., "v47"

def get_product_price(product_id):
    cache_key = f"price:{PRICING_VERSION}:{product_id}"

    price = cache.get(cache_key)
    if price:
        return price

    price = db.query("SELECT price FROM products WHERE id = ?", product_id)
    cache.set(cache_key, price, ttl=3600)  # Long TTL is fine
    return price

def update_all_pricing():
    # Update database
    db.execute("UPDATE products SET price = price * 1.1")

    # Invalidate all cached prices by bumping version
    redis.incr("pricing_version")  # Now "v48"

    # Old cache keys ("price:v47:*") are orphaned
    # New requests use new keys ("price:v48:*")

Instantaneous invalidation across all edge servers worldwide. No cache purge API calls. No distributed invalidation coordination. The old cache entries expire naturally.

This is what Cloudflare’s CDN uses internally. When you purge cache, they’re bumping a version number, not actually deleting entries.

Database Performance at Scale

A single PostgreSQL or MySQL instance can handle 10,000-50,000 queries per second on good hardware. That’s enough for most applications. When it’s not, you have options.

Connection Pooling Done Right

Applications don’t connect directly to the database. They connect to a pool. The pool maintains persistent connections and reuses them.

Why connection pooling matters Creating a database connection takes 20-50ms. Reusing an existing connection takes under 1ms. For queries that take 5ms, connection overhead doubles your latency.

Pool sizing math This seems simple but most teams get it wrong. Too few connections and requests queue. Too many and you overwhelm the database.

The formula: connections = ((core_count * 2) + effective_spindle_count)

For modern SSDs, effective spindle count is essentially infinite, but the CPU formula holds: a server with 8 cores should have about 16-32 database connections.

But that’s for the database server. Your application pool needs to account for the number of application servers.

If you have:

5 application servers
Each with a pool of 20 connections
That’s 100 connections to the database

Your database needs to handle 100 connections. If it has 8 cores, that’s fine (it can handle ~200). If it has 2 cores, you’re overloaded.

PgBouncer for connection pooling Application servers connect to PgBouncer. PgBouncer maintains a smaller pool to the actual database.

[databases]
myapp = host=db.internal port=5432

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

1,000 application connections multiplexed into 25 database connections. The database doesn’t see the thundering herd.

Pool modes matter:

Session mode: One database connection per client for entire session. Safe but doesn’t pool effectively.
Transaction mode: Database connection released after each transaction. Pools well. Works for most apps.
Statement mode: Database connection released after each statement. Maximum pooling. Breaks prepared statements and temp tables.

Discord runs PgBouncer in transaction mode. They have thousands of application servers connecting to dozens of database connections.

Read Replicas and Separation

Typical web apps are 90% reads, 10% writes. Send reads to replicas. Keep the primary for writes only.

# Simple read/write split
class DatabaseRouter:
    def db_for_read(self):
        # Round-robin across replicas
        return random.choice([
            'replica-1',
            'replica-2',
            'replica-3'
        ])

    def db_for_write(self):
        return 'primary'

# Usage
user = db.query('SELECT * FROM users WHERE id = ?', user_id, db=db_for_read())
db.execute('UPDATE users SET last_login = NOW() WHERE id = ?', user_id, db=db_for_write())

This works until you hit replication lag. You write to primary, immediately read from replica, and don’t see your write because replication is 200ms behind.

Solutions:

Sticky reads after writes - After a write, read from primary for 2-3 seconds
Read your writes - Track last write timestamp, wait for replica to catch up
Eventual consistency - Accept that some reads are slightly stale

GitHub uses approach #1. After you star a repository, your profile shows the change immediately (read from primary). Other users see it 1-2 seconds later (read from replicas).

CQRS: Command Query Responsibility Segregation

At extreme scale, separate databases for reads and writes.

Write Database (PostgreSQL)

Normalized schema
ACID transactions
Authoritative source of truth
Handles inserts, updates, deletes

Read Database (Elasticsearch or denormalized PostgreSQL)

Denormalized, optimized for queries
Eventually consistent
Can be minutes behind
Handles all reads

When you update data, it goes to write database. A background process streams changes to read database. Reads are fast because the schema is optimized for queries. Writes are safe because the write database maintains integrity.

Stack Overflow works this way. They write to SQL Server. The read layer is Elasticsearch with denormalized documents. Searching is instant because Elasticsearch is optimized for full-text search. Data integrity is guaranteed because SQL Server is the source of truth.

The trade-off: complexity. You’re managing two databases, keeping them in sync, and accepting eventual consistency. Only worth it at scale.

Query Optimization Beyond Indexes

EXPLAIN ANALYZE is your best friend

EXPLAIN ANALYZE
SELECT p.*, u.name, COUNT(l.id) as like_count
FROM posts p
JOIN users u ON u.id = p.author_id
LEFT JOIN likes l ON l.post_id = p.id
WHERE p.created_at > NOW() - INTERVAL '7 days'
GROUP BY p.id, u.name
ORDER BY like_count DESC
LIMIT 10;

The output shows:

Which indexes were used (or not used)
How many rows were scanned
Where time was spent
Whether the query planner made good choices

Common findings:

Sequential scan when index exists - The planner thinks a full table scan is faster. Often true for small tables, bad for large ones. Force index usage or adjust planner statistics.

Nested loop on large tables - Nested loop joins work for small datasets. For large ones, hash joins or merge joins are better. Adjust work_mem or rewrite the query.

Sort on disk - If sort data exceeds work_mem, PostgreSQL sorts on disk (slow). Increase work_mem or reduce result set before sorting.

Partitioning Large Tables

A table with 100 million rows slows down even with perfect indexes. Partitioning splits it into smaller tables.

-- Partition by month
CREATE TABLE events (
    id BIGSERIAL,
    user_id INTEGER,
    event_type VARCHAR(50),
    created_at TIMESTAMP
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2024_11 PARTITION OF events
    FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');

CREATE TABLE events_2024_12 PARTITION OF events
    FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

Queries that filter by created_at only scan relevant partitions. A query for November data scans 1 partition with 3 million rows, not 1 table with 100 million rows.

Queries go from 5 seconds to 200ms.

The trick: partition on columns you always filter by. Partitioning by user_id helps if you always query one user at a time. Partitioning by created_at helps if you always query date ranges. Partitioning by random ID doesn’t help anyone.

Multi-Layer Caching Strategies

You already know about caching. This is about making caches work together without stampeding.

The Full Stack

Browser Cache (HTTP headers)
  ↓ miss
CDN Edge Cache (Cloudflare, Cloudfront)
  ↓ miss
Application Cache (Redis)
  ↓ miss
Database Query Cache
  ↓ miss
Database (PostgreSQL, MySQL)

Each layer has different characteristics:

Browser Cache

Capacity: ~100MB per site
Latency: 0ms (already on disk)
TTL: Hours to months
Invalidation: Hard (requires cache busting)
Best for: Versioned assets (app.v123.js)

CDN Edge Cache

Capacity: ~100GB per POP
Latency: 10-50ms
TTL: Minutes to hours
Invalidation: API calls
Best for: Popular content

Application Cache (Redis)

Capacity: Limited by RAM (typically 8-64GB)
Latency: 1-5ms
TTL: Seconds to hours
Invalidation: Precise control
Best for: Database query results, computed data

Database Cache

Capacity: 25-50% of database size
Latency: Sub-millisecond
TTL: Managed by database
Invalidation: Automatic
Best for: Hot data you can’t control

Cache Coherence Strategies

With multiple cache layers, keeping them consistent is hard.

Approach 1: Write-through Update database and cache simultaneously.

def update_user_profile(user_id, new_data):
    # Update database
    db.execute("UPDATE users SET name = ? WHERE id = ?",
               new_data['name'], user_id)

    # Update cache immediately
    cache_key = f"user:{user_id}"
    cache.set(cache_key, new_data, ttl=300)

Pros: Cache never stale. Cons: Two writes for every update. If cache write fails, they diverge.

Approach 2: Write-invalidate Update database, delete from cache.

def update_user_profile(user_id, new_data):
    # Update database
    db.execute("UPDATE users SET name = ? WHERE id = ?",
               new_data['name'], user_id)

    # Invalidate cache
    cache_key = f"user:{user_id}"
    cache.delete(cache_key)

Pros: Simple. Cache can’t be stale. Cons: Next read is a cache miss.

Approach 3: Write-behind Update cache, asynchronously update database.

def update_user_profile(user_id, new_data):
    # Update cache immediately
    cache_key = f"user:{user_id}"
    cache.set(cache_key, new_data, ttl=300)

    # Queue database update
    queue.enqueue(lambda: db.execute(
        "UPDATE users SET name = ? WHERE id = ?",
        new_data['name'], user_id
    ))

Pros: Fast writes. Cons: Database might be behind. If cache evicts before queue processes, data is lost.

Most systems use write-invalidate. It’s simple and correct. Write-through makes sense for critical data. Write-behind is rare because the risks usually outweigh the benefits.

Dealing with Cache Stampedes

Covered locking earlier. Here’s a different approach: stale-while-revalidate.

def get_product_price(product_id):
    cache_key = f"price:{product_id}"

    # Cache stores (value, timestamp, stale)
    cached = cache.get(cache_key)

    if cached:
        value, timestamp, is_stale = cached

        if not is_stale:
            # Fresh data - return it
            return value

        # Stale data - return it but refresh in background
        threading.Thread(target=refresh_cache,
                        args=(product_id, cache_key)).start()
        return value

    # No cache - fetch synchronously
    price = db.query("SELECT price FROM products WHERE id = ?", product_id)
    cache.set(cache_key, (price, time.time(), False), ttl=300)
    return price

def refresh_cache(product_id, cache_key):
    price = db.query("SELECT price FROM products WHERE id = ?", product_id)
    cache.set(cache_key, (price, time.time(), False), ttl=300)

When cache expires, mark it stale but keep serving it. Refresh in the background. Users never see the cache miss latency.

This is what Varnish HTTP cache does with stale-while-revalidate header.

Async Processing Architecture

Background jobs are simple at small scale: queue a job, worker picks it up, processes it. At scale, you need sophisticated job handling.

Message Queues vs Event Streams

Message Queues (RabbitMQ, SQS, Cloud Tasks)

Jobs are consumed once
Worker acknowledges completion
Failed jobs retry
Order not guaranteed (usually)

Event Streams (Kafka, Kinesis, Pulsar)

Events are broadcast to subscribers
Multiple consumers can read the same event
Events stored for days/weeks
Order guaranteed within partition

Use queues for “do this work once” tasks: send email, generate PDF, process payment.

Use streams for “these events happened” notifications: user signed up, order placed, payment completed. Multiple systems care about these events.

Backpressure and Rate Limiting

A sudden spike in jobs can overwhelm workers. Your API receives 100,000 new orders in 10 minutes (flash sale). Each order queues 3 jobs (confirmation email, inventory update, shipping label). That’s 300,000 jobs.

Your workers process 100 jobs/second. That’s 50 minutes to clear the queue. But jobs keep coming. The queue grows to millions. Workers fall further behind.

Backpressure solution: Queue limits

def enqueue_job(job):
    queue_size = redis.llen('job_queue')

    if queue_size > 10000:
        # Queue is too large - reject new jobs
        raise QueueFullError("Job queue at capacity")

    redis.rpush('job_queue', job)

Better to reject new jobs than to accept them and process them hours late. The API can return 503 Service Unavailable. The client retries later or the user gets an error message.

Rate limiting solution: Token bucket

class RateLimiter:
    def __init__(self, rate, capacity):
        self.rate = rate  # tokens per second
        self.capacity = capacity
        self.tokens = capacity
        self.last_update = time.time()

    def allow(self):
        now = time.time()
        elapsed = now - self.last_update

        # Refill tokens
        self.tokens = min(self.capacity,
                         self.tokens + elapsed * self.rate)
        self.last_update = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Limit to 100 jobs per second
limiter = RateLimiter(rate=100, capacity=100)

def enqueue_job(job):
    if not limiter.allow():
        raise RateLimitError("Too many jobs")

    redis.rpush('job_queue', job)

Jobs enqueue at sustainable rate. Queue doesn’t grow unbounded. System stays responsive.

Dead Letter Queues

Some jobs fail permanently. Email address doesn’t exist. Payment card declined. External API returned 400 Bad Request (client error, not worth retrying).

Retrying these jobs wastes resources. After N failures, move them to a dead letter queue (DLQ) for manual review.

def process_job(job):
    try:
        result = perform_work(job)
        acknowledge_job(job)
    except RetryableError as e:
        # Temporary failure - retry
        job.retries += 1
        if job.retries < 5:
            requeue_job(job)
        else:
            # Too many retries - move to DLQ
            dead_letter_queue.add(job)
            acknowledge_job(job)  # Don't requeue
    except PermanentError as e:
        # Permanent failure - don't retry
        dead_letter_queue.add(job)
        acknowledge_job(job)

Ops teams review DLQ weekly. Most are legitimate failures (invalid email). Some reveal bugs (malformed data from API). Fix the bugs, requeue the jobs.

Stripe processes billions of jobs. Their DLQ catches edge cases that QA never found. It’s a debugging goldmine.

Load Balancing Deep Dive

Surface content mentioned load balancing. This covers the real complexity.

Layer 4 vs Layer 7

Layer 4 (Transport Layer) Load balancer looks at IP addresses and TCP/UDP ports. Fast. Doesn’t inspect HTTP headers or application data.

Client → Load Balancer (sees: TCP connection to port 443)
           → Server A, B, or C

Advantages:

Very fast (minimal processing)
Protocol-agnostic (works for HTTP, WebSockets, gRPC, databases)
Low latency overhead (microseconds)

Disadvantages:

Can’t route based on URL path
Can’t inspect cookies or headers
Can’t do SSL termination easily

Layer 7 (Application Layer) Load balancer inspects HTTP requests. Can route based on URL, headers, cookies.

Client → Load Balancer (sees: GET /api/users, Host: api.example.com)
           → Routes to API servers

Client → Load Balancer (sees: GET /images/logo.png, Host: cdn.example.com)
           → Routes to asset servers

Advantages:

Intelligent routing (send /api to API servers, /images to asset servers)
SSL termination (decrypt once at load balancer, plain HTTP to backends)
Can inspect cookies for sticky sessions

Disadvantages:

Slower (milliseconds of overhead)
More CPU intensive
Must understand the protocol

Most systems use L7 for HTTP traffic, L4 for everything else.

Load Balancing Algorithms

Round Robin Each request goes to the next server in the list. Simple. Fair. Ignores server load.

Server A gets request 1, 4, 7, 10… Server B gets request 2, 5, 8, 11… Server C gets request 3, 6, 9, 12…

Works great if requests have similar cost. Breaks down if request 1 is “fetch user profile” (10ms) and request 2 is “generate annual report” (30 seconds).

Least Connections Route to server with fewest active connections. Better load distribution if request cost varies.

Servers:

A: 5 connections
B: 3 connections
C: 8 connections

Next request goes to B.

Least Response Time Track how long each server takes to respond. Route to the fastest.

Servers:

A: avg 150ms response time
B: avg 80ms response time
C: avg 200ms response time

Next request goes to B.

Useful when servers have different hardware or some are degraded.

Weighted Distribution Some servers are more powerful. Give them more traffic.

Servers:

A: 2x CPUs, weight 2
B: 4x CPUs, weight 4
C: 2x CPUs, weight 2

Server B gets 50% of traffic. A and C split the remaining 50%.

IP Hash Hash the client IP, always send to the same server. Provides stickiness without cookies.

def select_server(client_ip, servers):
    hash_value = hash(client_ip)
    index = hash_value % len(servers)
    return servers[index]

User from IP 192.168.1.100 always hits the same server. Good for caching per-server state. Bad if one server is overloaded (doesn’t redistribute).

Sticky Sessions and Why They Matter

Some applications store session state in server memory. User logs in, session data lives on Server A. Next request must go to Server A or user appears logged out.

Cookie-based stickiness

Client requests login
  → Load balancer routes to Server A
  → Server A sets cookie: lb-server=A

Client makes next request with cookie: lb-server=A
  → Load balancer sees cookie, routes to Server A

Works until Server A crashes. User loses session.

Better solution: Shared session storage Don’t store sessions in application memory. Store in Redis or database.

# Bad - in-memory sessions
sessions = {}

@app.route('/login')
def login():
    session_id = generate_id()
    sessions[session_id] = {'user_id': 123}
    return set_cookie('session_id', session_id)

# Good - shared session storage
@app.route('/login')
def login():
    session_id = generate_id()
    redis.set(f'session:{session_id}', {'user_id': 123}, ttl=3600)
    return set_cookie('session_id', session_id)

Now any server can handle any request. No sticky sessions needed. Server crashes don’t lose sessions.

This is how stateless applications scale. GitHub, Netflix, Stripe all use shared session storage.

Health Checks That Actually Work

Load balancers remove unhealthy servers from rotation. But “health check” needs to verify actual functionality, not just “is the server running.”

Bad health check:

@app.route('/health')
def health():
    return "OK", 200

This returns 200 even if:

Database is down
Cache is unreachable
Disk is full
Application logic is broken

Better health check:

@app.route('/health')
def health():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'disk': check_disk_space(),
    }

    if all(checks.values()):
        return {'status': 'healthy', 'checks': checks}, 200
    else:
        return {'status': 'unhealthy', 'checks': checks}, 503

def check_database():
    try:
        db.execute('SELECT 1')
        return True
    except:
        return False

def check_cache():
    try:
        redis.ping()
        return True
    except:
        return False

def check_disk_space():
    usage = shutil.disk_usage('/').percent
    return usage < 90

Load balancer calls /health every 10 seconds. Server returns 503 if dependencies are down. Load balancer removes it from rotation. Other servers handle traffic while this one recovers.

Auto-Scaling Strategy

Manually adding servers when traffic spikes is too slow. Auto-scaling does it automatically.

Horizontal vs Vertical Scaling

Vertical (Scale Up) Add CPU/RAM to existing servers. 2-core server becomes 8-core server.

Advantages:

Simple (no code changes)
No distributed systems complexity
Good for databases

Disadvantages:

Expensive (bigger servers cost disproportionately more)
Limited (can’t infinitely scale one machine)
Single point of failure

Horizontal (Scale Out) Add more servers. 2 servers become 10 servers.

Advantages:

Cost-effective (many small servers cheaper than one huge server)
No limits (add servers until you run out of money)
Redundancy (losing one server doesn’t kill the system)

Disadvantages:

Application must be stateless
Coordination overhead
More complex deployment

Web applications scale horizontally. Databases often scale vertically (until they can’t, then you shard).

Scaling Triggers

Auto-scaling needs to know when to add/remove servers.

CPU-based: Add server when CPU > 70%. Remove server when CPU < 30%.

Simple but flawed. CPU might be high because one slow request is blocking. Adding more servers doesn’t help.

Request queue depth: Add server when queue depth > 100 requests. Remove server when queue depth < 10.

Better. Directly measures demand. Works for request-response systems.

Custom metrics: Add server when [business metric] exceeds threshold.

Examples:

Stream processing: Add server when lag > 10 seconds
API: Add server when p95 latency > 500ms
Background jobs: Add server when queue > 1000 jobs

These tie scaling to actual user impact.

Predictive Scaling

Reactive scaling adds servers after load increases. Users already experienced slowness. Predictive scaling adds servers before load increases.

# Historical pattern: traffic spikes at 9am weekdays
scaling_schedule = {
    'weekday_morning': {
        'time': '8:45',
        'action': 'scale_to',
        'instances': 20
    },
    'weekday_evening': {
        'time': '18:00',
        'action': 'scale_to',
        'instances': 5
    }
}

E-commerce sites scale up before Black Friday. News sites scale up before elections. SaaS apps scale up at 9am Monday when everyone logs in.

AWS Auto Scaling and GCP Autoscaler support scheduled scaling.

Cost Optimization

Auto-scaling can get expensive. Strategies to control costs:

Spot instances / Preemptible VMs Cloud providers sell excess capacity at 70-90% discount. They can reclaim it with 2 minutes notice.

Use for:

Batch processing
Background jobs
Non-critical workloads

Don’t use for:

Databases
Stateful services
Anything where interruption causes data loss

Reserved instances Commit to running servers for 1-3 years. Get 40-60% discount.

Use for baseline capacity. Your app always needs 5 servers minimum. Reserve those. Scale beyond 5 with on-demand.

Right-sizing Most servers are overprovisioned. Monitor actual usage. If CPU averages 20%, use smaller instance types.

Tools: AWS Compute Optimizer, GCP Recommender, Datadog rightsizing.

Scale-to-zero for dev environments Development and staging environments don’t need to run 24/7. Scale them to zero after business hours.

# Scale down at 7pm
0 19 * * * scale-to-zero dev-environment

# Scale up at 7am
0 7 * * * scale-up dev-environment

Saves 60% of dev environment costs (12 hours per day vs 24).

Performance Monitoring and Profiling

You can’t optimize what you don’t measure. At scale, monitoring moves from “nice to have” to “critical infrastructure.”

Application Performance Monitoring (APM)

APM tools trace requests through your system:

Request hits load balancer (3ms)
Routed to application server (2ms)
Application queries database (45ms)
Application calls external API (120ms)
Application renders response (8ms)
Total: 178ms

You immediately see the bottleneck: external API call (120ms, 67% of request time).

Popular APM tools:

New Relic: Full-featured, expensive
Datadog: Strong infrastructure monitoring + APM
Elastic APM: Open source, integrates with ELK stack
Honeycomb: Modern, excellent for microservices
Sentry: Primarily error tracking, has basic performance monitoring

Flame Graphs for Profiling

Flame graphs visualize where CPU time is spent. Each horizontal bar is a function. Width is how much time it used.

main()                          [=====================================] 100%
  ├─ handle_request()           [============================] 70%
  │   ├─ fetch_user()           [=======] 20%
  │   │   └─ database_query()   [======] 18%
  │   └─ render_template()      [===================] 50%
  │       ├─ load_template()    [=====] 12%
  │       └─ render()           [=============] 38%
  └─ send_response()            [========] 30%

You instantly see render() takes 38% of CPU time. Optimize that first.

Generate flame graphs with profilers:

Python: py-spy
Node.js: 0x, clinic.js
Ruby: stackprof
Go: pprof
Java: async-profiler

Detecting N+1 Queries

APM tools track database queries per request. Bullet (Ruby), Django Debug Toolbar (Python), and similar tools alert on N+1 queries.

Example alert:

⚠️ N+1 Query Detected
GET /posts
  SELECT * FROM posts LIMIT 10          [1 query]
  SELECT * FROM users WHERE id = ?      [10 queries]

Suggestion: Use eager loading
  posts = Post.includes(:author).limit(10)

These alerts appear in development before they hit production.

Real User Monitoring (RUM)

Server-side monitoring shows backend performance. RUM shows what users actually experience.

// Measure page load performance
window.addEventListener('load', () => {
  const perfData = performance.getEntriesByType('navigation')[0];

  const metrics = {
    dns: perfData.domainLookupEnd - perfData.domainLookupStart,
    tcp: perfData.connectEnd - perfData.connectStart,
    ttfb: perfData.responseStart - perfData.requestStart,
    download: perfData.responseEnd - perfData.responseStart,
    domProcessing: perfData.domComplete - perfData.domLoading,
    total: perfData.loadEventEnd - perfData.fetchStart
  };

  // Send to analytics
  sendMetrics(metrics);
});

This captures real-world performance including:

Network latency (varies by geography)
Client device performance (desktop vs mobile)
Browser differences
Third-party scripts slowing pages

You discover things server metrics miss: “Our app is fast in the US but slow in India because our CDN has no edge servers in Asia.”

Core Web Vitals

Google’s performance metrics that affect SEO:

Largest Contentful Paint (LCP): How long until main content is visible

Good: < 2.5s
Poor: > 4s

First Input Delay (FID): How long until page responds to interaction

Good: < 100ms
Poor: > 300ms

Cumulative Layout Shift (CLS): How much content jumps around while loading

Good: < 0.1
Poor: > 0.25

Measure with:

new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    if (entry.name === 'LCP') {
      console.log('LCP:', entry.startTime);
    }
  }
}).observe({entryTypes: ['largest-contentful-paint']});

Tools: Google PageSpeed Insights, Lighthouse, Chrome DevTools.

Case Studies: Scaling Real Systems

Discord: Handling 5 Million Concurrent Users

Problem: Message indexes growing too large. Billions of messages per channel. Queries slowing down.

Original Architecture: MongoDB with messages in single collection per channel. Indexes on message_id. As channels aged, indexes grew to gigabytes.

Solution: Moved to Cassandra partitioned by time buckets. Messages from 2024-11-01 to 2024-11-07 in one partition. Queries always include time range.

-- Old query (slow on large channels)
SELECT * FROM messages WHERE channel_id = ? ORDER BY id DESC LIMIT 50

-- New query (fast - hits one partition)
SELECT * FROM messages
WHERE channel_id = ?
  AND bucket = '2024-11-01'
  AND timestamp > ?
LIMIT 50

Result: Queries went from 2 seconds to 30ms. System now handles 850 million messages per day.

Key insight: Time-series data naturally partitions by time. Stop fighting it.

Stripe: Processing Billions of API Requests

Problem: Payment processing can’t have downtime. Every request must succeed or fail gracefully.

Architecture:

Idempotency keys on all write operations
Multi-region active-active deployment
Request retries with exponential backoff
Circuit breakers on external dependencies

Idempotency:

@app.route('/v1/charges', methods=['POST'])
def create_charge(amount, idempotency_key):
    # Check if we've processed this request before
    existing = db.query(
        "SELECT * FROM charges WHERE idempotency_key = ?",
        idempotency_key
    )
    if existing:
        # Return the previous result
        return existing

    # Process the charge
    charge = process_payment(amount)

    # Store with idempotency key
    db.insert("charges", {
        'id': charge.id,
        'amount': amount,
        'idempotency_key': idempotency_key
    })

    return charge

Client retries on network failure. Server returns same result. No duplicate charges.

Result: 99.999% uptime processing $640 billion annually.

Key insight: Make operations idempotent. Everything becomes retry-safe.

Netflix: Streaming to 200 Million Subscribers

Problem: Serving video globally with minimal buffering.

Architecture:

Open Connect CDN with servers in ISP data centers
Pre-positioned content based on viewing predictions
Adaptive bitrate streaming
Chaos engineering (deliberately break things to test resilience)

Content Pre-positioning: Netflix predicts what users will watch based on viewing history. They push popular content to edge servers during off-peak hours.

When you hit play on “Stranger Things,” it streams from a server physically located at your ISP. Not from AWS. From a box in your city.

Chaos Engineering: They randomly kill servers in production. Services must gracefully degrade. This is the origin of Chaos Monkey.

Result: 99.99% availability streaming petabytes daily.

Key insight: For latency-sensitive workloads, physics matters more than clever code. Get data closer to users.

Shopify: Black Friday Flash Sales

Problem: 10,000 requests per second baseline. 75,000 requests per second during flash sales.

Architecture:

Read replicas for database (20+ replicas during flash sales)
Redis cache with 99% hit rate
Job queues for order processing
Rate limiting to prevent store overload

Adaptive Scaling:

# Scale based on queue depth and error rate
def scale_decision
  queue_depth = redis.llen('order_queue')
  error_rate = statsd.get('http.errors.rate')

  if queue_depth > 5000 || error_rate > 0.05
    scale_up
  elsif queue_depth < 500 && error_rate < 0.01
    scale_down
  end
end

Result: Processed $2.9 billion on Black Friday 2023 without downtime.

Key insight: Plan for 10x your normal traffic. Then test at 20x.

When to Actually Use These Techniques

Most of this content is overkill for most apps. Guidelines:

Use CDN + basic caching: From day one. It’s free or cheap and simple to implement.

Use read replicas: When database CPU consistently > 60% from read queries.

Use connection pooling: From day one if using traditional databases. It’s standard practice.

Use CQRS: When read and write patterns are fundamentally different and causing contention. Probably never.

Use multi-layer caching: When your application cache hit rate is high (>80%) but you want to reduce origin load further.

Use auto-scaling: When traffic patterns are predictable (time-of-day) or spiky (events, sales).

Use async processing: When operations take >2 seconds or use significant resources.

Use edge computing: When you have global users and latency is a competitive advantage.

The pattern: start simple, measure, add complexity only when needed.

Discord didn’t start with Cassandra. They started with MongoDB. When it stopped scaling, they migrated. Stripe didn’t start with multi-region deployment. They started with a single region and added regions as they grew globally.

Build for today’s scale. Plan for tomorrow’s. Don’t build for a future that might not come.

Trade-offs at Every Layer

Every optimization has costs. Here’s what you’re signing up for:

Optimization Technique	Costs & Drawbacks	When It’s Worth It
Caching	Added complexity, stale data risks, cache invalidation bugs	Database-bound workloads where reads far outnumber writes
Read replicas	Eventual consistency issues, replication lag, operational overhead	Primary database is CPU-bound from read queries
Auto-scaling	Can get expensive if tuned poorly, adds complexity, slower than over-provisioning	Traffic patterns are unpredictable or highly variable
Edge computing	Harder to debug, limited runtime environment, cold start latency	Global user base with low-latency requirements
Async processing	Increased complexity, eventual consistency model, harder to debug failures	Operations take >200ms or involve external services

The unifying theme: complexity is a cost. Only pay it when the benefit exceeds the cost.

A 200ms response time that’s simple to maintain beats a 100ms response time that requires five engineers to understand.

Performance & Scalability Design - Deep Water

Understanding the Scalability Cliff

Global CDN Architecture Beyond Static Files

Edge Computing Patterns

Regional Caching Tiers

Cache Invalidation at Scale

Database Performance at Scale

Connection Pooling Done Right

Read Replicas and Separation

CQRS: Command Query Responsibility Segregation

Query Optimization Beyond Indexes

Partitioning Large Tables

Multi-Layer Caching Strategies

The Full Stack

Cache Coherence Strategies

Dealing with Cache Stampedes

Async Processing Architecture

Message Queues vs Event Streams

Backpressure and Rate Limiting

Dead Letter Queues

Load Balancing Deep Dive

Layer 4 vs Layer 7

Load Balancing Algorithms

Sticky Sessions and Why They Matter

Health Checks That Actually Work

Auto-Scaling Strategy

Horizontal vs Vertical Scaling

Scaling Triggers

Predictive Scaling

Cost Optimization

Performance Monitoring and Profiling

Application Performance Monitoring (APM)

Flame Graphs for Profiling

Detecting N+1 Queries

Real User Monitoring (RUM)

Core Web Vitals

Case Studies: Scaling Real Systems

Discord: Handling 5 Million Concurrent Users

Stripe: Processing Billions of API Requests

Netflix: Streaming to 200 Million Subscribers

Shopify: Black Friday Flash Sales

When to Actually Use These Techniques

Trade-offs at Every Layer

Want to Go Deeper?

Related Topics