Infrastructure as Code - Deep Water
Advanced State Management Architectures
Multi-Account State Strategy
Large organizations don’t put all infrastructure in one AWS account or state file. They split by blast radius.
Pattern: Account per environment + workload
organization/
├── terraform-state-production/
│ ├── compute/
│ │ └── terraform.tfstate (web servers, workers)
│ ├── data/
│ │ └── terraform.tfstate (databases, caches)
│ └── network/
│ └── terraform.tfstate (VPC, subnets, routing)
├── terraform-state-staging/
│ └── ... (same structure)
└── terraform-state-dev/
└── ... (same structure)
Why split state:
- Blast radius containment - Corrupted state in compute doesn’t affect databases
- Parallel operations - Different teams work on different state files simultaneously
- Different update cadence - Network changes monthly, compute changes daily
- Access control - Database team can’t accidentally destroy compute infrastructure
Trade-offs:
- More complexity (multiple backends to configure)
- Cross-state dependencies require data sources or manual coordination
- More state files to backup and monitor
State Locking at Scale
When 50 engineers might run Terraform simultaneously, DynamoDB locking isn’t enough.
Enhanced locking with Terraform Cloud:
terraform {
cloud {
organization = "acme-corp"
workspaces {
name = "production-compute"
}
}
}
Terraform Cloud provides:
- Queue-based locking - Runs wait in queue rather than failing
- Speculative plans - Run plans without locking (read-only)
- Run history - See who ran what and when
- State rollback - Restore previous state versions
- Team-based access - Fine-grained permissions per workspace
Alternative: Atlantis for self-hosted
Atlantis runs Terraform in response to PR comments:
- Engineer opens PR with Terraform changes
- Comment
atlantis planon PR - Atlantis acquires lock, runs plan, comments output
- Team reviews plan in PR
- Comment
atlantis applyto execute - Atlantis applies, releases lock, comments results
Configuration (.atlantis.yaml):
version: 3
projects:
- name: production-compute
dir: environments/production/compute
workspace: production
apply_requirements: [approved, mergeable]
workflow: production
workflows:
production:
plan:
steps:
- init
- plan:
extra_args: ["-lock-timeout=10m"]
apply:
steps:
- apply:
extra_args: ["-lock-timeout=10m"]
Atlantis enforces:
- PR must be approved before apply
- PR must be mergeable (passes CI)
- Only one apply runs at a time per workspace
- Full audit trail in PR comments
State Migration and Versioning
Challenge: Moving from local state to S3, or changing state structure.
Migration process:
- Backup current state:
cp terraform.tfstate terraform.tfstate.backup
- Configure new backend:
terraform {
backend "s3" {
bucket = "new-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
- Initialize and migrate:
terraform init -migrate-state
Terraform prompts: “Do you want to copy existing state to the new backend?”
- Verify migration:
terraform plan # Should show no changes
State version conflicts:
Terraform state format changes between versions. State created with Terraform 1.5 might not work with 1.3.
Solution: Version pinning
terraform {
required_version = ">= 1.5.0, < 2.0.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
Team uses same Terraform version → no state format incompatibilities.
State versioning in S3:
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
Corrupt state? Restore previous version:
aws s3api list-object-versions --bucket terraform-state --prefix prod/terraform.tfstate
# Get version ID from output
aws s3api get-object --bucket terraform-state \
--key prod/terraform.tfstate \
--version-id <version-id> \
terraform.tfstate
Advanced Module Patterns
Dynamic Module Composition
Pattern: Conditional resource creation
Some environments need a bastion host, others don’t:
# modules/network/main.tf
variable "create_bastion" {
type = bool
default = false
}
resource "aws_instance" "bastion" {
count = var.create_bastion ? 1 : 0
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
subnet_id = aws_subnet.public[0].id
tags = {
Name = "bastion"
}
}
output "bastion_ip" {
value = var.create_bastion ? aws_instance.bastion[0].public_ip : null
}
Usage:
module "production_network" {
source = "../../modules/network"
create_bastion = false # Production uses VPN, not bastion
}
module "dev_network" {
source = "../../modules/network"
create_bastion = true # Dev uses bastion for convenience
}
Module Dependency Injection
Pattern: Externalize dependencies
Instead of hardcoding which VPC to use, inject it:
# modules/app/main.tf
variable "vpc_id" {
description = "VPC to deploy into"
type = string
}
variable "subnet_ids" {
description = "Subnets for load balancer"
type = list(string)
}
resource "aws_lb" "app" {
name = "app-lb"
subnets = var.subnet_ids
load_balancer_type = "application"
# ... config ...
}
Usage with different VPC modules:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
# ... VPC config ...
}
module "app" {
source = "../../modules/app"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.public_subnets
}
Benefits:
- App module works with any VPC implementation
- Easy to test (inject mock VPC for testing)
- Reusable across different network architectures
Module Testing with Terratest
Real infrastructure testing:
package test
import (
"testing"
"time"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/http-helper"
"github.com/stretchr/testify/assert"
)
func TestWebServerModule(t *testing.T) {
t.Parallel()
awsRegion := "us-east-1"
terraformOptions := &terraform.Options{
TerraformDir: "../examples/web-server",
Vars: map[string]interface{}{
"region": awsRegion,
"instance_type": "t2.micro",
},
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": awsRegion,
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get outputs
instanceID := terraform.Output(t, terraformOptions, "instance_id")
publicIP := terraform.Output(t, terraformOptions, "public_ip")
// Verify instance exists
assert.NotEmpty(t, instanceID)
// Verify instance is running
instanceState := aws.GetEc2InstanceState(t, awsRegion, instanceID)
assert.Equal(t, "running", instanceState)
// Verify HTTP endpoint responds
url := fmt.Sprintf("http://%s", publicIP)
http_helper.HttpGetWithRetry(t, url, nil, 200, "Hello World", 30, 5*time.Second)
}
What this test validates:
- Terraform code is syntactically valid
- Infrastructure actually deploys
- Resources are configured correctly (instance running)
- Application is functional (HTTP endpoint responds)
CI/CD integration:
name: Module Tests
on: push
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-go@v4
with:
go-version: '1.21'
- name: Run Terratest
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cd test
go test -v -timeout 30m
Cost control: Tests create real infrastructure. Costs add up.
Solutions:
- Run tests in separate AWS account with billing alerts
- Use t2.micro/smallest instances
- Destroy immediately after tests (automated in defer)
- Run on PR only, not every commit
- Use spot instances where possible
Policy as Code: Advanced Patterns
Sentinel: Complex Policy Logic
Pattern: Environment-specific policies
# policies/instance-size.sentinel
import "tfplan/v2" as tfplan
import "strings"
# Get environment from workspace name
env = strings.split(tfplan.workspace, "-")[0]
# Instance size limits by environment
size_limits = {
"prod": ["t3.medium", "t3.large", "m5.large"],
"staging": ["t3.small", "t3.medium"],
"dev": ["t2.micro", "t2.small"],
}
# Find all EC2 instances
instances = filter tfplan.resource_changes as _, rc {
rc.type is "aws_instance" and
rc.mode is "managed" and
(rc.change.actions contains "create" or rc.change.actions contains "update")
}
# Check each instance
instance_size_valid = rule {
all instances as _, instance {
instance.change.after.instance_type in size_limits[env]
}
}
main = rule {
instance_size_valid
}
Result:
- Dev environment can only create t2.micro/small (cost control)
- Staging limited to t3.small/medium (realistic but not expensive)
- Production can use larger instances (performance matters)
OPA: Advanced Compliance
Pattern: Multi-cloud compliance
package terraform.compliance
import future.keywords
# All S3 buckets must have encryption, versioning, and logging
deny[msg] {
resource := input.resource.aws_s3_bucket[name]
not resource.server_side_encryption_configuration
msg := sprintf("S3 bucket %s must have encryption enabled", [name])
}
deny[msg] {
resource := input.resource.aws_s3_bucket[name]
not resource.versioning[_].enabled
msg := sprintf("S3 bucket %s must have versioning enabled", [name])
}
# All databases must be in private subnets
deny[msg] {
resource := input.resource.aws_db_instance[name]
subnet_group := input.resource.aws_db_subnet_group[resource.db_subnet_group_name]
subnet := input.resource.aws_subnet[subnet_group.subnet_ids[_]]
subnet.map_public_ip_on_launch == true
msg := sprintf("Database %s must not be in public subnet", [name])
}
# Require specific tags
required_tags := ["Environment", "Owner", "CostCenter"]
deny[msg] {
resource := input.resource[type][name]
resource_types := ["aws_instance", "aws_db_instance", "aws_s3_bucket"]
type in resource_types
tag := required_tags[_]
not resource.tags[tag]
msg := sprintf("%s %s missing required tag: %s", [type, name, tag])
}
Testing policies:
# Create test Terraform plan
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
# Test policies
conftest test tfplan.json
FAIL - tfplan.json
S3 bucket 'logs' must have versioning enabled
Database 'primary' must not be in public subnet
aws_instance 'web' missing required tag: CostCenter
3 tests, 0 passed, 3 failed
CI/CD enforcement:
- name: Policy Check
run: |
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json
if [ $? -ne 0 ]; then
echo "Policy violations detected. Fix before merge."
exit 1
fi
CloudFormation Guard
Pattern: AWS-specific compliance
# All S3 buckets must block public access
AWS::S3::Bucket {
Properties.PublicAccessBlockConfiguration.BlockPublicAcls == true
Properties.PublicAccessBlockConfiguration.BlockPublicPolicy == true
Properties.PublicAccessBlockConfiguration.IgnorePublicAcls == true
Properties.PublicAccessBlockConfiguration.RestrictPublicBuckets == true
}
# RDS instances must be encrypted
AWS::RDS::DBInstance {
Properties.StorageEncrypted == true
Properties.KmsKeyId exists
}
# Security groups must not allow 0.0.0.0/0 on port 22 (SSH)
AWS::EC2::SecurityGroup {
Properties.SecurityGroupIngress[*] {
when IpProtocol == "tcp" and (FromPort <= 22 and ToPort >= 22) {
CidrIp != "0.0.0.0/0"
CidrIpv6 != "::/0"
}
}
}
Running Guard:
cfn-guard validate --data template.yaml --rules rules.guard
Summary Report for template.yaml
FAILED: 2 rules failed
SKIPPED: 0 rules skipped
Rule [AWS::EC2::SecurityGroup]
Status: FAIL
Location: Resources/WebServerSecurityGroup/Properties/SecurityGroupIngress/0/CidrIp
Message: Expected "0.0.0.0/0" to not equal "0.0.0.0/0"
Multi-Cloud and Hybrid Infrastructure
Cross-Cloud Resource Management
Pattern: Application in AWS, observability in GCP, CI/CD in Azure
# providers.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
azuredevops = {
source = "microsoft/azuredevops"
version = "~> 0.11"
}
}
}
provider "aws" {
region = "us-east-1"
}
provider "google" {
project = "my-project"
region = "us-central1"
}
provider "azuredevops" {
org_service_url = "https://dev.azure.com/myorg"
}
# Application infrastructure in AWS
resource "aws_instance" "app_server" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.medium"
# ... config ...
}
# Logs sent to Google Cloud Logging
resource "google_logging_sink" "aws_logs" {
name = "aws-application-logs"
destination = "bigquery.googleapis.com/projects/my-project/datasets/aws_logs"
filter = "resource.type=aws_instance"
}
# CI/CD pipeline in Azure DevOps
resource "azuredevops_build_definition" "app_pipeline" {
project_id = azuredevops_project.project.id
name = "app-deployment-pipeline"
repository {
repo_type = "GitHub"
repo_id = "myorg/myapp"
branch_name = "main"
}
ci_trigger {
use_yaml = true
}
}
Challenge: Cross-cloud networking
AWS VPC can’t directly peer with GCP VPC. Solutions:
- VPN connections:
resource "aws_vpn_gateway" "aws_gateway" {
vpc_id = aws_vpc.main.id
}
resource "google_compute_vpn_gateway" "gcp_gateway" {
name = "aws-vpn"
network = google_compute_network.main.id
}
resource "google_compute_vpn_tunnel" "tunnel" {
name = "aws-tunnel"
peer_ip = aws_vpn_connection.main.tunnel1_address
shared_secret = random_string.vpn_secret.result
target_vpn_gateway = google_compute_vpn_gateway.gcp_gateway.id
# ... routing ...
}
- Cloud interconnect (expensive but fast):
- AWS Direct Connect + Google Cloud Interconnect
- Dedicated physical connection between clouds
- $1,000+ monthly, but < 10ms latency
- Public internet with service mesh:
- No VPN needed
- Services communicate over HTTPS
- mTLS for authentication and encryption
- Istio or Linkerd handle routing
Hybrid Cloud (On-Prem + Cloud)
Pattern: Gradual cloud migration
# On-prem data center (managed via custom provider or API)
resource "vmware_virtual_machine" "legacy_database" {
name = "legacy-db-01"
# ... on-prem config ...
}
# Cloud-based application servers
resource "aws_instance" "app_server" {
count = 3
ami = data.aws_ami.app.id
instance_type = "t3.medium"
# Connect to on-prem database via VPN
user_data = templatefile("${path.module}/user-data.sh", {
db_host = vmware_virtual_machine.legacy_database.default_ip_address
db_port = 5432
})
}
# VPN connection between AWS and on-prem
resource "aws_vpn_connection" "onprem" {
vpn_gateway_id = aws_vpn_gateway.main.id
customer_gateway_id = aws_customer_gateway.onprem.id
type = "ipsec.1"
static_routes_only = true
}
Migration phases:
- Phase 1: On-prem database, cloud app servers (connected via VPN)
- Phase 2: Cloud database (replica), on-prem database (primary), cloud app servers
- Phase 3: Cloud database (primary), on-prem database (retired), cloud app servers
Terraform manages all three phases with same code, different variable values.
Large-Scale Terraform: 10,000+ Resources
Workspace and Directory Structure
Anti-pattern: Monolithic state
One Terraform directory with 10,000 resources:
terraform plantakes 30 minutes- One typo risks destroying everything
- 50 engineers fighting over state locks
Solution: Hierarchical workspaces
infrastructure/
├── global/ # Resources shared across regions
│ ├── route53/ # DNS (global)
│ ├── iam/ # IAM roles (global)
│ └── cloudfront/ # CDN (global)
├── us-east-1/
│ ├── network/ # VPC, subnets, routing
│ ├── compute/
│ │ ├── web/ # Web servers
│ │ ├── api/ # API servers
│ │ └── workers/ # Background workers
│ ├── data/
│ │ ├── rds/ # Databases
│ │ ├── elasticache/ # Redis
│ │ └── s3/ # Object storage
│ └── observability/ # CloudWatch, logs
└── eu-west-1/
└── ... (same structure)
Each directory = separate state file:
networkstate: 50 resources, plan takes 30 secondscompute/webstate: 100 resources, plan takes 1 minute- Teams work independently
Cross-directory dependencies:
compute/web/main.tf:
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "terraform-state"
key = "us-east-1/network/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "web" {
# Use VPC from network state
subnet_id = data.terraform_remote_state.network.outputs.public_subnets[0]
vpc_security_group_ids = [
data.terraform_remote_state.network.outputs.web_security_group_id
]
# ... config ...
}
Trade-off: More complex (explicit dependencies) but faster and safer.
Terragrunt for DRY Configuration
Problem: Each environment duplicates backend config:
# environments/dev/main.tf
terraform {
backend "s3" {
bucket = "terraform-state"
key = "dev/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
}
}
# environments/prod/main.tf
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate" # Only difference
region = "us-east-1"
dynamodb_table = "terraform-locks"
}
}
Terragrunt solution:
# terragrunt.hcl (root)
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite"
}
config = {
bucket = "terraform-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
# environments/dev/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
# environments/prod/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
Run terragrunt plan instead of terraform plan. Backend configured automatically.
Terragrunt for dependencies:
# compute/terragrunt.hcl
dependency "network" {
config_path = "../network"
}
inputs = {
vpc_id = dependency.network.outputs.vpc_id
subnet_ids = dependency.network.outputs.private_subnets
}
terragrunt run-all apply automatically:
- Applies network first
- Waits for network to complete
- Applies compute using network outputs
Parallelization and Performance
Terraform parallelism:
terraform apply -parallelism=20 # Default is 10
Creates/updates 20 resources simultaneously. Faster but riskier (harder to rollback).
For large deployments:
terraform apply -parallelism=50 # Aggressive
Monitor for API rate limits (AWS, GCP, Azure all have limits).
Targeted applies for speed:
Only updating web servers:
terraform apply -target=module.web_servers
Runs in seconds instead of minutes. But loses dependency tracking temporarily.
Plan file caching:
Generate plan once, apply multiple times:
terraform plan -out=tfplan
# Review tfplan
terraform apply tfplan # No need to re-plan
Drift Detection and Reconciliation at Scale
Continuous Drift Detection
Challenge: 10,000 resources, drift happens daily.
Solution: Scheduled drift detection per workspace
# .github/workflows/drift-detection.yml
name: Nightly Drift Detection
on:
schedule:
- cron: '0 3 * * *' # 3 AM daily
jobs:
detect-drift:
strategy:
matrix:
workspace:
- global/route53
- us-east-1/network
- us-east-1/compute/web
- us-east-1/compute/api
- us-east-1/data/rds
# ... all workspaces
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Detect Drift in ${{ matrix.workspace }}
working-directory: infrastructure/${{ matrix.workspace }}
run: |
terraform init
terraform plan -detailed-exitcode > plan.txt || true
if grep -q "will be" plan.txt; then
echo "::error::Drift detected in ${{ matrix.workspace }}"
cat plan.txt
# Send to Slack
curl -X POST $SLACK_WEBHOOK -H 'Content-Type: application/json' \
-d "{\"text\":\"Drift detected in ${{ matrix.workspace }}\"}"
fi
Each workspace checked independently. Drifted workspaces alert team via Slack.
Automated Drift Reconciliation
Pattern: Auto-fix approved drift types
# drift-reconciliation.py
import subprocess
import json
def detect_drift(workspace):
result = subprocess.run([
'terraform', 'plan', '-json', '-detailed-exitcode'
], cwd=f'infrastructure/{workspace}', capture_output=True)
if result.returncode == 2: # Drift detected
return json.loads(result.stdout)
return None
def is_safe_to_auto_fix(change):
"""Only auto-fix tag changes, not resource deletion"""
if change['type'] == 'update':
# Check what changed
for attr in change['change']['after']:
if not attr.startswith('tag'):
return False # Non-tag change, require human review
return True
return False # Creates/deletes need human approval
def reconcile_workspace(workspace):
drift = detect_drift(workspace)
if not drift:
return # No drift
for resource_change in drift['resource_changes']:
if is_safe_to_auto_fix(resource_change):
print(f"Auto-fixing drift in {resource_change['address']}")
subprocess.run([
'terraform', 'apply', '-auto-approve',
f'-target={resource_change["address"]}'
], cwd=f'infrastructure/{workspace}')
else:
print(f"Manual review required for {resource_change['address']}")
# Create GitHub issue or Jira ticket
workspaces = [
'us-east-1/network',
'us-east-1/compute/web',
# ... all workspaces
]
for workspace in workspaces:
reconcile_workspace(workspace)
Safety guardrails:
- Only auto-fix tag changes
- Resource updates require human review
- Deletions always blocked (manual approval required)
- Drift exceeding threshold triggers incident
Cost Optimization via IaC
FinOps Integration
Pattern: Cost tagging and allocation
locals {
common_tags = {
Environment = var.environment
Owner = var.owner
CostCenter = var.cost_center
ManagedBy = "terraform"
Project = var.project_name
}
}
resource "aws_instance" "web" {
# ... config ...
tags = merge(local.common_tags, {
Name = "web-server-${var.environment}"
Role = "web"
})
}
resource "aws_db_instance" "database" {
# ... config ...
tags = merge(local.common_tags, {
Name = "database-${var.environment}"
Role = "data"
})
}
AWS Cost Explorer can now break down costs by:
- Environment (dev vs prod)
- Owner (team1 vs team2)
- CostCenter (engineering vs marketing)
- Project (project-alpha vs project-beta)
Policy enforcement:
# Require cost tags
required_cost_tags := ["CostCenter", "Owner", "Project"]
deny[msg] {
resource := input.resource[type][name]
tag := required_cost_tags[_]
not resource.tags[tag]
msg := sprintf("Resource %s missing cost tag: %s", [name, tag])
}
Scheduled Resource Management
Pattern: Auto-shutdown dev environments at night
resource "aws_instance" "dev_server" {
count = var.environment == "dev" ? var.instance_count : 0
# ... config ...
tags = merge(local.common_tags, {
AutoShutdown = "true"
ShutdownTime = "19:00"
StartupTime = "08:00"
})
}
Lambda function reads tags, stops instances at 7 PM, starts at 8 AM:
import boto3
from datetime import datetime
ec2 = boto3.client('ec2')
def lambda_handler(event, context):
now = datetime.now()
current_hour = now.hour
# Find instances with AutoShutdown tag
instances = ec2.describe_instances(Filters=[
{'Name': 'tag:AutoShutdown', 'Values': ['true']}
])
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
shutdown_hour = int(instance['Tags']['ShutdownTime'].split(':')[0])
startup_hour = int(instance['Tags']['StartupTime'].split(':')[0])
if current_hour == shutdown_hour:
ec2.stop_instances(InstanceIds=[instance['InstanceId']])
elif current_hour == startup_hour:
ec2.start_instances(InstanceIds=[instance['InstanceId']])
Savings: Dev/staging environments run 50 hours/week instead of 168 hours/week → 70% cost reduction.
Right-Sizing Analysis
Pattern: Terraform + CloudWatch metrics
resource "aws_instance" "web" {
instance_type = var.instance_type
# Enable detailed monitoring for right-sizing
monitoring = true
tags = merge(local.common_tags, {
RightSizingCandidate = "true"
})
}
resource "aws_cloudwatch_metric_alarm" "cpu_low" {
alarm_name = "web-server-underutilized"
comparison_operator = "LessThanThreshold"
evaluation_periods = "24" # 24 hours
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = "3600" # 1 hour
statistic = "Average"
threshold = "10" # Less than 10% CPU
dimensions = {
InstanceId = aws_instance.web.id
}
alarm_actions = [aws_sns_topic.rightsizing_alerts.arn]
}
When CPU < 10% for 24 hours, alert triggers → investigate downgrading from t3.medium to t3.small.
Advanced Secrets Management
Dynamic Secrets with Vault
Pattern: Terraform creates infrastructure, Vault manages secrets
# Create database
resource "aws_db_instance" "app_db" {
# ... config ...
username = "vaultadmin"
password = random_password.vault_master.result
}
# Configure Vault to manage database credentials
resource "vault_database_secret_backend_connection" "postgres" {
backend = vault_mount.db.path
name = "app-database"
allowed_roles = ["app-role"]
postgresql {
connection_url = "postgresql://{{username}}:{{password}}@${aws_db_instance.app_db.endpoint}/app"
username = "vaultadmin"
password = random_password.vault_master.result
}
}
resource "vault_database_secret_backend_role" "app" {
backend = vault_mount.db.path
name = "app-role"
db_name = vault_database_secret_backend_connection.postgres.name
creation_statements = [
"CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';",
"GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";"
]
default_ttl = 3600 # 1 hour
max_ttl = 86400 # 24 hours
}
Application retrieves database credentials from Vault:
- Credentials are unique per instance
- Expire after 1 hour
- Automatically rotated
- Compromised credentials limited blast radius
Secrets Rotation with Terraform
Challenge: Rotating secrets without downtime.
Pattern: Blue-green secret rotation
resource "random_password" "db_password_v1" {
length = 32
special = true
lifecycle {
create_before_destroy = true
}
}
resource "random_password" "db_password_v2" {
length = 32
special = true
keepers = {
rotation_date = var.rotation_date # Change to trigger rotation
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = var.use_password_v2 ? random_password.db_password_v2.result : random_password.db_password_v1.result
}
resource "aws_db_instance" "database" {
# ... config ...
password = var.use_password_v2 ? random_password.db_password_v2.result : random_password.db_password_v1.result
}
Rotation process:
var.use_password_v2 = false(using v1)- Create v2 secret
- Update application to read v2 from Secrets Manager
- Deploy application (gradual rollout)
- Once all instances use v2:
var.use_password_v2 = true - Database password updated to v2
- v1 can be destroyed
No downtime. Both secrets valid during transition.
Compliance and Audit
Terraform Cloud Audit Logs
Every Terraform operation logged:
{
"timestamp": "2025-11-16T14:23:01Z",
"user": "alice@example.com",
"action": "apply",
"workspace": "production-compute",
"resources_changed": 3,
"resources_created": 0,
"resources_destroyed": 0,
"resources_updated": 3,
"plan_id": "plan-abc123",
"apply_id": "run-def456",
"status": "applied",
"duration_seconds": 47
}
Compliance questions answered:
- Who made this change? (user)
- When? (timestamp)
- What changed? (plan_id links to detailed diff)
- Was it reviewed? (approval workflow)
- Did it succeed? (status)
Git-Based Audit Trail
Every infrastructure change = git commit:
$ git log --oneline infrastructure/
a1b2c3d (HEAD -> main) Increase web server instance type to t3.large
d4e5f6g Add Redis cache to production
g7h8i9j Update RDS to db.t3.medium for better performance
j1k2l3m Initial production infrastructure
Each commit shows:
- What changed (
git diff a1b2c3d) - Who changed it (
git log --format="%an <%ae>" a1b2c3d) - Why (
git log --format="%B" a1b2c3d- commit message) - When (
git log --format="%ai" a1b2c3d)
For SOC 2 / ISO 27001:
- Auditors can review complete change history
- Every change tied to person and reason
- No undocumented infrastructure modifications
- Rollback history preserved
Summary: When to Use Advanced Patterns
| Pattern | Use When | Overkill If |
|---|---|---|
| Multi-account state | 100+ resources, multiple teams | < 50 resources, single team |
| Terragrunt | Lots of duplication across environments | Simple, single environment |
| Policy as code | Compliance requirements, large team | Small team with code review |
| Module testing | Reusable modules shared across org | One-off infrastructure |
| Drift reconciliation | Resources frequently modified manually | Strict IaC-only policy |
| Cost optimization | $10k+/month cloud bill | < $1k/month |
| Vault dynamic secrets | High-security requirements | Static credentials acceptable |
The goal isn’t using every advanced pattern. It’s using the right patterns for your scale, security requirements, and team maturity. Start simple, add complexity only when you feel the pain it solves.