Voozh

From Single Region to Production-Grade Global Infrastructure

Day 27 of the 30-Day Terraform Challenge — and today I built something that can survive an entire AWS region going offline.

Yesterday I built a scalable web app in one region. Today I built infrastructure that spans two regions, with automatic failover, cross-region database replication, and zero single points of failure.

This is what production-grade looks like.

The Architecture

 ┌─────────────────────────────────────────────────────────────┐
 │ Route53 Failover DNS │
 │ app.example.com │
 └─────────────────────┬───────────────┬───────────────────────┘
 │ │
 ┌─────────────────────▼───────────────▼───────────────────────┐
 │ │
 │ PRIMARY REGION (us-east-1) SECONDARY REGION (us-west-2) │
 │ │
 │ ┌─────────────┐ ┌─────────────┐ │
 │ │ ALB │ │ ALB │ │
 │ └──────┬──────┘ └──────┬──────┘ │
 │ │ │ │
 │ ┌──────▼──────┐ ┌──────▼──────┐ │
 │ │ ASG │ │ ASG │ │
 │ │ (2-4 EC2) │ │ (2-4 EC2) │ │
 │ └──────┬──────┘ └──────┬──────┘ │
 │ │ │ │
 │ ┌──────▼──────┐ ┌──────▼──────┐ │
 │ │ RDS Multi-AZ│◄──────────────│ RDS Replica │ │
 │ │ (Primary) │ Replication │ (Read-only)│ │
 │ └─────────────┘ └─────────────┘ │
 └─────────────────────────────────────────────────────────────┘

What's happening:

Route53 health checks monitor both regions
If primary fails, DNS automatically routes to secondary
RDS cross-region replica keeps data in sync
Each region has its own VPC, ALB, and Auto Scaling Group

The Project Structure

day27-multi-region-ha/
├── modules/
│ ├── vpc/ # VPC, subnets, NAT gateways
│ ├── alb/ # Load balancer, target group
│ ├── asg/ # Auto Scaling, CloudWatch alarms
│ ├── rds/ # RDS instance with Multi-AZ and replicas
│ └── route53/ # DNS failover routing
├── envs/
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── backend.tf
└── provider.tf

Five modules, each with a single responsibility. The VPC module doesn't know about the ALB. The ALB module doesn't know about the ASG. The calling configuration wires them together.

The VPC Module (Network Foundation)

# modules/vpc/main.tf
resource "aws_vpc" "main" {
 cidr_block = var.vpc_cidr
 enable_dns_support = true
 enable_dns_hostnames = true
}

resource "aws_subnet" "public" {
 count = length(var.public_subnet_cidrs)
 vpc_id = aws_vpc.main.id
 cidr_block = var.public_subnet_cidrs[count.index]
 availability_zone = var.availability_zones[count.index]
 map_public_ip_on_launch = true
}

resource "aws_subnet" "private" {
 count = length(var.private_subnet_cidrs)
 vpc_id = aws_vpc.main.id
 cidr_block = var.private_subnet_cidrs[count.index]
 availability_zone = var.availability_zones[count.index]
}

resource "aws_nat_gateway" "main" {
 count = length(var.public_subnet_cidrs)
 allocation_id = aws_eip.nat[count.index].id
 subnet_id = aws_subnet.public[count.index].id
}

Why two subnet types:

Public subnets → ALB (needs internet access)
Private subnets → EC2 instances (no direct internet access)
NAT Gateways → allow instances to download packages while remaining private

The ALB Module (Traffic Distribution)

# modules/alb/main.tf
resource "aws_lb" "web" {
 name = "${var.name}-alb-${var.region}"
 load_balancer_type = "application"
 security_groups = [aws_security_group.alb.id]
 subnets = var.subnet_ids
}

resource "aws_lb_target_group" "web" {
 name = "${var.name}-tg-${var.region}"
 port = 80
 protocol = "HTTP"
 vpc_id = var.vpc_id

 health_check {
 path = "/health"
 interval = 30
 healthy_threshold = 2
 unhealthy_threshold = 2
 }
}

The health check endpoint (/health) is critical — Route53 uses it to determine if the region is healthy.

The ASG Module (Auto Scaling)

# modules/asg/main.tf
resource "aws_autoscaling_group" "web" {
 min_size = var.min_size
 max_size = var.max_size
 desired_capacity = var.desired_capacity
 target_group_arns = var.target_group_arns
 health_check_type = "ELB"
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
 alarm_name = "web-cpu-high-${var.environment}-${var.region}"
 threshold = 70
 alarm_actions = [aws_autoscaling_policy.scale_out.arn]
}

The connection: target_group_arns links the ASG to the ALB. Without this, instances launch but never receive traffic.

The RDS Module (Database Tier)

# modules/rds/main.tf
resource "aws_db_instance" "main" {
 identifier = var.identifier
 engine = "mysql"
 instance_class = "db.t3.micro"
 multi_az = var.multi_az
 storage_encrypted = true

 # For cross-region replica
 replicate_source_db = var.replicate_source_db
}

Multi-AZ (primary region): Synchronous replication to a standby in another AZ. Failover within minutes.

Cross-region replica (secondary region): Asynchronous replication. Used for disaster recovery, not failover.

The Route53 Module (DNS Failover)

# modules/route53/main.tf
resource "aws_route53_health_check" "primary" {
 fqdn = var.primary_alb_dns_name
 port = 80
 type = "HTTP"
 resource_path = "/health"
 failure_threshold = 3
}

resource "aws_route53_record" "primary" {
 zone_id = var.hosted_zone_id
 name = var.domain_name
 type = "A"
 set_identifier = "primary"
 health_check_id = aws_route53_health_check.primary.id

 failover_routing_policy {
 type = "PRIMARY"
 }

 alias {
 name = var.primary_alb_dns_name
 zone_id = var.primary_alb_zone_id
 evaluate_target_health = true
 }
}

How failover works:

Route53 health checks ping the ALB's /health endpoint every 30 seconds
After 3 failures (90 seconds), health check marks region as unhealthy
Route53 stops sending traffic to primary, starts sending to secondary
DNS TTL (60 seconds) + health check interval = ~2-3 minute failover

The Calling Configuration (Wiring Everything Together)

# envs/prod/main.tf
module "vpc_primary" {
 source = "../../modules/vpc"
 region = "us-east-1"
 # ... VPC config
}

module "alb_primary" {
 source = "../../modules/alb"
 vpc_id = module.vpc_primary.vpc_id
 subnet_ids = module.vpc_primary.public_subnet_ids
}

module "asg_primary" {
 source = "../../modules/asg"
 target_group_arns = [module.alb_primary.target_group_arn]
 launch_template_ami = var.primary_ami_id
}

module "rds_primary" {
 source = "../../modules/rds"
 multi_az = true
}

module "rds_replica" {
 source = "../../modules/rds"
 is_replica = true
 replicate_source_db = module.rds_primary.db_instance_arn
}

module "route53" {
 source = "../../modules/route53"
 primary_alb_dns_name = module.alb_primary.alb_dns_name
 secondary_alb_dns_name = module.alb_secondary.alb_dns_name
}

The data flow:

VPC module outputs subnet IDs
ALB module uses those to place the load balancer
ASG module uses ALB's target group ARN to register instances
RDS replica uses primary's ARN to set up replication
Route53 uses both ALB DNS names for failover

The Deployment

$ terraform apply -auto-approve

Apply complete! Resources: 19 added, 1 changed, 0 destroyed.

Outputs:
alb_url = "http://alb-us-east-1-234339925.eu-north-1.elb.amazonaws.com"

The Result

What works:

ALB distributes traffic to healthy instances
ASG maintains 2-4 instances based on CPU
CloudWatch alarms trigger scaling at 70% CPU
RDS Multi-AZ protects against AZ failure
Cross-region replica keeps secondary region in sync

What happens during a region outage:

Health checks fail (90 seconds)
Route53 stops sending traffic to primary
Traffic shifts to secondary region
Users continue accessing the application with minimal interruption

What I Learned

Multi-AZ ≠ cross-region. Multi-AZ protects against AZ failure within a region. Cross-region replicas protect against full regional outages. You need both for true high availability.

Health checks are critical. Without them, Route53 has no way to know a region is down. Every ALB needs a /health endpoint.

Modules must be focused. The VPC module shouldn't know about the ALB. The ALB module shouldn't know about the ASG. Each module does one thing well.

The calling configuration is the "glue." All the wiring happens in envs/prod/main.tf. The modules stay generic and reusable.

The Bottom Line

Component	Protects Against	Failover Time
Multi-AZ RDS	AZ failure	Minutes
Cross-region replica	Regional outage	Manual promotion
Auto Scaling Group	Instance failure	Minutes
Route53 failover	Regional outage	2-3 minutes

This is what production-grade infrastructure looks like. No single points of failure. Automatic failover. Cross-region replication.

One terraform apply. Two regions. Zero downtime.

URL: https://dev.to/tink-origami/building-a-3-tier-multi-region-high-availability-architecture-with-terraform-5eki

⇱ Building a 3-Tier Multi-Region High Availability Architecture with Terraform - DEV Community