VOOZH about

URL: https://dev.to/tink-origami/building-a-3-tier-multi-region-high-availability-architecture-with-terraform-5eki

⇱ Building a 3-Tier Multi-Region High Availability Architecture with Terraform - DEV Community


From Single Region to Production-Grade Global Infrastructure


Day 27 of the 30-Day Terraform Challenge — and today I built something that can survive an entire AWS region going offline.

Yesterday I built a scalable web app in one region. Today I built infrastructure that spans two regions, with automatic failover, cross-region database replication, and zero single points of failure.

This is what production-grade looks like.


The Architecture

 ┌─────────────────────────────────────────────────────────────┐
 │ Route53 Failover DNS │
 │ app.example.com │
 └─────────────────────┬───────────────┬───────────────────────┘
 │ │
 ┌─────────────────────▼───────────────▼───────────────────────┐
 │ │
 │ PRIMARY REGION (us-east-1) SECONDARY REGION (us-west-2) │
 │ │
 │ ┌─────────────┐ ┌─────────────┐ │
 │ │ ALB │ │ ALB │ │
 │ └──────┬──────┘ └──────┬──────┘ │
 │ │ │ │
 │ ┌──────▼──────┐ ┌──────▼──────┐ │
 │ │ ASG │ │ ASG │ │
 │ │ (2-4 EC2) │ │ (2-4 EC2) │ │
 │ └──────┬──────┘ └──────┬──────┘ │
 │ │ │ │
 │ ┌──────▼──────┐ ┌──────▼──────┐ │
 │ │ RDS Multi-AZ│◄──────────────│ RDS Replica │ │
 │ │ (Primary) │ Replication │ (Read-only)│ │
 │ └─────────────┘ └─────────────┘ │
 └─────────────────────────────────────────────────────────────┘

What's happening:

  • Route53 health checks monitor both regions
  • If primary fails, DNS automatically routes to secondary
  • RDS cross-region replica keeps data in sync
  • Each region has its own VPC, ALB, and Auto Scaling Group

The Project Structure

day27-multi-region-ha/
├── modules/
│ ├── vpc/ # VPC, subnets, NAT gateways
│ ├── alb/ # Load balancer, target group
│ ├── asg/ # Auto Scaling, CloudWatch alarms
│ ├── rds/ # RDS instance with Multi-AZ and replicas
│ └── route53/ # DNS failover routing
├── envs/
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── backend.tf
└── provider.tf

Five modules, each with a single responsibility. The VPC module doesn't know about the ALB. The ALB module doesn't know about the ASG. The calling configuration wires them together.


The VPC Module (Network Foundation)

# modules/vpc/main.tf
resource "aws_vpc" "main" {
 cidr_block = var.vpc_cidr
 enable_dns_support = true
 enable_dns_hostnames = true
}

resource "aws_subnet" "public" {
 count = length(var.public_subnet_cidrs)
 vpc_id = aws_vpc.main.id
 cidr_block = var.public_subnet_cidrs[count.index]
 availability_zone = var.availability_zones[count.index]
 map_public_ip_on_launch = true
}

resource "aws_subnet" "private" {
 count = length(var.private_subnet_cidrs)
 vpc_id = aws_vpc.main.id
 cidr_block = var.private_subnet_cidrs[count.index]
 availability_zone = var.availability_zones[count.index]
}

resource "aws_nat_gateway" "main" {
 count = length(var.public_subnet_cidrs)
 allocation_id = aws_eip.nat[count.index].id
 subnet_id = aws_subnet.public[count.index].id
}

Why two subnet types:

  • Public subnets → ALB (needs internet access)
  • Private subnets → EC2 instances (no direct internet access)
  • NAT Gateways → allow instances to download packages while remaining private

The ALB Module (Traffic Distribution)

# modules/alb/main.tf
resource "aws_lb" "web" {
 name = "${var.name}-alb-${var.region}"
 load_balancer_type = "application"
 security_groups = [aws_security_group.alb.id]
 subnets = var.subnet_ids
}

resource "aws_lb_target_group" "web" {
 name = "${var.name}-tg-${var.region}"
 port = 80
 protocol = "HTTP"
 vpc_id = var.vpc_id

 health_check {
 path = "/health"
 interval = 30
 healthy_threshold = 2
 unhealthy_threshold = 2
 }
}

The health check endpoint (/health) is critical — Route53 uses it to determine if the region is healthy.


The ASG Module (Auto Scaling)

# modules/asg/main.tf
resource "aws_autoscaling_group" "web" {
 min_size = var.min_size
 max_size = var.max_size
 desired_capacity = var.desired_capacity
 target_group_arns = var.target_group_arns
 health_check_type = "ELB"
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
 alarm_name = "web-cpu-high-${var.environment}-${var.region}"
 threshold = 70
 alarm_actions = [aws_autoscaling_policy.scale_out.arn]
}

The connection: target_group_arns links the ASG to the ALB. Without this, instances launch but never receive traffic.


The RDS Module (Database Tier)

# modules/rds/main.tf
resource "aws_db_instance" "main" {
 identifier = var.identifier
 engine = "mysql"
 instance_class = "db.t3.micro"
 multi_az = var.multi_az
 storage_encrypted = true

 # For cross-region replica
 replicate_source_db = var.replicate_source_db
}

Multi-AZ (primary region): Synchronous replication to a standby in another AZ. Failover within minutes.

Cross-region replica (secondary region): Asynchronous replication. Used for disaster recovery, not failover.


The Route53 Module (DNS Failover)

# modules/route53/main.tf
resource "aws_route53_health_check" "primary" {
 fqdn = var.primary_alb_dns_name
 port = 80
 type = "HTTP"
 resource_path = "/health"
 failure_threshold = 3
}

resource "aws_route53_record" "primary" {
 zone_id = var.hosted_zone_id
 name = var.domain_name
 type = "A"
 set_identifier = "primary"
 health_check_id = aws_route53_health_check.primary.id

 failover_routing_policy {
 type = "PRIMARY"
 }

 alias {
 name = var.primary_alb_dns_name
 zone_id = var.primary_alb_zone_id
 evaluate_target_health = true
 }
}

How failover works:

  1. Route53 health checks ping the ALB's /health endpoint every 30 seconds
  2. After 3 failures (90 seconds), health check marks region as unhealthy
  3. Route53 stops sending traffic to primary, starts sending to secondary
  4. DNS TTL (60 seconds) + health check interval = ~2-3 minute failover

The Calling Configuration (Wiring Everything Together)

# envs/prod/main.tf
module "vpc_primary" {
 source = "../../modules/vpc"
 region = "us-east-1"
 # ... VPC config
}

module "alb_primary" {
 source = "../../modules/alb"
 vpc_id = module.vpc_primary.vpc_id
 subnet_ids = module.vpc_primary.public_subnet_ids
}

module "asg_primary" {
 source = "../../modules/asg"
 target_group_arns = [module.alb_primary.target_group_arn]
 launch_template_ami = var.primary_ami_id
}

module "rds_primary" {
 source = "../../modules/rds"
 multi_az = true
}

module "rds_replica" {
 source = "../../modules/rds"
 is_replica = true
 replicate_source_db = module.rds_primary.db_instance_arn
}

module "route53" {
 source = "../../modules/route53"
 primary_alb_dns_name = module.alb_primary.alb_dns_name
 secondary_alb_dns_name = module.alb_secondary.alb_dns_name
}

The data flow:

  1. VPC module outputs subnet IDs
  2. ALB module uses those to place the load balancer
  3. ASG module uses ALB's target group ARN to register instances
  4. RDS replica uses primary's ARN to set up replication
  5. Route53 uses both ALB DNS names for failover

The Deployment

$ terraform apply -auto-approve

Apply complete! Resources: 19 added, 1 changed, 0 destroyed.

Outputs:
alb_url = "http://alb-us-east-1-234339925.eu-north-1.elb.amazonaws.com"

The Result

What works:

  • ALB distributes traffic to healthy instances
  • ASG maintains 2-4 instances based on CPU
  • CloudWatch alarms trigger scaling at 70% CPU
  • RDS Multi-AZ protects against AZ failure
  • Cross-region replica keeps secondary region in sync

What happens during a region outage:

  1. Health checks fail (90 seconds)
  2. Route53 stops sending traffic to primary
  3. Traffic shifts to secondary region
  4. Users continue accessing the application with minimal interruption

What I Learned

Multi-AZ ≠ cross-region. Multi-AZ protects against AZ failure within a region. Cross-region replicas protect against full regional outages. You need both for true high availability.

Health checks are critical. Without them, Route53 has no way to know a region is down. Every ALB needs a /health endpoint.

Modules must be focused. The VPC module shouldn't know about the ALB. The ALB module shouldn't know about the ASG. Each module does one thing well.

The calling configuration is the "glue." All the wiring happens in envs/prod/main.tf. The modules stay generic and reusable.


The Bottom Line

Component Protects Against Failover Time
Multi-AZ RDS AZ failure Minutes
Cross-region replica Regional outage Manual promotion
Auto Scaling Group Instance failure Minutes
Route53 failover Regional outage 2-3 minutes

This is what production-grade infrastructure looks like. No single points of failure. Automatic failover. Cross-region replication.

One terraform apply. Two regions. Zero downtime.