Building a Multi-Region API Platform at Scale - Murtaza Ahmed

Problem

SecurePKI, a SaaS platform serving enterprise clients in North America, Europe, and APAC, was running from a single US-East region. European customers experienced 400ms+ API latencies, violating SLAs. Two major outages in Q1 (totaling 4 hours of downtime) triggered contract penalty clauses and threatened client renewals worth ₹40L ARR.

Architecture

Deployed active-active API clusters in us-east-1, eu-west-1, and ap-southeast-1. Route 53 latency-based routing directs users to the nearest region. Each region runs an identical Go API stack behind an ALB, with a regional Redis cluster for caching and session state. PostgreSQL uses a primary-replica setup with cross-region async replication. Static assets served via CloudFront. All infrastructure managed with Terraform modules.

Infrastructure Layout

graph TD Route53["Route 53 Latency Routing"] --> US Route53 --> EU Route53 --> AP subgraph US["us-east-1"] ALB1["ALB"] --> API1["API"] API1 --> Redis1[("Redis")] API1 --> PG1[("PostgreSQL Primary")] end subgraph EU["eu-west-1"] ALB2["ALB"] --> API2["API"] API2 --> Redis2[("Redis")] API2 --> PG2[("PostgreSQL Replica")] end subgraph AP["ap-southeast-1"] ALB3["ALB"] --> API3["API"] API3 --> Redis3[("Redis")] API3 --> PG3[("PostgreSQL Replica")] end PG1 -.->|Async Replication| PG2 PG1 -.->|Async Replication| PG3

Key Decisions

Chose active-active over active-passive. The latency SLAs required serving from the nearest region, not just failing over

Used async replication for PostgreSQL instead of synchronous. Accepted eventual consistency (sub-second lag) to avoid cross-region write latency penalties

Implemented a conflict resolution strategy using last-writer-wins with vector clocks for the small subset of data that could be written from multiple regions

Chose Terraform over Pulumi. The ops team had existing Terraform expertise and the module ecosystem was more mature

Deployed Redis Cluster per region rather than a global cache. Network hops to a central cache would negate the latency gains

Results

p99 API latency reduced from 400ms to 120ms for all regions

Achieved 99.9% uptime over 12 months (vs. 99.5% prior year)

Zero SLA penalty events since deployment

Retained all at-risk enterprise contracts (₹40L ARR preserved)

Infrastructure cost increased only 15% despite tripling regional presence. Optimized with spot instances and reserved capacity

Lessons & Trade-offs

Multi-region is not just infrastructure. Application code must be region-aware for data locality and conflict handling

Async replication lag is manageable but must be visible. We built a real-time lag monitoring dashboard that alerts at 500ms

Terraform state management across regions requires careful design. We used a dedicated S3 backend per region with a global state orchestrator

Load testing must simulate realistic geographic distribution, not just throughput. Latency profiles differ dramatically by region