Cutting Cloud Infrastructure Costs by 60% - Murtaza Ahmed

Problem

A Series C startup's AWS bill had grown from ₹8L/month to ₹15L/month in 18 months with no corresponding increase in traffic. Engineering had over-provisioned during a rapid scaling phase and never right-sized. The CFO flagged infrastructure as the fastest-growing expense line item, threatening runway by 6 months.

Architecture

Conducted a comprehensive cost audit using AWS Cost Explorer, Trusted Advisor, and custom CloudWatch metrics. Identified four categories of waste: over-provisioned EC2 instances (40% of savings), unused EBS volumes and snapshots (15%), suboptimal RDS configurations (25%), and missing reserved instance coverage (20%). Implemented changes in three phases over 8 weeks using Terraform to ensure reproducibility.

Cost Breakdown: Before & After

Category          Before (₹)       After (₹)      Savings
─────────────────────────────────────────────────────────
EC2 Compute        ₹6,80,000      ₹4,08,000       40%
RDS Databases      ₹3,75,000      ₹2,62,500       30%
EBS Storage        ₹2,25,000      ₹1,57,500       30%
Data Transfer      ₹1,25,000      ₹90,000         28%
Other              ₹95,000        ₹57,000         40%
─────────────────────────────────────────────────────────
TOTAL             ₹15,00,000     ₹9,75,000       35%

Implementation Phases

Phase 1 (Weeks 1-2): Audit and cleanup. Deleted unused resources, snapshots, and orphaned volumes. Immediate ₹1.5L/month savings.
Phase 2 (Weeks 3-5): Right-sizing and Graviton migration. Resized instances based on actual utilization metrics.
Phase 3 (Weeks 6-8): Reserved Instances and Spot. Purchased RIs for baseline, migrated batch to Spot.

Key Decisions

Migrated compute-heavy workloads to Graviton (ARM) instances. 20% cheaper with equivalent or better performance for Go and Python services

Implemented aggressive auto-scaling policies based on actual CPU and memory metrics rather than the flat over-provisioned baselines

Moved non-critical batch processing to Spot Instances with a fallback to on-demand. Achieved 70% savings on batch compute

Consolidated 12 underutilized RDS instances into 4 right-sized instances with connection pooling via PgBouncer

Purchased 1-year Reserved Instances for baseline compute after establishing stable usage patterns over 3 months

Results

Monthly AWS bill reduced from ₹15L to ₹9.75L (35% reduction)

Annual savings of ₹63L - extended runway by 8 months

API performance unchanged - p99 latency remained at 120ms

Auto-scaling now handles 3x traffic spikes without manual intervention

Infrastructure-as-code coverage increased from 60% to 95%

Lessons & Trade-offs

Cost optimization is an ongoing process, not a one-time project. We set up monthly cost review meetings and Slack alerts for anomalies

Graviton migration is nearly free performance. Test thoroughly, some native dependencies don't have ARM builds

Spot Instances are excellent for fault-tolerant workloads but require proper interruption handling. We lost 2 batch runs before implementing checkpointing

Reserved Instances require commitment. Don't purchase until you have 3+ months of stable usage data