Every cost review we run ends up at the same place: a third of the instance fleet is two sizes too big, and nobody wants to touch it because the last person who tried caused a 2am page. Rightsizing is not a spreadsheet exercise. It is a change-management exercise that happens to save money.
We measure before we touch anything
We pull at least 14 days of data, never less. AWS Compute Optimizer and the CloudWatch agent give us memory, which the default metrics do not. The headline number we care about is p95 of CPU and memory over the window, not the average. An instance that sits at 8% average CPU but spikes to 90% at the daily batch run is not oversized - it is correctly sized for the spike.
- p95 CPU and p95 memory, separately, over 14-30 days
- Network throughput, because some m-family boxes are really there for bandwidth
- Whether the workload is stateful - a stateless web tier is a totally different risk profile from a database
- The actual SLO, not the imagined one
One size down, then wait
When we cut, we cut a single size and hold for a full business cycle before the next move. A jump from m5.4xlarge straight to m5.large looks great on the bill and terrible at the next traffic peak. Stepping down one notch (4xlarge to 2xlarge) typically recovers 40-50% of the waste with almost none of the risk, and we can repeat it next month if the numbers still say so.
Rightsizing that causes one incident sets the whole programme back six months. Slow and boring beats clever and broken.
Switch families, not just sizes
Half the easy wins are not about getting smaller, they are about getting newer. Moving a stable workload from m5 to m7g (Graviton) often lands 15-20% cheaper per core with better performance, assuming the build is arm64-clean. We test on a canary node first - native dependencies are where this bites.
aws compute-optimizer get-ec2-instance-recommendations \
--query 'instanceRecommendations[?finding==`OVER_PROVISIONED`].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType]' \
--output tableThe rollback plan is the deliverable
We never ship a rightsizing change without a documented way back. For an Auto Scaling group that means keeping the old launch template version. For a managed database it means a tested resize window and a snapshot taken five minutes before. Done this way, a typical first pass on a neglected account recovers 20-35% of compute spend over a quarter, and the on-call team stops flinching when they see our calendar invite.