Rightsizing compute without breaking things

Every cost review we run ends up at the same place: a third of the instance fleet is two sizes too big, and nobody wants to touch it because the last person who tried caused a 2am page. Rightsizing is not a spreadsheet exercise. It is a change-management exercise that happens to save money.

We measure before we touch anything

We pull at least 14 days of data, never less. AWS Compute Optimizer and the CloudWatch agent give us memory, which the default metrics do not. The headline number we care about is p95 of CPU and memory over the window, not the average. An instance that sits at 8% average CPU but spikes to 90% at the daily batch run is not oversized - it is correctly sized for the spike.

p95 CPU and p95 memory, separately, over 14-30 days
Network throughput, because some m-family boxes are really there for bandwidth
Whether the workload is stateful - a stateless web tier is a totally different risk profile from a database
The actual SLO, not the imagined one

One size down, then wait

When we cut, we cut a single size and hold for a full business cycle before the next move. A jump from m5.4xlarge straight to m5.large looks great on the bill and terrible at the next traffic peak. Stepping down one notch (4xlarge to 2xlarge) typically recovers 40-50% of the waste with almost none of the risk, and we can repeat it next month if the numbers still say so.

Rightsizing that causes one incident sets the whole programme back six months. Slow and boring beats clever and broken.

Switch families, not just sizes

Half the easy wins are not about getting smaller, they are about getting newer. Moving a stable workload from m5 to m7g (Graviton) often lands 15-20% cheaper per core with better performance, assuming the build is arm64-clean. We test on a canary node first - native dependencies are where this bites.

bash

aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[?finding==`OVER_PROVISIONED`].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType]' \
  --output table

The rollback plan is the deliverable

We never ship a rightsizing change without a documented way back. For an Auto Scaling group that means keeping the old launch template version. For a managed database it means a tested resize window and a snapshot taken five minutes before. Done this way, a typical first pass on a neglected account recovers 20-35% of compute spend over a quarter, and the on-call team stops flinching when they see our calendar invite.

We measure before we touch anything

p95 CPU and p95 memory, separately, over 14-30 days

Network throughput, because some m-family boxes are really there for bandwidth

Whether the workload is stateful - a stateless web tier is a totally different risk profile from a database

The actual SLO, not the imagined one

One size down, then wait

Rightsizing that causes one incident sets the whole programme back six months. Slow and boring beats clever and broken.

Switch families, not just sizes

bash

aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[?finding==`OVER_PROVISIONED`].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType]' \
  --output table

The rollback plan is the deliverable

Rightsizing compute without breaking things

We measure before we touch anything

One size down, then wait

Switch families, not just sizes

The rollback plan is the deliverable

João Matos

Other notes from the team.

The FinOps audit: where the money actually goes

Commitments done right: reserved instances, savings plans, and spot

A cloud tagging strategy that survives contact with reality

Rightsizing compute without breaking things

We measure before we touch anything

One size down, then wait

Switch families, not just sizes

The rollback plan is the deliverable

João Matos

Other notes from the team.

The FinOps audit: where the money actually goes

Commitments done right: reserved instances, savings plans, and spot

A cloud tagging strategy that survives contact with reality