Cost anomaly detection that doesn't cry wolf

We have walked into accounts with cost anomaly alerts wired to a Slack channel that everyone muted in month one. The alerts were not wrong, exactly - they fired on normal weekly seasonality and on a 30-euro test environment. A detector that cries wolf is worse than none, because it trains people to ignore the one alert that mattered.

Segment before you threshold

The biggest single fix is monitoring per service, account, or tag - not the whole bill. A 5% jump in the total can hide a doubling of one team's spend, while a 5% wobble at the top level might just be a billing-day artifact. AWS Cost Anomaly Detection lets you define monitors per dimension, and that segmentation is what turns noise into signal.

Monitor by service and by linked account, not the consolidated total
Set an absolute floor - we ignore anomalies under ~50-100 euros of impact
Account for known seasonality: weekends, month-end batch, marketing campaigns
Route by ownership so the alert reaches whoever can act, not a shared firehose

Two thresholds, not one

We tune on both a percentage and an absolute amount, and an alert has to clear both. A 300% spike on a 5-euro resource is noise; a 12% rise on a 40,000-euro service is a real problem. Requiring both conditions kills the long tail of trivial-but-loud alerts that cause people to mute the channel in the first place.

The goal is not to catch every anomaly. It is to make sure that when the channel pings, people still look.

Every alert needs an owner and a next step

An anomaly that lands in a channel with no owner dies there. We route each monitor to the team that owns that service and include the likely culprits in the message - which usage type moved, by how much, since when. The reviewer should be able to confirm or dismiss in under two minutes, otherwise the alert quietly joins the ignored pile.

bash

aws ce create-anomaly-monitor \
  --anomaly-monitor '{"MonitorName":"per-service","MonitorType":"DIMENSIONAL","MonitorDimension":"SERVICE"}'

Review the misses, not just the hits

Once a quarter we look back at the cost spikes that happened and check whether the detector caught them. False negatives - the real anomaly that never fired - are far more dangerous than a few false positives, and they only show up if you deliberately audit for them. A detector worth keeping fires maybe a handful of times a month, each one worth a glance. If it is firing daily, it is not protecting the budget - it is just background noise with a price tag.

Segment before you threshold

Monitor by service and by linked account, not the consolidated total

Set an absolute floor - we ignore anomalies under ~50-100 euros of impact

Account for known seasonality: weekends, month-end batch, marketing campaigns

Route by ownership so the alert reaches whoever can act, not a shared firehose

Two thresholds, not one

The goal is not to catch every anomaly. It is to make sure that when the channel pings, people still look.

Every alert needs an owner and a next step

bash

aws ce create-anomaly-monitor \
  --anomaly-monitor '{"MonitorName":"per-service","MonitorType":"DIMENSIONAL","MonitorDimension":"SERVICE"}'

Review the misses, not just the hits

Cost anomaly detection that doesn't cry wolf

Segment before you threshold

Two thresholds, not one

Every alert needs an owner and a next step

Review the misses, not just the hits

João Matos

Other notes from the team.

The FinOps audit: where the money actually goes

Rightsizing compute without breaking things

Commitments done right: reserved instances, savings plans, and spot

Cost anomaly detection that doesn't cry wolf

Segment before you threshold

Two thresholds, not one

Every alert needs an owner and a next step

Review the misses, not just the hits

João Matos

Other notes from the team.

The FinOps audit: where the money actually goes

Rightsizing compute without breaking things

Commitments done right: reserved instances, savings plans, and spot