Catching infrastructure drift before it catches you

Drift is the gap between what Terraform thinks exists and what actually exists. It happens the moment someone clicks something in a console during an incident, which is to say it happens constantly. The teams who get burned are not the ones with drift - everyone has drift - they are the ones who only discover it during an unrelated apply, when Terraform offers to revert a hotfix that is keeping production alive.

Detect on a schedule, not at apply time

We run `terraform plan -detailed-exitcode` on a nightly schedule against every state. Exit code 2 means there is a diff. We pipe that into a daily report rather than failing loudly, because a noisy drift alarm gets muted in a week. The goal is a short, reviewed list every morning, not a wall of red.

hcl

terraform plan -detailed-exitcode -lock=false -out=tfplan
# exit 0 = no changes, 1 = error, 2 = drift
if [ $? -eq 2 ]; then
  terraform show -json tfplan | jq '.resource_changes[] | select(.change.actions != ["no-op"])'
fi

Not all drift deserves the same response

This is the part people skip. Some drift you adopt, some you revert, some you ignore on purpose. A tag added by a cost-allocation Lambda is not a problem to fix, it is reality to absorb - so you either codify the tag or tell the provider to ignore it.

Emergency console change during an incident - adopt it back into code the same week, while the context is fresh.
Out-of-band automation (autoscaling, backup tooling) writing tags or capacity - use `lifecycle { ignore_changes = [...] }`, do not fight it.
Someone manually widening a security group - revert immediately, this is the dangerous kind.
Provider default that shifted under you - pin it, then decide deliberately.

Everyone has drift. The teams who get burned are the ones who find it during an unrelated apply.

Make the right path the easy path

The durable fix is not better detection, it is removing the reason people drift in the first place. When the only way to bump a memory limit at 3am is the console, people will use the console. We invest in a break-glass apply path - a pre-approved pipeline that can run a targeted change fast - so the incident response and the code stay in sync. Detection tells you where you have a process gap. Closing the gap is the actual work.

Detect on a schedule, not at apply time

hcl

terraform plan -detailed-exitcode -lock=false -out=tfplan
# exit 0 = no changes, 1 = error, 2 = drift
if [ $? -eq 2 ]; then
  terraform show -json tfplan | jq '.resource_changes[] | select(.change.actions != ["no-op"])'
fi

Not all drift deserves the same response

Emergency console change during an incident - adopt it back into code the same week, while the context is fresh.

Out-of-band automation (autoscaling, backup tooling) writing tags or capacity - use `lifecycle { ignore_changes = [...] }`, do not fight it.

Someone manually widening a security group - revert immediately, this is the dangerous kind.

Provider default that shifted under you - pin it, then decide deliberately.

Everyone has drift. The teams who get burned are the ones who find it during an unrelated apply.

Make the right path the easy path

Catching infrastructure drift before it catches you

Detect on a schedule, not at apply time

Not all drift deserves the same response

Make the right path the easy path

João Matos

Other notes from the team.

Why we still write Terraform in 2026

Terraform state at scale: locking, remote backends, and shrinking blast radius

Designing Terraform modules that survive three years

Catching infrastructure drift before it catches you

Detect on a schedule, not at apply time

Not all drift deserves the same response

Make the right path the easy path

João Matos

Other notes from the team.

Why we still write Terraform in 2026

Terraform state at scale: locking, remote backends, and shrinking blast radius

Designing Terraform modules that survive three years