Drift is the gap between what Terraform thinks exists and what actually exists. It happens the moment someone clicks something in a console during an incident, which is to say it happens constantly. The teams who get burned are not the ones with drift - everyone has drift - they are the ones who only discover it during an unrelated apply, when Terraform offers to revert a hotfix that is keeping production alive.
Detect on a schedule, not at apply time
We run `terraform plan -detailed-exitcode` on a nightly schedule against every state. Exit code 2 means there is a diff. We pipe that into a daily report rather than failing loudly, because a noisy drift alarm gets muted in a week. The goal is a short, reviewed list every morning, not a wall of red.
terraform plan -detailed-exitcode -lock=false -out=tfplan
# exit 0 = no changes, 1 = error, 2 = drift
if [ $? -eq 2 ]; then
terraform show -json tfplan | jq '.resource_changes[] | select(.change.actions != ["no-op"])'
fiNot all drift deserves the same response
This is the part people skip. Some drift you adopt, some you revert, some you ignore on purpose. A tag added by a cost-allocation Lambda is not a problem to fix, it is reality to absorb - so you either codify the tag or tell the provider to ignore it.
- Emergency console change during an incident - adopt it back into code the same week, while the context is fresh.
- Out-of-band automation (autoscaling, backup tooling) writing tags or capacity - use `lifecycle { ignore_changes = [...] }`, do not fight it.
- Someone manually widening a security group - revert immediately, this is the dangerous kind.
- Provider default that shifted under you - pin it, then decide deliberately.
Everyone has drift. The teams who get burned are the ones who find it during an unrelated apply.
Make the right path the easy path
The durable fix is not better detection, it is removing the reason people drift in the first place. When the only way to bump a memory limit at 3am is the console, people will use the console. We invest in a break-glass apply path - a pre-approved pipeline that can run a targeted change fast - so the incident response and the code stay in sync. Detection tells you where you have a process gap. Closing the gap is the actual work.