Terraform state at scale: locking, remote backends, and shrinking blast radius

The first sign that a Terraform state has outgrown itself is rarely an error. It is the silence after you type `terraform plan` and go make coffee. We inherited a state on S3 with roughly 1,400 resources in it, and a clean plan took 38 minutes because every refresh hit every API. The team had stopped running plans before merging. That is the real cost: people route around slow tooling, and then drift shows up in production.

Remote state is table stakes, locking is the actual point

Everyone moves state off the laptop eventually. The part people underinvest in is locking. We have walked into two engagements this year where two pipelines could run apply against the same state at once because the DynamoDB lock table was never created, or had been deleted during a cost cleanup. Both had corrupted state at least once. With the newer S3 backend you no longer need a separate table - native S3 locking via a `.tflock` object covers it - but you have to opt in, and most older configs still point at DynamoDB.

hcl

terraform {
  backend "s3" {
    bucket       = "veritech-tfstate-prod"
    key          = "network/terraform.tfstate"
    region       = "eu-west-1"
    use_lockfile = true
    encrypt      = true
  }
}

Splitting state by blast radius, not by team org chart

The instinct is to split state along team lines. That feels tidy and is usually wrong. We split along blast radius: what is the worst thing that happens if this state is corrupted or this apply goes sideways at 2am? Networking, IAM, and DNS go in their own small, rarely-changing states. The application layer that ships ten times a day gets its own. The rule we use: if two resource groups have wildly different change frequencies, they do not belong in the same state.

Foundation (accounts, IAM, org policy) - changes monthly, blast radius enormous
Network (VPCs, transit gateway, DNS zones) - changes weekly, blast radius high
Platform (clusters, databases, shared queues) - changes weekly
Application (services, autoscaling, app config) - changes many times a day, blast radius local

Split state by what hurts when it breaks, not by who owns the Jira board.

Reading across states without coupling them

Once you split, the next trap is `terraform_remote_state` data sources everywhere, which quietly recouples your states into one big graph. We allow exactly one direction of dependency - lower layers expose outputs, higher layers read them - and we prefer publishing a handful of values to SSM Parameter Store or a data source lookup over `remote_state` when the consumer is in a different account. It is a little more boilerplate and a lot less foot-gun.

On the migration itself: `terraform state mv` works but is slow and error-prone across hundreds of resources. For the 1,400-resource monolith we scripted the moves, ran them against a copy of the state first, and kept the old state read-only for two weeks before deleting it. Nobody enjoyed it, but plans dropped from 38 minutes to under 90 seconds per layer, and people started reviewing plans again.

Remote state is table stakes, locking is the actual point

hcl

terraform {
  backend "s3" {
    bucket       = "veritech-tfstate-prod"
    key          = "network/terraform.tfstate"
    region       = "eu-west-1"
    use_lockfile = true
    encrypt      = true
  }
}

Splitting state by blast radius, not by team org chart

Foundation (accounts, IAM, org policy) - changes monthly, blast radius enormous

Network (VPCs, transit gateway, DNS zones) - changes weekly, blast radius high

Platform (clusters, databases, shared queues) - changes weekly

Application (services, autoscaling, app config) - changes many times a day, blast radius local

Split state by what hurts when it breaks, not by who owns the Jira board.

Reading across states without coupling them

Terraform state at scale: locking, remote backends, and shrinking blast radius

Remote state is table stakes, locking is the actual point

Splitting state by blast radius, not by team org chart

Reading across states without coupling them

João Matos

Other notes from the team.

Why we still write Terraform in 2026

Designing Terraform modules that survive three years

Catching infrastructure drift before it catches you

Terraform state at scale: locking, remote backends, and shrinking blast radius

Remote state is table stakes, locking is the actual point

Splitting state by blast radius, not by team org chart

Reading across states without coupling them

João Matos

Other notes from the team.

Why we still write Terraform in 2026

Designing Terraform modules that survive three years

Catching infrastructure drift before it catches you