The first sign that a Terraform state has outgrown itself is rarely an error. It is the silence after you type `terraform plan` and go make coffee. We inherited a state on S3 with roughly 1,400 resources in it, and a clean plan took 38 minutes because every refresh hit every API. The team had stopped running plans before merging. That is the real cost: people route around slow tooling, and then drift shows up in production.
Remote state is table stakes, locking is the actual point
Everyone moves state off the laptop eventually. The part people underinvest in is locking. We have walked into two engagements this year where two pipelines could run apply against the same state at once because the DynamoDB lock table was never created, or had been deleted during a cost cleanup. Both had corrupted state at least once. With the newer S3 backend you no longer need a separate table - native S3 locking via a `.tflock` object covers it - but you have to opt in, and most older configs still point at DynamoDB.
terraform {
backend "s3" {
bucket = "veritech-tfstate-prod"
key = "network/terraform.tfstate"
region = "eu-west-1"
use_lockfile = true
encrypt = true
}
}Splitting state by blast radius, not by team org chart
The instinct is to split state along team lines. That feels tidy and is usually wrong. We split along blast radius: what is the worst thing that happens if this state is corrupted or this apply goes sideways at 2am? Networking, IAM, and DNS go in their own small, rarely-changing states. The application layer that ships ten times a day gets its own. The rule we use: if two resource groups have wildly different change frequencies, they do not belong in the same state.
- Foundation (accounts, IAM, org policy) - changes monthly, blast radius enormous
- Network (VPCs, transit gateway, DNS zones) - changes weekly, blast radius high
- Platform (clusters, databases, shared queues) - changes weekly
- Application (services, autoscaling, app config) - changes many times a day, blast radius local
Split state by what hurts when it breaks, not by who owns the Jira board.
Reading across states without coupling them
Once you split, the next trap is `terraform_remote_state` data sources everywhere, which quietly recouples your states into one big graph. We allow exactly one direction of dependency - lower layers expose outputs, higher layers read them - and we prefer publishing a handful of values to SSM Parameter Store or a data source lookup over `remote_state` when the consumer is in a different account. It is a little more boilerplate and a lot less foot-gun.
On the migration itself: `terraform state mv` works but is slow and error-prone across hundreds of resources. For the 1,400-resource monolith we scripted the moves, ran them against a copy of the state first, and kept the old state read-only for two weeks before deleting it. Nobody enjoyed it, but plans dropped from 38 minutes to under 90 seconds per layer, and people started reviewing plans again.