Every few months a client asks us to make their platform multi-cloud so an outage at one provider cannot take them down. It is a reasonable instinct and almost always the wrong first move. The cheaper resilience is sitting one layer down: a second region with the same provider, which you can actually staff, test, and afford.
What an outage usually looks like
When we pull the post-mortems, the failures that hurt are rarely a whole cloud going dark. They are a single region degrading, a control plane throttling, an availability zone losing power, or a bad deploy in one place. A well-built multi-region setup on AWS handles every one of those. A full provider outage that spans all regions is real but rare, and the cost of defending against it with true multi-cloud is enormous.
- Region-level event: Route 53 health checks fail over to your standby region in minutes.
- Zonal failure: an Auto Scaling group across three AZs absorbs it with no human in the loop.
- Control-plane throttling: you ride it out if your data plane keeps serving without API calls.
- Bad deploy: a region-by-region rollout limits the blast radius to one place.
Why multi-cloud is so much harder than it sounds
Multi-cloud means two of everything that is not portable. Two IAM models, two networking stacks, two managed database flavours, two observability pipelines, two on-call rotations that each need real depth. The lowest common denominator becomes your architecture, so you give up the managed services that made the cloud worth using. We have watched teams spend a year building a portability layer they never failed over to once.
The cost of multi-cloud is paid every single day in complexity; the benefit shows up maybe once a decade.
There are honest reasons to run multi-cloud: a contractual requirement, a true regulatory mandate, or using one provider for a service the other lacks. Pure availability is rarely one of them, because the marginal nine you buy is expensive and the one you skipped (region failover you never tested) is cheap.
What we actually recommend
Start with two regions in one provider and an honest active-passive design. Replicate state with the managed tooling you already trust, Aurora global databases or DynamoDB global tables, and keep your standby warm enough that failover is a switch, not a rebuild. Then test it on a schedule, in business hours, with the team watching.
- Pick a primary and a standby region in the same legal jurisdiction so data residency does not move.
- Automate failover in code and rehearse it quarterly, not just in the runbook.
- Keep config and secrets replicated so the standby is never more than minutes stale.
- Only after this works should you weigh whether a second provider buys anything real.
Resilience is a property you can demonstrate, not a logo on an architecture diagram. Get region failover working and tested first. Multi-cloud can wait until you have a reason that is not just fear of a headline.