Disaster recovery: setting RTO and RPO honestly and testing it

Disaster recovery is where optimism goes to get tested. Almost every plan we audit has confident RTO and RPO numbers on paper and no evidence anyone has ever met them. The plan reads well and would fail at 3am during an actual incident, because the gap between a documented procedure and a rehearsed one is enormous.

Define the two numbers honestly

RTO is how long you can be down. RPO is how much data you can afford to lose. They are business decisions, not technical ones, and the honest version is expensive to say out loud. A four-hour RTO and a five-minute RPO sound modest until you cost them, because tighter numbers mean warmer standby and more frequent replication, which means more money. The job is to match the spend to what the business genuinely needs, not to what sounds impressive.

RTO: the maximum tolerable time to restore service, set by what an outage actually costs the business.
RPO: the maximum tolerable data loss, set by how much rework or lost revenue a gap creates.
Both vary by system: the billing database is not the marketing site, so do not give them the same target.
Tighter numbers cost real money, so write down what you will actually pay for, not the wish.

Pick the strategy that fits the number

The DR strategy follows directly from the RTO and RPO. Backup and restore is cheapest and slowest, fine for an RTO measured in hours. Pilot light keeps the data layer warm and the rest cold. Warm standby runs a scaled-down copy ready to take over. Active-active serves from both sites all the time and gives near-zero RTO at the highest cost. Do not pay for active-active when the business is genuinely fine with a two-hour recovery.

Every DR tier you climb roughly multiplies the cost; buy only the tier the business will actually pay for.

An untested plan is fiction

The single most common failure is a DR plan that has never been exercised. The backups exist but nobody has restored from them, so no one knows the restore takes nine hours, or that a credential expired, or that the runbook references a server that was decommissioned last year. The first real test of an untested plan is the disaster itself, which is the worst possible time to discover the gaps.

Run a real failover on a schedule, not a tabletop walkthrough that touches nothing.
Time it and compare the measured RTO and RPO against the numbers you promised.
Restore from backups regularly, because a backup you have never restored is just storage cost.
Rotate who runs the drill so recovery does not depend on one person being awake.

A disaster recovery plan is only as good as the last time you ran it. Set the numbers from the business, buy the tier those numbers justify, and then prove it by failing over on purpose while everyone is rested and watching. The teams that recover calmly are the ones that have already done it a dozen times when it did not count.

Define the two numbers honestly

RTO: the maximum tolerable time to restore service, set by what an outage actually costs the business.

RPO: the maximum tolerable data loss, set by how much rework or lost revenue a gap creates.

Both vary by system: the billing database is not the marketing site, so do not give them the same target.

Tighter numbers cost real money, so write down what you will actually pay for, not the wish.

Pick the strategy that fits the number

Every DR tier you climb roughly multiplies the cost; buy only the tier the business will actually pay for.

An untested plan is fiction

Run a real failover on a schedule, not a tabletop walkthrough that touches nothing.

Time it and compare the measured RTO and RPO against the numbers you promised.

Restore from backups regularly, because a backup you have never restored is just storage cost.

Rotate who runs the drill so recovery does not depend on one person being awake.

Disaster recovery: setting RTO and RPO honestly and testing it

Define the two numbers honestly

Pick the strategy that fits the number

An untested plan is fiction

João Matos

Other notes from the team.

Multi-cloud without the religion

Multi-region before multi-cloud: what actually buys resilience

Event-driven architecture: where it pays off and where it bites

Disaster recovery: setting RTO and RPO honestly and testing it

Define the two numbers honestly

Pick the strategy that fits the number

An untested plan is fiction

João Matos

Other notes from the team.

Multi-cloud without the religion

Multi-region before multi-cloud: what actually buys resilience

Event-driven architecture: where it pays off and where it bites