Disaster Recovery Tabletops That Find Real Problems

2026-04-30 · DR, Resilience, Process

Most disaster recovery exercises are theater. A facilitator reads a scenario, participants describe what they would do, a scribe takes notes, and at the end everyone agrees the company is well-prepared. The DR plan is filed. Six months later, when an actual incident hits, the team discovers that the runbook references a snapshot bucket that was deleted in a cost-cutting exercise.

We run tabletops every quarter. They surface real issues every time. This is the format we settled on after several iterations of running ineffective ones.

The Test of a Good Scenario

A good tabletop scenario has three properties:

Plausible: the failure mode is something that has either happened to a peer company or that our own dependency graph admits as possible.
Specific: it names actual systems, regions, and time-of-day. "AWS goes down" is not a scenario. "us-east-1 EBS control plane is degraded between 14:00 and 17:30 UTC on a Tuesday" is.
Inconvenient: it occurs during a realistic operational moment. The on-call has just paged out, two senior engineers are on vacation, and the deploy that introduced a related change went out 90 minutes ago.

The third property is the one most exercises skip. A scenario that assumes the full team is available, well-rested, and focused tests almost nothing about our actual capacity to recover.

The Format

A session runs 90 minutes with one facilitator, one scribe, and 5-8 participants drawn from on-call, platform, and the affected service teams. We strictly avoid making it a leadership presentation; the most valuable participants are the engineers who would actually execute the recovery.

The facilitator opens with the scenario in narrative form, written in present tense:

It is 16:42 UTC on a Wednesday. PagerDuty alerts fire for elevated 5xx
errors on the checkout service. The on-call engineer (you, Maria)
acknowledges and opens the dashboards. Latency is spiking but the
backend services look healthy.

At 16:47, monitoring confirms us-east-1 RDS Proxy is failing over.
The failover is taking unusually long — three minutes elapsed and
no progress. Twitter shows other companies reporting AWS issues.
The senior database engineer is on PTO until Monday.

What do you do next?

Then the facilitator advances the timeline in 5-15 minute increments, introducing new information as it would have arrived during a real incident: a new alert, a customer complaint forwarded by support, a vendor status-page update that turns out to be wrong.

The Discipline of "Show Me"

The single most important rule: when a participant says they would do X, the facilitator says "show me." Not metaphorically — literally pull up the dashboard, find the runbook, open the terminal.

This is where real problems surface. A participant says "I would page the database team." Show me the rotation. The rotation page shows three names, all on the team that was reorganized last quarter. The actual database expertise now sits in a different group. The page would go nowhere.

Or: "I would restore from the latest snapshot." Show me the snapshot. The snapshot exists but is from 14 hours ago, not the four hours the RPO document claims. Someone changed the backup schedule during a Terraform refactor and nobody noticed.

Or: "I would fail over to the secondary region." Show me the failover runbook. The runbook references aws cli commands that have been deprecated for two years. The replacement commands have different flag semantics. The runbook would fail at step 3.

Every one of these examples is a real finding from our tabletops. They share a pattern: documents and beliefs about our DR capability that did not survive contact with the actual systems.

Recording Findings, Not Vibes

The scribe captures findings in a structured format, not free-form notes:

FINDING-2026Q1-04
Severity: high
Category: stale-runbook
Description: Cross-region failover runbook uses deprecated CLI commands;
             step 3 would error out with no clear recovery path.
Owner: platform-storage
SLA: 14 days
Verification: re-execute steps in DR drill, document outcome

Findings have owners, SLAs, and explicit verification criteria. They live in the same issue tracker as production bugs and have the same accountability. A finding cannot be closed by editing a document — it must be verified by re-running the step that exposed it.

We track the lifecycle of findings across quarters. If a finding from Q1 is still open in Q3, it is escalated. Three consecutive quarters of open findings in a category triggers a leadership conversation about resource allocation. This conversation has happened twice. Both times it resulted in a project getting actual staffing.

Scenarios That Worked Best

Some scenarios produce more findings than others. The ones that have been consistently valuable:

Cascading dependency failure: a SaaS provider you depend on (CDN, auth provider, payment processor) has a partial outage that affects only some of your traffic. Tests routing, fallback policies, and observability of third-party state.
Slow leak: a database is being slowly corrupted over hours; symptoms ramp gradually. Tests detection thresholds, escalation timing, and the human bias toward "let me wait one more cycle to see if it self-resolves."
Bad deploy + dependent change: a feature deploy at hour zero, a dependency change at hour two, an incident at hour four. Tests the ability to bisect without simply reverting everything.
Loss of key personnel: the principal engineer for system X is unreachable. Tests documentation completeness and bus-factor reality.

Scenarios that turned out to be theater: anything where the team is "the AWS region goes down completely." This is too coarse to drive specific findings. Either the multi-region failover plays out as designed and the exercise ends in 20 minutes, or it does not work at all and nobody can recover the conversation.

The Skeptic Role

We rotate one participant into the role of skeptic. Their job is to challenge any claim that has not been demonstrated. "We monitor that" — show me the alert. "The on-call would notice" — show me the past incident where the on-call did notice that. "We have automation for that" — show me it succeeding in a test within the last 90 days.

This role is uncomfortable to play. We added it explicitly because the natural flow of a tabletop pulls toward consensus and reassurance. A designated skeptic gives someone permission to push back against the comfortable narrative.

What Improved

Over two years of quarterly tabletops:

Mean time to detect dropped 47% across the scenarios we have rehearsed, primarily because tabletop findings drove specific alert improvements rather than generic "more monitoring."
Three runbooks that nobody had executed in over a year were either updated or deleted as obsolete.
Two service teams discovered they had no on-call rotation for their critical service. They now do.
One backup strategy was completely redesigned after a tabletop showed the existing one had an RPO three times worse than documented.

None of these findings would have surfaced in a routine review. Each one required a specific scenario that forced a specific claim to be tested. That is the work tabletops do. If yours are not producing concrete, owner-assigned, verified findings, they are not exercises — they are meetings.