Runbook Automation: From Markdown to Executable Recovery
Every team has a folder full of runbooks. Most of them have not been touched in months. The author left. The CLI flags changed. The fourth step references a tool that has been replaced. When an incident actually requires the runbook, an on-call engineer is reading stale Markdown in the middle of the night and translating it to commands by hand.
We spent the last year migrating from Confluence runbooks to executable recovery procedures. The result is not a single tool but a layered approach: human-readable intent at the top, executable code at the bottom, and a contract that the two cannot drift.
The Drift Problem
Runbooks rot for a predictable reason: they are documentation, and documentation has no consumer until it is needed. CI does not test them. PRs do not update them. A graduate engineer joining the team has no signal that a runbook is wrong until they execute it during an outage.
Our internal audit covered 174 runbooks. Of those, 89 referenced commands that produced errors when run as-is. Forty-one referenced services that had been renamed or decommissioned. Eleven were so out-of-date that the entire incident class they covered no longer existed.
The fix is not "write better documentation." The fix is to make runbooks executable artifacts that fail loudly when reality changes.
The Three-Layer Structure
Every runbook now has three layers, each in version control:
- Intent (Markdown): a short narrative explaining what the runbook fixes, when to invoke it, and what success looks like.
- Procedure (YAML): a list of steps with explicit pre-conditions, executable commands, and post-conditions.
- Tests (shell): a synthetic environment that exercises the procedure against a sandbox cluster on every merge to main.
The Markdown layer answers "should I run this?" The YAML layer answers "what does running this actually do?" The test layer answers "does it still work?"
Procedure Format
The YAML format is intentionally minimal. We resisted the urge to build a full DSL:
name: postgres-replica-promote
description: Promote a PostgreSQL replica to primary when Patroni fails to elect.
triggers:
- alert: PostgresPrimaryDown
duration: 5m
preconditions:
- name: patroni-quorum-lost
check: patronictl list | grep -c Replica >= 2
- name: primary-unreachable
check: pg_isready -h ${PRIMARY_HOST} -t 5; test $? -ne 0
steps:
- id: identify-best-replica
command: patronictl list --format json | jq -r 'sort_by(.Lag).[0].Member'
capture: TARGET_REPLICA
- id: promote
command: patronictl failover --candidate ${TARGET_REPLICA} --force
timeout: 60s
- id: verify-writes
command: psql -h ${TARGET_REPLICA} -c "CREATE TEMP TABLE t (x int); DROP TABLE t;"
postconditions:
- name: new-primary-writable
check: psql -h ${TARGET_REPLICA} -c "SELECT pg_is_in_recovery()" | grep -q "f"
rollback: |
patronictl reinit ${TARGET_REPLICA} --force
Three properties matter. First, every step is a real shell command that can be copy-pasted. Second, preconditions and postconditions are checked automatically, so a runbook will not execute if the environment is in a state it does not understand. Third, the rollback block is required — a step that cannot be undone must explicitly say so.
Execution Modes
The same procedure file is consumed by three different execution modes, depending on operator trust:
- Dry-run: prints what would be executed, evaluates preconditions only. Used during incident triage to confirm the runbook applies.
- Interactive: prompts before each step, shows the command, captures output. Used when the on-call wants the executor for argument substitution but not full automation.
- Auto: executes the entire procedure non-interactively. Reserved for procedures with a long track record and explicit auto-approval in the metadata.
Auto mode is gated. A procedure must run successfully in interactive mode at least 20 times across at least 5 distinct operators before it is eligible for auto-promotion. The metric is collected automatically from the executor's audit log.
Synthetic Testing
The test layer is where most teams give up. We accepted that not every runbook can be tested cheaply, and tiered them:
- Tier 1 (must test): procedures that auto-execute, procedures that touch production data, procedures invoked more than once per month.
- Tier 2 (best effort): procedures with deterministic preconditions that can be simulated.
- Tier 3 (untested): procedures dependent on physical infrastructure (DC power, network hardware) where simulation is impractical.
For tier 1, we maintain a kind cluster pre-populated with the services we care about. The test harness spins it up, induces the failure (kills a pod, partitions a network, fills a disk), runs the procedure in auto mode, and asserts the postconditions. A failing test blocks the PR that broke it.
Connecting to Alerts
Each procedure declares its triggers. The alert routing layer reads these declarations and adds a link to the appropriate procedure in the page notification. The on-call sees:
[PAGE] PostgresPrimaryDown - shard-4 unreachable for 6m
Suggested procedure: postgres-replica-promote
Last test: PASS (3h ago)
Auto-eligible: yes (24 successful interactive runs)
[ack] [dry-run] [interactive] [auto]
The "last test" and "auto-eligible" fields are essential. They tell the on-call whether the procedure is current and whether someone trusts it enough to let it run unattended.
What We Got Wrong
The first iteration tried to be clever. We built a full workflow engine with branching logic, conditional steps, and parallel execution. Nobody used it. The runbooks that mattered were linear, and the cognitive cost of debugging a branching procedure during an incident exceeded the value of the abstraction.
The second iteration overcorrected and made procedures so dumb that they could not represent common patterns like "try option A, fall back to option B if A fails." We added a single fallback primitive — a step can declare an on_failure command — and stopped there.
We also initially required every procedure to have a test. This produced low-quality tests that exercised happy paths and missed real failure modes. Marking tier 3 procedures as explicitly untested was healthier than pretending all procedures had coverage.
Outcomes
Twelve months in, the numbers we care about:
- Mean time to recovery for procedures with auto-eligible runbooks dropped 64%.
- Of 174 original runbooks, 92 were ported, 51 were deleted as obsolete, and 31 are pending. The deletion rate alone justified the effort.
- Time-to-first-action during pages dropped from 4 minutes (read wiki, find runbook, parse Markdown) to under 30 seconds (click link, dry-run, decide).
- Three near-misses were caught by failing CI tests on PRs that would have silently broken recovery procedures.
The point of runbook automation is not to remove humans from incident response. It is to make sure that when a human shows up at 3 AM, the tools in front of them actually work. Markdown alone cannot make that promise.