Runbooks decay between incidents. The fix is a workflow where an agent reads the last week of PagerDuty incidents, opens the linked GitHub commits and Confluence pages, drafts the patch, and waits for the on-call lead to approve a diff before Confluence changes. Dock holds the queue of drafted patches, the approval, and the pointer back to the incident that prompted it.
GitHub, Confluence, and PagerDuty stay the system of record for the raw data. Dock is the system of record for what the AGENT INTERPRETS. Each Dock row carries a pointer back to the platform record, agent identity, decision, reviewer, and timestamp. The agent re-fetches platform data via fresh API reads when it needs current state.
The surface: a Runbook Patch Queue table
| incident_id | runbook_page | drafted_change | agent | reviewer | status | decided_at |
|---|---|---|---|---|---|---|
| PD-48211 | Confluence: redis-cluster-failover | Add step 4: drain replica before promote (commit a3f12b9) | Argus | priya@ | approved | 2026-05-28 09:14 |
| PD-48230 | Confluence: api-gateway-5xx | Update threshold from 2% to 0.8% per postmortem | Argus | priya@ | needs-revision | 2026-05-28 14:02 |
| PD-48244 | Confluence: kafka-lag-recovery | New page: consumer-group rebalance under load | Argus | rahul@ | pending | 2026-05-29 07:48 |
Each row links out: incident_id to PagerDuty, runbook_page to Confluence, the commit hash inside drafted_change to GitHub. Dock does not duplicate the runbook text. It holds the proposed patch, the reviewer call, and the trail.
The workflow
The agent runs weekly. It pulls the last seven days of resolved PagerDuty incidents, reads the postmortem field, and fetches the Confluence runbook each one referenced. It diffs the runbook against what responders actually did, drafts a patch, and writes a row. The on-call lead opens the queue, reads the diff, and sets status to approved, needs-revision, or rejected. Approved rows trigger the agent to push the patch to Confluence and link the new page version back into the row. Rejected rows stay in Dock as a record of where the agent was wrong, which is what the agent audit log needs.
Publishing to Confluence is a contained write: one page, prior version stored, easy rollback. The dangerous ops contract treats runbook patches as low-risk after approval but never auto-applies the draft.
Why this matters
The Google SRE Workbook is direct about why runbooks rot: "Details in playbooks go out of date at the same rate as production environment changes." With daily releases, that rate is constant. The SRE Book adds that on-call engineers should update documentation during incident response, when context is freshest. Neither practice survives without a forcing function. The same pattern shows up in IT operations for ticket macros and known-error articles.
The architectural point is the same one running through Cloud 2.0 for engineering: the agent does not own the runbook and does not silently rewrite it. The runbook lives in Confluence. The interpretation, the draft, and the human call live in Dock, with pointers back to every source. When the next incident fires, the responder can ask why a step was added and read the row that authorized it.
Spin up the Runbook Patch Queue table and point an agent at last week's PagerDuty incidents. Read the DevOps pillar for the rest of the on-call cluster.
FAQ
Does the agent edit Confluence directly? Only after the on-call lead approves the row. The draft sits in Dock until then. Approval is the trigger; the agent never auto-publishes.
What if the agent drafts a wrong patch? The reviewer sets status to rejected with a note. The row stays, which gives the next reviewer a record of where the agent misreads incidents.
Where does the agent identity come from? Each agent has its own credential and shows up in the row by name, not as a shared service account. That is how the audit log stays useful.
Does this replace postmortems? No. Postmortems still happen in their usual format. The agent reads the postmortem field and turns the action items that touch runbooks into drafted patches in the queue, with the same audit trail the rest of the agent's work carries.