Dock
Sign in & remix
REMIX PREVIEWUse Cases· MAY 30

Dock + PagerDuty: on-call runbooks with attributed engineer review

Dock turns PagerDuty incidents into agent-drafted runbooks that an on-call engineer signs off, with every step pointed back to the PagerDuty incident and Datadog signal.

By mei· 4 min read· from trydock.ai

When PagerDuty pages an on-call engineer at 2am, the agent should already have a runbook draft waiting. Dock is where that draft lives. The agent reads the PagerDuty incident, pulls the relevant Datadog metrics, drafts remediation steps, and posts a row for the on-call engineer to approve before anything touches production. The page still wakes a human. The runbook arrives pre-written.

[Platforms appropriate to this sub] PagerDuty and Datadog stay the system of record for the raw data. Dock is the system of record for what the AGENT INTERPRETS. Each Dock row carries a pointer back to the platform record, agent identity, decision, reviewer, and timestamp. The agent re-fetches platform data via fresh API reads when it needs current state.

The runbook surface

Incident Service Datadog signal Agent diagnosis Proposed action On-call review
PD-48211 checkout-api p99 latency 4.2s, 12x baseline DB connection pool exhausted after deploy 7c2a Roll back deploy 7c2a, drain pool Approved by @priya 02:14
PD-48213 search-indexer error rate 8%, Kafka lag 42k Consumer group rebalance stuck on broker-3 Restart consumer group, no data loss expected Approved by @priya 02:31
PD-48217 auth-service 503s from us-east-1 only Regional NLB health check failing, instances healthy Page network on-call, do not auto-remediate Escalated by @priya 03:02

Each row links back to the PagerDuty incident URL and the Datadog dashboard snapshot. The agent re-reads both before proposing the action. Stale state is the failure mode that ends careers, so the agent never relies on what it cached five minutes ago.

One workflow: PD-48211 walked through

PagerDuty fires at 02:11. The agent picks up the webhook, reads incident PD-48211, queries Datadog for checkout-api p99 and pool metrics, and finds a sharp inflection at deploy time. It drafts a runbook row proposing rollback of deploy 7c2a. Priya, on-call, gets a Slack ping with the Dock row link. She opens it, sees the agent's diagnosis with the Datadog graph embedded, the proposed rollback command, and the dangerous-ops contract it would execute under. She approves. The agent runs the rollback through the same CI pipeline a human would use, then re-checks Datadog and closes the PagerDuty incident with the resolution note. The row stays as the postmortem starting point.

Why this matters

Runbooks rot. The agent maintains them by reading every incident it touches, but it does not edit production runbooks unilaterally. It proposes diffs in Dock that the on-call engineer or the service owner approves. The agent identity on the row is the same identity that ran the remediation, which means audit and compliance reviewers can trace any production change back to a named agent, a named reviewer, and a named incident. This is the cloud 2.0 shift for engineering: the agent does the toil, the human owns the call, and the substrate records both.

The PagerDuty Incident Response guide (response.pagerduty.com) treats the runbook as something a responder reads under stress. Dock treats it as something an agent drafts and a responder approves. Google's SRE chapter on blameless postmortem culture argues the postmortem is a learning artifact, not a blame artifact. When the agent's diagnosis is wrong, the row shows the wrong diagnosis next to the human override. That is the learning loop.

This pattern also covers IT operations incidents that page a different on-call rotation but follow the same draft-then-approve shape.

See the full Dock for DevOps overview for the SLO review, change-management, and chaos-game-day surfaces that sit alongside this one.

FAQ

Does the agent ever auto-remediate without review? Only for actions explicitly pre-approved in the dangerous-ops contract, like restarting a stateless pod. Anything touching data, deploys, or network config requires a named on-call sign-off on the Dock row.

What if Datadog and PagerDuty disagree on incident state? The agent re-fetches both before posting the row. If PagerDuty shows resolved but Datadog still shows the signal, the agent flags the mismatch and asks the on-call to reconcile before closing.

How does the runbook get into git? After the on-call approves the row, the agent opens a PR against the runbook repo with the diff. The PR links the Dock row, the PagerDuty incident, and the Datadog snapshot. The service owner reviews the PR on its own schedule.

Can the agent participate in the postmortem? Yes. The agent posts its diagnosis, the human override if any, and the actual root cause once known. The row is the postmortem's first draft, not its conclusion.

Remix this into Dock

Make this yours. Edit, extend, run agents on it.

Sign in (free, 20 workspaces) — Dock mints a copy of this in your own workspace. The original stays untouched.

No Dock account? Sign-in is signup. Magic-link in 30 seconds.