---
title: "Dangerous agent operations: the contract that stops them"
excerpt: "When an agent does something an agent shouldn't, the failure usually has a specific shape. Not malicious. Not philosophically off the rails. Mundane: the model found a path where the locally-optimal next step was destructive, took it, and now the cost is real. The protection is a stable, audited list of operations the agent cannot run alone."
author: scout
category: Engineering
date: "2026-05-19"
---

When an agent does something an agent shouldn't, the failure rarely looks like a movie villain. It looks mundane. The agent found a path through your prompt where the locally-optimal next step was a destructive action, took it, and now the cost is real: a refund that shouldn't have processed, a row deleted instead of marked archived, a Slack message sent to a customer at 3am, a credential rotated without a rollback path.

The model wasn't malicious. The orchestration wasn't broken. The agent did exactly what the optimization gradient pointed at. That is the problem. Optimization gradients don't carry the context that says "this specific class of operations requires a human to look at it before it fires."

This essay is the pillar for the design pattern that fixes it. The shape: a short, stable, **audited list of dangerous operations the agent cannot run without a human-in-the-loop confirmation**. The contract is in code, not in a prompt. The gate is enforced server-side, not on the model's promise. The audit log keeps the gate honest.

## Why prompt-based guardrails don't work

The first instinct most teams have is to put guardrails in the prompt. "Don't refund customers without confirming first." "Don't delete rows ending in -prod." "Always ask before sending email."

These guardrails fail predictably. Three reasons:

**Prompts are advisory, not enforced.** The model can ignore a system instruction when the user's framing reaches it. The model can be jailbroken. The model can simply hallucinate that it confirmed when it didn't. None of these are model bugs; they're the cost of using natural-language instructions as a security boundary.

**Guardrails accumulate without bound.** Every incident produces a new "remember to do X" line in the prompt. After a year, the prompt is 4,000 tokens of accumulated incident history, the model attention budget is shot, and the agent's main job suffers because it's spending capacity on dodging the last fire.

**The blast-radius is wrong.** A prompt-level rule says "ask before refunding." The right-shaped rule says "the refund API call requires a confirmation token before it fires." The first is a suggestion to the model. The second is a fact about the system. Only the second survives when the model has a bad day.

The fix is not better prompts. The fix is to move the gate out of the prompt and into the API surface itself.

## The four pieces of agent-safe operations

Once the gate moves out of the prompt, four pieces compose into a real architecture:

- **The contract.** A short, named list of operations the agent cannot run without confirmation. Lives in code, audited, change-controlled.
- **The consent gate.** The technical mechanism that enforces the contract: the first call to a dangerous op returns a confirmation token, the agent surfaces it to its principal, the principal confirms, the agent re-calls with the token.
- **The two-key handshake.** A stronger variant for the highest-stakes operations: two distinct principals must approve before the operation fires. Used for the small set of changes where a single compromised agent should not be enough.
- **The scope boundary.** The OAuth-style permission boundary that says which tools an agent can even attempt to call. The contract gates the dangerous calls; the scope rules out the irrelevant ones.

The four are sequenced. Scope rules out what the agent shouldn't call. Contract gates what the agent can call but shouldn't run alone. Consent gate is the mechanism that implements the contract. Two-key handshake hardens the contract for the few cases that warrant it.

Skip any one and the architecture has a hole. Scope without contract means the agent can do anything inside its scope. Contract without consent gate means the rule isn't enforced. Consent gate without two-key handshake means a compromised agent can still run anything in the contract by faking the confirmation step. Two-key handshake without scope means the agent could in principle call operations that should never have been on the menu.

## The contract: what's on the list

The contract is the short list of operations that require human confirmation. The list should be **small** (a few items, not a few dozen), **stable** (changes are reviewable events), and **audited** (every gated call is recorded). For Dock, the contract currently has two operations on it: `upgrade_plan` and `downgrade_plan`. Each one moves real money or permanently changes org state. Each one has a confirmation summary template, a fast-path, and an audit row per invocation.

The rule for what's on the list is concrete. An operation belongs on the contract if:

1. **It's irreversible or expensive to reverse.** A refund moves money out. A deletion drops data. A subscription downgrade changes plan caps that the team relied on. If the cost of "agent did it incorrectly" is higher than the cost of "agent had to wait for a human," the op belongs on the contract.
2. **The agent has no in-context way to verify intent.** If the model has to ask, it has to ask outside its own session. The contract is the structural way to ask.
3. **The blast radius exceeds the agent's current scope.** Operations that escalate (raise plan caps, grant access, mint credentials) belong on the contract because their downstream effects outlive the immediate session.

Every operation outside the contract runs without a gate. That's the design. The contract's value is that it stays short — if every operation needed confirmation, the team would be in the loop for every call and the agent's autonomy would be worthless. The contract enumerates the ones where waiting is cheap and being wrong is expensive.

We've written the worked example at length in [the dangerous-ops contract](/blog/dangerous-ops-contract). The two operations currently on Dock's list, the template for each confirmation, the fast-paths.

## The consent gate: how the contract is enforced

The consent gate is the mechanism. It implements the contract at the API surface.

The shape that works in practice has four steps:

- **First call**: the agent invokes the dangerous op. The handler returns a confirmation token plus a human-readable summary. No side effects fire.
- **Surface**: the agent shows the summary to its principal (human, or another agent in a two-key flow). The summary describes what is about to happen in concrete terms ("Upgrade dock.team-acme from Pro to Scale, $30/mo recurring").
- **Confirm**: the principal confirms the action. The confirmation produces a signed token specific to (org, principal, operation, params) with a short TTL (typically 60s).
- **Second call**: the agent re-calls the same operation with the token. The handler validates the token's (org, principal, operation, params) match the current call, then fires the side effect. The token is consumed; it cannot be replayed.

The design eliminates several classes of failure. The agent can't run the op without showing the summary (because the first call doesn't fire). The agent can't replay a stale confirmation (because the token is single-use and TTL'd). The agent can't substitute different params after confirmation (because the token is bound to the params). The audit log records both the gate-open and the gate-fire moments.

For the depth on token binding, TTL choice, and the consent-gate library shape, see [consent gates for dangerous operations](/blog/consent-gates-for-dangerous-ops).

## The two-key handshake: when one principal isn't enough

For the highest-stakes class of operations, single-principal consent isn't enough. A compromised agent + a compromised principal can still get one approval through.

The two-key handshake adds a second principal. The operation fires only when two distinct principals (typically: the agent's owning human and a second human, or two humans in different roles) both produce confirmation tokens, within a shared TTL window, for the same operation and params. The handler verifies both tokens before firing.

Two-key is the right shape for:

- Org-level changes (deleting an org, transferring ownership)
- Mass mutations (anything that affects more than N rows or more than M dollars)
- Credential operations (rotating root tokens, granting admin)
- Production data writes from staging principals (the cross-environment hop is the failure mode)

Two-key is the wrong shape for everything else. Every operation that requires two humans to confirm is an operation that slows down to human speed. The contract has to stay small even more than the consent-gate contract — two-key is for the operations where two-human-pace is the right pace.

For the full pattern, including the cross-principal coordination mechanics, see [two-key handshakes for irreversible agent actions](/blog/two-key-handshakes-irreversible).

## The scope boundary: OAuth for agents

The contract gates the dangerous calls. The scope rules out the irrelevant ones.

OAuth scopes are the inherited shape for declaring what an agent can attempt to call. They were designed for users granting limited access to apps; they work for agents granting limited access to themselves. The categories an agent's scope should cover:

- **Resource scope**: which workspaces, which orgs, which docs.
- **Operation scope**: which tools (`read_row`, `update_row`, `delete_row` separately, not "manage rows" as a bundle).
- **Time scope**: token expiry, scope-specific TTLs for sensitive operations.
- **Sub-principal scope**: which agents owned by which humans, in which orgs.

The mistake to avoid is **scope inheritance from the user without modification**. If an agent runs with the user's OAuth token, the agent has every scope the user has. The right shape is per-agent scope assignment: each agent gets its own token, with scopes narrower than the user's. The owning user can grant access without granting their entire access.

The deeper pattern (and the specific failures we've hit with OAuth as it stands) is in [OAuth scopes for agents](/blog/oauth-scopes-for-agents).

## How to build a dangerous-ops contract

If you're designing the safety layer for an agent platform, walk through these six steps in order. Each step is a fork that affects the others.

1. **Enumerate the operations whose worst case is unacceptable.** Write the list. For each, ask: "if the agent ran this incorrectly, what would it cost to recover?" If the answer is "we couldn't fully recover," the op belongs on the contract. If the answer is "we'd just retry," it doesn't. Most platforms end up with 2–6 operations on the initial contract; growth past 10 is a smell.

2. **Pick the consent-gate library shape.** First-call returns token, principal confirms, second-call consumes token. Avoid building two different consent mechanisms across two teams; pick one, share the library, every gated op routes through it. We use a single `billing-consent.ts` shape in Dock; new contract ops extend it.

3. **Define the confirmation summary template for each op.** The summary is what the principal sees at confirmation time. It must show the concrete params, not just the operation name. "Upgrade plan" is wrong; "Upgrade dock/team-acme from Pro ($19/mo) to Scale ($49/mo) effective immediately" is right. Templates live next to the gate handler, not in the prompt.

4. **Identify the fast-paths.** Some gated ops have circumstances where they can fire without confirmation: same-day retries, cron-triggered restorations, idempotency-key matches. Document each fast-path explicitly. Fast-paths that are "the model decided this is safe" don't count; the fast-path must be an objective property of the call.

5. **Promote the two-key handshake for the small subset.** Walk the contract. For each item, ask: "would one compromised principal be enough?" The ops where the answer is no get promoted to two-key. Two-key adds friction; promote sparingly. Org-deletion, credential rotation, and cross-environment writes are the usual candidates.

6. **Wire the audit log.** Every gated call records: who called, what was called, what params, what summary was shown, whether confirmation happened, who confirmed, when the second call fired, and whether the token validated. Six months from now this is how you reconstruct what happened. Two months from now this is how you prove you didn't mis-fire a customer-facing action.

The team that answers all six concretely has a real dangerous-ops architecture. The team that answers four is shipping prompt-based guardrails with extra steps.

## FAQ

**What are dangerous agent operations?**

Operations an AI agent can call that have effects expensive or impossible to reverse: refunds, deletions, plan changes, credential rotations, mass mutations, sending external messages. The defining property is that "agent did it incorrectly" costs more than "agent had to wait for a human." These operations live on a stable, code-defined list (the dangerous-ops contract) that gates each call through a confirmation step before the side effect fires.

**Why can't I just tell the model not to do dangerous things?**

Prompts are advisory; the model can ignore them, be jailbroken, or hallucinate that it confirmed when it didn't. Prompts also accumulate without bound, eating context that should go to the task. The fix is to move the gate out of the prompt and into the API surface, where the rule is enforced by the system rather than promised by the model.

**What is a consent gate?**

A consent gate is the technical mechanism that enforces the dangerous-ops contract. The first call to a gated operation returns a confirmation token plus a human-readable summary; no side effects fire. The agent shows the summary to its principal; the principal confirms; the agent re-calls with the token. The handler validates and fires. See [consent gates for dangerous operations](/blog/consent-gates-for-dangerous-ops) for the worked example.

**How does a two-key handshake differ from a regular consent gate?**

A regular consent gate requires one principal's confirmation. A two-key handshake requires two distinct principals' confirmations within a shared TTL. Two-key is for operations where one compromised principal would still be too much: org deletion, root credential rotation, mass mutations above a threshold, cross-environment writes. See [two-key handshakes for irreversible agent actions](/blog/two-key-handshakes-irreversible).

**How small should the dangerous-ops contract be?**

Smaller than feels comfortable. Each item on the list slows down a class of operations to human-confirmation pace. Most production platforms end up with 2–6 items on the initial contract; growth past 10 is usually a sign that prompt-based rules have leaked into the gate. The contract should answer "which ops are worth waiting on" — not "which ops feel scary."

**Where do OAuth scopes fit in this architecture?**

OAuth scopes are the boundary layer underneath the contract. The contract gates dangerous calls; scopes rule out irrelevant ones. An agent's scope should be per-agent (not inherited from the user wholesale), resource-narrow (which workspaces, not "all workspaces the user has"), and operation-specific (`update_row` separately from `delete_row`). See [OAuth scopes for agents](/blog/oauth-scopes-for-agents) for the failure modes when scopes are too broad.

**What's a fast-path on the dangerous-ops contract?**

A fast-path is a circumstance under which a gated operation can fire without the confirmation step. Examples: same-day retries of an already-confirmed operation, cron-triggered restorations matching a previous confirmation, idempotency-key matches. Fast-paths must be objective properties of the call (the system can check them), not subjective ("the model decided this is safe"). Document each fast-path explicitly next to the gate handler.

**How do I audit gated operations?**

Every gated call records six fields: caller principal, operation called, params, summary shown, confirmer principal (if confirmed), and whether the second call fired with a valid token. The audit log lets you reconstruct what happened during incidents and prove compliance during reviews. Without the gate-open + gate-fire pair recorded, the audit trail has a gap that's exactly the size of the dangerous operation.

## Where Dock fits

Dock implements the dangerous-ops contract, consent-gate library, and two-key handshake as first-class primitives. The contract today contains `upgrade_plan` and `downgrade_plan` — both move money or permanently change org state. The consent-gate library lives at `src/lib/billing-consent.ts` and is the single mechanism for every gated op (no parallel implementations). The two-key handshake handles the highest-stakes class.

Every gated call is audited end-to-end: who called, what was shown, who confirmed, what fired. The MCP catalog reflects the contract — tools that are gated declare their gate-ness, so the agent knows up-front to expect a token round-trip.

If you're building an agent platform and your current safety story is "we tell the model not to do dangerous things," the architecture is prompt-based guardrails. Migrating to the contract pattern is mostly a library decision (pick one consent-gate shape) plus an enumeration exercise (write the contract) plus an audit-log addition.

## Read next

The four essays below dig into specific pieces of agent-safe operations.

- [The dangerous-ops contract](/blog/dangerous-ops-contract) — the worked example of Dock's current contract, with the two ops, summary templates, and fast-paths.
- [Consent gates for dangerous operations](/blog/consent-gates-for-dangerous-ops) — the library shape: token binding, TTL choice, validation semantics.
- [Two-key handshakes for irreversible agent actions](/blog/two-key-handshakes-irreversible) — the higher-stakes pattern: two principals, shared TTL, when to promote ops.
- [OAuth scopes for agents: what's broken](/blog/oauth-scopes-for-agents) — the boundary layer underneath the contract.
- [Agentic AI architecture: the five layers nobody draws together](/blog/agentic-ai-architecture) — where safety sits in the broader stack (it's a cross-cutting concern with strong Layer 3 implications).
- [AI agent identity: the design model nobody has standardized](/blog/agent-identity) — the identity layer that makes per-principal confirmation possible at all.