---
title: "Automate your inbox triage with an agent"
excerpt: "10-step playbook to put an agent on your inbox: triage, label, draft replies, surface the 5 emails that need you, without ever auto-sending."
category: "Template"
---

# Automate your inbox triage with an agent

    A 10-step playbook. Open in Dock and you'll get four surfaces seeded:

    - **Steps** (table) the 10 gates as rows, owner + due + status
    - **Pointers** (table) every official Gmail / Outlook / Anthropic doc linked from this playbook
    - **Brief** (doc) the canonical write-up: your label taxonomy, your agent's system prompt, your daily-digest format
    - **Triage log** (table) one row per email triaged, with classification, action, override

    Open `Steps` first. The most important step is the 1-week shadow run, don't skip it.

## Outcome

An agent triaging your inbox daily: labels every incoming email by your taxonomy, drafts replies on the routine ones, surfaces the 5 that need human attention in a morning digest, never auto-sends without a human click.

**Estimated time:** 3-5 days build, 1 week shadow run before live  
**Difficulty:** intermediate  
**For:** Operators + founders + execs with 100-500 emails/day.

## What you'll need

Pre-register or install before you start.

- **[Gmail](https://mail.google.com/)** _(Free (personal), Workspace from $7/seat/mo)_ — The inbox the agent triages. Outlook works too with the equivalent Microsoft Graph API.
- **[Gmail API](https://developers.google.com/gmail/api)** _(Free (subject to quota: 1B quota units/day, ~250 sends/day for personal))_ — OAuth-scoped programmatic access to read, label, and draft emails.
- **[Claude Pro or API](https://www.anthropic.com/pricing)** _(Claude Pro $20/mo OR API metered (~$3/M input, $15/M output for Sonnet))_ — The model that classifies + drafts. Sonnet is the sweet spot for cost vs. quality.
- **[Gmail MCP server (community)](https://github.com/modelcontextprotocol/servers/tree/main/src/gmail)** _(Free, community-maintained)_ — Drop-in MCP server that exposes Gmail's API as MCP tools for any compatible client.
- **[Cron host](https://render.com/docs/cron-jobs)** _(Free tier on Render / Cloudflare Workers / Vercel cron)_ — Run the triage script every 30-60 min, or trigger from a Gmail push notification.

---

# The template · 10 steps

## Step 1: Audit the last 30 days of your inbox

_Estimated time: 2-3 hr_

You can't triage what you don't categorize. Pull the last 30 days of email and pattern-match: how many are newsletters, how many are customer emails, how many are 1:1 from real humans, how many are notifications from SaaS tools. The categories you find here become the agent's label taxonomy.

### Tasks

- [ ] Export 30 days of inbox via Gmail Takeout OR run a script via the Gmail API
- [ ] Cluster emails by sender domain + subject pattern
- [ ] Identify the top 8-12 categories by volume
- [ ] Note which categories are 'reply needed' vs. 'read + file' vs. 'auto-archive'
- [ ] Eyeball the false-positive risk of each category (newsletters mis-classified as customer email = real harm)

### Pointers

- **[Official]** [Gmail Takeout](https://takeout.google.com/)
- **[Official]** [Gmail API users.messages.list](https://developers.google.com/gmail/api/reference/rest/v1/users.messages/list)

> [!CAUTION]
> **Gotchas**
>
> - Don't trust your guess on category mix. Most people overestimate 1:1 human emails by 3-5x.
> - Auto-categorizing categories under 1% of volume is a waste, the false-positive cost outweighs the time saved.

### Agent prompt for this step

```text
Read the last 30 days of email from the user's Gmail inbox.

Cluster the emails by:
1. Sender domain (e.g. all @stripe.com)
2. Subject pattern (e.g. "Your invoice for...", "Action required:")
3. Body fingerprint (templated newsletters vs. real human writing)

Output as a Brief section titled "Inbox audit (30 days)":
1. Top 12 categories by volume, each with: name, sender pattern, subject pattern, example, action recommendation (reply / file / archive / flag).
2. Recommended label taxonomy (8-12 labels).
3. Risk flags: categories where misclassification has real consequences.

Constraints: be honest about volume. The user wants the truth about their inbox, not a flattering summary.
```

## Step 2: Define the label taxonomy + actions

_Estimated time: 1 hr_

Labels are the agent's vocabulary. Too few (3-4) and you can't take different actions per category. Too many (20+) and the model misclassifies. Aim for 8-12 labels, each with a defined action: reply / file / flag / archive. Auto-archive only on labels you trust 99%+ on; everything else gets a human-visible flag.

### Tasks

- [ ] Pick 8-12 label names (kebab-case or human-readable, your choice)
- [ ] For each label: action (reply-needed / read-and-file / flag-for-review / auto-archive)
- [ ] For each label with action=reply-needed: is it a routine reply (agent drafts) or a human reply (agent flags only)?
- [ ] Add labels to Gmail (Settings -> Labels -> Create new)
- [ ] Document the taxonomy in the Brief

### Pointers

- **[Official]** [Gmail label management](https://support.google.com/mail/answer/118708)

> [!CAUTION]
> **Gotchas**
>
> - Don't use Gmail's built-in 'Important' or 'Starred' for the agent, those are user-facing signals you'll want for your own use.
> - Auto-archive on a label is dangerous. The user never sees those emails again. Reserve for newsletters + automated SaaS notifications, never customer-facing.

## Step 3: Set up Gmail API access (OAuth)

_Estimated time: 1-2 hr_

The agent needs to read, label, and draft email programmatically. That means OAuth-scoped access to the Gmail API: gmail.readonly + gmail.modify (label + draft, not send). Send scope is intentionally NOT requested, the agent never sends without a human click.

### Tasks

- [ ] Create a Google Cloud project at console.cloud.google.com
- [ ] Enable the Gmail API
- [ ] Create OAuth 2.0 credentials (type: Desktop app or Web app depending on host)
- [ ] Configure consent screen: scopes gmail.readonly + gmail.modify (NOT gmail.send)
- [ ] Run the OAuth flow once, store the refresh token securely (env var on the cron host)
- [ ] Test: a script reads 1 email and applies a label, confirm it works

### Pointers

- **[Official]** [Gmail API OAuth setup](https://developers.google.com/gmail/api/quickstart/python)
- **[Official]** [Gmail API scopes reference](https://developers.google.com/gmail/api/auth/scopes)

> [!CAUTION]
> **Gotchas**
>
> - Don't request gmail.send scope. The agent should never have it; this is the safety rail.
> - OAuth refresh tokens for unverified apps expire after 7 days. Either verify the app (~3-day Google review) or run as a Workspace internal app.
> - Store the refresh token in an env var, not in code. A leaked refresh token = full inbox access until revoked.

## Step 4: Pick the runtime: cron script or MCP server

_Estimated time: 30 min decision, 4-8 hr build_

Two patterns work. (a) A cron script that runs every 30 min, calls the Gmail API, sends each email to Claude / GPT for classification, applies labels + creates drafts. (b) An MCP server (Gmail MCP) that exposes Gmail as tools to any MCP client, you ask Claude 'triage my inbox' and it loops. Pattern (a) is more reliable for unattended automation; pattern (b) is more flexible for interactive workflows.

### Tasks

- [ ] Pick: cron script (unattended) vs. MCP server (interactive)
- [ ] If cron: pick a host (Render, Cloudflare Workers, Vercel cron, your own VPS)
- [ ] If MCP: install the community Gmail MCP server + configure auth
- [ ] Either way: pick the model (Claude Sonnet is the sweet spot, Haiku for high-volume cheap classification, Opus for high-stakes drafting)

### Pointers

- **[Official]** [Render cron jobs](https://render.com/docs/cron-jobs)
- **[Official]** [Cloudflare Workers cron triggers](https://developers.cloudflare.com/workers/configuration/cron-triggers/)
- **[Code]** [Gmail MCP server](https://github.com/modelcontextprotocol/servers/tree/main/src/gmail)

> [!CAUTION]
> **Gotchas**
>
> - Cron-every-5-min on a 200-email/day inbox is overkill and burns tokens. Every 30 min is the sweet spot.
> - If you go MCP-server route, you must remember to run the triage prompt manually, easy to forget. Cron wins for unattended.

## Step 5: Write the classification prompt

_Estimated time: 2-4 hr_

The classification prompt is the heart of the agent. Given email metadata + body, output: label, confidence, action, draft (if reply-needed and routine). The prompt should be deterministic on routine emails and conservative on edge cases (when in doubt, flag for human review).

### Tasks

- [ ] Draft the system prompt: identity, taxonomy (paste the labels), action rules, JSON output schema
- [ ] Include 5-10 few-shot examples (real emails from your audit, with the correct label + action)
- [ ] Output schema: { label, confidence: 0-1, action, draft_subject?, draft_body? }
- [ ] Add a 'when in doubt, label as needs-human-review' instruction explicitly
- [ ] Test on 50 emails from your audit, check classification accuracy

### Pointers

- **[Official]** [Anthropic prompt engineering for classification](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/use-examples)
- **[Official]** [OpenAI structured output](https://platform.openai.com/docs/guides/structured-outputs)

> [!CAUTION]
> **Gotchas**
>
> - Without a confidence threshold, the model hallucinates labels at 5-10% rate on edge cases.
> - Including the user's full email signature in the draft is a tell. Strip it; let the user add their own.
> - Drafts longer than ~100 words feel obviously AI-generated. Shorter is more authentic for routine replies.

### Agent prompt for this step

```text
Read the label taxonomy from the Brief.

Draft a classification system prompt with:
1. Identity: "You are an inbox triage agent for [user]."
2. Taxonomy (paste the labels + actions verbatim)
3. Output: strict JSON { label: string, confidence: 0-1, action: "reply" | "file" | "flag" | "archive", draft_subject?: string, draft_body?: string }
4. Rules:
   - When in doubt, label "needs-human-review" with action "flag".
   - Only draft a reply when the email matches a routine pattern AND confidence > 0.85.
   - Never include the user's signature in the draft.
   - Keep drafts under 100 words for routine replies.

Then add 5-10 few-shot examples from the user's audit (real emails + correct classifications).

Output as a Brief section titled "Classification prompt v1".
```

## Step 6: Implement the triage loop (label, draft, log)

_Estimated time: 4-8 hr_

The loop: list new emails since last run, send each to the model with the classification prompt, apply the label via Gmail API, create a draft (NOT a sent email) if action is reply, log to the Triage log surface. Keep state in a flat file or a tiny SQLite DB so you don't double-process.

### Tasks

- [ ] Implement: fetch unread emails since last_run_at
- [ ] For each: call the model with the email body + classification prompt
- [ ] Apply the returned label via gmail.users.labels.modify
- [ ] If action is reply + draft is set: create a draft via gmail.users.drafts.create
- [ ] Append a row to the Triage log surface with: msg_id, from, subject, label, confidence, action, draft_id?, override?
- [ ] Update last_run_at

### Pointers

- **[Official]** [gmail.users.drafts.create](https://developers.google.com/gmail/api/reference/rest/v1/users.drafts/create)
- **[Official]** [gmail.users.labels.modify](https://developers.google.com/gmail/api/reference/rest/v1/users.messages/modify)

> [!CAUTION]
> **Gotchas**
>
> - Use users.drafts.create, not users.messages.send. Drafts are reviewable in Gmail; sends are gone.
> - Inbox automation hits Gmail's 250 send/day limit fast (personal) or 2000/day (Workspace). Drafts don't count, but if you ever flip to send, you'll burn the quota in an hour during catch-up runs.
> - Always store the original email content in your log, not just the classification. When you debug a misclassification, you'll need the source.

## Step 7: Build the morning digest format

_Estimated time: 1-2 hr_

The agent's daily output to you is the digest: 5 emails that need human attention today, ranked by urgency. Format it for 30-second skim: 1 line per email (sender, subject, why it needs you, suggested action). Send the digest as a doc update, a Slack message, or an email to yourself, whatever surface you check first thing.

### Tasks

- [ ] Decide the surface: doc (Brief), Slack DM, or self-email
- [ ] Format: top 5 emails, 1 line each (from, subject, why, suggested action)
- [ ] Add a footer: 'X emails labeled, Y drafts created since last digest'
- [ ] Schedule the digest for your wake-up time (8am local is common)
- [ ] Test for 3 days, refine the format based on what you actually act on

### Pointers

- **[Official]** [Slack incoming webhooks](https://api.slack.com/messaging/webhooks)

> [!CAUTION]
> **Gotchas**
>
> - Digests longer than 5 emails get scanned, not read. Keep to 5, push the rest to a 'later' queue.
> - If the agent surfaces 5 emails and 4 of them aren't actually urgent, your importance ranking is wrong. Tweak.

### Agent prompt for this step

```text
Read the last 24 hours of the Triage log surface.

Pick the 5 emails that need human attention today. Ranking criteria:
1. action = "flag" or "needs-human-review"
2. Sender importance (paying customer > vendor > newsletter > SaaS notification)
3. Time-sensitivity in the email body ("by EOD", "tomorrow", "ASAP")

For each: 1 line in the format:
`[Importance icon] [From] -- [Subject] -- [Why this needs you] -- [Suggested action]`

Include a footer: total count of triaged emails + drafts created since last digest.

Output as a Brief section titled "Morning digest <date>".
```

## Step 8: Run a 1-week shadow pass: log only, don't act

_Estimated time: 1 week (passive)_

Before the agent applies a single label, run for 1 week in shadow mode: the loop runs, classifies, decides actions, writes to the Triage log. You manually review each decision. This is non-negotiable. Without a shadow week, you'll find out about misclassifications when a customer email gets auto-archived to Newsletters.

### Tasks

- [ ] Disable label application + draft creation in the loop
- [ ] Keep the loop running with full classification + logging only
- [ ] Each morning: review the Triage log surface for the previous day
- [ ] Mark each row as 'agree' or 'override' (and what it should have been)
- [ ] On day 7: compute classification accuracy + identify the labels with > 5% override rate

### Pointers

- **[Guide]** [Shadow-mode pattern](https://en.wikipedia.org/wiki/Canary_release) — Same pattern as canary deploys: log + observe before mutate.

> [!CAUTION]
> **Gotchas**
>
> - Skipping shadow mode is the #1 source of inbox-agent disasters. A 95% accurate classifier on a 200/day inbox = 10 misclassifications per day. Some will be customer-facing.
> - Use the shadow week to calibrate confidence thresholds, not to add more labels. Add labels later, after the existing ones are stable.

## Step 9: Go live, with a kill switch and a daily review

_Estimated time: 30 min to flip, 5 min/day for 4 weeks, then 30 min/week_

After shadow week + tuning, flip on label application + draft creation. Add a kill switch (an env var or a Dock workspace flag) you can toggle to instantly stop the agent. Review the Triage log for 5 minutes every morning for the first 4 weeks; after that, weekly reviews are enough.

### Tasks

- [ ] Re-enable label application + draft creation in the loop
- [ ] Add a kill switch: env var TRIAGE_AGENT_ENABLED=true|false (loop bails on false)
- [ ] Set up a daily 5-min review on your calendar for the first 4 weeks
- [ ] Each review: scan Triage log, mark overrides, tweak prompt for systematic errors
- [ ] After 4 weeks: switch to weekly reviews
- [ ] Set up an alert if override rate exceeds 10% in any 24 hr window

### Pointers

- **[Guide]** [Operational kill switch pattern](https://en.wikipedia.org/wiki/Feature_toggle)

> [!CAUTION]
> **Gotchas**
>
> - Override rates spike when senders change patterns (a customer's email signature changes, a SaaS tool restructures their notifications). Watch the trend.
> - Don't tweak the prompt mid-day on a single misclassification, you'll thrash. Wait for daily review.
> - Drafts that sit untouched for > 7 days should be deleted by the agent on a weekly cleanup pass, otherwise they pile up and the inbox gets messier than before automation.

## Step 10: Iterate weekly: add labels, refine drafts, expand scope

_Estimated time: Ongoing, 30 min/week_

Once stable, the agent has room to grow: add a label for a new sender pattern, train it to draft replies on a category that previously was flag-only, expand to calendar invites or expense receipts. Resist scope-creep, every new responsibility is a new failure surface. Add one capability per week max.

### Tasks

- [ ] Each week: review override patterns from the previous week
- [ ] Pick one improvement: new label, new draft template, new automation
- [ ] Implement, test in shadow mode for the week, ship if accuracy > 95%
- [ ] Update the Brief with the new capability + the date it shipped
- [ ] Track the agent's labor savings: hours/week of email triage saved (a real metric, not vibes)

### Pointers

- **[Guide]** [Email triage time-savings benchmarks](https://hbr.org/2019/01/how-to-spend-way-less-time-on-email-every-day)

> [!CAUTION]
> **Gotchas**
>
> - An agent with 25 labels has 25 ways to misclassify. The marginal label past 12-15 usually doesn't pay back.
> - Don't expand to send-on-behalf-of-user without a hard human-confirmation step. The risk-reward isn't worth it for any volume of email.

---

## Hand the template to your agent

Paste the prompt below into your agent's permanent system prompt so the agent reads, writes, and maintains this workspace as you work through the steps.

```text
You are the inbox-triage agent on the workspace at your-org/automate-your-inbox-with-an-agent.

Your role: triage incoming emails, draft replies on routine ones, surface the 5 that need human attention.

Cadence:
- Run every 30 min during work hours.
- For each new email: classify by the taxonomy (in the Brief), apply the matching label, decide an action (reply / forward / file / flag).
- If the action is reply AND the email matches a routine pattern: draft a reply (do NOT send).
- Append every triage decision to the Triage log surface.
- At 8am local time: post a morning digest in the Brief with the 5 emails that need human attention today.

NEVER:
- Send an email without a human clicking Send.
- Apply a label not in the taxonomy.
- Forward an email outside the user's domain without explicit instruction.

First MCP tool calls:
1. list_workspaces()
2. get_doc(workspace_slug="automate-your-inbox-with-an-agent", surface_slug="brief")
3. list_rows(workspace_slug="automate-your-inbox-with-an-agent", surface_slug="triage-log")
```

---

## FAQ

### Is it safe to let an AI agent read my email?

Safer than letting it send. The pattern in this playbook is read-only API access (gmail.readonly + gmail.modify) for labels and drafts; no gmail.send scope ever. The agent never auto-sends; humans click Send on every reply. Your email content goes to the model API (Anthropic / OpenAI) for classification, both have enterprise data agreements that don't train on your data; verify those terms apply to your account tier.

### Won't the agent miss something important?

It will. The mitigation is the morning digest + flag-when-in-doubt rule. The agent labels emails it's confident about and flags everything else for human review. After 1 week of shadow running, you'll know your classification accuracy on each label; tune the confidence threshold so 'flag for review' captures the edge cases. A 95% accurate agent on a 200-email/day inbox flags ~10 emails for human attention, exactly the volume you can review in 5 min.

### How much does this cost in API tokens?

For a 200-email/day inbox using Claude Sonnet: ~$10-20/month in token cost (each email is ~500-1000 input tokens for classification + a 50-200 token output). Switch to Haiku for ~$2-4/month if you don't need draft quality. Heavy drafters (5+ replies/day) push toward $30-50/mo. Cron + hosting is $0-10/mo on free tiers. Total: $15-60/mo on top of your existing inbox subscription.

### Why not use Gmail's built-in filters and Smart Reply?

Filters are pattern-matching (sender, subject keyword); they can't reason about content. Smart Reply suggests 3 short replies but doesn't classify or take batch action. An agent reads each email, applies your custom taxonomy, drafts contextual replies, and produces a daily digest. Use filters for the truly mechanical cases (auto-archive a specific newsletter), use the agent for everything that requires reading the email.

### What's the worst-case failure?

An agent that auto-archives a customer email by misclassifying it as a newsletter, and you don't see it for a week. Mitigations: (1) shadow mode for week 1, no actions, (2) auto-archive only on labels with > 99% accuracy in shadow review, (3) the Triage log surface preserves every email's classification + content forever, so misses are recoverable, (4) a kill switch you can flip if anything looks off.

### Can my AI agents help build the agent?

Yes. The playbook ships agent prompts for the slow parts: the 30-day inbox audit, the label taxonomy, the classification prompt with few-shot examples, the morning digest format, and the weekly override-pattern analysis. The Triage log surface is the canonical record the agent reads to learn what it's doing well and where it's drifting.

