An inbox-triage agent is the highest-leverage personal automation you can build, but the failure modes are real: an agent that auto-sends a wrong reply costs you a customer; an agent that hallucinates a calendar invite breaks trust; an agent that hits Gmail's send rate limit silently drops half your replies. This playbook walks the 10 steps to build a triage agent that does the work without the disasters: a strict label taxonomy, OAuth-scoped auth, draft-but-don't-send safety, an opinionated daily digest, and the 1-week shadow run before you trust it.
Outcome
An agent triaging your inbox daily: labels every incoming email by your taxonomy, drafts replies on the routine ones, surfaces the 5 that need human attention in a morning digest, never auto-sends without a human click.
Time3-5 days build, 1 week shadow run before liveDifficultyintermediateForOperators + founders + execs with 100-500 emails/day.
Top to bottom. Each step has tasks, pointers, gotchas.
01 / 10
Audit the last 30 days of your inbox
2-3 hr
You can't triage what you don't categorize. Pull the last 30 days of email and pattern-match: how many are newsletters, how many are customer emails, how many are 1:1 from real humans, how many are notifications from SaaS tools. The categories you find here become the agent's label taxonomy.
Tasks
Export 30 days of inbox via Gmail Takeout OR run a script via the Gmail API
Cluster emails by sender domain + subject pattern
Identify the top 8-12 categories by volume
Note which categories are 'reply needed' vs. 'read + file' vs. 'auto-archive'
Eyeball the false-positive risk of each category (newsletters mis-classified as customer email = real harm)
Don't trust your guess on category mix. Most people overestimate 1:1 human emails by 3-5x.
Auto-categorizing categories under 1% of volume is a waste, the false-positive cost outweighs the time saved.
Agent prompt for this step
Read the last 30 days of email from the user's Gmail inbox.
Cluster the emails by:
1. Sender domain (e.g. all @stripe.com)
2. Subject pattern (e.g. "Your invoice for...", "Action required:")
3. Body fingerprint (templated newsletters vs. real human writing)
Output as a Brief section titled "Inbox audit (30 days)":
1. Top 12 categories by volume, each with: name, sender pattern, subject pattern, example, action recommendation (reply / file / archive / flag).
2. Recommended label taxonomy (8-12 labels).
3. Risk flags: categories where misclassification has real consequences.
Constraints: be honest about volume. The user wants the truth about their inbox, not a flattering summary.
02 / 10
Define the label taxonomy + actions
1 hr
Labels are the agent's vocabulary. Too few (3-4) and you can't take different actions per category. Too many (20+) and the model misclassifies. Aim for 8-12 labels, each with a defined action: reply / file / flag / archive. Auto-archive only on labels you trust 99%+ on; everything else gets a human-visible flag.
Tasks
Pick 8-12 label names (kebab-case or human-readable, your choice)
For each label: action (reply-needed / read-and-file / flag-for-review / auto-archive)
For each label with action=reply-needed: is it a routine reply (agent drafts) or a human reply (agent flags only)?
Add labels to Gmail (Settings -> Labels -> Create new)
Don't use Gmail's built-in 'Important' or 'Starred' for the agent, those are user-facing signals you'll want for your own use.
Auto-archive on a label is dangerous. The user never sees those emails again. Reserve for newsletters + automated SaaS notifications, never customer-facing.
03 / 10
Set up Gmail API access (OAuth)
1-2 hr
The agent needs to read, label, and draft email programmatically. That means OAuth-scoped access to the Gmail API: gmail.readonly + gmail.modify (label + draft, not send). Send scope is intentionally NOT requested, the agent never sends without a human click.
Tasks
Create a Google Cloud project at console.cloud.google.com
Enable the Gmail API
Create OAuth 2.0 credentials (type: Desktop app or Web app depending on host)
Don't request gmail.send scope. The agent should never have it; this is the safety rail.
OAuth refresh tokens for unverified apps expire after 7 days. Either verify the app (~3-day Google review) or run as a Workspace internal app.
Store the refresh token in an env var, not in code. A leaked refresh token = full inbox access until revoked.
04 / 10
Pick the runtime: cron script or MCP server
30 min decision, 4-8 hr build
Two patterns work. (a) A cron script that runs every 30 min, calls the Gmail API, sends each email to Claude / GPT for classification, applies labels + creates drafts. (b) An MCP server (Gmail MCP) that exposes Gmail as tools to any MCP client, you ask Claude 'triage my inbox' and it loops. Pattern (a) is more reliable for unattended automation; pattern (b) is more flexible for interactive workflows.
Tasks
Pick: cron script (unattended) vs. MCP server (interactive)
If cron: pick a host (Render, Cloudflare Workers, Vercel cron, your own VPS)
If MCP: install the community Gmail MCP server + configure auth
Either way: pick the model (Claude Sonnet is the sweet spot, Haiku for high-volume cheap classification, Opus for high-stakes drafting)
Cron-every-5-min on a 200-email/day inbox is overkill and burns tokens. Every 30 min is the sweet spot.
If you go MCP-server route, you must remember to run the triage prompt manually, easy to forget. Cron wins for unattended.
05 / 10
Write the classification prompt
2-4 hr
The classification prompt is the heart of the agent. Given email metadata + body, output: label, confidence, action, draft (if reply-needed and routine). The prompt should be deterministic on routine emails and conservative on edge cases (when in doubt, flag for human review).
Tasks
Draft the system prompt: identity, taxonomy (paste the labels), action rules, JSON output schema
Include 5-10 few-shot examples (real emails from your audit, with the correct label + action)
Without a confidence threshold, the model hallucinates labels at 5-10% rate on edge cases.
Including the user's full email signature in the draft is a tell. Strip it; let the user add their own.
Drafts longer than ~100 words feel obviously AI-generated. Shorter is more authentic for routine replies.
Agent prompt for this step
Read the label taxonomy from the Brief.
Draft a classification system prompt with:
1. Identity: "You are an inbox triage agent for [user]."
2. Taxonomy (paste the labels + actions verbatim)
3. Output: strict JSON { label: string, confidence: 0-1, action: "reply" | "file" | "flag" | "archive", draft_subject?: string, draft_body?: string }
4. Rules:
- When in doubt, label "needs-human-review" with action "flag".
- Only draft a reply when the email matches a routine pattern AND confidence > 0.85.
- Never include the user's signature in the draft.
- Keep drafts under 100 words for routine replies.
Then add 5-10 few-shot examples from the user's audit (real emails + correct classifications).
Output as a Brief section titled "Classification prompt v1".
06 / 10
Implement the triage loop (label, draft, log)
4-8 hr
The loop: list new emails since last run, send each to the model with the classification prompt, apply the label via Gmail API, create a draft (NOT a sent email) if action is reply, log to the Triage log surface. Keep state in a flat file or a tiny SQLite DB so you don't double-process.
Tasks
Implement: fetch unread emails since last_run_at
For each: call the model with the email body + classification prompt
Apply the returned label via gmail.users.labels.modify
If action is reply + draft is set: create a draft via gmail.users.drafts.create
Append a row to the Triage log surface with: msg_id, from, subject, label, confidence, action, draft_id?, override?
Use users.drafts.create, not users.messages.send. Drafts are reviewable in Gmail; sends are gone.
Inbox automation hits Gmail's 250 send/day limit fast (personal) or 2000/day (Workspace). Drafts don't count, but if you ever flip to send, you'll burn the quota in an hour during catch-up runs.
Always store the original email content in your log, not just the classification. When you debug a misclassification, you'll need the source.
07 / 10
Build the morning digest format
1-2 hr
The agent's daily output to you is the digest: 5 emails that need human attention today, ranked by urgency. Format it for 30-second skim: 1 line per email (sender, subject, why it needs you, suggested action). Send the digest as a doc update, a Slack message, or an email to yourself, whatever surface you check first thing.
Tasks
Decide the surface: doc (Brief), Slack DM, or self-email
Format: top 5 emails, 1 line each (from, subject, why, suggested action)
Add a footer: 'X emails labeled, Y drafts created since last digest'
Schedule the digest for your wake-up time (8am local is common)
Test for 3 days, refine the format based on what you actually act on
Digests longer than 5 emails get scanned, not read. Keep to 5, push the rest to a 'later' queue.
If the agent surfaces 5 emails and 4 of them aren't actually urgent, your importance ranking is wrong. Tweak.
Agent prompt for this step
Read the last 24 hours of the Triage log surface.
Pick the 5 emails that need human attention today. Ranking criteria:
1. action = "flag" or "needs-human-review"
2. Sender importance (paying customer > vendor > newsletter > SaaS notification)
3. Time-sensitivity in the email body ("by EOD", "tomorrow", "ASAP")
For each: 1 line in the format:
`[Importance icon] [From] -- [Subject] -- [Why this needs you] -- [Suggested action]`
Include a footer: total count of triaged emails + drafts created since last digest.
Output as a Brief section titled "Morning digest <date>".
08 / 10
Run a 1-week shadow pass: log only, don't act
1 week (passive)
Before the agent applies a single label, run for 1 week in shadow mode: the loop runs, classifies, decides actions, writes to the Triage log. You manually review each decision. This is non-negotiable. Without a shadow week, you'll find out about misclassifications when a customer email gets auto-archived to Newsletters.
Tasks
Disable label application + draft creation in the loop
Keep the loop running with full classification + logging only
Each morning: review the Triage log surface for the previous day
Mark each row as 'agree' or 'override' (and what it should have been)
On day 7: compute classification accuracy + identify the labels with > 5% override rate
Skipping shadow mode is the #1 source of inbox-agent disasters. A 95% accurate classifier on a 200/day inbox = 10 misclassifications per day. Some will be customer-facing.
Use the shadow week to calibrate confidence thresholds, not to add more labels. Add labels later, after the existing ones are stable.
09 / 10
Go live, with a kill switch and a daily review
30 min to flip, 5 min/day for 4 weeks, then 30 min/week
After shadow week + tuning, flip on label application + draft creation. Add a kill switch (an env var or a Dock workspace flag) you can toggle to instantly stop the agent. Review the Triage log for 5 minutes every morning for the first 4 weeks; after that, weekly reviews are enough.
Tasks
Re-enable label application + draft creation in the loop
Add a kill switch: env var TRIAGE_AGENT_ENABLED=true|false (loop bails on false)
Set up a daily 5-min review on your calendar for the first 4 weeks
Each review: scan Triage log, mark overrides, tweak prompt for systematic errors
After 4 weeks: switch to weekly reviews
Set up an alert if override rate exceeds 10% in any 24 hr window
Override rates spike when senders change patterns (a customer's email signature changes, a SaaS tool restructures their notifications). Watch the trend.
Don't tweak the prompt mid-day on a single misclassification, you'll thrash. Wait for daily review.
Drafts that sit untouched for > 7 days should be deleted by the agent on a weekly cleanup pass, otherwise they pile up and the inbox gets messier than before automation.
Once stable, the agent has room to grow: add a label for a new sender pattern, train it to draft replies on a category that previously was flag-only, expand to calendar invites or expense receipts. Resist scope-creep, every new responsibility is a new failure surface. Add one capability per week max.
Tasks
Each week: review override patterns from the previous week
Pick one improvement: new label, new draft template, new automation
Implement, test in shadow mode for the week, ship if accuracy > 95%
Update the Brief with the new capability + the date it shipped
Track the agent's labor savings: hours/week of email triage saved (a real metric, not vibes)
An agent with 25 labels has 25 ways to misclassify. The marginal label past 12-15 usually doesn't pay back.
Don't expand to send-on-behalf-of-user without a hard human-confirmation step. The risk-reward isn't worth it for any volume of email.
Hand the template to your agent
Workspace-wide agent prompt.
Paste this into your agent's permanent system prompt so the agent reads, writes, and maintains the template's surfaces as you work through the steps.
Agent system prompt
You are the inbox-triage agent on the workspace at your-org/automate-your-inbox-with-an-agent.
Your role: triage incoming emails, draft replies on routine ones, surface the 5 that need human attention.
Cadence:
- Run every 30 min during work hours.
- For each new email: classify by the taxonomy (in the Brief), apply the matching label, decide an action (reply / forward / file / flag).
- If the action is reply AND the email matches a routine pattern: draft a reply (do NOT send).
- Append every triage decision to the Triage log surface.
- At 8am local time: post a morning digest in the Brief with the 5 emails that need human attention today.
NEVER:
- Send an email without a human clicking Send.
- Apply a label not in the taxonomy.
- Forward an email outside the user's domain without explicit instruction.
First MCP tool calls:
1. list_workspaces()
2. get_doc(workspace_slug="automate-your-inbox-with-an-agent", surface_slug="brief")
3. list_rows(workspace_slug="automate-your-inbox-with-an-agent", surface_slug="triage-log")
FAQ
Common questions on this template.
Is it safe to let an AI agent read my email?
Safer than letting it send. The pattern in this playbook is read-only API access (gmail.readonly + gmail.modify) for labels and drafts; no gmail.send scope ever. The agent never auto-sends; humans click Send on every reply. Your email content goes to the model API (Anthropic / OpenAI) for classification, both have enterprise data agreements that don't train on your data; verify those terms apply to your account tier.
Won't the agent miss something important?
It will. The mitigation is the morning digest + flag-when-in-doubt rule. The agent labels emails it's confident about and flags everything else for human review. After 1 week of shadow running, you'll know your classification accuracy on each label; tune the confidence threshold so 'flag for review' captures the edge cases. A 95% accurate agent on a 200-email/day inbox flags ~10 emails for human attention, exactly the volume you can review in 5 min.
How much does this cost in API tokens?
For a 200-email/day inbox using Claude Sonnet: ~$10-20/month in token cost (each email is ~500-1000 input tokens for classification + a 50-200 token output). Switch to Haiku for ~$2-4/month if you don't need draft quality. Heavy drafters (5+ replies/day) push toward $30-50/mo. Cron + hosting is $0-10/mo on free tiers. Total: $15-60/mo on top of your existing inbox subscription.
Why not use Gmail's built-in filters and Smart Reply?
Filters are pattern-matching (sender, subject keyword); they can't reason about content. Smart Reply suggests 3 short replies but doesn't classify or take batch action. An agent reads each email, applies your custom taxonomy, drafts contextual replies, and produces a daily digest. Use filters for the truly mechanical cases (auto-archive a specific newsletter), use the agent for everything that requires reading the email.
What's the worst-case failure?
An agent that auto-archives a customer email by misclassifying it as a newsletter, and you don't see it for a week. Mitigations: (1) shadow mode for week 1, no actions, (2) auto-archive only on labels with > 99% accuracy in shadow review, (3) the Triage log surface preserves every email's classification + content forever, so misses are recoverable, (4) a kill switch you can flip if anything looks off.
Can my AI agents help build the agent?
Yes. The playbook ships agent prompts for the slow parts: the 30-day inbox audit, the label taxonomy, the classification prompt with few-shot examples, the morning digest format, and the weekly override-pattern analysis. The Triage log surface is the canonical record the agent reads to learn what it's doing well and where it's drifting.
Open this template as a workspace.
We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.