9-step playbook to put a read-only research agent on a codebase: maps the architecture, answers questions, ships weekly written explainers, no commits.

Run a research agent on your codebase

A 9-step playbook. Open in Dock and you'll get four surfaces seeded:

- **Steps** (table) the 9 gates as rows, owner + due + status
- **Pointers** (table) every official Claude Code / Cursor / MCP doc linked from this playbook
- **Brief** (doc) the canonical architecture map the agent maintains
- **Question log** (table) one row per research question, with answer + cited file paths + confidence

Open `Steps` first. The most important rule: read-only scope. No commits, no edits, no ever.

Outcome

A read-only research agent that maps the codebase, answers architecture questions in chat with cited file paths, and ships a weekly written explainer that grows into durable team knowledge.

Estimated time: 1-2 days build, ongoing weekly use
Difficulty: intermediate
For: Engineers joining a new codebase, on-call rotations, or maintainers of legacy code.

What you'll need

Pre-register or install before you start.

Claude Code (Free CLI, metered Anthropic API usage (~$0.10-0.50 per research session)) — Terminal-native agent with full filesystem access. Best default for codebase research.
Cursor (Free tier, Pro $20/mo) — IDE-native agent with codebase indexing. Good for in-editor research while you read code.
Aider (Free, metered API usage on whichever model you wire it to) — Open-source CLI alternative if you want full control over the agent loop.
Anthropic API (Claude Sonnet or Opus) (Sonnet ~$3/M input, Opus ~$15/M input) — The model. Sonnet for cost, Opus for the hardest architectural questions.
ripgrep (rg) (Free) — Fast text search. Every codebase research agent needs this installed locally.

The template · 9 steps

Step 1: Pick the codebase + define the read-only scope

Estimated time: 30 min

Decide what 'the codebase' means: one repo, a monorepo with N packages, or a multi-repo system. Be explicit, the agent's job description depends on it. Then write the read-only rule in plain text: the agent reads, summarizes, cites, and writes only to the workspace docs. It does not edit code, does not run scripts that modify state, does not commit, does not push. Lock this in before you give the agent a runtime.

Tasks

Name the repo (or set of repos) the agent has access to
Define 'read-only': read code, read git log, read tests, run static analysis tools, no mutations
Confirm the agent's runtime supports a read-only mode (Claude Code: --no-write flag; Cursor: ask-mode; Aider: --read-only)
Write the scope into the Brief as the canonical reference

Pointers

[Official] Claude Code permissions
[Official] Cursor ask-mode vs. agent-mode

[!CAUTION] Gotchas

An agent that's allowed to run scripts can mutate state (DB writes, API calls, file changes) unintentionally. Ban shell access if you don't trust the model on git checkout vs git reset --hard.

Read-only doesn't mean 'no tools'. The agent can run static analyzers, parse ASTs, query git log; it just can't change anything.

Step 2: Seed the architecture map with a top-down pass

Estimated time: 1-2 hr

The biggest mistake is to ask the agent specific questions before it has a high-level map. The agent will answer with whatever files it grep'd, often missing the canonical implementation. Instead, ask the agent to produce a top-down architecture map first: services, modules, key abstractions, data flow. Save that to the Brief, refer back to it in every subsequent question.

Tasks

Ask the agent to produce a top-down architecture map: services, modules, entry points, key abstractions
For each module: 1-paragraph purpose + 3-5 key files + relationships to other modules
Identify the 5-10 'load-bearing' files (hit constantly, refactor-resistant)
Save the map to the Brief as the canonical reference
Spot-check 5 claims in the map by reading the cited files yourself

Pointers

[Guide] C4 model for software architecture — The vocabulary the agent's architecture map should use: System, Container, Component, Code.

[!CAUTION] Gotchas

Agents over-index on the first files they read. Force a breadth-first pass before depth dives.

An architecture map full of confident claims with no file paths is hallucinated. Demand citations.

The first map will be 60-70% right. Spot-check, correct, ship v2.

Agent prompt for this step

Read the codebase top-down.

Output an architecture map as a Brief section titled "Architecture map v1":
1. System overview (1 paragraph: what this codebase is, who uses it)
2. Modules / services (5-15 entries, each with: name, purpose, entry-point files)
3. Data flow (how a typical request / event moves through the system)
4. Load-bearing files (5-10 files that are hit constantly + a 1-line note on why)
5. Open questions (things you couldn't determine from reading alone)

Constraints:
- Cite file paths for every claim. If you can't cite a path, mark it "(inferred)".
- Don't editorialize. Describe what's there, not what should be.
- Confidence level per module: high / medium / low. Low-confidence modules go in "Open questions" too.

Step 3: Set up the question pipeline + log

Estimated time: 1 hr

Every research question the agent answers should land in a structured log: question, answer, file paths cited, confidence (high / medium / low), open follow-ups. The log is the durable artifact, the chat history disappears, the log persists. Make the agent write to the log on every answer.

Tasks

Define the Question log schema: question, answer (markdown), files_cited (array), confidence, follow_ups, asked_by, asked_at
Update the agent's system prompt to write each answer to the log
Add a 'confidence' rule: high = direct quote from a file, medium = inferred from 2+ files, low = pattern match without specific evidence
Spot-check: ask 5 questions, confirm 5 rows appear in the log with correct citations

Pointers

[Official] Citation patterns in agentic research

[!CAUTION] Gotchas

Agents will skip the log step if it's awkward in the prompt. Make it the FINAL step in the answer flow, not optional.

Confidence calibration drifts. Spot-check 1 in 10 'high confidence' answers; if accuracy is below 95%, recalibrate the prompt.

Step 4: Define the question taxonomy: what the agent will + won't answer

Estimated time: 30 min

Not every question is a research question. The agent should answer: 'where is this implemented?', 'why is this designed this way?', 'what's the data flow for X?', 'what are the test patterns?'. The agent should refuse: 'fix this bug', 'add a feature', 'refactor X', 'update the schema'. Refusal isn't lazy, it's the boundary that keeps the read-only scope intact.

Tasks

Write the in-scope list (5-10 question types the agent answers well)
Write the out-of-scope list (5-10 question types the agent refuses + suggests a coding agent for)
Add the lists to the agent's system prompt as explicit refuse patterns
Document in the Brief so users (you + teammates) know what to ask

Pointers

[Official] Anthropic refusal patterns

[!CAUTION] Gotchas

Without explicit out-of-scope rules, the agent drifts toward 'helpful' edits. The user asks 'where is X?' and the agent ends up suggesting 5 changes. Refuse early.

Refusals should suggest the right tool: 'this is a coding task, spawn a Cursor / Claude Code agent in the repo to do it.'

Step 5: Run a 1-day pilot: answer 10 questions, calibrate confidence

Estimated time: 4-6 hr

Pick 10 real questions you have about the codebase. Run them through the agent. Spot-check every answer against the source files. Compute accuracy: how many high-confidence answers were actually right? How many flagged confidence levels matched the actual evidence quality? This calibration loop is what separates a useful research agent from a confident-bullshit machine.

Tasks

Write 10 real questions, mix of architecture / implementation / why / where
Ask each, log to the Question log surface
For each answer: open the cited files, verify the claim
Mark each as: correct, partial, hallucinated
Compute accuracy by confidence bucket (high / medium / low)
If high-confidence accuracy < 95%: tighten the system prompt's confidence definition

Pointers

[Official] LLM eval methodology

[!CAUTION] Gotchas

Confidence drift is real: an agent that says 'high confidence' on an inferred answer breaks trust fast. Calibrate before going live.

Don't pilot on questions you already know the answer to AND questions you don't. Mix both, the unknowns are the real test.

Agent prompt for this step

Run the calibration pass on the 10 pilot questions.

For each question + answer pair in the Question log:
1. Re-read the cited file paths.
2. Verify the answer matches the source code.
3. Update the row with: verified (correct / partial / hallucinated), notes on what was wrong.

Then output a calibration report as a Brief section titled "Pilot calibration v1":
- Accuracy by confidence bucket (high / med / low)
- Common failure modes (wrong file, missing context, over-confident)
- 3 specific system-prompt tweaks to apply, ordered by expected lift

Constraint: be brutal in the assessment. A confidently wrong answer is worse than a 'I don't know.'

Step 6: Wire the agent into your team's workflow

Estimated time: 1-2 hr

An agent nobody asks questions of is dead. Wire it into the workflow: a Slack channel, a recurring #ask-the-codebase ritual, a 'before-you-ask-a-senior-engineer' default. Give it a name, give it a place. The teams that get the most out of research agents treat them as a junior team member with infinite patience for 'newbie' questions.

Tasks

Create a Slack channel #codebase-research (or equivalent in your team's chat)
Wire the agent to listen to the channel (custom Slack app + Claude API, OR Cursor / Claude Code with a Slack MCP)
Document the channel in your team's onboarding: 'ask any codebase question here, the agent answers + a human confirms'
Set the agent to log every question to the Question log surface
Encourage 5-10 teammates to ask 1 question each in the first week

Pointers

[Official] Slack app + bot user setup
[Code] Slack MCP server (community)

[!CAUTION] Gotchas

An agent in a channel without humans replying becomes a soliloquy. Set the expectation: agent answers first, a human replies if the agent missed something, both are visible.

Don't auto-DM teammates with daily summaries. They'll mute the bot. Pull-based (a channel) > push-based (a DM).

Step 7: Ship the weekly architecture explainer

Estimated time: 30 min/week to review, 0 min for the agent to draft

Every Friday, the agent produces a written explainer covering the architectural patterns that emerged from the week's questions. Over time, these explainers become the de facto codebase documentation: ground truth, structured, version-controlled. Better than the README that hasn't been updated since 2022.

Tasks

Schedule the agent to produce a weekly explainer every Friday at 4pm
Format: 1 architectural theme (the most-asked-about one this week), 3-5 file paths, 2-3 paragraphs of explanation
Save each weekly explainer to the Brief under a 'Weekly explainers' heading
On Monday morning: review last week's explainer, fact-check, edit, link to from team docs
After 4-6 weeks: review the cumulative explainer, restructure into a proper architecture doc

Pointers

[Guide] Architecture Decision Records (ADRs) — The format the cumulative explainers should converge toward.

[!CAUTION] Gotchas

Weekly explainers that just summarize the questions are useless. They have to surface the underlying pattern, not enumerate.

Don't accumulate 12 weekly explainers without restructuring. By month 2, refactor them into proper architecture docs.

Agent prompt for this step

Read the last 7 days of the Question log surface.

Identify the most-asked-about architectural theme. ("What were the top 3 file paths cited this week? What domain did the questions cluster around?")

Write a weekly explainer as a Brief section titled "Weekly explainer <week-of>":
1. Theme (1 sentence)
2. Why it came up this week (1 paragraph)
3. The architectural pattern (2-3 paragraphs, with file path citations)
4. Open questions / known gaps
5. Suggested next read-through if the team wants to deepen on this

Constraints:
- Cite real file paths for every claim. If you can't cite, drop the claim.
- Write for a senior engineer joining the team, not for marketing.
- Length: 300-600 words, no longer.

Step 8: Spot-check + edit the explainers; treat them as living docs

Estimated time: 30 min/week

The agent's explainers are the first draft, not the final word. Every Monday, spend 30 min reading last week's explainer with the source code in the other window. Correct hallucinations, sharpen claims, add the institutional context the agent doesn't have. Then check the edited version into the team's docs repo. Over time, this is how you build the codebase documentation that always seemed too expensive to maintain.

Tasks

Monday: read last week's explainer side-by-side with cited files
Mark hallucinations / over-claims / missing context
Edit the explainer in the Brief surface
Copy the edited version into the team's docs repo (e.g. /docs/architecture/.md)
Update the Question log: link to the canonical doc on rows that contributed to the explainer

Pointers

[Guide] Docs-as-code workflow

[!CAUTION] Gotchas

Explainers checked into docs without spot-check propagate hallucinations into the team's mental model. Don't skip the Monday review.

Editing in a chat tab loses the changes when the tab closes. Edit in the Brief surface or the docs repo, not in chat.

Step 9: Iterate: add static analysis, expand to multi-repo, retire when redundant

Estimated time: Ongoing, 1-2 hr/month

Once stable, expand: wire in static analysis tools the agent can run (cloc, semgrep, dependency graph), expand from one repo to a monorepo or a multi-repo cluster, integrate with the team's existing docs site. Eventually the agent's job will shrink, the docs the team builds from explainers replace many of the questions. That's success, not failure.

Tasks

Add tool access: rg, cloc, semgrep, ts-node / pyright for type info, git log for history
Expand to additional repos in a monorepo or multi-repo system
Integrate the explainer output with the team's docs site (Vercel-hosted, mkdocs, Docusaurus, whatever)
Retire question categories the explainers have permanently answered, route to the docs
After 3 months: review the volume of new questions per week. If it's under 5, the agent has done its job for this codebase

Pointers

[Official] semgrep rules + custom queries

[!CAUTION] Gotchas

Multi-repo agents need to be careful with read-only enforcement, one repo's read-only might be another's coding work area. Lock the scope per repo.

Don't keep an agent running on a codebase with low question volume. Tokens cost real money; agents that aren't earning their keep should sleep.

Hand the template to your agent

Paste the prompt below into your agent's permanent system prompt so the agent reads, writes, and maintains this workspace as you work through the steps.

You are the research agent on the workspace at your-org/run-a-research-agent-on-your-codebase.

Your role: read the codebase + answer questions about it. NEVER edit files. NEVER commit. NEVER write to the filesystem outside the workspace docs.

Cadence:
- For each question the user asks: read the relevant files, answer with cited file paths + line numbers, log the question + answer + confidence to the Question log.
- Each Friday: produce a weekly architecture explainer in the Brief, summarising the questions answered that week + the architectural patterns that emerged.

Read-only scope is the hard rule. If the user asks you to fix a bug or refactor: refuse, suggest spawning a coding agent in a separate workspace.

First MCP tool calls:
1. list_workspaces()
2. get_doc(workspace_slug="run-a-research-agent-on-your-codebase", surface_slug="brief")
3. list_rows(workspace_slug="run-a-research-agent-on-your-codebase", surface_slug="question-log")

FAQ

Why a separate research agent vs. just using Claude Code or Cursor for everything?

Scope discipline. A multipurpose 'do anything in the repo' agent will edit a file you didn't ask it to edit, eventually. A research agent in read-only mode never can. The boundary is the safety. You can always spawn a separate coding agent (with write access) when you've identified the change to make; the research agent's job is to find it without changing it.

How do I keep the agent from hallucinating module relationships?

Three rails: (1) demand a citation (file path + line number) for every claim, (2) calibrate confidence buckets in the system prompt and spot-check the high-confidence claims weekly, (3) run a top-down architecture pass first, refer back to it instead of letting each question be a fresh exploration. The combination keeps hallucination rates under 5% on high-confidence answers in practice.

Should the agent commit its own explainers to the docs repo?

No. Edit + commit is a human step. The agent drafts in the Brief surface; you review, edit, and commit. Auto-committing AI-drafted docs is how teams end up with 200 docs nobody trusts. Manual commit is the trust gate.

What runtime should I pick: Claude Code, Cursor, or a custom MCP-based agent?

Claude Code is the simplest default: terminal-native, full filesystem access, supports a --no-write flag for read-only mode. Cursor is the right choice if you're already in the editor and want in-context Q&A. A custom MCP-based agent (with the workspace + Slack + git MCPs wired in) is more work but gives you a permanent agent that lives in chat, not a per-session conversation.

What does this cost in API tokens?

A typical research session (5-10 questions, 1 architecture pass) is ~$0.10-0.50 on Claude Sonnet. A weekly explainer is ~$0.20-1. Total monthly cost for a team of 5-10 asking the agent 20-50 questions/week: $5-30/month. Heavy users with the agent in a Slack channel: $20-80/month. Negligible compared to the time saved.

Can my AI agents help build the agent?

Yes. The playbook ships agent prompts for the slow parts: the top-down architecture map, the confidence calibration pass, the weekly explainer drafts, and the Monday-review process. The Question log surface is the canonical record, every question logged with citations + confidence, every weekly explainer linked back to the questions it summarized.

Run a research agent on your codebase

Run a research agent on your codebase

Outcome

What you'll need

The template · 9 steps

Step 1: Pick the codebase + define the read-only scope

Tasks

Pointers

Step 2: Seed the architecture map with a top-down pass

Tasks

Pointers

Agent prompt for this step

Step 3: Set up the question pipeline + log

Tasks

Pointers

Step 4: Define the question taxonomy: what the agent will + won't answer

Tasks

Pointers

Step 5: Run a 1-day pilot: answer 10 questions, calibrate confidence

Tasks

Pointers

Agent prompt for this step

Step 6: Wire the agent into your team's workflow

Tasks

Pointers

Step 7: Ship the weekly architecture explainer

Tasks

Pointers

Agent prompt for this step

Step 8: Spot-check + edit the explainers; treat them as living docs

Tasks

Pointers

Step 9: Iterate: add static analysis, expand to multi-repo, retire when redundant

Tasks

Pointers

Hand the template to your agent

FAQ

Why a separate research agent vs. just using Claude Code or Cursor for everything?

How do I keep the agent from hallucinating module relationships?

Should the agent commit its own explainers to the docs repo?

What runtime should I pick: Claude Code, Cursor, or a custom MCP-based agent?

What does this cost in API tokens?

Can my AI agents help build the agent?

Make this yours. Edit, extend, run agents on it.