Run· Mixed

Run a research agent on your codebase

9-step playbook to put a read-only research agent on a codebase: maps the architecture, answers questions, ships weekly written explainers, no commits.

Open in DockEngineers onboarding to a new codebase

A research agent on a codebase is the fastest way to get from 'I just cloned this repo' to 'I can answer architecture questions confidently.' Done right, the agent reads the code, answers your questions in chat, and ships a weekly written explainer that becomes durable knowledge for the team. Done wrong, the agent hallucinates module relationships and you propagate the wrong mental model. This playbook walks the 9 steps to keep the agent honest: read-only scope (no commits), high-leverage seeding (top-down architecture map first, not random file reads), question logging, and the weekly grounded-write-up loop.

Outcome

A read-only research agent that maps the codebase, answers architecture questions in chat with cited file paths, and ships a weekly written explainer that grows into durable team knowledge.

Time1-2 days build, ongoing weekly useDifficultyintermediateForEngineers joining a new codebase, on-call rotations, or maintainers of legacy code.

What you'll need

Pre-register or install before you start.

Claude CodeFree CLI, metered Anthropic API usage (~$0.10-0.50 per research session)

Terminal-native agent with full filesystem access. Best default for codebase research.

CursorFree tier, Pro $20/mo

IDE-native agent with codebase indexing. Good for in-editor research while you read code.

AiderFree, metered API usage on whichever model you wire it to

Open-source CLI alternative if you want full control over the agent loop.

Anthropic API (Claude Sonnet or Opus)Sonnet ~$3/M input, Opus ~$15/M input

The model. Sonnet for cost, Opus for the hardest architectural questions.

ripgrep (rg)Free

Fast text search. Every codebase research agent needs this installed locally.

The template · 9 steps

Top to bottom. Each step has tasks, pointers, gotchas.

Pick the codebase + define the read-only scope

30 min

Decide what 'the codebase' means: one repo, a monorepo with N packages, or a multi-repo system. Be explicit, the agent's job description depends on it. Then write the read-only rule in plain text: the agent reads, summarizes, cites, and writes only to the workspace docs. It does not edit code, does not run scripts that modify state, does not commit, does not push. Lock this in before you give the agent a runtime.

Tasks

Name the repo (or set of repos) the agent has access to
Define 'read-only': read code, read git log, read tests, run static analysis tools, no mutations
Confirm the agent's runtime supports a read-only mode (Claude Code: --no-write flag; Cursor: ask-mode; Aider: --read-only)
Write the scope into the Brief as the canonical reference

Pointers

OfficialClaude Code permissions OfficialCursor ask-mode vs. agent-mode

Gotchas

An agent that's allowed to run scripts can mutate state (DB writes, API calls, file changes) unintentionally. Ban shell access if you don't trust the model on `git checkout` vs `git reset --hard`.
Read-only doesn't mean 'no tools'. The agent can run static analyzers, parse ASTs, query git log; it just can't change anything.

Seed the architecture map with a top-down pass

1-2 hr

The biggest mistake is to ask the agent specific questions before it has a high-level map. The agent will answer with whatever files it grep'd, often missing the canonical implementation. Instead, ask the agent to produce a top-down architecture map first: services, modules, key abstractions, data flow. Save that to the Brief, refer back to it in every subsequent question.

Tasks

Ask the agent to produce a top-down architecture map: services, modules, entry points, key abstractions
For each module: 1-paragraph purpose + 3-5 key files + relationships to other modules
Identify the 5-10 'load-bearing' files (hit constantly, refactor-resistant)
Save the map to the Brief as the canonical reference
Spot-check 5 claims in the map by reading the cited files yourself

Pointers

GuideC4 model for software architecture

Gotchas

Agents over-index on the first files they read. Force a breadth-first pass before depth dives.
An architecture map full of confident claims with no file paths is hallucinated. Demand citations.
The first map will be 60-70% right. Spot-check, correct, ship v2.

Agent prompt for this step

Read the codebase top-down.

Output an architecture map as a Brief section titled "Architecture map v1":
1. System overview (1 paragraph: what this codebase is, who uses it)
2. Modules / services (5-15 entries, each with: name, purpose, entry-point files)
3. Data flow (how a typical request / event moves through the system)
4. Load-bearing files (5-10 files that are hit constantly + a 1-line note on why)
5. Open questions (things you couldn't determine from reading alone)

Constraints:
- Cite file paths for every claim. If you can't cite a path, mark it "(inferred)".
- Don't editorialize. Describe what's there, not what should be.
- Confidence level per module: high / medium / low. Low-confidence modules go in "Open questions" too.

Set up the question pipeline + log

1 hr

Every research question the agent answers should land in a structured log: question, answer, file paths cited, confidence (high / medium / low), open follow-ups. The log is the durable artifact, the chat history disappears, the log persists. Make the agent write to the log on every answer.

Tasks

Define the Question log schema: question, answer (markdown), files_cited (array), confidence, follow_ups, asked_by, asked_at
Update the agent's system prompt to write each answer to the log
Add a 'confidence' rule: high = direct quote from a file, medium = inferred from 2+ files, low = pattern match without specific evidence
Spot-check: ask 5 questions, confirm 5 rows appear in the log with correct citations

Pointers

OfficialCitation patterns in agentic research

Gotchas

Agents will skip the log step if it's awkward in the prompt. Make it the FINAL step in the answer flow, not optional.
Confidence calibration drifts. Spot-check 1 in 10 'high confidence' answers; if accuracy is below 95%, recalibrate the prompt.

Define the question taxonomy: what the agent will + won't answer

30 min

Not every question is a research question. The agent should answer: 'where is this implemented?', 'why is this designed this way?', 'what's the data flow for X?', 'what are the test patterns?'. The agent should refuse: 'fix this bug', 'add a feature', 'refactor X', 'update the schema'. Refusal isn't lazy, it's the boundary that keeps the read-only scope intact.

Tasks

Write the in-scope list (5-10 question types the agent answers well)
Write the out-of-scope list (5-10 question types the agent refuses + suggests a coding agent for)
Add the lists to the agent's system prompt as explicit refuse patterns
Document in the Brief so users (you + teammates) know what to ask

Pointers

OfficialAnthropic refusal patterns

Gotchas

Without explicit out-of-scope rules, the agent drifts toward 'helpful' edits. The user asks 'where is X?' and the agent ends up suggesting 5 changes. Refuse early.
Refusals should suggest the right tool: 'this is a coding task, spawn a Cursor / Claude Code agent in the repo to do it.'

Run a 1-day pilot: answer 10 questions, calibrate confidence

4-6 hr

Pick 10 real questions you have about the codebase. Run them through the agent. Spot-check every answer against the source files. Compute accuracy: how many high-confidence answers were actually right? How many flagged confidence levels matched the actual evidence quality? This calibration loop is what separates a useful research agent from a confident-bullshit machine.

Tasks

Write 10 real questions, mix of architecture / implementation / why / where
Ask each, log to the Question log surface
For each answer: open the cited files, verify the claim
Mark each as: correct, partial, hallucinated
Compute accuracy by confidence bucket (high / medium / low)
If high-confidence accuracy < 95%: tighten the system prompt's confidence definition

Pointers

OfficialLLM eval methodology

Gotchas

Confidence drift is real: an agent that says 'high confidence' on an inferred answer breaks trust fast. Calibrate before going live.
Don't pilot on questions you already know the answer to AND questions you don't. Mix both, the unknowns are the real test.

Agent prompt for this step

Run the calibration pass on the 10 pilot questions.

For each question + answer pair in the Question log:
1. Re-read the cited file paths.
2. Verify the answer matches the source code.
3. Update the row with: verified (correct / partial / hallucinated), notes on what was wrong.

Then output a calibration report as a Brief section titled "Pilot calibration v1":
- Accuracy by confidence bucket (high / med / low)
- Common failure modes (wrong file, missing context, over-confident)
- 3 specific system-prompt tweaks to apply, ordered by expected lift

Constraint: be brutal in the assessment. A confidently wrong answer is worse than a 'I don't know.'

Wire the agent into your team's workflow

1-2 hr

An agent nobody asks questions of is dead. Wire it into the workflow: a Slack channel, a recurring #ask-the-codebase ritual, a 'before-you-ask-a-senior-engineer' default. Give it a name, give it a place. The teams that get the most out of research agents treat them as a junior team member with infinite patience for 'newbie' questions.

Tasks

Create a Slack channel #codebase-research (or equivalent in your team's chat)
Wire the agent to listen to the channel (custom Slack app + Claude API, OR Cursor / Claude Code with a Slack MCP)
Document the channel in your team's onboarding: 'ask any codebase question here, the agent answers + a human confirms'
Set the agent to log every question to the Question log surface
Encourage 5-10 teammates to ask 1 question each in the first week

Pointers

OfficialSlack app + bot user setup CodeSlack MCP server (community)

Gotchas

An agent in a channel without humans replying becomes a soliloquy. Set the expectation: agent answers first, a human replies if the agent missed something, both are visible.
Don't auto-DM teammates with daily summaries. They'll mute the bot. Pull-based (a channel) > push-based (a DM).

Ship the weekly architecture explainer

30 min/week to review, 0 min for the agent to draft

Every Friday, the agent produces a written explainer covering the architectural patterns that emerged from the week's questions. Over time, these explainers become the de facto codebase documentation: ground truth, structured, version-controlled. Better than the README that hasn't been updated since 2022.

Tasks

Schedule the agent to produce a weekly explainer every Friday at 4pm
Format: 1 architectural theme (the most-asked-about one this week), 3-5 file paths, 2-3 paragraphs of explanation
Save each weekly explainer to the Brief under a 'Weekly explainers' heading
On Monday morning: review last week's explainer, fact-check, edit, link to from team docs
After 4-6 weeks: review the cumulative explainer, restructure into a proper architecture doc

Pointers

GuideArchitecture Decision Records (ADRs)

Gotchas

Weekly explainers that just summarize the questions are useless. They have to surface the underlying pattern, not enumerate.
Don't accumulate 12 weekly explainers without restructuring. By month 2, refactor them into proper architecture docs.

Agent prompt for this step

Read the last 7 days of the Question log surface.

Identify the most-asked-about architectural theme. ("What were the top 3 file paths cited this week? What domain did the questions cluster around?")

Write a weekly explainer as a Brief section titled "Weekly explainer <week-of>":
1. Theme (1 sentence)
2. Why it came up this week (1 paragraph)
3. The architectural pattern (2-3 paragraphs, with file path citations)
4. Open questions / known gaps
5. Suggested next read-through if the team wants to deepen on this

Constraints:
- Cite real file paths for every claim. If you can't cite, drop the claim.
- Write for a senior engineer joining the team, not for marketing.
- Length: 300-600 words, no longer.

Spot-check + edit the explainers; treat them as living docs

30 min/week

The agent's explainers are the first draft, not the final word. Every Monday, spend 30 min reading last week's explainer with the source code in the other window. Correct hallucinations, sharpen claims, add the institutional context the agent doesn't have. Then check the edited version into the team's docs repo. Over time, this is how you build the codebase documentation that always seemed too expensive to maintain.

Tasks

Monday: read last week's explainer side-by-side with cited files
Mark hallucinations / over-claims / missing context
Edit the explainer in the Brief surface
Copy the edited version into the team's docs repo (e.g. /docs/architecture/<theme>.md)
Update the Question log: link to the canonical doc on rows that contributed to the explainer

Pointers

GuideDocs-as-code workflow

Gotchas

Explainers checked into docs without spot-check propagate hallucinations into the team's mental model. Don't skip the Monday review.
Editing in a chat tab loses the changes when the tab closes. Edit in the Brief surface or the docs repo, not in chat.

Iterate: add static analysis, expand to multi-repo, retire when redundant

Ongoing, 1-2 hr/month

Once stable, expand: wire in static analysis tools the agent can run (cloc, semgrep, dependency graph), expand from one repo to a monorepo or a multi-repo cluster, integrate with the team's existing docs site. Eventually the agent's job will shrink, the docs the team builds from explainers replace many of the questions. That's success, not failure.

Tasks

Add tool access: rg, cloc, semgrep, ts-node / pyright for type info, git log for history
Expand to additional repos in a monorepo or multi-repo system
Integrate the explainer output with the team's docs site (Vercel-hosted, mkdocs, Docusaurus, whatever)
Retire question categories the explainers have permanently answered, route to the docs
After 3 months: review the volume of new questions per week. If it's under 5, the agent has done its job for this codebase

Pointers

Officialsemgrep rules + custom queries

Gotchas

Multi-repo agents need to be careful with read-only enforcement, one repo's read-only might be another's coding work area. Lock the scope per repo.
Don't keep an agent running on a codebase with low question volume. Tokens cost real money; agents that aren't earning their keep should sleep.

Hand the template to your agent

Workspace-wide agent prompt.

Paste this into your agent's permanent system prompt so the agent reads, writes, and maintains the template's surfaces as you work through the steps.

Agent system prompt

You are the research agent on the workspace at your-org/run-a-research-agent-on-your-codebase.

Your role: read the codebase + answer questions about it. NEVER edit files. NEVER commit. NEVER write to the filesystem outside the workspace docs.

Cadence:
- For each question the user asks: read the relevant files, answer with cited file paths + line numbers, log the question + answer + confidence to the Question log.
- Each Friday: produce a weekly architecture explainer in the Brief, summarising the questions answered that week + the architectural patterns that emerged.

Read-only scope is the hard rule. If the user asks you to fix a bug or refactor: refuse, suggest spawning a coding agent in a separate workspace.

First MCP tool calls:
1. list_workspaces()
2. get_doc(workspace_slug="run-a-research-agent-on-your-codebase", surface_slug="brief")
3. list_rows(workspace_slug="run-a-research-agent-on-your-codebase", surface_slug="question-log")

FAQ

Common questions on this template.

Why a separate research agent vs. just using Claude Code or Cursor for everything?: Scope discipline. A multipurpose 'do anything in the repo' agent will edit a file you didn't ask it to edit, eventually. A research agent in read-only mode never can. The boundary is the safety. You can always spawn a separate coding agent (with write access) when you've identified the change to make; the research agent's job is to find it without changing it.
How do I keep the agent from hallucinating module relationships?: Three rails: (1) demand a citation (file path + line number) for every claim, (2) calibrate confidence buckets in the system prompt and spot-check the high-confidence claims weekly, (3) run a top-down architecture pass first, refer back to it instead of letting each question be a fresh exploration. The combination keeps hallucination rates under 5% on high-confidence answers in practice.
Should the agent commit its own explainers to the docs repo?: No. Edit + commit is a human step. The agent drafts in the Brief surface; you review, edit, and commit. Auto-committing AI-drafted docs is how teams end up with 200 docs nobody trusts. Manual commit is the trust gate.
What runtime should I pick: Claude Code, Cursor, or a custom MCP-based agent?: Claude Code is the simplest default: terminal-native, full filesystem access, supports a --no-write flag for read-only mode. Cursor is the right choice if you're already in the editor and want in-context Q&A. A custom MCP-based agent (with the workspace + Slack + git MCPs wired in) is more work but gives you a permanent agent that lives in chat, not a per-session conversation.
What does this cost in API tokens?: A typical research session (5-10 questions, 1 architecture pass) is ~$0.10-0.50 on Claude Sonnet. A weekly explainer is ~$0.20-1. Total monthly cost for a team of 5-10 asking the agent 20-50 questions/week: $5-30/month. Heavy users with the agent in a Slack channel: $20-80/month. Negligible compared to the time saved.
Can my AI agents help build the agent?: Yes. The playbook ships agent prompts for the slow parts: the top-down architecture map, the confidence calibration pass, the weekly explainer drafts, and the Monday-review process. The Question log surface is the canonical record, every question logged with citations + confidence, every weekly explainer linked back to the questions it summarized.

Open this template as a workspace.

We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.

Open in Dock