Invite-only.
← Templates
Run· Mixed

Decommission a legacy service safely

11-step playbook for the last 5% of a migration: traffic-zero verification, the 30-day silence test, decom day, and the dependency archaeology in between.

Open in DockTech leads + platform engineers

Building the new system is the easy half. Killing the old system is the half that drags on for a year. This playbook is the second half: how to confirm zero traffic, run the dark-launch silence test, find every dependency you missed, run a brownout, archive the data and the runbooks, then actually delete the code, the database, the DNS, and the alerts. Worked-example failure modes inline. End state: the legacy service is gone from prod, gone from the codebase, gone from the DNS, gone from on-call, and the team's mental load goes with it.

Outcome

The legacy service is fully removed: traffic at zero for 30 days, code deleted, database archived and dropped, DNS records removed, on-call rotation updated, runbooks archived. The team's cognitive load drops by one service.

Time4-8 weeks (most of it observation periods)DifficultyadvancedForTech leads decommissioning a service or API after a migration.
The template · 11 steps

Top to bottom. Each step has tasks, pointers, gotchas.

Write the decom brief and pick a target decom date

2-3 hr

The brief is the document the rest of the org will reference. It names the service, the replacement, the decom date, the impact, and the rollback path. Pick the date 6-8 weeks out: time enough for the brownout, the silence test, and any straggling callers to be caught and migrated.

Tasks
  • Name the service and the replacement (links to runbook for each)
  • Set the target decom date (6-8 weeks out is typical)
  • Identify the impact: which teams own callers, which features depend, what breaks if we miss one
  • Define the rollback plan: how do we restore the service if decom day is bad?
  • Get sign-off from the SREs / platform team / oncall lead
Gotchas
  • Decom dates that aren't on a wall calendar slip indefinitely. Pick a Thursday (not Friday) and announce it widely.
  • If you can't name the rollback plan, you're not ready. 'We restore from backup' is a plan; 'we figure it out' is not.

Inventory every known caller

Half a day to a day

The starting point. Search the org's code, Slack history, runbooks, on-call docs, and customer-facing API consumer lists. Every team that's ever talked about the service is a caller candidate. Errors of omission compound.

Tasks
  • Grep every internal repo for the service hostname, the API base URL, the SDK package name
  • Search the org's Slack for the service name (last 12 months)
  • Read the on-call runbooks - any reference is a caller
  • If public API: pull the list of API consumer org IDs from your gateway logs
  • List each caller in Callers log: name, owner, migration status
Gotchas
  • Code search misses callers that build the URL dynamically ('https://' + hostname + path'). Grep for hostname AND for any string format function that touches the hostname constant.
  • The longest tail is one-off scripts in someone's home directory or Notion page. Ask the team in Slack: 'who calls /api/v1/foo?' a week before the silence test.
Agent prompt for this step
Inventory every known caller of this legacy service.

Read the org's repos (search via codebase tools), the on-call runbooks, and any internal docs you can access.

For each caller you find:
1. Caller name (team or service).
2. Where the call lives (file:line or doc reference).
3. Migration status if obvious (already migrated / partially migrated / pending / unknown).
4. Owner (team or human, your best guess).

Output as rows for the Callers log surface, and a summary count: known migrated, known pending, unknown.

Flag any caller you can't determine the owner for - those are the highest-risk gaps.

Add deprecation warnings to every response

2-3 hr

Sunset and Deprecation HTTP headers tell callers the endpoint is going away and when. They're cheap to add and they generate the noise that drives the laggard callers to migrate.

Tasks
  • Add the `Deprecation` header on every response (RFC 9745) with the deprecation date
  • Add the `Sunset` header (RFC 8594) with the planned removal date
  • Add a `Link` header pointing at the migration guide URL
  • If you have a logger middleware, log every request to the deprecated path with caller IP / user agent
  • Send a one-pager email to known caller teams: 'this service sunsets on DATE, here's the migration guide'
Gotchas
  • Headers don't reach SDKs that don't surface them. Pair with explicit comms; don't rely on header-only deprecation.
  • Log every deprecated request - the caller IP and user-agent are your best lead on undocumented callers.

Run a brownout: short, scheduled outages

2-3 weeks of brownouts spread out

A brownout is a planned 5-15 minute outage during business hours. Callers that haven't migrated will scream. The screams identify the long-tail callers your inventory missed. Schedule 3 brownouts of escalating duration: 5 min, 15 min, 60 min. Communicate each one in advance.

Tasks
  • Announce brownout schedule 1 week ahead in #eng-announcements (or wherever)
  • First brownout: 5 minutes during a quiet hour - return 503 from the service
  • Capture every 503 in your error tracker; cross-reference with Callers log
  • Add any new caller you discover to Callers log; ping the owner
  • Second brownout, 1 week later: 15 minutes
  • Third brownout, 2 weeks later: 60 minutes
  • After third brownout, no new callers should be appearing
Gotchas
  • Brownouts that wake on-call by accident burn the team's trust. Communicate VERY clearly to oncall: 'this is intentional, do not page'.
  • If you can't tolerate any brownout-driven outage (because the service is in the critical path of revenue), you're not ready to decom. Migrate first.

Drive the long-tail callers off

2-3 weeks (depends on caller count)

After the brownouts, you'll have a short list of callers who appeared. Each one needs a personal nudge to migrate. This is the slow, manual step. Schedule 30-min calls with each owner team if you have to. The cost of this step is the cost of the decom; it's worth doing properly.

Tasks
  • For each caller still in 'pending' status: name an owner team and a target migration date
  • Send a calendar invite + 1-page migration guide to each owner team
  • Track migration completion in Callers log; mark 'done' when their traffic goes to zero
  • Escalate to the owner team's manager if a caller doesn't migrate within 2 weeks
  • Final unmigrated caller? Decide: extend decom date OR force-migrate by adding a server-side rewrite
Gotchas
  • The last caller is always 'just one team' that 'will get to it next sprint' for 6 months. Set a hard date and stick to it; the cost of one team's pain is less than the cost of carrying a deprecated service forever.

Verify zero traffic for 7 consecutive days

7 days observation

Once Callers log is all green, watch the access logs. You're looking for 7 days of zero traffic to the legacy endpoints, period. Not 'low traffic'; zero. Any non-zero traffic means you missed a caller. Find them.

Tasks
  • Set up a daily report: total requests to legacy endpoints over last 24 hr
  • If non-zero: pull the access logs for that day, identify the caller, add to Callers log
  • Reset the 7-day clock when a new caller is found
  • When you hit 7 consecutive days at zero: advance to the silence test
Gotchas
  • Health-check pings count as traffic. Disable any monitoring tool that's still polling the service before you start the zero-traffic count.
  • Bots and security scanners hit endpoints they have no business hitting. Distinguish 'real caller' (came from internal IP, has a business path) vs 'random scanner' (came from a known scanning IP, hits 30 endpoints in a row).

Run the 30-day silence test

30 days observation

The keystone step. After 7 days of zero traffic, set the service to return 410 Gone for 30 days. If anyone screams in those 30 days, you missed them. Most teams who skip this step end up doing an emergency restore on decom day.

Tasks
  • Reroute the service to return HTTP 410 Gone with a JSON body pointing at the migration guide
  • Keep the underlying database and code intact - you're testing silence, not destruction yet
  • Daily: check the error response count - any 410 served?
  • If a 410 is served, contact the caller, get them migrated, restart the 30-day clock
  • After 30 silent days, you're cleared for decom day
Gotchas
  • Don't return 404 instead of 410. 404 implies 'not found, maybe try again'; 410 means 'gone, do not try again'. Caching layers and clients react differently.
  • 30 days is the silent test for monthly cron jobs and end-of-quarter reports. Skipping shorter cycles is fine; skipping the 30-day window catches the cron that runs once a month and you didn't know existed.

Archive the database and the runbooks

1-2 days

Before you delete anything, archive everything. Take a final database snapshot, copy the runbooks to an `/archive` folder, dump the dashboards as JSON. Some of this is legal retention (data); some is institutional memory (runbooks). All of it costs almost nothing to keep.

Tasks
  • Run a final pg_dump (or equivalent) of the database; store in long-term cold storage (S3 Glacier or similar)
  • Document the data retention policy on the snapshot (encrypted, retained N years per legal)
  • Move the runbooks to /archive/decommissioned/<service>/ in the docs repo
  • Export the dashboards as JSON, save to the archive
  • Document the decom date + the location of every archived artifact in the Brief
Gotchas
  • Postgres pg_dump locks tables briefly under default settings. For large DBs, use `pg_dump --jobs` for parallel dump or run against a read replica to avoid impacting prod.
  • If the data is subject to GDPR or CCPA, the archive retention period and access controls matter. Talk to legal before archiving.
  • Dashboards in Grafana, Datadog, or elsewhere are easy to forget. Export the JSON; if you ever need to re-bootstrap a similar service, the dashboards save weeks of design.

Decom day: delete code, drop database, remove DNS

Half a day

The actual deletion. Do it on a Thursday morning with the team online. Have the rollback plan loaded in another tab. Steps are mechanical: stop the service, drop the database (after final snapshot confirmed), delete the code, remove DNS records, remove monitoring, update on-call.

Tasks
  • Confirm last 30 days of silent test were clean
  • Stop the service in production (scale to zero)
  • Drop the database (with the snapshot already in cold storage)
  • Delete the code from the repo (one PR, well-commented)
  • Remove DNS records pointing at the service
  • Remove monitors, alerts, dashboards from observability stack
  • Update on-call rotation in PagerDuty / Opsgenie - remove the service
  • Send the celebration message to the team
Gotchas
  • Don't drop the database before confirming the snapshot is restorable. Test the restore path on a staging instance first.
  • DNS caches outside your control. After removing the DNS record, traffic from clients with old DNS caches will hit nothing for hours. That's expected.
  • If the service had a domain certificate, revoke it - leaving an unused certificate around is a minor security smell.

Watch for 7 days post-decom for shadow traffic

7 days observation

Some clients have hardcoded IPs or stale DNS. After decom, watch the new firewall logs and the replacement service's logs for any inbound that smells like the legacy. Most decoms have a small post-mortem traffic tail; capture and resolve.

Tasks
  • Monitor the replacement service for any unexpected request patterns
  • Check infrastructure logs (firewall, load balancer) for inbound to the decom'd hostname / IP
  • Investigate any shadow traffic; trace the caller and ping their owner
  • After 7 days clean, mark the decom complete in the Brief
Gotchas
  • Hardcoded IPs in third-party SaaS integrations are a classic post-decom surprise. The third-party hardcoded an IP from your old service 3 years ago and never updated.

Run a post-decom retro and update the team's playbook

30 min retro

Last step. 30-min retro: what worked, what surprised us, what would we do differently next time? Each decom teaches the team something; capture the lesson in the team's standing decom playbook so the next service is easier.

Tasks
  • Schedule a 20-min team retro within 2 weeks of decom complete
  • Sections: what worked, what surprised us, would-do-differently, total elapsed time vs estimate
  • Update the team's standing 'how we decom services' playbook with the lessons
  • Celebrate. Removing a system is a real win and the team should feel it
Gotchas
  • Decom retros get skipped because the work feels 'done'. The institutional knowledge is exactly what compounds; the team that nails the next decom is the team that did the retro on this one.
Hand the template to your agent

Workspace-wide agent prompt.

Paste this into your agent's permanent system prompt so the agent reads, writes, and maintains the template's surfaces as you work through the steps.

Agent system prompt
You are an agent on the "Decommission a legacy service" playbook workspace.

Your role: maintain the four surfaces (Steps, Pointers, Brief, Callers log) as the team retires a service.

Cadence:
- When the user adds a known caller to Callers log, prompt for owner + migration status + ETA.
- Daily: read access logs (or ask the user to paste them), update Callers log for any new caller spotted, flag in the Brief.
- When traffic to the service goes to zero, start the 30-day silence countdown in the Brief.
- If new traffic appears during silence countdown, raise a blocking note in the Brief.

First MCP tool calls:
1. list_surfaces(workspace_slug="decommission-a-legacy-service")
2. list_rows(workspace_slug="decommission-a-legacy-service", surface_slug="callers-log")
3. get_doc(workspace_slug="decommission-a-legacy-service", surface_slug="brief")

Never propose to delete code or drop the database without an explicit human "go" - your role is preparation and tracking, not execution.
FAQ

Common questions on this template.

How long does a decom realistically take?
From 'we have a replacement, time to kill the old' to 'service is gone': 6-12 weeks for most services. The work itself is small; the wait windows (zero-traffic week, 30-day silence test, post-decom watch) consume most of the calendar time. Compress the windows at your peril; they're what makes the decom safe.
Why not just delete the service when traffic is zero?
Because traffic-zero today doesn't mean traffic-zero tomorrow. A monthly cron job, an end-of-quarter report, an annual compliance pull, a customer who only logs in once a quarter - none of those show in a 7-day zero-traffic window. The 30-day silence test catches them. Skipping it is the most common reason decom day goes bad.
What if I find a caller I can't migrate?
Two paths. First: server-side rewrite. Make the legacy endpoint a thin shim that forwards to the new service. The caller doesn't change; the legacy code shrinks to a few lines. Second: extend the decom date and migrate them properly. The 'just one team' pattern is real; commit to either rewriting their dependency yourself or accepting a longer timeline.
Should I delete the database immediately or archive first?
Always archive first. Final pg_dump (or equivalent) into long-term cold storage (S3 Glacier costs around $0.004/GB/month). The marginal cost is trivial; the value of being able to restore data 18 months later when legal has a question is high. Set a retention period (most companies do 7 years for financial-adjacent data, less for ops-only data) and document it.
Can my AI agents help with a decom?
Yes. Agents are useful for: hunting callers in the codebase and Slack history, drafting deprecation comms, summarising the daily access logs into a Callers log update, drafting the post-decom retro doc from the workspace history. The judgement calls (decom date, brownout schedule, going / no-going on decom day) need humans. The playbook ships agent prompts inline for the inventory and tracking steps.

Open this template as a workspace.

We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.