Run a feature-flag rollout
A 10-step playbook from "code merged behind a flag" to "flag removed from the codebase". Open in Dock and you'll get four surfaces seeded:
- **Steps** (table) - the 10 stages, owner + status + percentage
- **Pointers** (table) - linked feature-flag service docs, dashboards, kill-switch URL
- **Brief** (doc) - the rollout brief: what shipped, why this metric, what we'll do if guardrails drift
- **Cohorts log** (table) - one row per cohort exposed (1%, 10%, etc.), date, success metric reading, decision
Read `Steps` top-to-bottom. Don't advance to the next percentage until guardrails are clean.
Outcome
A new feature live for 100% of users with the success metric measurably moved, no guardrail regressions, and the flag fully removed from the codebase. Total time from 1% to flag-deleted: 2-4 weeks for most features.
Estimated time: 2-4 weeks (most of it observation between cohorts)
Difficulty: intermediate
For: Tech leads and senior engineers running rollouts.
What you'll need
Pre-register or install before you start.
- LaunchDarkly ($0 for solo developer, $20/seat/mo Foundation for teams) — Hosted feature-flag service with percentage rollouts, targeting, kill switch.
- Statsig (Free tier 1M events/mo, usage-based after) — Flag service with built-in experimentation and metric pipeline.
- ConfigCat (Free up to 10 flags, $99/mo Pro) — Lightweight flag service, good for small teams.
- Flagsmith (Free open source, $45/mo hosted Startup) — Open-source flag service, self-host or hosted.
- Your metrics backend (Free open source) — Measure success metric and guardrail metric per cohort.
The template · 10 steps
Step 1: Write the one-page rollout brief before flipping the flag
Estimated time: 1 hr
The brief lives in the Brief surface and answers four questions: what feature, what success metric, what guardrails, what kill criteria. If you can't answer all four in writing, you're not ready to roll out. Most rollouts that go bad skipped this step.
Tasks
- Name the feature in plain language ('new checkout flow', not 'feature_47')
- Name the flag identifier exactly as it appears in code ('checkout.flow_v2.enabled')
- Name the success metric ('checkout completion rate, 5-min window')
- Name 2-3 guardrail metrics ('checkout error rate', 'page load p95', 'support tickets per 1k sessions')
- Define kill criteria ('roll back if guardrail X drops more than Y% for 30 min')
- Identify the dashboard URL that shows all four numbers per cohort
Pointers
[!CAUTION] Gotchas
- If your kill criteria say 'we'll know it when we see it', you don't have kill criteria. Write a number.
- A feature without a guardrail metric is a feature you can't responsibly roll out. The success metric tells you it works; the guardrail tells you it didn't break something else.
Agent prompt for this step
Read the feature spec or PR description. Draft a 1-page rollout brief covering:
1. Feature in plain language (1 sentence).
2. Flag identifier (exact code reference).
3. Success metric: what we expect to move and by how much, over what window.
4. Guardrails: 2-3 metrics that should NOT regress, with thresholds.
5. Kill criteria: under what condition we flip back to 0%, in writing.
6. Dashboard URL where all metrics are visible per cohort.
Constraints: no marketing language. Be specific. Numbers, not adjectives.
Output as the Brief surface.
Step 2: Verify the kill switch actually works in production
Estimated time: 30 min
Before you roll out to anyone, prove that flipping the flag to off in production reverts behavior in seconds. This sounds obvious. It's the single most-skipped step. You only get one chance to discover the kill switch is broken, and it's the moment you need it.
Tasks
- In production, set the flag to 0% (or to a target user that's only you)
- Verify the new code path is NOT executed for that target
- Set the flag to 100% for the target
- Verify the new code path IS executed
- Set it back to 0%
- Time the flip-and-effect cycle: it should be under 60 seconds
Pointers
- [Official] LaunchDarkly: kill switch playbook
[!CAUTION] Gotchas
- Some flag SDKs cache values for 60+ seconds by default. Verify your propagation latency in production, not in dev.
- Server-side rendered pages can serialize the flag value into HTML, and that HTML gets CDN-cached. The kill switch flips the flag but the cached page still shows the new flow until cache expires. Solve this with a cache-buster header tied to the flag.
Step 3: Choose the cohort scheme: random, sticky, attribute
Estimated time: 1 hr
Random rollouts split traffic 1% / 99% by random hash. Sticky rollouts hash by user ID so a user gets the same experience every visit. Attribute-based rollouts target by segment (account age, plan tier, geography). Sticky is the default for user-facing features. Random is fine for stateless A/B at the request level.
Tasks
- Decide: random / sticky-by-user / sticky-by-account / attribute-based
- If sticky: confirm the SDK uses a stable hash (user_id, not session_id)
- If attribute-based: write the targeting rule and test it on a sample of 10 users
- Document the choice + rationale in the Brief
Pointers
- [Official] Statsig: rollout strategies
[!CAUTION] Gotchas
- Sticky-by-session sounds fine until a user reloads and gets a different experience. Use sticky-by-user for anything user-facing.
- If you start sticky-by-user and switch to random mid-rollout, every existing user re-rolls and 50% of them get a different experience. The team will field bug reports. Decide upfront and stick.
Step 4: Roll out to 1% and watch for 24 hours
Estimated time: 24 hr observation window
1% is your noise floor. You're not measuring success here; you're measuring 'does the feature crash, error, or melt the database'. After 24 hours of clean guardrails, advance.
Tasks
- Set the flag to 1% in production
- Verify the dashboard now shows two cohorts (control + 1%)
- Watch error rate, latency, and saturation for 24 hours
- If guardrails are clean: advance to 10%
- If anything trips: kill switch to 0%, investigate, restart at 1% after fix
- Append a row to Cohorts log: date, 1%, metrics, decision
[!CAUTION] Gotchas
- 1% of total traffic is often too small to see anything statistically meaningful. The 24-hour wait is for catching crash bugs, not for measuring impact. Don't try to measure success at 1%.
- Weekend traffic is different from weekday. If you start a rollout on Friday at 1%, hold over the weekend and reassess on Monday morning.
Step 5: Advance to 10% and start measuring
Estimated time: 48-72 hr observation
10% is where the success metric starts to be measurable. Watch for 48-72 hours. If success metric is moving in the right direction and guardrails are clean, advance. If success metric is flat after a full week at 10%, your hypothesis is probably wrong; consider rolling back and rethinking.
Tasks
- Advance to 10% on a Monday or Tuesday (avoid Friday)
- Watch for 48-72 hours: success metric, guardrails, support tickets
- Compare cohort vs control in your dashboard
- If success metric is statistically moving and guardrails clean: advance to 50%
- If success metric is flat after a week at 10%, kill the rollout and rethink
- Append a row to Cohorts log
Pointers
[!CAUTION] Gotchas
- Avoid advancing on Fridays. If something breaks Friday night, you're debugging at 2 AM Saturday with no team support.
- Don't advance based on vibes. If your dashboard says success metric is up 0.3% with a 95% CI of -2% to +2.5%, that's noise. Wait for signal.
Step 6: Advance to 50%, then 100% with a 24-hour gap
Estimated time: 48 hr at 50%, then immediate to 100%
50% is your final guardrail check. From 10% to 50% the feature is now seen by half your users; if there's a load issue or a tail-case bug, this is where it surfaces. Hold for 24-48 hours, then go to 100%. Some teams hold longer if the feature is high-risk.
Tasks
- Advance to 50% on a weekday morning
- Watch for 24-48 hours: any new error patterns? any latency drift? any support tickets?
- If clean: advance to 100%
- Verify the control cohort no longer exists in the dashboard
- Append rows to Cohorts log for both 50% and 100% transitions
[!CAUTION] Gotchas
- Tail-case bugs that didn't show at 10% sometimes appear at 50% because the absolute number of edge-case users crosses a threshold. Don't skip the 50% checkpoint.
- Database load and cache hit rates can shift at 50%. Watch for slow queries that didn't matter before.
Step 7: Communicate the rollout to the team and customer success
Estimated time: 1-2 hr
By 100%, the feature is live for everyone. CS, support, and sales need to know what changed, what to tell customers, and how to handle reported issues. A 5-minute Loom + a one-page changelog entry is enough. Skipping this step means support fields confused tickets for weeks.
Tasks
- Write a 1-paragraph changelog entry for internal team (link to spec, screenshots, before/after)
- Record a 3-5 min Loom showing the new behavior
- Post in #eng + #cs + #support Slack channels with both
- Update the public-facing changelog or release notes if customer-visible
- If support volume warrants it, draft a help-center article
Pointers
- [Tool] Loom: async demos
[!CAUTION] Gotchas
- Customer support that learns about a feature change from a customer ticket is a bad day for everyone. Loop them in before 50%.
- If the feature changes pricing, billing behavior, or data retention, legal and finance need to know too.
Step 8: Wait 7-14 days at 100% before declaring victory
Estimated time: 7-14 days observation
100% rollout is not the win. The win is 7-14 days at 100% with success metric stable, no guardrail regressions, no spike in support tickets. Some bugs only show after long-tail user behavior (someone hits the new flow on day 12 and discovers it doesn't handle their weird account state).
Tasks
- Hold at 100% for at least 7 days
- Daily check: success metric stable? guardrails stable? support tickets normal?
- If a bug surfaces, decide: hotfix at 100%, or kill back to 0% and fix?
- After 7-14 clean days, declare the rollout successful
- Move on to the cleanup phase
[!CAUTION] Gotchas
- Most teams celebrate at 100% and immediately start the next feature. Don't. The cleanup phase below is where this rollout converts from 'feature shipped' to 'tech debt avoided'.
- Subscription billing bugs, monthly cron job bugs, and end-of-quarter reporting bugs only show on cycles longer than your observation window. For high-risk features, observe for one full billing cycle.
Step 9: Remove the flag from the codebase
Estimated time: Half a day to a day
The cleanup PR removes every reference to the flag, deletes the dead code path (the old behavior), and the flag from the flag service. This is the step almost everyone skips. Result: a codebase littered with dead flags and dead branches that compound the next engineer's confusion. Don't be that team.
Tasks
- Search the codebase for the flag identifier; list every file referencing it
- Open a PR that removes the flag check + deletes the old branch (the 'flag = false' code)
- Update tests: any test that tested the flag-off path is dead, delete it
- Get the PR reviewed and merged - this should be a small, mostly mechanical change
- After merge, archive the flag in the flag service (or delete if your service supports it)
Pointers
[!CAUTION] Gotchas
- Flags that touch many files end up with cleanup PRs that are 30+ files of mechanical changes. Get a quick review and merge fast; the longer it sits, the more rebase pain.
- Don't 'archive for later'. The flag service charge keeps ticking, the code rot keeps spreading, and the cleanup gets harder. Delete now.
Agent prompt for this step
Read the codebase. Find every reference to the flag identifier (search for the flag string in code, tests, configs, docs).
Output:
1. A list of file:line references for every match.
2. A proposed cleanup diff that removes the flag check and inlines the new behavior. Mark which old-behavior branches are now dead.
3. A list of tests that would become dead after the cleanup (tests for the old behavior).
4. A draft PR title + description.
Constraints: do not touch behavior. Pure removal of the flag and its dead branches. If a flag check guards multiple sites, propose them all in one PR.
Step 10: Run a 30-day post-rollout review
Estimated time: 30 min retro
30 days after 100%, look back. Did the feature deliver the success metric move you predicted? What surprised you? What would you do differently next rollout? A 20-minute team retro converts each rollout into compounding institutional knowledge.
Tasks
- Schedule a 20-min team meeting at the 30-day mark
- Compare predicted success metric movement vs actual
- List 1 thing that went well, 1 thing that was harder than expected, 1 thing to do differently
- If success metric did NOT move as predicted, decide: keep, refine, or revert?
- Update the team's Brief doc or playbook with lessons learned
[!CAUTION] Gotchas
- Rollouts that 'worked' are easy to skip the retro on. Don't. The lessons from a smooth rollout are how you scale the team's rollout muscle.
- If success metric didn't move, this is the moment to revert (yes, even though it's been live for a month). Carrying a feature that doesn't deliver is worse than removing it.
Hand the template to your agent
Paste the prompt below into your agent's permanent system prompt so the agent reads, writes, and maintains this workspace as you work through the steps.
You are an agent on the "Run a feature-flag rollout" playbook workspace.
Your role: maintain the four surfaces (Steps, Pointers, Brief, Cohorts log) as the rollout progresses.
Cadence:
- When the user advances to a new cohort percentage, append a row to Cohorts log with date, cohort %, success metric reading, guardrail readings.
- When a guardrail crosses kill criteria, mark the row red and propose a kill switch action in the Brief.
- When the rollout reaches 100%, propose the cleanup PR title and a list of files referencing the flag.
First MCP tool calls:
1. list_surfaces(workspace_slug="run-a-feature-flag-rollout")
2. list_rows(workspace_slug="run-a-feature-flag-rollout", surface_slug="cohorts-log")
3. get_doc(workspace_slug="run-a-feature-flag-rollout", surface_slug="brief")
Always read the dashboard before advancing a cohort. Never auto-advance without explicit human confirmation.
FAQ
How long should a typical rollout take from 1% to 100%?
For most features: 1-2 weeks of cohort observation, plus a 7-14 day soak at 100%, plus 1-2 days for cleanup. Call it 3-4 weeks end-to-end. Faster is possible for low-risk changes (UI polish, copy tweaks) where you can compress the cohort gates to a few hours each. Slower (4-8 weeks) is right for high-risk areas: payments, auth, anything touching customer data.
Do I need a paid feature-flag service for solo or small teams?
No. LaunchDarkly is free for individual developers. Statsig and ConfigCat have generous free tiers. Flagsmith is open source and self-hostable. The hosted services pay off when you have multiple teams shipping concurrently and need centralized targeting + audit; for one-team rollouts, the free tier or self-hosted is fine.
What's the most common reason rollouts go bad?
No kill criteria written down before the rollout starts. Without explicit kill criteria, the team debates whether the metric drift is 'real' for hours while the bad cohort grows. With kill criteria written down, the decision is mechanical: drift exceeds threshold, flip to 0%, investigate. The second-most-common: cleanup never happens, so the codebase accumulates dead flags.
Should I run an A/B test instead of a rollout?
Different goals. A rollout is for shipping a feature you've decided to launch; you're managing risk, not measuring choice. An A/B test is for deciding between two options you haven't committed to. Most teams do both: rollouts for everything, A/B tests when there's a real product question (different copy, different price, different flow). The infrastructure is the same; the framing is different.
Can my AI agents help run rollouts?
Yes. Agents are useful for: drafting the rollout brief from the spec, drafting the cleanup PR by finding every flag reference in the codebase, summarising daily metric reads as a one-line update, and detecting flag rot (flags at 100% with no cleanup PR after 14 days). The playbook ships agent prompts inline for the brief and the cleanup steps.