---
title: "Migrate a Postgres database without downtime"
excerpt: "10-step playbook from 'we need to change this table' to 'migration shipped, zero downtime, no on-call paged' using the expand-contract pattern."
category: "Template"
---

# Migrate a Postgres database without downtime

    A 10-step playbook. Open in Dock and you'll get four surfaces seeded:

    - **Steps** (table) — the 10 expand-contract gates as rows, owner + due + status
    - **Migrations log** (table) — every SQL migration shipped with timing + lock duration
    - **Brief** (doc) — the migration plan + the schema before/after
    - **Rollback plan** (doc) — the explicit rollback for every step (NOT optional)

    Read `Steps` top-to-bottom. The rule that catches most teams: never deploy app code that requires the new schema until the migration has run AND been verified. Step ordering matters more than step content.

## Outcome

A non-trivial Postgres schema or data migration shipped to production with zero downtime, dual-write transition, verified backfill, and a rollback plan that's been tested.

**Estimated time:** 3-7 days for a typical schema change; 2-4 weeks for a large data migration  
**Difficulty:** advanced  
**For:** Engineers + tech leads on production Postgres at small / mid scale.

## What you'll need

Pre-register or install before you start.

- **[Postgres](https://www.postgresql.org/docs/current/)** _(Free (self-hosted) or managed (RDS, Neon, Supabase, etc.))_ — The DB. Read the lock semantics docs before touching ALTER TABLE in production.
- **[pg_repack](https://github.com/reorg/pg_repack)** _(Free (open source))_ — Rebuild tables online without long locks. Essential for large-table operations.
- **[pgroll](https://github.com/xataio/pgroll)** _(Free (open source))_ — Tooling for zero-downtime Postgres migrations using the expand-contract pattern.
- **[pgbouncer](https://www.pgbouncer.org/)** _(Free (open source))_ — Connection pooler. Helps survive brief migration locks by queuing connections.
- **[Datadog (or your APM)](https://www.datadoghq.com/)** _(From $15/host/month)_ — Observability of query latency + lock waits during the migration.
- **[Strong migrations (Rails)](https://github.com/ankane/strong_migrations)** _(Free (open source))_ — Catches dangerous Rails migrations in CI before they hit production.

---

# The template · 10 steps

## Step 1: Decide if you actually need a migration

_Estimated time: 1-2 hr_

The cheapest migration is the one you don't ship. Before reaching for ALTER TABLE, ask: can the change live in application code? Can it be a virtual / generated column? Can the rename happen at the API layer without renaming the underlying column? Many 'migrations' are really naming preferences that don't need a schema change.

### Tasks

- [ ] Document the WHY: what business or technical need drives this migration
- [ ] Consider the alternatives: app-layer rename, generated column, view, no-op
- [ ] Estimate the cost of NOT migrating (tech debt, query performance, correctness)
- [ ] Estimate the cost of migrating (engineer time, risk, downtime risk)
- [ ] If the cost of migrating is higher than the cost of waiting, defer
- [ ] Document the decision in the Brief

### Pointers

- **[Official]** [Postgres CREATE VIEW (often sufficient)](https://www.postgresql.org/docs/current/sql-createview.html)

> [!CAUTION]
> **Gotchas**
>
> - Renames are the most-common migration that doesn't need to be a migration. An API-layer rename gives the new name to consumers without touching the DB.
> - Generated columns (Postgres 12+) let you derive new fields from existing data without a backfill. Often the right answer for 'we need this computed value.'
> - Migrations that are 'because the schema is ugly' rarely justify the production risk. Ugly works; broken doesn't.

## Step 2: Pick the migration strategy

_Estimated time: 2-4 hr_

Three strategies cover most cases: (1) expand-contract for schema changes (add new, dual-write, backfill, swap reads, drop old), (2) shadow-table for big rewrites or partition changes (build new table from scratch, switch over atomically), (3) chunked backfill for pure data migrations (UPDATE in 1k-row batches over hours/days). Pick based on the change size + acceptable risk.

### Tasks

- [ ] Categorize the migration: schema change / data migration / structural rewrite
- [ ] Estimate row count + table size (small <1M, medium 1M-100M, large 100M+)
- [ ] Estimate write rate during migration (low <100/s, medium 100-1000/s, high >1000/s)
- [ ] Pick the strategy based on the matrix in the Brief
- [ ] Document the strategy + rationale in the Brief

### Pointers

- **[Guide]** [Expand-contract migration pattern (Martin Fowler)](https://martinfowler.com/bliki/ParallelChange.html)
- **[Guide]** [Shopify's database migration playbook](https://shopify.engineering/safer-database-migrations)
- **[Guide]** [GitHub's gh-ost (online schema migration)](https://github.com/github/gh-ost) — MySQL but the pattern translates; great write-up.

> [!CAUTION]
> **Gotchas**
>
> - Expand-contract is the right default for schema changes. The temptation to skip dual-write and 'just ALTER' on a large table is exactly when you cause an outage.
> - Shadow tables are the right answer for partitioning, type changes on huge tables, and total restructures. Don't try expand-contract for those.
> - Chunked backfills under high write rates need a 'don't lap the running migration' mechanic. UPDATE rows by primary key range, not by status, to avoid revisiting rows that other writes touched.

## Step 3: Write the expand step (additive only)

_Estimated time: 1 day_

The expand step adds the new schema without touching the old. Add columns as nullable, add indexes CONCURRENTLY, add new tables. NOT NULL constraints, defaults that backfill on add, and indexes without CONCURRENTLY all lock the table — defer those to the contract phase.

### Tasks

- [ ] Write the expand SQL: ADD COLUMN (nullable), CREATE INDEX CONCURRENTLY, CREATE TABLE
- [ ] Verify: no NOT NULL on add, no DEFAULT on add for existing rows (Postgres 11+ handles this fast; 10 and earlier lock the table)
- [ ] Verify: indexes use CONCURRENTLY (slower but no lock)
- [ ] Run the migration in a clone of production (a recent restore is closest)
- [ ] Measure: actual lock duration, actual time-to-complete
- [ ] Save the expand SQL in Migrations log + the rollback DROP statement in Rollback plan

### Pointers

- **[Official]** [Postgres ALTER TABLE locks (CTAS-style)](https://www.postgresql.org/docs/current/sql-altertable.html)
- **[Official]** [Postgres CREATE INDEX CONCURRENTLY](https://www.postgresql.org/docs/current/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY)

> [!CAUTION]
> **Gotchas**
>
> - ALTER TABLE ADD COLUMN with a non-volatile DEFAULT is fast on Postgres 11+ (metadata-only). On Postgres 10 and earlier, it rewrites the table under ACCESS EXCLUSIVE — minutes-to-hours of lock on a large table. Check your version.
> - CREATE INDEX without CONCURRENTLY locks the table for the duration of the build. On a 100M-row table that's 30-60 minutes of write blockage.
> - Adding a foreign key with default validation locks both tables. Use ADD CONSTRAINT ... NOT VALID + a separate VALIDATE CONSTRAINT to avoid the lock.

### Agent prompt for this step

```text
Read this Prisma / Sequelize / SQLAlchemy / raw SQL schema and produce the expand SQL for the migration described in the Brief.

Output:
1. The expand SQL: only additive operations (ADD COLUMN nullable, CREATE INDEX CONCURRENTLY, CREATE TABLE)
2. The corresponding rollback SQL: DROP COLUMN, DROP INDEX, DROP TABLE
3. Estimated lock duration per statement (research the lock level: ACCESS EXCLUSIVE / SHARE UPDATE EXCLUSIVE / etc.)
4. Run order (some statements depend on others; order matters)
5. Pre-flight verification queries (e.g. "verify column doesn't already exist, verify index doesn't already exist")

Output to Migrations log as a row + the SQL into a versioned migration file in the codebase.

CRITICAL: do NOT include NOT NULL on ADD COLUMN. Do NOT include DEFAULT on ADD COLUMN for tables >1M rows on Postgres <11. These force a full table rewrite under ACCESS EXCLUSIVE lock.
```

## Step 4: Ship the dual-write app code

_Estimated time: 1-2 days_

Once the schema is expanded, deploy app code that writes to BOTH the old and new columns. The old column stays the source of truth; the new column accumulates writes for verification. Reads still hit the old column. The dual-write phase typically lives for 1-7 days while you verify + backfill.

### Tasks

- [ ] Find every place in the codebase that writes the old column (grep for column name + ORM model usage)
- [ ] Add a dual-write to the new column at every write site (in code, not in a trigger — easier to roll back)
- [ ] Deploy with the dual-write + a feature flag to disable it if needed
- [ ] Verify: new writes land in BOTH columns; old writes stay in old only
- [ ] Watch query latency + error rates for 24-48 hours (any drift means rolling back)
- [ ] Document the deploy time + rollback procedure in the Rollback plan

### Pointers

- **[Guide]** [Stripe's database migrations post](https://stripe.com/blog/online-migrations)

> [!CAUTION]
> **Gotchas**
>
> - Triggers for dual-write seem easier than app code but harder to debug + harder to roll back. Prefer app-layer dual-write.
> - Dual-write at the application layer means you must catch every write site. ORM models with implicit writes (timestamps, callbacks, audit logs) are the common miss.
> - If dual-writes throw on the new column, you can't break the old column's writes. Wrap the new column write in try/catch + log; never let it bubble up.

## Step 5: Backfill the existing data

_Estimated time: Hours to days depending on table size_

After dual-write is shipping new data to both columns, backfill copies the historical data from old to new. Do this in chunks (1k-10k rows per batch) with sleeps between batches to avoid locking the table or filling WAL. Track progress; verify checksums; this is where most migrations take the longest wall-clock time.

### Tasks

- [ ] Write the backfill: UPDATE table SET new_col = derived(old_col) WHERE id BETWEEN ? AND ? AND new_col IS NULL
- [ ] Run in chunks of 1k-10k rows with a sleep(100ms-1s) between batches
- [ ] Track progress in a log: rows updated, time elapsed, current id range
- [ ] Run during off-peak hours if possible (still write-safe but reduces lag)
- [ ] Monitor: query latency, replication lag, lock waits
- [ ] Verify on completion: COUNT(*) WHERE new_col IS NULL = 0 (excluding rows added during backfill that dual-write covers)
- [ ] Run a checksum: random sample 1000 rows, verify new_col matches derived(old_col)

### Pointers

- **[Tool]** [pgroll (Xata, automates this pattern)](https://github.com/xataio/pgroll)

> [!CAUTION]
> **Gotchas**
>
> - Backfilling without WHERE new_col IS NULL re-writes rows that dual-write has already filled. Mostly harmless but burns I/O and can mask dual-write bugs.
> - Backfills that don't track progress in a checkpoint table can't resume after a crash. Always checkpoint the last id processed.
> - Massive single-transaction backfills (no chunking) hold locks + fill WAL + can OOM the leader. Chunk to 1k-10k rows + commit each chunk.

### Agent prompt for this step

```text
Read the Brief + the dual-write code + the table size and produce the backfill script.

Output:
1. A backfill script (Python / Ruby / Node) that runs in chunks
2. Chunk size: 1k for tables <10M rows, 5k for 10-100M, 10k for 100M+
3. Sleep duration between batches: 100ms-1s based on observed write rate
4. Progress logging (rows / minute, total elapsed, ETA)
5. Resume logic (reads the last committed id from a checkpoint table)
6. Verification step (COUNT WHERE NULL + random-sample checksum)

CRITICAL: the script must idempotently skip rows that dual-write has already filled (WHERE new_col IS NULL). Don't re-derive rows that the live application has already written.
```

## Step 6: Verify the new column is correct

_Estimated time: 1-2 days_

Before reading from the new column, prove it's correct. Run consistency checks: count of NULL values, checksum sample, row-by-row diff for a random subset. If checks fail, fix the bug + re-backfill. Don't rush past this gate; reading from a wrong column in production is the catastrophic failure mode.

### Tasks

- [ ] Run COUNT(*) WHERE new_col IS NULL — expect 0 (or very small if backfill is still catching up)
- [ ] Run a random-sample diff: pick 1000 random rows, compute expected new_col from old_col, compare
- [ ] Run a full diff if table is small enough (under ~10M rows): SELECT id WHERE new_col != derived(old_col)
- [ ] If diffs exist: investigate root cause (dual-write bug? backfill bug? race condition?)
- [ ] Fix the bug; re-backfill the affected rows
- [ ] Re-verify until 0 diffs
- [ ] Document verification results in the Brief

### Pointers

- **[Official]** [Postgres aggregate functions (count, etc.)](https://www.postgresql.org/docs/current/functions-aggregate.html)

> [!CAUTION]
> **Gotchas**
>
> - Verification that checks 'is non-null' but doesn't checksum content is incomplete. The new column can be populated AND wrong (a buggy dual-write at every write site).
> - Race conditions during dual-write produce small but non-zero diffs. If your diff count is non-zero but small (<0.01%), investigate the race before proceeding.
> - Don't read from the new column in app code until verification is GREEN. Reading from a populated-but-wrong column is the catastrophic failure mode.

## Step 7: Switch reads to the new column

_Estimated time: 1-2 days_

Now flip the reads. Deploy app code that reads from the new column instead of the old. The dual-write keeps both columns up to date during this phase, so you can roll back fast. Run the new-read code for 1-7 days to confirm production behavior matches before contracting.

### Tasks

- [ ] Find every place in the codebase that READS the old column
- [ ] Switch each read site to the new column
- [ ] Behind a feature flag if possible, so you can roll back without a deploy
- [ ] Deploy + monitor: query latency, error rates, customer complaints
- [ ] Run for 1-7 days under normal traffic before proceeding
- [ ] Verify: app behavior matches pre-migration baseline (no errors, no missing data, no wrong values)

### Pointers

- **[Guide]** [Feature flag patterns (LaunchDarkly)](https://launchdarkly.com/blog/migrating-databases-with-feature-flags/)

> [!CAUTION]
> **Gotchas**
>
> - Forgetting one read site is the common bug. Run a static analysis tool or a code-grep to verify zero references to the old column in active code paths.
> - If you flip 100% of reads at once + the new column has subtle bugs, you can't easily diff against the old column. Use a feature flag for gradual rollout (1% → 10% → 50% → 100%).
> - Reports + analytics queries often hit the DB outside the app code path. Audit those too; they can read the wrong column for weeks before someone notices.

## Step 8: Stop dual-writes (the contract step, part 1)

_Estimated time: 1 day_

Once reads are on the new column for at least a week with no issues, remove the dual-write to the old column. The old column becomes stale immediately; the new column is now the source of truth. This is the point of no return without a backfill — but only AFTER confirming reads have switched without issue.

### Tasks

- [ ] Remove the dual-write code (writes to old column)
- [ ] Keep writes only to the new column
- [ ] Deploy + monitor for 24-48 hours
- [ ] Verify: writes are landing in new only; old column is no longer changing
- [ ] Document the cutover time in the Brief (after this, rollback requires a backfill in reverse)

### Pointers

- **[Guide]** [Stripe's online migrations post](https://stripe.com/blog/online-migrations)

> [!CAUTION]
> **Gotchas**
>
> - Removing dual-write while there's still a code path that reads the old column = silent data drift. Verify NO ACTIVE READS first.
> - Background jobs, cron tasks, and analytics pipelines often read the old column. Stop dual-writes only after auditing every consumer.
> - Once dual-writes stop, rolling back requires a reverse backfill (new → old) under whatever traffic is hitting the table. That takes hours-to-days. Be sure before this step.

## Step 9: Drop the old column (the contract step, part 2)

_Estimated time: 1 day to drop, weeks of waiting beforehand_

Days-to-weeks after stopping dual-writes, drop the old column. Most teams keep it for 30-90 days as a safety net before fully dropping. DROP COLUMN is an ACCESS EXCLUSIVE lock briefly, but on most tables it's milliseconds. Big tables may need pg_repack to reclaim space.

### Tasks

- [ ] Wait at least 7-30 days from stopping dual-writes (longer for important data)
- [ ] Verify NO references to the old column remain (grep, static analysis, prod query log)
- [ ] Run ALTER TABLE DROP COLUMN in a maintenance window if the table is huge (or always for safety)
- [ ] On large tables: schedule pg_repack to reclaim space (DROP COLUMN doesn't reclaim space immediately)
- [ ] Verify: old column is gone, new column is complete, app behaves correctly
- [ ] Update the Brief: migration complete

### Pointers

- **[Tool]** [pg_repack](https://github.com/reorg/pg_repack)

> [!CAUTION]
> **Gotchas**
>
> - DB migrations with FK validations on huge tables lock writes for hours; use ON UPDATE NO ACTION + a separate validation step. Same applies to constraint changes.
> - DROP COLUMN doesn't reclaim disk space. The column data lives on disk until VACUUM FULL or pg_repack rewrites the table. Plan for this if you're disk-constrained.
> - Backups taken before the drop still contain the old column data. Plan accordingly if you're dropping for compliance / privacy reasons (you may need to also expire old backups).

## Step 10: Postmortem the migration + extract patterns

_Estimated time: 1-2 hr_

Every non-trivial migration teaches the team something — patterns to reuse, traps to avoid. Run a 30-min retro after the migration completes: what went well, what went poorly, what would you do differently next time. The output is a section in your team's database playbook so the next migration is faster.

### Tasks

- [ ] Hold a 30-min retro with everyone who touched the migration
- [ ] Identify what went well + what didn't (timeline issues, surprises, near-misses)
- [ ] Extract reusable patterns: the expand SQL template, the backfill script template, the verification queries
- [ ] Save the patterns in your team's database playbook (Confluence, Notion, your shared docs)
- [ ] Identify any tooling investments that would help (auto-detection of dangerous migrations in CI, etc.)
- [ ] Schedule any follow-up work (improve monitoring, fix issues found during the migration)

### Pointers

- **[Tool]** [Strong migrations (Rails dangerous-migration linter)](https://github.com/ankane/strong_migrations) — CI-time linter that catches dangerous migrations before they ship.

> [!CAUTION]
> **Gotchas**
>
> - Skipping the retro means the next migration repeats the same mistakes. Half an hour of retro saves days on the next migration.
> - Patterns extracted from one migration should become reusable templates (or even auto-generated by tooling). Don't write the expand-contract checklist from scratch every time.
> - The most valuable retro insight is often 'we should have caught this in CI.' Invest in static analysis / linters that flag dangerous patterns before they reach production.

---

## Hand the template to your agent

Paste the prompt below into your agent's permanent system prompt so the agent reads, writes, and maintains this workspace as you work through the steps.

```text
You are an agent on the "Migrate a Postgres database without downtime" playbook workspace at your-org/migrate-database-without-downtime.

Your role: maintain the four surfaces (Steps, Migrations log, Brief, Rollback plan) as the team executes the migration.

Cadence:
- When the user marks a step Done, append a line to the Brief with the timing + outcome.
- When a new SQL migration ships, add a row to Migrations log with timing, lock duration, row count affected.
- When a step has no rollback path, BLOCK the step — flag in Rollback plan as "rollback not possible, decision required."
- Daily during the migration window: surface query latency + error rate trends from Datadog.

First MCP tool calls:
1. list_surfaces(workspace_slug="migrate-database-without-downtime")
2. list_rows(workspace_slug="migrate-database-without-downtime", surface_slug="migrations-log")
3. get_doc(workspace_slug="migrate-database-without-downtime", surface_slug="rollback-plan")

Do NOT recommend skipping the dual-write or backfill steps for "speed." The cost of a bad migration is hours of customer impact + a postmortem; the time saved by skipping is minutes.
```

---

## FAQ

### When do I need expand-contract vs just ALTER TABLE?

Use ALTER TABLE directly only when the operation is fast + safe: ADD COLUMN nullable on Postgres 11+, CREATE INDEX CONCURRENTLY, ADD CONSTRAINT NOT VALID. Use expand-contract for anything that changes the meaning of an existing column, requires backfilling existing rows, or needs application code changes that can't ship simultaneously with the SQL. The rule of thumb: if the migration takes more than a few seconds OR requires app code coordination, do expand-contract.

### How long should the dual-write phase last?

Minimum 24 hours to verify production-traffic dual-writes are correct. Typical 3-7 days for medium-importance data. For high-stakes data (billing, account info), 2-4 weeks is reasonable. The cost of dual-writes is small (extra column writes); the cost of premature contraction is unrecoverable data.

### What's the most common zero-downtime migration mistake?

Three failures dominate: (1) ALTER TABLE with NOT NULL or DEFAULT on a large table on Postgres 10-, locking writes for hours. (2) CREATE INDEX without CONCURRENTLY, locking writes for the build duration. (3) Stopping dual-writes before reads have switched, causing silent data drift while the app reads stale data.

### How do I roll back a partially-completed migration?

Each step has its own rollback. Rollback discipline: (1) During expand: just DROP the new schema additions. (2) During dual-write: revert the app deploy. (3) During backfill: stop the backfill; the new column has partial data but isn't read. (4) After read switch: revert the app deploy to read from old. (5) After contract: this is where rollback gets expensive — you need a reverse backfill new → old. Document each rollback path BEFORE starting; the rollback plan is non-optional.

### Can my AI agents help with database migrations?

Yes, carefully. Agents are particularly useful for: scaffolding the expand SQL + rollback SQL, generating the backfill script with chunking + checkpointing, finding every read/write site of the affected column in the codebase, drafting the verification queries, monitoring lock waits + query latency during the migration. Do NOT auto-execute migrations with agents; humans gate every SQL run. The playbook ships agent prompts inline.

### What about MySQL or other databases?

The expand-contract pattern is database-agnostic. The lock semantics differ — MySQL has long had online DDL via gh-ost / pt-online-schema-change; Postgres relies on careful use of CONCURRENTLY + the patterns in this playbook. The 10 steps map cleanly to any RDBMS. Adapt the SQL specifics; keep the discipline.

