- When do I need expand-contract vs just ALTER TABLE?
- Use ALTER TABLE directly only when the operation is fast + safe: ADD COLUMN nullable on Postgres 11+, CREATE INDEX CONCURRENTLY, ADD CONSTRAINT NOT VALID. Use expand-contract for anything that changes the meaning of an existing column, requires backfilling existing rows, or needs application code changes that can't ship simultaneously with the SQL. The rule of thumb: if the migration takes more than a few seconds OR requires app code coordination, do expand-contract.
- How long should the dual-write phase last?
- Minimum 24 hours to verify production-traffic dual-writes are correct. Typical 3-7 days for medium-importance data. For high-stakes data (billing, account info), 2-4 weeks is reasonable. The cost of dual-writes is small (extra column writes); the cost of premature contraction is unrecoverable data.
- What's the most common zero-downtime migration mistake?
- Three failures dominate: (1) ALTER TABLE with NOT NULL or DEFAULT on a large table on Postgres 10-, locking writes for hours. (2) CREATE INDEX without CONCURRENTLY, locking writes for the build duration. (3) Stopping dual-writes before reads have switched, causing silent data drift while the app reads stale data.
- How do I roll back a partially-completed migration?
- Each step has its own rollback. Rollback discipline: (1) During expand: just DROP the new schema additions. (2) During dual-write: revert the app deploy. (3) During backfill: stop the backfill; the new column has partial data but isn't read. (4) After read switch: revert the app deploy to read from old. (5) After contract: this is where rollback gets expensive — you need a reverse backfill new → old. Document each rollback path BEFORE starting; the rollback plan is non-optional.
- Can my AI agents help with database migrations?
- Yes, carefully. Agents are particularly useful for: scaffolding the expand SQL + rollback SQL, generating the backfill script with chunking + checkpointing, finding every read/write site of the affected column in the codebase, drafting the verification queries, monitoring lock waits + query latency during the migration. Do NOT auto-execute migrations with agents; humans gate every SQL run. The playbook ships agent prompts inline.
- What about MySQL or other databases?
- The expand-contract pattern is database-agnostic. The lock semantics differ — MySQL has long had online DDL via gh-ost / pt-online-schema-change; Postgres relies on careful use of CONCURRENTLY + the patterns in this playbook. The 10 steps map cleanly to any RDBMS. Adapt the SQL specifics; keep the discipline.