Set up multi-region for a Postgres app
A 10-step playbook. Open in Dock and you'll get four surfaces seeded:
- **Steps** (table) - the 10 steps as rows, owner + status
- **Pointers** (table) - linked Postgres docs, hosting provider docs, replication patterns
- **Brief** (doc) - the architecture brief: what you're building, where, with what tradeoffs
- **Latency log** (table) - one row per region pair, baseline + post-deploy p95 latency
Read `Steps` top-to-bottom on first build. Don't skip the read-replica step (step 4); most multi-region pain is solved there.
Outcome
An app where users in your secondary region see read latency drop from 200-300ms to 5-30ms, writes still work correctly, replica lag is monitored, and you have a documented failover plan when a region goes down.
Estimated time: 2-4 weeks
Difficulty: advanced
For: Tech leads serving users across multiple continents.
What you'll need
Pre-register or install before you start.
- Neon (Free tier, $19/mo Launch, then usage-based) — Serverless Postgres with read replicas in multiple regions; lowest-friction multi-region option for small teams.
- Supabase (Free tier, $25/mo Pro, $599/mo Team for read replicas) — Managed Postgres with read replicas (Team plan and above) across regions.
- AWS RDS for Postgres (Per-instance + per-GB storage + cross-region transfer; ~$200-500/mo for a 2-region small setup) — Cross-region read replicas with Aurora Global Database for sub-second cross-region replication.
- PgBouncer (Free open source) — Connection pooler that sits between app and Postgres; essential at scale, useful for routing.
- Postgres logical replication (Free open source) — Built-in publication / subscription replication; the underpinning of cross-region read replicas.
The template · 10 steps
Step 1: Confirm you actually need multi-region (not just CDN + caching)
Estimated time: 1 day of profiling
Most 'we need multi-region' problems are solved by a CDN, edge caching, and a well-designed read path. Multi-region Postgres adds operational complexity that takes years to fully amortize. Before you start, profile: are slow pages actually waiting on cross-region DB hits, or are they waiting on uncached static assets, slow client JS, or N+1 queries?
Tasks
- Pick 5 representative slow page loads from your secondary-region users
- Profile each: network waterfall, server-side timing, query-level breakdown
- Confirm the P95 page-load bottleneck is DB round-trip, not asset CDN miss or N+1 queries
- Estimate the latency win from a regional replica (rule of thumb: 80% of cross-region RTT)
- If the bottleneck is anything else, fix that first
Pointers
- [Guide] Web Vitals + LCP profiling
- [Tool] Cloudflare Argo + edge cache
[!CAUTION] Gotchas
- A page that does 30 sequential DB queries will be slow no matter how close the DB is. Fix the N+1 first; your replicas won't save you.
- If your bottleneck is asset delivery, a CDN buys you 90% of the latency win for 10% of the operational cost. Always try CDN first.
Step 2: Pick the topology: primary + read replicas (start here)
Estimated time: 1 hr
Three topology options: (1) primary + read replicas (one writeable region, multiple read regions), (2) sharded by tenant (each region owns a slice of users), (3) true multi-master (every region writes, conflicts resolve). Start with (1). It covers 90% of needs at 10% of the operational cost. Only escalate if profiling shows write latency is the bottleneck.
Tasks
- Identify your primary region (the one that does ~70%+ of writes)
- Identify the secondary regions (top 1-3 by user count, after primary)
- Pick the topology: primary + read replicas, period, for first iteration
- Document the choice + why in the Brief, with a 'when we'd reconsider' note
Pointers
- [Official] Postgres replication overview
- [Official] Aurora Global Database
[!CAUTION] Gotchas
- True multi-master Postgres in 2024 is still operationally hard. BDR, Active-Active, and similar require deep expertise. If a vendor is selling you 'just turn on multi-master', read the conflict resolution docs carefully.
- Sharding by tenant region works great until a customer expands across regions. Plan for that case before you commit to sharding.
Step 3: Provision read replicas in target regions
Estimated time: 2-3 hr
On Neon, Supabase Team+, AWS RDS, or whatever managed Postgres you use, spin up a read replica per target region. Cross-region replication lag is typically 50-500ms depending on provider; verify the actual lag before you write code that assumes it.
Tasks
- Pick the regions (typically: primary in us-east-1, replicas in eu-west-1 and ap-southeast-1)
- Provision the replica via your provider's console or CLI
- Wait for the initial sync to complete (minutes to hours depending on data size)
- Verify you can connect from the replica region with the connection string
- Test a known-fresh write on the primary, then a SELECT against the replica; measure the lag
Pointers
- [Official] Neon read replicas across regions
- [Official] AWS RDS cross-region read replicas
[!CAUTION] Gotchas
- Initial sync of a multi-GB database across regions can take hours and cost real bandwidth. Provision during off-peak.
- Cross-region transfer charges add up fast. AWS charges $0.02/GB out for cross-region; a 100GB DB initial sync is $2 plus ongoing replication traffic.
Step 4: Implement read/write splitting in the app
Estimated time: Half a day to a day
Without app-level routing, every request hits the primary regardless of region. Split your DB connection: one for writes (always primary), one or more for reads (region-local replica). Most ORMs support multiple connections; route reads to the closest replica based on request region.
Tasks
- Configure two DB connection objects in app code:
primary(writes) andreplica(reads, region-local) - Determine the request's region (from your CDN's geo header, e.g. CF-IPCountry, or env-pinned per deployment)
- Route each query: writes always to
primary; reads toreplicaif available, elseprimary - For transactions that read-then-write, route ALL queries in the transaction to
primary(replica reads inside a write txn = stale data bugs) - Keep a feature flag: 'use_replica_reads' so you can roll back instantly
Pointers
- [Official] Prisma multiple data sources
- [Official] Vercel Edge geo headers
[!CAUTION] Gotchas
- Read-then-write transactions on a replica = stale read = data corruption. Always route the entire transaction to primary.
- Some ORMs are sneaky about transactions. A 'side-effect-free' SELECT FOR UPDATE locks rows; route to primary.
- If your app uses connection pooling at the DB level (pgbouncer transaction mode), make sure the routing decision happens at the app, before the pool selects a connection.
Agent prompt for this step
Read every DB query in the codebase. Classify each as:
- READ (SELECT, no side effects).
- WRITE (INSERT, UPDATE, DELETE, or any function that mutates state).
- TRANSACTION_READ (a SELECT inside a transaction that also writes).
Output:
1. A list of file:line references with classification.
2. A proposed code change for each READ to use the replica connection.
3. A list of TRANSACTION_READ queries to leave alone (must stay on primary).
4. Any query that's borderline (e.g. SELECT FOR UPDATE) - flag for human review.
Step 5: Handle the read-after-write inconsistency
Estimated time: Half a day
Replicas lag, even fast ones. A user who creates a record on the primary and immediately reads from the replica may get the pre-create state. Three patterns: (1) read-your-own-writes by routing the next N seconds of reads to primary, (2) wait-for-replication if your provider exposes it, (3) accept the inconsistency for non-critical reads.
Tasks
- Identify the workflows where read-after-write matters (creates that show up immediately, account changes that affect auth)
- For each: pick a strategy. Default: route reads to primary for 5 seconds after a write from the same user/session
- Implement a per-session 'recent write' timestamp; if now - timestamp < 5s, route reads to primary
- For non-critical reads (analytics, dashboards), accept replica lag; the dashboard from 1 second ago is fine
- Test by hand: create a record, immediately list, verify the new record appears
Pointers
[!CAUTION] Gotchas
- The 'always read from primary' fallback is the safe default but defeats the purpose of replicas. Be deliberate about which reads can tolerate lag.
- User-visible inconsistencies (you posted a comment, refresh, comment is gone) destroy trust faster than slow pages. Default to primary for self-reads.
Step 6: Monitor replica lag and set alerts
Estimated time: 2-3 hr
Replica lag is the silent killer. A replica with 30 seconds of lag will route reads that look stale; a replica with 5 minutes of lag will look broken. Monitor lag, alert on sustained lag over 5 seconds, fail over to primary if lag exceeds your tolerance.
Tasks
- Query
pg_stat_replication(on primary) orpg_last_wal_replay_lsn()(on replica) for lag in bytes or time - Add the metric to your observability stack (Prometheus, Datadog, etc.)
- Alert on replica lag > 5 seconds for > 1 minute
- Add a circuit breaker in app code: if replica is too far behind, route reads to primary until it catches up
- Test the circuit breaker by force-pausing the replica
Pointers
- [Official] Postgres pg_stat_replication
- [Official] Postgres pg_last_wal_replay_lsn
[!CAUTION] Gotchas
- Replica lag during heavy writes is normal and self-recovers. Alert on sustained lag (over 1 minute), not on momentary spikes.
- A replica with multiple GB of lag may take 10+ minutes to catch up after the write storm ends. Have the circuit breaker.
Step 7: Handle writes that need low latency in secondary regions
Estimated time: Half a day to a day, only if needed
Some writes are 'fire and forget' (analytics, audit logs, write-only feeds). For those, you don't need to wait on the round-trip to primary. Async write patterns: queue locally, flush to primary in the background. Your secondary-region users see fast page loads even on write-heavy interactions.
Tasks
- Identify writes where the user doesn't need the result immediately (events, analytics, audit logs)
- Implement a queue (Redis, SQS, the DB itself with a 'writes_pending' table in the local replica)
- Background worker drains the queue to the primary, with retry on failure
- Monitor the queue depth; alert if it grows unbounded
- Document which writes are async vs sync in the Brief
Pointers
[!CAUTION] Gotchas
- Async writes complicate the user's mental model. 'I clicked save, did it save?' is the question. Show optimistic UI but design for the failure case.
- Don't async-queue writes that need transactional guarantees with reads (e.g. inventory decrements). Those need primary-region sync.
Step 8: Plan and test the failover scenario
Estimated time: Half a day to test, ongoing to maintain
When primary goes down, what happens? Most managed Postgres providers offer automatic failover (promote a replica to primary). Test it. Run a fire drill: pick a quiet window, manually trigger the failover, time how long the app is down, capture the runbook.
Tasks
- Document the failover runbook: how to manually promote a replica, where DNS / connection string change
- Schedule a fire drill: weekend morning, force a failover, time recovery
- Verify the app reconnects to the new primary within X minutes
- Document the data loss window (writes the dead primary acknowledged but didn't replicate before death)
- Update the runbook with anything that surprised you in the drill
Pointers
- [Official] AWS RDS failover documentation
- [Code] Patroni (HA Postgres)
[!CAUTION] Gotchas
- Auto-failover sometimes elects the wrong replica (the one with the most lag). Verify the election rules with your provider.
- After failover, the old primary cannot rejoin without resync. Plan for the cost of the resync (downtime or read-only mode for hours).
- DNS-based failover takes minutes due to TTL. Faster failover requires connection-string-aware client drivers (most modern Postgres drivers handle this).
Step 9: Handle migrations carefully (zero-downtime + cross-region)
Estimated time: Ongoing - per migration
Schema migrations are the most dangerous operation in multi-region. A migration on the primary must replicate cleanly to all replicas. Backwards-incompatible migrations break replicas mid-deploy. Always: add columns nullable first, deploy code that doesn't require them, backfill, then deploy code that uses them.
Tasks
- Adopt expand-then-contract migration patterns: add new column nullable, deploy reading code, backfill, deploy writing code, drop old column in a later migration
- For large tables, use logical replication for zero-downtime migrations rather than
pg_dump(which locks tables briefly under default settings) - Test every migration on a staging replica that mirrors prod topology
- Schedule risky migrations during a quiet window and monitor replica lag
- Document the rollback plan per migration
Pointers
- [Official] Postgres logical replication for migrations
- [Code] Strong migrations gem (Rails, but principles apply)
[!CAUTION] Gotchas
- Postgres pg_dump locks tables briefly under default settings. For large migrations and zero-downtime data moves, use logical replication, not pg_dump.
- DDL replication has subtle gotchas. ALTER TABLE on a heavily-replicated table can stall replicas while the change propagates.
- Backwards-incompatible migrations + rolling deploy = some app instances run new code, some run old, and they all see the same DB. Always expand-then-contract.
Step 10: Monitor latency from end users and iterate
Estimated time: 1 week observation + ongoing
After deploy, measure. Track p50/p95/p99 page-load latency from each region. Compare against the baseline you measured in step 1. The goal: secondary-region users see latency within 30% of primary-region users. If you're not there, find the remaining bottleneck.
Tasks
- Set up RUM (real-user monitoring) per region: Sentry, LogRocket, or Vercel Analytics
- Track p50/p95/p99 page-load latency per region for the top 5 pages
- Compare against baseline (pre-multi-region) latency
- Append rows to Latency log: region pair, query type, baseline, post-deploy
- Iterate on remaining bottlenecks: more reads to push to replicas, edge caching, async write paths
Pointers
- [Official] Vercel Analytics (RUM)
- [Official] Sentry performance monitoring
[!CAUTION] Gotchas
- RUM data is biased toward users with fast browsers. Cross-reference with synthetic tests from each region.
- If post-deploy latency is still high, check whether your CDN is also region-aware. Read replicas don't help if your assets still come from the primary region.
Hand the template to your agent
Paste the prompt below into your agent's permanent system prompt so the agent reads, writes, and maintains this workspace as you work through the steps.
You are an agent on the "Set up multi-region for a Postgres app" playbook workspace.
Your role: maintain the four surfaces (Steps, Pointers, Brief, Latency log) as the team builds out multi-region.
Cadence:
- When the team measures p95 latency from a region, append a row to Latency log: region pair, query type, baseline, post-deploy.
- When the team adds a new query to a region-pinned path, update the Brief with the rationale.
- Daily: read replication-lag metrics; if any replica lag exceeds 5 seconds, raise it in the Brief.
First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-multi-region-postgres")
2. list_rows(workspace_slug="set-up-multi-region-postgres", surface_slug="latency-log")
3. get_doc(workspace_slug="set-up-multi-region-postgres", surface_slug="brief")
Never propose true multi-master replication without explicit user opt-in - the operational cost is high enough that it should be a conscious decision, not an automated one.
FAQ
Do I really need multi-region or is a CDN enough?
For most small SaaS apps, a CDN + edge caching solves 80% of the latency problem at 10% of the operational cost. Profile your slow pages first. If the bottleneck is asset delivery (LCP, JS bundles, images), use a CDN. If the bottleneck is genuinely DB round-trip from a secondary region, add read replicas. Don't pre-optimize multi-region; the operational cost is real.
What does multi-region Postgres actually cost?
On managed Postgres providers, expect roughly 2x your single-region bill. AWS RDS with one cross-region replica adds the replica's per-instance cost plus cross-region transfer (~$0.02/GB on AWS). For a small app: $200-500/month for a 2-region setup. Neon and Supabase have flatter pricing structures; Supabase's read replicas require Team plan ($599/month) which is the bigger jump.
What's the biggest mistake teams make going multi-region?
Two: (1) jumping to multi-master replication when read replicas would have sufficed - the operational cost of true multi-master is enormous and most teams don't actually need it; (2) forgetting that replicas lag, and routing reads to a stale replica right after a write produces 'I clicked save and it disappeared' bugs. Plan read-after-write consistency from day one.
Should I shard by tenant region instead?
Sharding-by-region is great when each customer is firmly anchored to one region (e.g. a B2B SaaS where each company's data lives in their headquarters region). It's painful when customers expand across regions (a US company hires in EU). For most SaaS, primary-with-replicas is more flexible. Shard if your tenants are clearly partitionable AND your data growth is per-tenant rather than per-event.
Can my AI agents help operate multi-region Postgres?
Yes. Agents are useful for: classifying every query in the codebase as read vs write, drafting the failover runbook, summarising daily replica lag, drafting expand-then-contract migrations from a single ALTER intent, and detecting when a query's region routing has regressed. The judgement calls (topology choice, failover trigger conditions) need humans. The playbook ships agent prompts inline for the read/write classification step.