10-step playbook for going multi-region with Postgres without painting yourself into a corner: read replicas first, smart routing, and the tradeoffs nobody warns you about.
Most teams hit multi-region pain when their app starts serving users across continents and a 200ms transatlantic round-trip per query becomes the cause of every slow page. The right answer is rarely a multi-master setup; it's read replicas in the secondary regions, app-level routing for reads, and a clear understanding of which writes are latency-tolerant. This playbook walks the 10 steps to ship the cheap, boring, working version of multi-region Postgres without committing to the operational nightmare of true multi-master.
Outcome
An app where users in your secondary region see read latency drop from 200-300ms to 5-30ms, writes still work correctly, replica lag is monitored, and you have a documented failover plan when a region goes down.
Time2-4 weeksDifficultyadvancedForTech leads serving users across multiple continents.
Top to bottom. Each step has tasks, pointers, gotchas.
01 / 10
Confirm you actually need multi-region (not just CDN + caching)
1 day of profiling
Most 'we need multi-region' problems are solved by a CDN, edge caching, and a well-designed read path. Multi-region Postgres adds operational complexity that takes years to fully amortize. Before you start, profile: are slow pages actually waiting on cross-region DB hits, or are they waiting on uncached static assets, slow client JS, or N+1 queries?
Tasks
Pick 5 representative slow page loads from your secondary-region users
A page that does 30 sequential DB queries will be slow no matter how close the DB is. Fix the N+1 first; your replicas won't save you.
If your bottleneck is asset delivery, a CDN buys you 90% of the latency win for 10% of the operational cost. Always try CDN first.
02 / 10
Pick the topology: primary + read replicas (start here)
1 hr
Three topology options: (1) primary + read replicas (one writeable region, multiple read regions), (2) sharded by tenant (each region owns a slice of users), (3) true multi-master (every region writes, conflicts resolve). Start with (1). It covers 90% of needs at 10% of the operational cost. Only escalate if profiling shows write latency is the bottleneck.
Tasks
Identify your primary region (the one that does ~70%+ of writes)
Identify the secondary regions (top 1-3 by user count, after primary)
Pick the topology: primary + read replicas, period, for first iteration
Document the choice + why in the Brief, with a 'when we'd reconsider' note
True multi-master Postgres in 2024 is still operationally hard. BDR, Active-Active, and similar require deep expertise. If a vendor is selling you 'just turn on multi-master', read the conflict resolution docs carefully.
Sharding by tenant region works great until a customer expands across regions. Plan for that case before you commit to sharding.
03 / 10
Provision read replicas in target regions
2-3 hr
On Neon, Supabase Team+, AWS RDS, or whatever managed Postgres you use, spin up a read replica per target region. Cross-region replication lag is typically 50-500ms depending on provider; verify the actual lag before you write code that assumes it.
Tasks
Pick the regions (typically: primary in us-east-1, replicas in eu-west-1 and ap-southeast-1)
Provision the replica via your provider's console or CLI
Wait for the initial sync to complete (minutes to hours depending on data size)
Verify you can connect from the replica region with the connection string
Test a known-fresh write on the primary, then a SELECT against the replica; measure the lag
Initial sync of a multi-GB database across regions can take hours and cost real bandwidth. Provision during off-peak.
Cross-region transfer charges add up fast. AWS charges $0.02/GB out for cross-region; a 100GB DB initial sync is $2 plus ongoing replication traffic.
04 / 10
Implement read/write splitting in the app
Half a day to a day
Without app-level routing, every request hits the primary regardless of region. Split your DB connection: one for writes (always primary), one or more for reads (region-local replica). Most ORMs support multiple connections; route reads to the closest replica based on request region.
Tasks
Configure two DB connection objects in app code: `primary` (writes) and `replica` (reads, region-local)
Determine the request's region (from your CDN's geo header, e.g. CF-IPCountry, or env-pinned per deployment)
Route each query: writes always to `primary`; reads to `replica` if available, else `primary`
For transactions that read-then-write, route ALL queries in the transaction to `primary` (replica reads inside a write txn = stale data bugs)
Keep a feature flag: 'use_replica_reads' so you can roll back instantly
Read-then-write transactions on a replica = stale read = data corruption. Always route the entire transaction to primary.
Some ORMs are sneaky about transactions. A 'side-effect-free' SELECT FOR UPDATE locks rows; route to primary.
If your app uses connection pooling at the DB level (pgbouncer transaction mode), make sure the routing decision happens at the app, before the pool selects a connection.
Agent prompt for this step
Read every DB query in the codebase. Classify each as:
- READ (SELECT, no side effects).
- WRITE (INSERT, UPDATE, DELETE, or any function that mutates state).
- TRANSACTION_READ (a SELECT inside a transaction that also writes).
Output:
1. A list of file:line references with classification.
2. A proposed code change for each READ to use the replica connection.
3. A list of TRANSACTION_READ queries to leave alone (must stay on primary).
4. Any query that's borderline (e.g. SELECT FOR UPDATE) - flag for human review.
05 / 10
Handle the read-after-write inconsistency
Half a day
Replicas lag, even fast ones. A user who creates a record on the primary and immediately reads from the replica may get the pre-create state. Three patterns: (1) read-your-own-writes by routing the next N seconds of reads to primary, (2) wait-for-replication if your provider exposes it, (3) accept the inconsistency for non-critical reads.
Tasks
Identify the workflows where read-after-write matters (creates that show up immediately, account changes that affect auth)
For each: pick a strategy. Default: route reads to primary for 5 seconds after a write from the same user/session
Implement a per-session 'recent write' timestamp; if now - timestamp < 5s, route reads to primary
For non-critical reads (analytics, dashboards), accept replica lag; the dashboard from 1 second ago is fine
Test by hand: create a record, immediately list, verify the new record appears
The 'always read from primary' fallback is the safe default but defeats the purpose of replicas. Be deliberate about which reads can tolerate lag.
User-visible inconsistencies (you posted a comment, refresh, comment is gone) destroy trust faster than slow pages. Default to primary for self-reads.
06 / 10
Monitor replica lag and set alerts
2-3 hr
Replica lag is the silent killer. A replica with 30 seconds of lag will route reads that look stale; a replica with 5 minutes of lag will look broken. Monitor lag, alert on sustained lag over 5 seconds, fail over to primary if lag exceeds your tolerance.
Tasks
Query `pg_stat_replication` (on primary) or `pg_last_wal_replay_lsn()` (on replica) for lag in bytes or time
Add the metric to your observability stack (Prometheus, Datadog, etc.)
Alert on replica lag > 5 seconds for > 1 minute
Add a circuit breaker in app code: if replica is too far behind, route reads to primary until it catches up
Test the circuit breaker by force-pausing the replica
Replica lag during heavy writes is normal and self-recovers. Alert on sustained lag (over 1 minute), not on momentary spikes.
A replica with multiple GB of lag may take 10+ minutes to catch up after the write storm ends. Have the circuit breaker.
07 / 10
Handle writes that need low latency in secondary regions
Half a day to a day, only if needed
Some writes are 'fire and forget' (analytics, audit logs, write-only feeds). For those, you don't need to wait on the round-trip to primary. Async write patterns: queue locally, flush to primary in the background. Your secondary-region users see fast page loads even on write-heavy interactions.
Tasks
Identify writes where the user doesn't need the result immediately (events, analytics, audit logs)
Implement a queue (Redis, SQS, the DB itself with a 'writes_pending' table in the local replica)
Background worker drains the queue to the primary, with retry on failure
Monitor the queue depth; alert if it grows unbounded
Document which writes are async vs sync in the Brief
Async writes complicate the user's mental model. 'I clicked save, did it save?' is the question. Show optimistic UI but design for the failure case.
Don't async-queue writes that need transactional guarantees with reads (e.g. inventory decrements). Those need primary-region sync.
08 / 10
Plan and test the failover scenario
Half a day to test, ongoing to maintain
When primary goes down, what happens? Most managed Postgres providers offer automatic failover (promote a replica to primary). Test it. Run a fire drill: pick a quiet window, manually trigger the failover, time how long the app is down, capture the runbook.
Tasks
Document the failover runbook: how to manually promote a replica, where DNS / connection string change
Schedule a fire drill: weekend morning, force a failover, time recovery
Verify the app reconnects to the new primary within X minutes
Document the data loss window (writes the dead primary acknowledged but didn't replicate before death)
Update the runbook with anything that surprised you in the drill
Schema migrations are the most dangerous operation in multi-region. A migration on the primary must replicate cleanly to all replicas. Backwards-incompatible migrations break replicas mid-deploy. Always: add columns nullable first, deploy code that doesn't require them, backfill, then deploy code that uses them.
Tasks
Adopt expand-then-contract migration patterns: add new column nullable, deploy reading code, backfill, deploy writing code, drop old column in a later migration
For large tables, use logical replication for zero-downtime migrations rather than `pg_dump` (which locks tables briefly under default settings)
Test every migration on a staging replica that mirrors prod topology
Schedule risky migrations during a quiet window and monitor replica lag
Postgres pg_dump locks tables briefly under default settings. For large migrations and zero-downtime data moves, use logical replication, not pg_dump.
DDL replication has subtle gotchas. ALTER TABLE on a heavily-replicated table can stall replicas while the change propagates.
Backwards-incompatible migrations + rolling deploy = some app instances run new code, some run old, and they all see the same DB. Always expand-then-contract.
10 / 10
Monitor latency from end users and iterate
1 week observation + ongoing
After deploy, measure. Track p50/p95/p99 page-load latency from each region. Compare against the baseline you measured in step 1. The goal: secondary-region users see latency within 30% of primary-region users. If you're not there, find the remaining bottleneck.
Tasks
Set up RUM (real-user monitoring) per region: Sentry, LogRocket, or Vercel Analytics
Track p50/p95/p99 page-load latency per region for the top 5 pages
Compare against baseline (pre-multi-region) latency
Append rows to Latency log: region pair, query type, baseline, post-deploy
Iterate on remaining bottlenecks: more reads to push to replicas, edge caching, async write paths
RUM data is biased toward users with fast browsers. Cross-reference with synthetic tests from each region.
If post-deploy latency is still high, check whether your CDN is also region-aware. Read replicas don't help if your assets still come from the primary region.
Hand the template to your agent
Workspace-wide agent prompt.
Paste this into your agent's permanent system prompt so the agent reads, writes, and maintains the template's surfaces as you work through the steps.
Agent system prompt
You are an agent on the "Set up multi-region for a Postgres app" playbook workspace.
Your role: maintain the four surfaces (Steps, Pointers, Brief, Latency log) as the team builds out multi-region.
Cadence:
- When the team measures p95 latency from a region, append a row to Latency log: region pair, query type, baseline, post-deploy.
- When the team adds a new query to a region-pinned path, update the Brief with the rationale.
- Daily: read replication-lag metrics; if any replica lag exceeds 5 seconds, raise it in the Brief.
First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-multi-region-postgres")
2. list_rows(workspace_slug="set-up-multi-region-postgres", surface_slug="latency-log")
3. get_doc(workspace_slug="set-up-multi-region-postgres", surface_slug="brief")
Never propose true multi-master replication without explicit user opt-in - the operational cost is high enough that it should be a conscious decision, not an automated one.
FAQ
Common questions on this template.
Do I really need multi-region or is a CDN enough?
For most small SaaS apps, a CDN + edge caching solves 80% of the latency problem at 10% of the operational cost. Profile your slow pages first. If the bottleneck is asset delivery (LCP, JS bundles, images), use a CDN. If the bottleneck is genuinely DB round-trip from a secondary region, add read replicas. Don't pre-optimize multi-region; the operational cost is real.
What does multi-region Postgres actually cost?
On managed Postgres providers, expect roughly 2x your single-region bill. AWS RDS with one cross-region replica adds the replica's per-instance cost plus cross-region transfer (~$0.02/GB on AWS). For a small app: $200-500/month for a 2-region setup. Neon and Supabase have flatter pricing structures; Supabase's read replicas require Team plan ($599/month) which is the bigger jump.
What's the biggest mistake teams make going multi-region?
Two: (1) jumping to multi-master replication when read replicas would have sufficed - the operational cost of true multi-master is enormous and most teams don't actually need it; (2) forgetting that replicas lag, and routing reads to a stale replica right after a write produces 'I clicked save and it disappeared' bugs. Plan read-after-write consistency from day one.
Should I shard by tenant region instead?
Sharding-by-region is great when each customer is firmly anchored to one region (e.g. a B2B SaaS where each company's data lives in their headquarters region). It's painful when customers expand across regions (a US company hires in EU). For most SaaS, primary-with-replicas is more flexible. Shard if your tenants are clearly partitionable AND your data growth is per-tenant rather than per-event.
Can my AI agents help operate multi-region Postgres?
Yes. Agents are useful for: classifying every query in the codebase as read vs write, drafting the failover runbook, summarising daily replica lag, drafting expand-then-contract migrations from a single ALTER intent, and detecting when a query's region routing has regressed. The judgement calls (topology choice, failover trigger conditions) need humans. The playbook ships agent prompts inline for the read/write classification step.
Open this template as a workspace.
We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.