Invite-only.
Set up multi-region for a Postgres app
Every step in the template

Set up multi-region for a Postgres app

An app where users in your secondary region see read latency drop from 200-300ms to 5-30ms, writes still work correctly, replica lag is monitored, and you have a documented failover plan when a region goes down.

Outcome

An app where users in your secondary region see read latency drop from 200-300ms to 5-30ms, writes still work correctly, replica lag is monitored, and you have a documented failover plan when a region goes down.

Time2-4 weeksDifficultyadvancedForTech leads serving users across multiple continents.
How this works

Open it, hand it to your agent, walk the steps.

Paste this to your agent (Claude / Cursor / Codex)
You are an agent on the "Set up multi-region for a Postgres app" template workspace.

Your role: maintain the four surfaces (Steps, Pointers, Brief, Latency log) as the team builds out multi-region.

Cadence:
- When the team measures p95 latency from a region, append a row to Latency log: region pair, query type, baseline, post-deploy.
- When the team adds a new query to a region-pinned path, update the Brief with the rationale.
- Daily: read replication-lag metrics; if any replica lag exceeds 5 seconds, raise it in the Brief.

First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-multi-region-postgres")
2. list_rows(workspace_slug="set-up-multi-region-postgres", surface_slug="latency-log")
3. get_doc(workspace_slug="set-up-multi-region-postgres", surface_slug="brief")

Never propose true multi-master replication without explicit user opt-in - the operational cost is high enough that it should be a conscious decision, not an automated one.
The template · 10 steps

Top to bottom. Each step has tasks, pointers, gotchas.

Confirm you actually need multi-region (not just CDN + caching)

1 day of profiling

Most 'we need multi-region' problems are solved by a CDN, edge caching, and a well-designed read path. Multi-region Postgres adds operational complexity that takes years to fully amortize. Before you start, profile: are slow pages actually waiting on cross-region DB hits, or are they waiting on uncached static assets, slow client JS, or N+1 queries?

Tasks
  • Pick 5 representative slow page loads from your secondary-region users
  • Profile each: network waterfall, server-side timing, query-level breakdown
  • Confirm the P95 page-load bottleneck is DB round-trip, not asset CDN miss or N+1 queries
  • Estimate the latency win from a regional replica (rule of thumb: 80% of cross-region RTT)
  • If the bottleneck is anything else, fix that first
Gotchas
  • A page that does 30 sequential DB queries will be slow no matter how close the DB is. Fix the N+1 first; your replicas won't save you.
  • If your bottleneck is asset delivery, a CDN buys you 90% of the latency win for 10% of the operational cost. Always try CDN first.

Pick the topology: primary + read replicas (start here)

1 hr

Three topology options: (1) primary + read replicas (one writeable region, multiple read regions), (2) sharded by tenant (each region owns a slice of users), (3) true multi-master (every region writes, conflicts resolve). Start with (1). It covers 90% of needs at 10% of the operational cost. Only escalate if profiling shows write latency is the bottleneck.

Tasks
  • Identify your primary region (the one that does ~70%+ of writes)
  • Identify the secondary regions (top 1-3 by user count, after primary)
  • Pick the topology: primary + read replicas, period, for first iteration
  • Document the choice + why in the Brief, with a 'when we'd reconsider' note
Gotchas
  • True multi-master Postgres in 2024 is still operationally hard. BDR, Active-Active, and similar require deep expertise. If a vendor is selling you 'just turn on multi-master', read the conflict resolution docs carefully.
  • Sharding by tenant region works great until a customer expands across regions. Plan for that case before you commit to sharding.

Provision read replicas in target regions

2-3 hr

On Neon, Supabase Team+, AWS RDS, or whatever managed Postgres you use, spin up a read replica per target region. Cross-region replication lag is typically 50-500ms depending on provider; verify the actual lag before you write code that assumes it.

Tasks
  • Pick the regions (typically: primary in us-east-1, replicas in eu-west-1 and ap-southeast-1)
  • Provision the replica via your provider's console or CLI
  • Wait for the initial sync to complete (minutes to hours depending on data size)
  • Verify you can connect from the replica region with the connection string
  • Test a known-fresh write on the primary, then a SELECT against the replica; measure the lag
Gotchas
  • Initial sync of a multi-GB database across regions can take hours and cost real bandwidth. Provision during off-peak.
  • Cross-region transfer charges add up fast. AWS charges $0.02/GB out for cross-region; a 100GB DB initial sync is $2 plus ongoing replication traffic.

Implement read/write splitting in the app

Half a day to a day

Without app-level routing, every request hits the primary regardless of region. Split your DB connection: one for writes (always primary), one or more for reads (region-local replica). Most ORMs support multiple connections; route reads to the closest replica based on request region.

Tasks
  • Configure two DB connection objects in app code: `primary` (writes) and `replica` (reads, region-local)
  • Determine the request's region (from your CDN's geo header, e.g. CF-IPCountry, or env-pinned per deployment)
  • Route each query: writes always to `primary`; reads to `replica` if available, else `primary`
  • For transactions that read-then-write, route ALL queries in the transaction to `primary` (replica reads inside a write txn = stale data bugs)
  • Keep a feature flag: 'use_replica_reads' so you can roll back instantly
Gotchas
  • Read-then-write transactions on a replica = stale read = data corruption. Always route the entire transaction to primary.
  • Some ORMs are sneaky about transactions. A 'side-effect-free' SELECT FOR UPDATE locks rows; route to primary.
  • If your app uses connection pooling at the DB level (pgbouncer transaction mode), make sure the routing decision happens at the app, before the pool selects a connection.
Agent prompt for this step
Read every DB query in the codebase. Classify each as:
- READ (SELECT, no side effects).
- WRITE (INSERT, UPDATE, DELETE, or any function that mutates state).
- TRANSACTION_READ (a SELECT inside a transaction that also writes).

Output:
1. A list of file:line references with classification.
2. A proposed code change for each READ to use the replica connection.
3. A list of TRANSACTION_READ queries to leave alone (must stay on primary).
4. Any query that's borderline (e.g. SELECT FOR UPDATE) - flag for human review.

Handle the read-after-write inconsistency

Half a day

Replicas lag, even fast ones. A user who creates a record on the primary and immediately reads from the replica may get the pre-create state. Three patterns: (1) read-your-own-writes by routing the next N seconds of reads to primary, (2) wait-for-replication if your provider exposes it, (3) accept the inconsistency for non-critical reads.

Tasks
  • Identify the workflows where read-after-write matters (creates that show up immediately, account changes that affect auth)
  • For each: pick a strategy. Default: route reads to primary for 5 seconds after a write from the same user/session
  • Implement a per-session 'recent write' timestamp; if now - timestamp < 5s, route reads to primary
  • For non-critical reads (analytics, dashboards), accept replica lag; the dashboard from 1 second ago is fine
  • Test by hand: create a record, immediately list, verify the new record appears
Gotchas
  • The 'always read from primary' fallback is the safe default but defeats the purpose of replicas. Be deliberate about which reads can tolerate lag.
  • User-visible inconsistencies (you posted a comment, refresh, comment is gone) destroy trust faster than slow pages. Default to primary for self-reads.

Monitor replica lag and set alerts

2-3 hr

Replica lag is the silent killer. A replica with 30 seconds of lag will route reads that look stale; a replica with 5 minutes of lag will look broken. Monitor lag, alert on sustained lag over 5 seconds, fail over to primary if lag exceeds your tolerance.

Tasks
  • Query `pg_stat_replication` (on primary) or `pg_last_wal_replay_lsn()` (on replica) for lag in bytes or time
  • Add the metric to your observability stack (Prometheus, Datadog, etc.)
  • Alert on replica lag > 5 seconds for > 1 minute
  • Add a circuit breaker in app code: if replica is too far behind, route reads to primary until it catches up
  • Test the circuit breaker by force-pausing the replica
Gotchas
  • Replica lag during heavy writes is normal and self-recovers. Alert on sustained lag (over 1 minute), not on momentary spikes.
  • A replica with multiple GB of lag may take 10+ minutes to catch up after the write storm ends. Have the circuit breaker.

Handle writes that need low latency in secondary regions

Half a day to a day, only if needed

Some writes are 'fire and forget' (analytics, audit logs, write-only feeds). For those, you don't need to wait on the round-trip to primary. Async write patterns: queue locally, flush to primary in the background. Your secondary-region users see fast page loads even on write-heavy interactions.

Tasks
  • Identify writes where the user doesn't need the result immediately (events, analytics, audit logs)
  • Implement a queue (Redis, SQS, the DB itself with a 'writes_pending' table in the local replica)
  • Background worker drains the queue to the primary, with retry on failure
  • Monitor the queue depth; alert if it grows unbounded
  • Document which writes are async vs sync in the Brief
Gotchas
  • Async writes complicate the user's mental model. 'I clicked save, did it save?' is the question. Show optimistic UI but design for the failure case.
  • Don't async-queue writes that need transactional guarantees with reads (e.g. inventory decrements). Those need primary-region sync.

Plan and test the failover scenario

Half a day to test, ongoing to maintain

When primary goes down, what happens? Most managed Postgres providers offer automatic failover (promote a replica to primary). Test it. Run a fire drill: pick a quiet window, manually trigger the failover, time how long the app is down, capture the runbook.

Tasks
  • Document the failover runbook: how to manually promote a replica, where DNS / connection string change
  • Schedule a fire drill: weekend morning, force a failover, time recovery
  • Verify the app reconnects to the new primary within X minutes
  • Document the data loss window (writes the dead primary acknowledged but didn't replicate before death)
  • Update the runbook with anything that surprised you in the drill
Gotchas
  • Auto-failover sometimes elects the wrong replica (the one with the most lag). Verify the election rules with your provider.
  • After failover, the old primary cannot rejoin without resync. Plan for the cost of the resync (downtime or read-only mode for hours).
  • DNS-based failover takes minutes due to TTL. Faster failover requires connection-string-aware client drivers (most modern Postgres drivers handle this).

Handle migrations carefully (zero-downtime + cross-region)

Ongoing - per migration

Schema migrations are the most dangerous operation in multi-region. A migration on the primary must replicate cleanly to all replicas. Backwards-incompatible migrations break replicas mid-deploy. Always: add columns nullable first, deploy code that doesn't require them, backfill, then deploy code that uses them.

Tasks
  • Adopt expand-then-contract migration patterns: add new column nullable, deploy reading code, backfill, deploy writing code, drop old column in a later migration
  • For large tables, use logical replication for zero-downtime migrations rather than `pg_dump` (which locks tables briefly under default settings)
  • Test every migration on a staging replica that mirrors prod topology
  • Schedule risky migrations during a quiet window and monitor replica lag
  • Document the rollback plan per migration
Gotchas
  • Postgres pg_dump locks tables briefly under default settings. For large migrations and zero-downtime data moves, use logical replication, not pg_dump.
  • DDL replication has subtle gotchas. ALTER TABLE on a heavily-replicated table can stall replicas while the change propagates.
  • Backwards-incompatible migrations + rolling deploy = some app instances run new code, some run old, and they all see the same DB. Always expand-then-contract.

Monitor latency from end users and iterate

1 week observation + ongoing

After deploy, measure. Track p50/p95/p99 page-load latency from each region. Compare against the baseline you measured in step 1. The goal: secondary-region users see latency within 30% of primary-region users. If you're not there, find the remaining bottleneck.

Tasks
  • Set up RUM (real-user monitoring) per region: Sentry, LogRocket, or Vercel Analytics
  • Track p50/p95/p99 page-load latency per region for the top 5 pages
  • Compare against baseline (pre-multi-region) latency
  • Append rows to Latency log: region pair, query type, baseline, post-deploy
  • Iterate on remaining bottlenecks: more reads to push to replicas, edge caching, async write paths
Gotchas
  • RUM data is biased toward users with fast browsers. Cross-reference with synthetic tests from each region.
  • If post-deploy latency is still high, check whether your CDN is also region-aware. Read replicas don't help if your assets still come from the primary region.
FAQ

Common questions on this template.

Do I really need multi-region or is a CDN enough?
For most small SaaS apps, a CDN + edge caching solves 80% of the latency problem at 10% of the operational cost. Profile your slow pages first. If the bottleneck is asset delivery (LCP, JS bundles, images), use a CDN. If the bottleneck is genuinely DB round-trip from a secondary region, add read replicas. Don't pre-optimize multi-region; the operational cost is real.
What does multi-region Postgres actually cost?
On managed Postgres providers, expect roughly 2x your single-region bill. AWS RDS with one cross-region replica adds the replica's per-instance cost plus cross-region transfer (~$0.02/GB on AWS). For a small app: $200-500/month for a 2-region setup. Neon and Supabase have flatter pricing structures; Supabase's read replicas require Team plan ($599/month) which is the bigger jump.
What's the biggest mistake teams make going multi-region?
Two: (1) jumping to multi-master replication when read replicas would have sufficed - the operational cost of true multi-master is enormous and most teams don't actually need it; (2) forgetting that replicas lag, and routing reads to a stale replica right after a write produces 'I clicked save and it disappeared' bugs. Plan read-after-write consistency from day one.
Should I shard by tenant region instead?
Sharding-by-region is great when each customer is firmly anchored to one region (e.g. a B2B SaaS where each company's data lives in their headquarters region). It's painful when customers expand across regions (a US company hires in EU). For most SaaS, primary-with-replicas is more flexible. Shard if your tenants are clearly partitionable AND your data growth is per-tenant rather than per-event.
Can my AI agents help operate multi-region Postgres?
Yes. Agents are useful for: classifying every query in the codebase as read vs write, drafting the failover runbook, summarising daily replica lag, drafting expand-then-contract migrations from a single ALTER intent, and detecting when a query's region routing has regressed. The judgement calls (topology choice, failover trigger conditions) need humans. The template ships agent prompts inline for the read/write classification step.

Open this template as a workspace.

We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.