← Set up multi-region for a Postgres app

Every step in the template

Set up multi-region for a Postgres app

An app where users in your secondary region see read latency drop from 200-300ms to 5-30ms, writes still work correctly, replica lag is monitored, and you have a documented failover plan when a region goes down.

Open in Dock

Outcome

Time2-4 weeksDifficultyadvancedForTech leads serving users across multiple continents.

How this works

Open it, hand it to your agent, walk the steps.

Paste this to your agent (Claude / Cursor / Codex)

You are an agent on the "Set up multi-region for a Postgres app" template workspace.

Your role: maintain the four surfaces (Steps, Pointers, Brief, Latency log) as the team builds out multi-region.

Cadence:
- When the team measures p95 latency from a region, append a row to Latency log: region pair, query type, baseline, post-deploy.
- When the team adds a new query to a region-pinned path, update the Brief with the rationale.
- Daily: read replication-lag metrics; if any replica lag exceeds 5 seconds, raise it in the Brief.

First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-multi-region-postgres")
2. list_rows(workspace_slug="set-up-multi-region-postgres", surface_slug="latency-log")
3. get_doc(workspace_slug="set-up-multi-region-postgres", surface_slug="brief")

Never propose true multi-master replication without explicit user opt-in - the operational cost is high enough that it should be a conscious decision, not an automated one.

What you'll need

Pre-register or install before you start.

NeonFree tier, $19/mo Launch, then usage-based

Serverless Postgres with read replicas in multiple regions; lowest-friction multi-region option for small teams.

SupabaseFree tier, $25/mo Pro, $599/mo Team for read replicas

Managed Postgres with read replicas (Team plan and above) across regions.

AWS RDS for PostgresPer-instance + per-GB storage + cross-region transfer; ~$200-500/mo for a 2-region small setup

Cross-region read replicas with Aurora Global Database for sub-second cross-region replication.

PgBouncerFree open source

Connection pooler that sits between app and Postgres; essential at scale, useful for routing.

Postgres logical replicationFree open source

Built-in publication / subscription replication; the underpinning of cross-region read replicas.

The template · 10 steps

Top to bottom. Each step has tasks, pointers, gotchas.

Confirm you actually need multi-region (not just CDN + caching)

1 day of profiling

Most 'we need multi-region' problems are solved by a CDN, edge caching, and a well-designed read path. Multi-region Postgres adds operational complexity that takes years to fully amortize. Before you start, profile: are slow pages actually waiting on cross-region DB hits, or are they waiting on uncached static assets, slow client JS, or N+1 queries?

Tasks

Pick 5 representative slow page loads from your secondary-region users
Profile each: network waterfall, server-side timing, query-level breakdown
Confirm the P95 page-load bottleneck is DB round-trip, not asset CDN miss or N+1 queries
Estimate the latency win from a regional replica (rule of thumb: 80% of cross-region RTT)
If the bottleneck is anything else, fix that first

Pointers

GuideWeb Vitals + LCP profiling ToolCloudflare Argo + edge cache

Gotchas

A page that does 30 sequential DB queries will be slow no matter how close the DB is. Fix the N+1 first; your replicas won't save you.
If your bottleneck is asset delivery, a CDN buys you 90% of the latency win for 10% of the operational cost. Always try CDN first.

Pick the topology: primary + read replicas (start here)

1 hr

Three topology options: (1) primary + read replicas (one writeable region, multiple read regions), (2) sharded by tenant (each region owns a slice of users), (3) true multi-master (every region writes, conflicts resolve). Start with (1). It covers 90% of needs at 10% of the operational cost. Only escalate if profiling shows write latency is the bottleneck.

Tasks

Identify your primary region (the one that does ~70%+ of writes)
Identify the secondary regions (top 1-3 by user count, after primary)
Pick the topology: primary + read replicas, period, for first iteration
Document the choice + why in the Brief, with a 'when we'd reconsider' note

Pointers

OfficialPostgres replication overview OfficialAurora Global Database

Gotchas

True multi-master Postgres in 2024 is still operationally hard. BDR, Active-Active, and similar require deep expertise. If a vendor is selling you 'just turn on multi-master', read the conflict resolution docs carefully.
Sharding by tenant region works great until a customer expands across regions. Plan for that case before you commit to sharding.

Provision read replicas in target regions

2-3 hr

On Neon, Supabase Team+, AWS RDS, or whatever managed Postgres you use, spin up a read replica per target region. Cross-region replication lag is typically 50-500ms depending on provider; verify the actual lag before you write code that assumes it.

Tasks

Pick the regions (typically: primary in us-east-1, replicas in eu-west-1 and ap-southeast-1)
Provision the replica via your provider's console or CLI
Wait for the initial sync to complete (minutes to hours depending on data size)
Verify you can connect from the replica region with the connection string
Test a known-fresh write on the primary, then a SELECT against the replica; measure the lag

Pointers

OfficialNeon read replicas across regions OfficialAWS RDS cross-region read replicas

Gotchas

Initial sync of a multi-GB database across regions can take hours and cost real bandwidth. Provision during off-peak.
Cross-region transfer charges add up fast. AWS charges $0.02/GB out for cross-region; a 100GB DB initial sync is $2 plus ongoing replication traffic.

Implement read/write splitting in the app

Half a day to a day

Without app-level routing, every request hits the primary regardless of region. Split your DB connection: one for writes (always primary), one or more for reads (region-local replica). Most ORMs support multiple connections; route reads to the closest replica based on request region.

Tasks

Configure two DB connection objects in app code: `primary` (writes) and `replica` (reads, region-local)
Determine the request's region (from your CDN's geo header, e.g. CF-IPCountry, or env-pinned per deployment)
Route each query: writes always to `primary`; reads to `replica` if available, else `primary`
For transactions that read-then-write, route ALL queries in the transaction to `primary` (replica reads inside a write txn = stale data bugs)
Keep a feature flag: 'use_replica_reads' so you can roll back instantly

Pointers

OfficialPrisma multiple data sources OfficialVercel Edge geo headers

Gotchas

Read-then-write transactions on a replica = stale read = data corruption. Always route the entire transaction to primary.
Some ORMs are sneaky about transactions. A 'side-effect-free' SELECT FOR UPDATE locks rows; route to primary.
If your app uses connection pooling at the DB level (pgbouncer transaction mode), make sure the routing decision happens at the app, before the pool selects a connection.

Agent prompt for this step

Read every DB query in the codebase. Classify each as:
- READ (SELECT, no side effects).
- WRITE (INSERT, UPDATE, DELETE, or any function that mutates state).
- TRANSACTION_READ (a SELECT inside a transaction that also writes).

Output:
1. A list of file:line references with classification.
2. A proposed code change for each READ to use the replica connection.
3. A list of TRANSACTION_READ queries to leave alone (must stay on primary).
4. Any query that's borderline (e.g. SELECT FOR UPDATE) - flag for human review.

Handle the read-after-write inconsistency

Half a day

Replicas lag, even fast ones. A user who creates a record on the primary and immediately reads from the replica may get the pre-create state. Three patterns: (1) read-your-own-writes by routing the next N seconds of reads to primary, (2) wait-for-replication if your provider exposes it, (3) accept the inconsistency for non-critical reads.

Tasks

Identify the workflows where read-after-write matters (creates that show up immediately, account changes that affect auth)
For each: pick a strategy. Default: route reads to primary for 5 seconds after a write from the same user/session
Implement a per-session 'recent write' timestamp; if now - timestamp < 5s, route reads to primary
For non-critical reads (analytics, dashboards), accept replica lag; the dashboard from 1 second ago is fine
Test by hand: create a record, immediately list, verify the new record appears

Pointers

GuideRead-your-writes consistency patterns

Gotchas

The 'always read from primary' fallback is the safe default but defeats the purpose of replicas. Be deliberate about which reads can tolerate lag.
User-visible inconsistencies (you posted a comment, refresh, comment is gone) destroy trust faster than slow pages. Default to primary for self-reads.

Monitor replica lag and set alerts

2-3 hr

Replica lag is the silent killer. A replica with 30 seconds of lag will route reads that look stale; a replica with 5 minutes of lag will look broken. Monitor lag, alert on sustained lag over 5 seconds, fail over to primary if lag exceeds your tolerance.

Tasks

Query `pg_stat_replication` (on primary) or `pg_last_wal_replay_lsn()` (on replica) for lag in bytes or time
Add the metric to your observability stack (Prometheus, Datadog, etc.)
Alert on replica lag > 5 seconds for > 1 minute
Add a circuit breaker in app code: if replica is too far behind, route reads to primary until it catches up
Test the circuit breaker by force-pausing the replica

Pointers

OfficialPostgres pg_stat_replication OfficialPostgres pg_last_wal_replay_lsn

Gotchas

Replica lag during heavy writes is normal and self-recovers. Alert on sustained lag (over 1 minute), not on momentary spikes.
A replica with multiple GB of lag may take 10+ minutes to catch up after the write storm ends. Have the circuit breaker.

Handle writes that need low latency in secondary regions

Half a day to a day, only if needed

Some writes are 'fire and forget' (analytics, audit logs, write-only feeds). For those, you don't need to wait on the round-trip to primary. Async write patterns: queue locally, flush to primary in the background. Your secondary-region users see fast page loads even on write-heavy interactions.

Tasks

Identify writes where the user doesn't need the result immediately (events, analytics, audit logs)
Implement a queue (Redis, SQS, the DB itself with a 'writes_pending' table in the local replica)
Background worker drains the queue to the primary, with retry on failure
Monitor the queue depth; alert if it grows unbounded
Document which writes are async vs sync in the Brief

Pointers

GuideOutbox pattern (for reliable writes)

Gotchas

Async writes complicate the user's mental model. 'I clicked save, did it save?' is the question. Show optimistic UI but design for the failure case.
Don't async-queue writes that need transactional guarantees with reads (e.g. inventory decrements). Those need primary-region sync.

Plan and test the failover scenario

Half a day to test, ongoing to maintain

When primary goes down, what happens? Most managed Postgres providers offer automatic failover (promote a replica to primary). Test it. Run a fire drill: pick a quiet window, manually trigger the failover, time how long the app is down, capture the runbook.

Tasks

Document the failover runbook: how to manually promote a replica, where DNS / connection string change
Schedule a fire drill: weekend morning, force a failover, time recovery
Verify the app reconnects to the new primary within X minutes
Document the data loss window (writes the dead primary acknowledged but didn't replicate before death)
Update the runbook with anything that surprised you in the drill

Pointers

OfficialAWS RDS failover documentation CodePatroni (HA Postgres)

Gotchas

Auto-failover sometimes elects the wrong replica (the one with the most lag). Verify the election rules with your provider.
After failover, the old primary cannot rejoin without resync. Plan for the cost of the resync (downtime or read-only mode for hours).
DNS-based failover takes minutes due to TTL. Faster failover requires connection-string-aware client drivers (most modern Postgres drivers handle this).

Handle migrations carefully (zero-downtime + cross-region)

Ongoing - per migration

Schema migrations are the most dangerous operation in multi-region. A migration on the primary must replicate cleanly to all replicas. Backwards-incompatible migrations break replicas mid-deploy. Always: add columns nullable first, deploy code that doesn't require them, backfill, then deploy code that uses them.

Tasks

Adopt expand-then-contract migration patterns: add new column nullable, deploy reading code, backfill, deploy writing code, drop old column in a later migration
For large tables, use logical replication for zero-downtime migrations rather than `pg_dump` (which locks tables briefly under default settings)
Test every migration on a staging replica that mirrors prod topology
Schedule risky migrations during a quiet window and monitor replica lag
Document the rollback plan per migration

Pointers

OfficialPostgres logical replication for migrations CodeStrong migrations gem (Rails, but principles apply)

Gotchas

Postgres pg_dump locks tables briefly under default settings. For large migrations and zero-downtime data moves, use logical replication, not pg_dump.
DDL replication has subtle gotchas. ALTER TABLE on a heavily-replicated table can stall replicas while the change propagates.
Backwards-incompatible migrations + rolling deploy = some app instances run new code, some run old, and they all see the same DB. Always expand-then-contract.

Monitor latency from end users and iterate

1 week observation + ongoing

After deploy, measure. Track p50/p95/p99 page-load latency from each region. Compare against the baseline you measured in step 1. The goal: secondary-region users see latency within 30% of primary-region users. If you're not there, find the remaining bottleneck.

Tasks

Set up RUM (real-user monitoring) per region: Sentry, LogRocket, or Vercel Analytics
Track p50/p95/p99 page-load latency per region for the top 5 pages
Compare against baseline (pre-multi-region) latency
Append rows to Latency log: region pair, query type, baseline, post-deploy
Iterate on remaining bottlenecks: more reads to push to replicas, edge caching, async write paths

Pointers

OfficialVercel Analytics (RUM)OfficialSentry performance monitoring

Gotchas

RUM data is biased toward users with fast browsers. Cross-reference with synthetic tests from each region.
If post-deploy latency is still high, check whether your CDN is also region-aware. Read replicas don't help if your assets still come from the primary region.

FAQ

Common questions on this template.

Do I really need multi-region or is a CDN enough?: For most small SaaS apps, a CDN + edge caching solves 80% of the latency problem at 10% of the operational cost. Profile your slow pages first. If the bottleneck is asset delivery (LCP, JS bundles, images), use a CDN. If the bottleneck is genuinely DB round-trip from a secondary region, add read replicas. Don't pre-optimize multi-region; the operational cost is real.
What does multi-region Postgres actually cost?: On managed Postgres providers, expect roughly 2x your single-region bill. AWS RDS with one cross-region replica adds the replica's per-instance cost plus cross-region transfer (~$0.02/GB on AWS). For a small app: $200-500/month for a 2-region setup. Neon and Supabase have flatter pricing structures; Supabase's read replicas require Team plan ($599/month) which is the bigger jump.
What's the biggest mistake teams make going multi-region?: Two: (1) jumping to multi-master replication when read replicas would have sufficed - the operational cost of true multi-master is enormous and most teams don't actually need it; (2) forgetting that replicas lag, and routing reads to a stale replica right after a write produces 'I clicked save and it disappeared' bugs. Plan read-after-write consistency from day one.
Should I shard by tenant region instead?: Sharding-by-region is great when each customer is firmly anchored to one region (e.g. a B2B SaaS where each company's data lives in their headquarters region). It's painful when customers expand across regions (a US company hires in EU). For most SaaS, primary-with-replicas is more flexible. Shard if your tenants are clearly partitionable AND your data growth is per-tenant rather than per-event.
Can my AI agents help operate multi-region Postgres?: Yes. Agents are useful for: classifying every query in the codebase as read vs write, drafting the failover runbook, summarising daily replica lag, drafting expand-then-contract migrations from a single ALTER intent, and detecting when a query's region routing has regressed. The judgement calls (topology choice, failover trigger conditions) need humans. The template ships agent prompts inline for the read/write classification step.

Open this template as a workspace.

We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.

Open in Dock