Set up a status page for your SaaS
An 8-step playbook. Open in Dock and you'll get four surfaces seeded:
- **Steps** (table) — the 8 setup gates as rows, owner + due + status
- **Components** (table) — the customer-facing components your status page exposes
- **Brief** (doc) — the canonical write-up + the incident message templates
- **Subscribers** (table) — the list of subscribed customers + admin contacts
Read `Steps` top-to-bottom on first open. Status pages must NEVER be hosted on the same infra as the product they monitor — that's gate one and it's non-negotiable.
Outcome
status.yourdomain.com live with the right components, auto-probes wired, customer subscribers receiving incident emails, and an incident playbook so updates ship within 30 min of a Sev-1.
Estimated time: 1 week
Difficulty: beginner
For: Founders + first SREs of SaaS companies with paying customers.
What you'll need
Pre-register or install before you start.
- Statuspage.io (Free for 2 components; from $29/mo Hobby; $99/mo Starter) — The market-leader hosted status page. Component-based, integrates with PagerDuty + Datadog.
- Instatus (Free up to 25 subscribers; $20/mo Hobby; $50/mo Pro) — Lighter alternative to Statuspage.io with comparable features. Faster page loads.
- Cachet (self-hosted) (Free (you host it on different infra from your product)) — Open-source self-hosted status page. PHP + MySQL.
- Better Uptime (Free for 1 monitor; $18/mo Freelancer; $50/mo Team) — Status page + uptime monitoring + on-call in one product.
- Pingdom (or Datadog Synthetics) (From $15/mo (Pingdom Synthetic)) — External uptime probes that auto-update the status page when components fail.
The template · 8 steps
Step 1: Pick a status page platform
Estimated time: 2-4 hr
The platform decision is 80% done by your team's existing tools. If you use PagerDuty + Atlassian, Statuspage.io is the natural fit (same vendor, deep integration). If you use Better Stack for monitoring, Better Uptime bundles status page + monitoring. If you have a strict no-SaaS rule, Cachet is the open-source path. The tradeoff is hosting cost (Statuspage's $99/mo) vs setup time (Cachet's day-or-two of self-hosting).
Tasks
- Decide: hosted (Statuspage.io / Instatus / Better Uptime) vs self-hosted (Cachet)
- If hosted: sign up; pick the tier that matches your subscriber + component count
- Set up the team accounts + roles (admin, IC, support)
- Document the decision in the Brief
Pointers
- [Tool] Statuspage.io
- [Tool] Instatus
- [Tool] Cachet
[!CAUTION] Gotchas
- Statuspage.io free tier limits you to 2 components — useful for trying it but not for a real product. Budget for the $29/mo or $99/mo tier from week 1.
- Self-hosted Cachet means YOU operate the status page. If the host is on the same infra as your product, when the product is down the status page is too — defeating the point. Host it on a separate cloud account or a different region.
- Better Uptime's combined product is great for early teams but creates a single point of failure: if Better Uptime is down, your monitoring AND status page both lose. Larger teams diversify.
Step 2: Decide on customer-facing components
Estimated time: 2-4 hr
Components are what users see on the status page: 'API,' 'Web app,' 'Database,' 'Webhooks.' The mistake: listing internal components customers don't care about ('Kafka cluster 3,' 'Background job queue C'). Make the components map to user-visible product surfaces, not your internal architecture.
Tasks
- List every customer-visible product surface (login, signup, web app, API, mobile app, webhooks, integrations)
- Group internal services into customer-facing components (don't expose the internal architecture)
- Decide on the granularity: too few = vague ('Service'); too many = noisy (15 components)
- Add the components in the platform; they default to 'Operational'
- Document the component-to-internal-service mapping in the Components surface
Pointers
- [Official] Atlassian Statuspage components guide
[!CAUTION] Gotchas
- 5-8 components is the sweet spot. Below 5 the page looks unprofessional; above 10 it becomes a wall of mostly-green that nobody reads.
- Don't expose 'database' as a component if customers can't tell it's down separately from the app. If 'DB down' = 'app down' to users, just have one 'Web app' component.
- Region-specific components ('US East', 'EU West') only matter if your customers know about regions. For most SaaS that's a no.
Agent prompt for this step
Read this codebase + the marketing site + the docs and produce a customer-facing component list for the status page.
For each component, output:
1. Component name (e.g. "API", "Web app", "Webhooks")
2. What customers do with it (1-line description)
3. Internal services that contribute (the underlying microservices or AWS resources)
4. Failure surface (what users see when it's degraded vs down)
5. Probe strategy (HTTP probe to which URL? Synthetic test? Datadog metric?)
Output to the Components surface. Aim for 5-8 components total — too few feels vague, too many is noisy.
Step 3: Wire automated probes
Estimated time: 1-2 days
Manual status updates are too slow — by the time the IC posts the status, customers have already complained on Twitter. Automated probes (Pingdom, Datadog Synthetics, AWS Route53 health checks) hit your endpoints every 60s and update the status page automatically when failures cross the threshold. Wire them on day one.
Tasks
- Pick the probing tool (Pingdom, Datadog Synthetics, AWS Route53, Better Stack)
- Set up one synthetic probe per customer-facing component
- Configure the probe: hit a public health endpoint every 60-120s, alert on 3 consecutive failures
- Wire the probe alert to the status page (Statuspage.io has direct integrations)
- Configure the threshold: 1 failure = Investigating; 2-3 = Degraded; 5+ = Major Outage
- Test: take down a staging endpoint; verify the status page updates within 5 min
Pointers
- [Official] Datadog Synthetics
- [Tool] Pingdom synthetic transactions
- [Official] Statuspage automation
[!CAUTION] Gotchas
- Probes from a single region give false positives during regional internet hiccups. Use multi-region probing (3+ regions) and require 2 of 3 to fail before triggering.
- Probes that check 'is the homepage 200?' miss application-level failures. Add deeper synthetic checks: log in, hit an authenticated endpoint, verify the response body.
- Auto-updating components is great UNTIL the probe is wrong. Always allow IC override; never let the status page lock in a state that contradicts what the IC knows.
Step 4: Configure custom domain (status.yourdomain.com)
Estimated time: 1 hr (plus DNS propagation)
Customers Google 'is yourcompany down' and expect a page at status.yourdomain.com. Setting up the custom domain is a CNAME + a few minutes; do it on day one. The status page on a yourcompany.statuspage.io URL hurts trust and SEO.
Tasks
- Add the custom domain in the platform (Statuspage.io / Instatus / Cachet)
- Set up the CNAME: status.yourdomain.com → platform's CNAME target
- Configure SSL (most platforms auto-issue Let's Encrypt)
- Wait 1-24 hr for DNS propagation
- Verify HTTPS works + the page loads at status.yourdomain.com
- Add status.yourdomain.com to your sitemap + footer link
Pointers
- [Official] Statuspage.io custom domain setup
[!CAUTION] Gotchas
- CNAME means status.yourdomain.com points to a different server than yourdomain.com — that's intentional. Status page is on the platform's infra, not yours.
- Don't forget HTTPS. Customers visiting an HTTP-only status page during an incident bounce instantly. Most platforms auto-provision SSL but verify.
- If you use Cloudflare proxy mode, switch your status page CNAME to DNS-only mode — proxying through your CDN routes traffic through your infra, defeating the point.
Step 5: Set up incident communication templates
Estimated time: 2-4 hr
During a Sev-1, the IC has 30 seconds to compose every status update. Templates eliminate the 'what do I say' freeze. Pre-write four templates: Investigating, Identified, Monitoring, Resolved. Each is 2-3 sentences with placeholders for the specifics.
Tasks
- Write the 'Investigating' template ('We're investigating reports of [ISSUE]. Next update in 30 min.')
- Write the 'Identified' template ('We've identified the cause: [CAUSE]. We're working on a fix. ETA [TIME].')
- Write the 'Monitoring' template ('A fix has been deployed. We're monitoring the recovery. Next update in 30 min if conditions hold.')
- Write the 'Resolved' template ('The incident has been resolved as of [TIME]. A postmortem will be posted within 5 business days.')
- Save templates in the platform's template library
- Save templates in the Brief for reference + version control
Pointers
[!CAUTION] Gotchas
- Templates that are too vague ('investigating an issue') burn customer trust. Be specific: 'login failures for ~5% of users.'
- Promising an ETA you can't keep is worse than not promising one. Say 'next update in 30 min' rather than 'fixed in 30 min.'
- Customers don't want a copy of the postmortem in the resolved message. Say 'postmortem in 5 business days' and let them subscribe to a separate email.
Agent prompt for this step
Draft 4 incident communication templates for this SaaS's status page.
Output: one template each for Investigating / Identified / Monitoring / Resolved.
Each template:
1. 2-3 sentences max
2. Placeholders in [BRACKETS] for specifics: [ISSUE], [CAUSE], [TIME], [ETA], [SCOPE]
3. Honest about uncertainty ("We're investigating" not "We've fixed")
4. Specific where possible ("checkout failures for ~5% of users" not "issues with the platform")
5. No blame ("we're investigating an upstream issue" not "AWS is having problems")
6. Cadence promise ("Next update in 30 min")
Plus 2 customer-facing email templates: Initial customer email (for affected paid customers within 4 hours of resolution) and follow-up postmortem email (within 5 business days).
Output to the Brief surface as a versioned templates section.
Step 6: Set up subscriptions (email, SMS, RSS, Slack)
Estimated time: 2-4 hr
Customers want push notifications when components break — they don't want to refresh status.yourdomain.com. Most platforms offer email, SMS, RSS, and Slack/Webhook subscriptions out of the box. Configure them all + advertise in your docs.
Tasks
- Enable email subscriptions (most platforms ON by default)
- Enable SMS subscriptions if you have enterprise customers (priced per SMS)
- Enable RSS / Atom feeds (free; trivial; some customers prefer)
- Enable Slack / Webhook subscriptions (your team subscribes their internal Slack)
- Configure: customers can subscribe to specific components vs all (most platforms default to all)
- Add a 'Subscribe to status updates' link in your docs + email footer + customer portal
- Pre-subscribe your support team + leadership so they get the same updates customers do
Pointers
- [Official] Statuspage subscriber notifications
[!CAUTION] Gotchas
- SMS subscriptions cost $0.04-$0.08 per message. A Sev-1 with 4 status updates × 1000 subscribers = $160-$320 per incident. Reserve SMS for enterprise tiers.
- Email subscriptions can land in spam. Verify SPF / DKIM / DMARC on the sending domain (most platforms use their own; some let you bring your own).
- Slack subscriptions for the customer's team is high-value. Promote it in your enterprise sales — a simple 'subscribe via Slack' is a procurement-friendly affordance.
Step 7: Document the incident operating procedure
Estimated time: 1 day
The status page is the artifact; the operating procedure is the discipline. Document: who can update the page, how fast updates ship after an incident is declared, how often updates ship during an incident, who writes the resolved message, who emails affected customers. Without the procedure, status updates ship inconsistently — which burns customer trust faster than the outage itself.
Tasks
- Write the operating procedure: who declares, who updates, who resolves, who post-mortems
- Set the SLAs: first update within 15 min of incident declared, then every 30 min until resolved
- Decide who writes the customer email post-incident (typically the IC + a support lead)
- Decide on retro cadence: every Sev-1 + Sev-2 gets a postmortem on the status page within 5 days
- Document the procedure in the Runbooks doc
- Train the IC rotation on the procedure (1-hour walk-through with a tabletop drill)
Pointers
[!CAUTION] Gotchas
- Status update cadence is the trust-or-burn lever. 30 minutes between updates feels long during an outage; 60 minutes feels abandoned. Aim for 30 min during active incidents.
- Customer emails for incidents take longer than status page updates (review, soft tone, no internal jargon). Default cadence: 4 hours after resolution for an initial email; 5 business days for the postmortem.
- Don't update the status page unless something has actually changed. 'Still investigating' every 30 min for 4 hours feels worse than 'next update in 1 hour' once.
Step 8: Operate it: weekly audit + quarterly retrospective
Estimated time: Weekly, 15 min; quarterly, 1 hr
Status pages decay. Components drift from the actual product (you ship a feature; nobody adds it). Probes drift as endpoints change. Subscribers list grows stale. Run a weekly 15-min audit + a quarterly retrospective: is the page still accurate, are subscribers happy, what's the average time-to-first-update during incidents.
Tasks
- Weekly: verify the components match the current product surface; verify probes are firing
- Weekly: review the subscribers list; remove obvious test accounts
- Quarterly: pull the average time-to-first-update from the last quarter's incidents (target: <15 min)
- Quarterly: pull subscriber growth rate (a healthy status page grows subscribers as customers do)
- Quarterly: review the templates — any phrasing that misfired during an incident?
- Annually: post a 'year in review' postmortem (uptime numbers, biggest incidents, learnings) — customers respect this
Pointers
- [Official] Statuspage uptime metric
[!CAUTION] Gotchas
- Status pages without operating discipline become decorative. The page lists 'Operational' green dots while customers report outages on Twitter — instant trust collapse.
- Subscribers churn when the page is too noisy (component flapping, false alerts) or too quiet (real outages with no updates). Tune the probe thresholds + the IC discipline together.
- Annual 'year in review' posts are surprisingly high-leverage marketing. Cloudflare's quarterly DNS reports + Stripe's reliability posts both convert procurement.
Hand the template to your agent
Paste the prompt below into your agent's permanent system prompt so the agent reads, writes, and maintains this workspace as you work through the steps.
You are an agent on the "Set up a status page for your SaaS" playbook workspace at your-org/set-up-status-page-for-saas.
Your role: maintain the four surfaces (Steps, Components, Brief, Subscribers) as the team sets up + operates the status page.
Cadence:
- When the user marks a step Done, append a line to the Brief.
- When a new customer-facing component ships in the product, prompt the user to add it to Components + the status page itself.
- When an incident is declared, draft the first customer-facing status update from the IC's internal Slack thread.
- Weekly: audit Subscribers — who's subscribed but hasn't received an email in 90 days (drift signal)?
First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-status-page-for-saas")
2. list_rows(workspace_slug="set-up-status-page-for-saas", surface_slug="components")
3. get_doc(workspace_slug="set-up-status-page-for-saas", surface_slug="brief")
Do NOT publish a status update without the IC's approval. Customer-facing comms during an incident are a trust-or-burn moment; agents draft, humans approve.
FAQ
Do I really need a status page if I'm pre-launch?
Probably not. Status pages serve customers who are paying you and need to plan around outages. Pre-launch, you have no one to communicate to. Set it up the week you take your first paying customer; from then on it's a trust artifact your sales team can point to.
What's the right number of status page components?
5-8 customer-visible components is the sweet spot. Below 5 looks unprofessional; above 10 becomes a wall of mostly-green that nobody reads. Group internal services into customer-facing components — 'API,' 'Web app,' 'Webhooks' is canonical. Don't expose Kafka cluster 3.
How fast should the first status update ship after an incident?
15 min from incident declaration is the industry target. Faster looks reactive; slower looks asleep. Cloudflare and Stripe routinely hit 5-10 min for major incidents. The trick is making the first update non-binding ('We're investigating reports of [X]; next update in 30 min') so you don't wait for a clean diagnosis.
Should the status page be on the same domain as my product?
On a subdomain (status.yourdomain.com via CNAME to your status page provider) — yes. On the same INFRA as your product — absolutely not. The whole point is that when your product is down, the status page is up. Hosting both on the same AWS account is the canonical own-goal.
Can my AI agents help operate the status page?
Yes. Agents are particularly useful for: drafting the first customer-facing status update from the IC's internal Slack thread, drafting the customer email post-incident, drafting quarterly retrospective posts from the incident log, auditing components against the actual product surface for drift. The playbook ships agent prompts inline.
What's the most common status page mistake?
Three failures dominate: (1) Hosting the status page on the same infra as the product (so when the product is down the page is too). (2) Vague status updates that burn customer trust ('investigating an issue' for 4 hours). (3) Auto-probes alone with no IC override, so the page locks into a wrong state (e.g. probe says 'green' while customers report failures).