Invite-only.
← Templates
Build· Mixed

Set up observability: logs, metrics, traces

9-step playbook for the three pillars of observability (logs, metrics, traces) at a level a 1-5 person team can adopt in a week.

Open in DockSmall teams running production apps

Observability used to mean enterprise vendors and a SRE team. The OpenTelemetry standard plus 2024-era tooling makes it cheap and tractable for a small team. This playbook walks the 9 steps to wire up structured logging, the four golden signals as metrics, distributed tracing with OTel, an SLO that the team actually believes in, and alerts that don't page on noise. End state: when production breaks, you see why in 10 seconds, not 2 hours.

Outcome

A production app where every request is traceable end-to-end, the four golden signals are dashboarded, on-call gets paged only on real customer-impacting issues, and a 'why is the app slow?' question gets a definitive answer in 10 seconds.

Time1 weekDifficultyintermediateForSmall engineering teams with a production app and no SRE.
The template · 9 steps

Top to bottom. Each step has tasks, pointers, gotchas.

Pick a vendor before you instrument anything

2-3 hr of research

Picking the backend first prevents the worst pattern: instrument everything, then realize the vendor's pricing model penalizes the cardinality you committed to. Logs go in Loki / Datadog Logs / Sentry. Metrics go in Prometheus / Datadog. Traces go in Tempo / Honeycomb / Datadog APM. Pick a stack that covers all three; resist the urge to mix-and-match across vendors on day one.

Tasks
  • List your top 3 budget constraints (monthly spend, team size, on-call sophistication)
  • Compare three options: Grafana Cloud (cheapest), Honeycomb (best for traces), Datadog (best UX, priciest)
  • Estimate volume: log events per day, metrics cardinality, trace spans per second
  • Sign up for the free tier of your pick and start there
Gotchas
  • Datadog's bill explodes with custom metrics cardinality. A single tag with 10,000 unique values can cost more than your hosting.
  • Sentry is great for errors but a poor fit for general logging. Don't pipe all your logs into Sentry; you'll exhaust the free tier in a week.
  • Vendor switching is painful once you have dashboards, alerts, and runbooks. Pick deliberately, not by default.

Switch to structured logging (JSON, with a schema)

Half a day

Plain text logs are uncorrelatable. Structured JSON logs with a known schema are queryable, alertable, and aggregatable. Adopt a logging library that emits JSON by default, define 5-10 standard fields (timestamp, level, request_id, user_id, route, latency_ms), and prohibit free-form messages from carrying load-bearing data.

Tasks
  • Pick a structured-logging lib for your language (pino in Node, zap in Go, structlog in Python)
  • Define a JSON schema with 5-10 standard fields every log line carries
  • Add request_id as a header that propagates through every middleware and downstream call
  • Replace string-template log lines with structured fields ('user logged in' becomes `{ event: 'login', user_id, method }`)
  • Verify a sample of logs in the backend and confirm they're queryable by user_id
Gotchas
  • Don't log full request bodies. Production logs leak PII into your vendor and your retention policy. Log the request shape, not the payload.
  • Free-form string logs are an anti-pattern at scale. The day you need to query 'all 500s on /checkout in the last hour by user', regex over strings is a 3-hour grep, structured fields are a 30-second query.
Agent prompt for this step
Read the codebase. Find every log call (console.log, logger.info, log.Info, etc.).

For each, propose a refactor that:
1. Uses the team's structured-logging library (ask the user which one).
2. Carries the standard fields (timestamp, level, request_id, user_id, route, event_name).
3. Replaces string templates with structured fields ("user logged in" becomes event: "login" + user_id).

Output as a list of file:line references with proposed replacements. Flag any log that contains PII (email, full name, IP) for review under your data classification.

Capture the four golden signals as metrics

Half a day

Latency, traffic, errors, saturation. Every service needs them. Most observability tools give them to you for free if you instrument correctly: latency from request duration histograms, traffic from request counters, errors from status code counters, saturation from CPU / memory / queue depth. Build a dashboard with these four panels per service and call it the home base.

Tasks
  • Add a request-duration histogram (p50, p95, p99) keyed by route + status_code
  • Add a request-count counter keyed by route + status_code
  • Add an error-count counter for 5xx responses
  • Add saturation metrics: CPU, memory, max DB connection pool, max queue depth
  • Build a 4-panel dashboard per service: Latency, Traffic, Errors, Saturation
  • Link the dashboard URL in your team docs
Gotchas
  • Latency p99 over a 1-minute window is noisy with low traffic. Use 5-min windows for p99 if your service does under 100 req/s.
  • Don't use averages for latency. Average latency hides the long tail. Always p50 / p95 / p99.

Wire up distributed tracing with OpenTelemetry

1 day

Tracing is the single biggest investigative tool you can add. A trace shows the path of one request through every service it touches, with timing per hop. OpenTelemetry's auto-instrumentation libraries cover most popular frameworks; you usually only need to add 5-10 manual spans for the important business logic.

Tasks
  • Install the OpenTelemetry SDK + exporter for your language
  • Enable auto-instrumentation for HTTP server, HTTP client, DB driver
  • Add manual spans around the 5 most important business operations (checkout, signup, cron job)
  • Configure the exporter to send to your chosen backend (OTLP endpoint)
  • Verify a request shows up as a complete trace in the backend, with all hops
Gotchas
  • OpenTelemetry's default sample rate is often too low to be useful. For a small app, sample 100% of traces. Drop to head-based sampling at 1% only when volume justifies it.
  • Manual spans without `try/finally` (or context manager) leak when an exception fires. Use the SDK's recommended pattern for your language.
  • Trace context propagation breaks across queue boundaries (SQS, Kafka, Redis pub/sub) unless you serialize the trace headers in the message envelope.
Agent prompt for this step
Read the codebase and propose 10 manual span instrumentations.

For each:
1. The function or block to wrap.
2. The span name (use convention: "service.operation", e.g. "checkout.charge_card").
3. The attributes to set on the span (user_id, order_id, amount, etc.).

Prioritize: payment flows, signup, third-party API calls, slow DB queries, background jobs. Skip pure-CPU functions.

Output as a list with file:line references and a code snippet showing the instrumentation in the team's language.

Define one or two SLOs the team actually believes in

Half a day to define, ongoing to refine

An SLO is a target for how often the service does the right thing. The classic shape: '99.9% of /checkout requests complete in under 500ms over a rolling 30 days.' That gives you an error budget (0.1% over 30 days = ~43 minutes). When you burn through the budget, you slow feature velocity. When you have lots of budget, you ship aggressively. SLOs only work if the team treats them as real.

Tasks
  • Pick the 1-2 user-facing flows that matter most (signup, checkout, primary product action)
  • For each, write the SLO: 'X% of [request type] complete in under [latency] / without error, over [window]'
  • Calculate the error budget (1 - target = budget; over the window = absolute minutes/requests)
  • Set up the metric query that measures it; verify the historical baseline meets the target
  • Decide the policy: what do we DO when we burn budget? (slow features, freeze deploys, etc.)
Gotchas
  • Don't pick 99.99% just because it sounds good. A 99.99% SLO over 30 days is 4 minutes of error budget. Most small teams can't operate at that level without dedicated SRE.
  • An SLO without a written 'what we do when we burn budget' policy is decoration. Write the policy.
  • SLOs measured against availability without latency miss the silent slowdown. Always pair availability with latency.

Set up alerts on burn rate, not on threshold

2-3 hr

Threshold alerts ('latency over 500ms') page constantly on noise. Burn-rate alerts ('we'll exhaust this month's error budget in 6 hours at the current rate') page when something is actually breaking. Multi-window burn rate (fast: 1 hour, slow: 6 hours) catches both fast outages and slow degradations.

Tasks
  • For each SLO, configure a fast-burn alert (5% of monthly budget in 1 hour) - pages on-call immediately
  • Configure a slow-burn alert (10% of monthly budget in 6 hours) - opens a ticket, no page
  • Route alerts to a single on-call rotation (PagerDuty / Opsgenie / on-call-bot in Slack)
  • Run a synthetic alert drill: trigger a fake burn, verify the page reaches the on-call
  • Document the response runbook for each alert
Gotchas
  • Alerts on individual host CPU, individual instance latency, etc. are an anti-pattern in 2024. Alert on user-facing SLOs, investigate via dashboards.
  • If your on-call is woken up more than once a quarter for something that didn't actually impact users, your alerts are wrong. Tune them down.

Build the four runbooks: latency, errors, saturation, third-party down

Half a day

Runbooks turn a 2 AM page into a 15-minute fix instead of a 2-hour panic. Write four for the most common failure modes: high latency, error rate spike, saturation (out-of-memory / connection pool exhausted), third-party dependency down. Each is a numbered list of 'first 5 minutes' steps.

Tasks
  • Latency runbook: which dashboard to open, which trace to look at, common causes (cold start, slow DB query, third-party slowdown)
  • Errors runbook: which logs to query, how to find the deploying commit, the rollback procedure
  • Saturation runbook: which metric to check (memory, DB connections, queue depth), how to scale, how to drain a stuck queue
  • Third-party-down runbook: which third party (auth provider, payment processor, email provider), the status page URL, the in-app degradation behavior
  • Link each runbook from the matching alert
Gotchas
  • Runbooks rot when no one runs them. Run a tabletop drill quarterly: pick a runbook, execute every step, find the broken link or stale screenshot, fix it.
  • Runbooks that just say 'investigate' aren't runbooks. They need specific dashboards, specific commands, specific decision trees.

Run a postmortem on the next real incident

2-4 hr per incident

Postmortems compound. The first one feels like overhead; the tenth one is your team's most valuable institutional document. Write the postmortem within 48 hours of the incident, blameless tone, focused on the system gaps not the individual mistakes.

Tasks
  • Within 48 hr: write the postmortem in a shared doc, blameless tone
  • Sections: timeline, impact, root cause, contributing factors, what worked, what didn't, action items
  • For each action item: name a single owner and a due date
  • Review the postmortem in a 30-min team meeting; capture follow-up questions
  • Track action items in your team's tracker until done
Gotchas
  • Blameful postmortems destroy team trust. The same human error in a system that allowed it through is a system bug, not a person bug.
  • Action items without owners or dates are decorative. Track them like P0 work; close them or formally drop them.

Review the dashboard weekly; iterate on signal vs noise

30 min/week ongoing

Observability is not done; it drifts. Set a weekly 30-min slot to walk through the dashboards, check SLO health, prune alerts that paged-but-weren't-real, add metrics for the new things you shipped. The team that does this is 10x better at responding to incidents in 12 months than the team that doesn't.

Tasks
  • Pick a weekly recurring slot (Monday morning works for most teams)
  • Review SLO compliance over the last 7 days; flag any near burn
  • Review every page from the last 7 days; ask 'was this real?'
  • For each false page, tune the alert (raise threshold, change window, mute by tag)
  • For each new feature shipped, add the metric or trace span that lets you debug it
Gotchas
  • Most teams skip the weekly review for 'busier' work. Six months later they have 200 alerts firing weekly and on-call is hated. Don't skip it.
  • If you can't reproduce the metric query for an alert by hand, the alert is not maintainable. Document the query in the runbook.
Hand the template to your agent

Workspace-wide agent prompt.

Paste this into your agent's permanent system prompt so the agent reads, writes, and maintains the template's surfaces as you work through the steps.

Agent system prompt
You are an agent on the "Set up observability" playbook workspace.

Your role: maintain the four surfaces (Steps, Pointers, Brief, SLO log) as the team rolls out observability.

Cadence:
- When a step is marked Done, append to the Brief doc what shipped (schema, dashboard URL, runbook link).
- When the user defines an SLO, capture it as a row in SLO log with target / window / error budget / alert link.
- When an incident happens, link the postmortem to the relevant SLO row in SLO log.

First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-observability")
2. list_rows(workspace_slug="set-up-observability", surface_slug="steps")
3. get_doc(workspace_slug="set-up-observability", surface_slug="brief")

When proposing instrumentation, always read the user's codebase first before suggesting span names - they should match real code paths.
FAQ

Common questions on this template.

Do I need all three pillars (logs, metrics, traces) on day one?
Structured logs and the four golden signals as metrics, yes. Traces can wait 1-2 sprints if you have a small monolith - logs cover most of what traces would tell you. Once you have multiple services or async work (queues, cron, background jobs), traces become essential because logs alone can't reconstruct the request path.
What does observability cost for a small team?
Grafana Cloud's free tier covers most teams under ~10 services: 10k metrics, 50GB logs, 50GB traces per month. Sentry's free Developer tier handles 5k errors. OpenTelemetry itself is free open source. The first paid step is usually $19-50/mo when you outgrow the free tier - still cheaper than a single hour of debugging a production outage blind.
Why OpenTelemetry instead of a vendor SDK?
Vendor lock-in. OpenTelemetry is the W3C-standard instrumentation; switching backends becomes a config change, not a rewrite. Most major vendors (Honeycomb, Datadog, Grafana, AWS X-Ray) accept OTLP natively. Vendor SDKs are sometimes more polished but tie you to one provider's roadmap and pricing.
How many alerts should we have?
Far fewer than most teams have. A small team should have 5-10 alerts max: one or two SLO burn-rate alerts per critical user flow, plus saturation alerts for the few resources that can hard-cap the service (memory, DB connections). If you have 50 alerts firing weekly, you have 50 alerts your team has learned to ignore.
Can my AI agents help maintain the observability stack?
Yes. Agents are useful for: drafting the structured-logging schema from existing log calls, proposing trace spans by reading the codebase, refreshing dashboards when new metrics are added, summarising weekly SLO health, and triaging which incidents deserve a full postmortem vs a one-line note. The playbook ships agent prompts inline for the logging refactor and trace instrumentation steps.

Open this template as a workspace.

We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.