Build· Mixed

Set up observability: logs, metrics, traces

9-step playbook for the three pillars of observability (logs, metrics, traces) at a level a 1-5 person team can adopt in a week.

Open in DockSmall teams running production apps

Observability used to mean enterprise vendors and a SRE team. The OpenTelemetry standard plus 2024-era tooling makes it cheap and tractable for a small team. This playbook walks the 9 steps to wire up structured logging, the four golden signals as metrics, distributed tracing with OTel, an SLO that the team actually believes in, and alerts that don't page on noise. End state: when production breaks, you see why in 10 seconds, not 2 hours.

Outcome

A production app where every request is traceable end-to-end, the four golden signals are dashboarded, on-call gets paged only on real customer-impacting issues, and a 'why is the app slow?' question gets a definitive answer in 10 seconds.

Time1 weekDifficultyintermediateForSmall engineering teams with a production app and no SRE.

What you'll need

Pre-register or install before you start.

OpenTelemetryFree open source

Vendor-neutral instrumentation: SDKs in every major language, exporters to most backends.

SentryFree Developer (5k errors/mo), $26/mo Team

Error tracking with source-mapped stacks and release tracking.

Grafana CloudFree tier (10k metrics, 50GB logs/mo), $19/mo Pro

Hosted Prometheus + Loki + Tempo for metrics, logs, traces in one stack.

HoneycombFree up to 20M events/mo, then usage-based

Alternative trace-first backend strong on high-cardinality querying.

Datadog$15-31/host/mo plus per-feature add-ons

Full-stack alternative when you want one vendor for everything and budget allows.

The template · 9 steps

Top to bottom. Each step has tasks, pointers, gotchas.

Pick a vendor before you instrument anything

2-3 hr of research

Picking the backend first prevents the worst pattern: instrument everything, then realize the vendor's pricing model penalizes the cardinality you committed to. Logs go in Loki / Datadog Logs / Sentry. Metrics go in Prometheus / Datadog. Traces go in Tempo / Honeycomb / Datadog APM. Pick a stack that covers all three; resist the urge to mix-and-match across vendors on day one.

Tasks

List your top 3 budget constraints (monthly spend, team size, on-call sophistication)
Compare three options: Grafana Cloud (cheapest), Honeycomb (best for traces), Datadog (best UX, priciest)
Estimate volume: log events per day, metrics cardinality, trace spans per second
Sign up for the free tier of your pick and start there

Pointers

OfficialOpenTelemetry vendor list OfficialHoneycomb pricing OfficialGrafana Cloud pricing

Gotchas

Datadog's bill explodes with custom metrics cardinality. A single tag with 10,000 unique values can cost more than your hosting.
Sentry is great for errors but a poor fit for general logging. Don't pipe all your logs into Sentry; you'll exhaust the free tier in a week.
Vendor switching is painful once you have dashboards, alerts, and runbooks. Pick deliberately, not by default.

Switch to structured logging (JSON, with a schema)

Half a day

Plain text logs are uncorrelatable. Structured JSON logs with a known schema are queryable, alertable, and aggregatable. Adopt a logging library that emits JSON by default, define 5-10 standard fields (timestamp, level, request_id, user_id, route, latency_ms), and prohibit free-form messages from carrying load-bearing data.

Tasks

Pick a structured-logging lib for your language (pino in Node, zap in Go, structlog in Python)
Define a JSON schema with 5-10 standard fields every log line carries
Add request_id as a header that propagates through every middleware and downstream call
Replace string-template log lines with structured fields ('user logged in' becomes `{ event: 'login', user_id, method }`)
Verify a sample of logs in the backend and confirm they're queryable by user_id

Pointers

OfficialOpenTelemetry log specification CodePino (Node.js logger)GuideTwelve-Factor App: logs

Gotchas

Don't log full request bodies. Production logs leak PII into your vendor and your retention policy. Log the request shape, not the payload.
Free-form string logs are an anti-pattern at scale. The day you need to query 'all 500s on /checkout in the last hour by user', regex over strings is a 3-hour grep, structured fields are a 30-second query.

Agent prompt for this step

Read the codebase. Find every log call (console.log, logger.info, log.Info, etc.).

For each, propose a refactor that:
1. Uses the team's structured-logging library (ask the user which one).
2. Carries the standard fields (timestamp, level, request_id, user_id, route, event_name).
3. Replaces string templates with structured fields ("user logged in" becomes event: "login" + user_id).

Output as a list of file:line references with proposed replacements. Flag any log that contains PII (email, full name, IP) for review under your data classification.

Capture the four golden signals as metrics

Half a day

Latency, traffic, errors, saturation. Every service needs them. Most observability tools give them to you for free if you instrument correctly: latency from request duration histograms, traffic from request counters, errors from status code counters, saturation from CPU / memory / queue depth. Build a dashboard with these four panels per service and call it the home base.

Tasks

Add a request-duration histogram (p50, p95, p99) keyed by route + status_code
Add a request-count counter keyed by route + status_code
Add an error-count counter for 5xx responses
Add saturation metrics: CPU, memory, max DB connection pool, max queue depth
Build a 4-panel dashboard per service: Latency, Traffic, Errors, Saturation
Link the dashboard URL in your team docs

Pointers

GuideGoogle SRE: four golden signals OfficialPrometheus: histogram quantiles

Gotchas

Latency p99 over a 1-minute window is noisy with low traffic. Use 5-min windows for p99 if your service does under 100 req/s.
Don't use averages for latency. Average latency hides the long tail. Always p50 / p95 / p99.

Wire up distributed tracing with OpenTelemetry

1 day

Tracing is the single biggest investigative tool you can add. A trace shows the path of one request through every service it touches, with timing per hop. OpenTelemetry's auto-instrumentation libraries cover most popular frameworks; you usually only need to add 5-10 manual spans for the important business logic.

Tasks

Install the OpenTelemetry SDK + exporter for your language
Enable auto-instrumentation for HTTP server, HTTP client, DB driver
Add manual spans around the 5 most important business operations (checkout, signup, cron job)
Configure the exporter to send to your chosen backend (OTLP endpoint)
Verify a request shows up as a complete trace in the backend, with all hops

Pointers

OfficialOpenTelemetry: getting started OfficialOpenTelemetry: auto-instrumentation OfficialOTLP protocol spec

Gotchas

OpenTelemetry's default sample rate is often too low to be useful. For a small app, sample 100% of traces. Drop to head-based sampling at 1% only when volume justifies it.
Manual spans without `try/finally` (or context manager) leak when an exception fires. Use the SDK's recommended pattern for your language.
Trace context propagation breaks across queue boundaries (SQS, Kafka, Redis pub/sub) unless you serialize the trace headers in the message envelope.

Agent prompt for this step

Read the codebase and propose 10 manual span instrumentations.

For each:
1. The function or block to wrap.
2. The span name (use convention: "service.operation", e.g. "checkout.charge_card").
3. The attributes to set on the span (user_id, order_id, amount, etc.).

Prioritize: payment flows, signup, third-party API calls, slow DB queries, background jobs. Skip pure-CPU functions.

Output as a list with file:line references and a code snippet showing the instrumentation in the team's language.

Define one or two SLOs the team actually believes in

Half a day to define, ongoing to refine

An SLO is a target for how often the service does the right thing. The classic shape: '99.9% of /checkout requests complete in under 500ms over a rolling 30 days.' That gives you an error budget (0.1% over 30 days = ~43 minutes). When you burn through the budget, you slow feature velocity. When you have lots of budget, you ship aggressively. SLOs only work if the team treats them as real.

Tasks

Pick the 1-2 user-facing flows that matter most (signup, checkout, primary product action)
For each, write the SLO: 'X% of [request type] complete in under [latency] / without error, over [window]'
Calculate the error budget (1 - target = budget; over the window = absolute minutes/requests)
Set up the metric query that measures it; verify the historical baseline meets the target
Decide the policy: what do we DO when we burn budget? (slow features, freeze deploys, etc.)

Pointers

GuideGoogle SRE Workbook: SLO chapter ToolSloth: SLOs as code

Gotchas

Don't pick 99.99% just because it sounds good. A 99.99% SLO over 30 days is 4 minutes of error budget. Most small teams can't operate at that level without dedicated SRE.
An SLO without a written 'what we do when we burn budget' policy is decoration. Write the policy.
SLOs measured against availability without latency miss the silent slowdown. Always pair availability with latency.

Set up alerts on burn rate, not on threshold

2-3 hr

Threshold alerts ('latency over 500ms') page constantly on noise. Burn-rate alerts ('we'll exhaust this month's error budget in 6 hours at the current rate') page when something is actually breaking. Multi-window burn rate (fast: 1 hour, slow: 6 hours) catches both fast outages and slow degradations.

Tasks

For each SLO, configure a fast-burn alert (5% of monthly budget in 1 hour) - pages on-call immediately
Configure a slow-burn alert (10% of monthly budget in 6 hours) - opens a ticket, no page
Route alerts to a single on-call rotation (PagerDuty / Opsgenie / on-call-bot in Slack)
Run a synthetic alert drill: trigger a fake burn, verify the page reaches the on-call
Document the response runbook for each alert

Pointers

GuideGoogle SRE: alerting on SLOs GuidePagerDuty: alert design

Gotchas

Alerts on individual host CPU, individual instance latency, etc. are an anti-pattern in 2024. Alert on user-facing SLOs, investigate via dashboards.
If your on-call is woken up more than once a quarter for something that didn't actually impact users, your alerts are wrong. Tune them down.

Build the four runbooks: latency, errors, saturation, third-party down

Half a day

Runbooks turn a 2 AM page into a 15-minute fix instead of a 2-hour panic. Write four for the most common failure modes: high latency, error rate spike, saturation (out-of-memory / connection pool exhausted), third-party dependency down. Each is a numbered list of 'first 5 minutes' steps.

Tasks

Latency runbook: which dashboard to open, which trace to look at, common causes (cold start, slow DB query, third-party slowdown)
Errors runbook: which logs to query, how to find the deploying commit, the rollback procedure
Saturation runbook: which metric to check (memory, DB connections, queue depth), how to scale, how to drain a stuck queue
Third-party-down runbook: which third party (auth provider, payment processor, email provider), the status page URL, the in-app degradation behavior
Link each runbook from the matching alert

Pointers

GuideGoogle SRE: runbook style guide

Gotchas

Runbooks rot when no one runs them. Run a tabletop drill quarterly: pick a runbook, execute every step, find the broken link or stale screenshot, fix it.
Runbooks that just say 'investigate' aren't runbooks. They need specific dashboards, specific commands, specific decision trees.

Run a postmortem on the next real incident

2-4 hr per incident

Postmortems compound. The first one feels like overhead; the tenth one is your team's most valuable institutional document. Write the postmortem within 48 hours of the incident, blameless tone, focused on the system gaps not the individual mistakes.

Tasks

Within 48 hr: write the postmortem in a shared doc, blameless tone
Sections: timeline, impact, root cause, contributing factors, what worked, what didn't, action items
For each action item: name a single owner and a due date
Review the postmortem in a 30-min team meeting; capture follow-up questions
Track action items in your team's tracker until done

Pointers

GuideGoogle SRE: postmortem culture GuideEtsy debriefing facilitation guide

Gotchas

Blameful postmortems destroy team trust. The same human error in a system that allowed it through is a system bug, not a person bug.
Action items without owners or dates are decorative. Track them like P0 work; close them or formally drop them.

Review the dashboard weekly; iterate on signal vs noise

30 min/week ongoing

Observability is not done; it drifts. Set a weekly 30-min slot to walk through the dashboards, check SLO health, prune alerts that paged-but-weren't-real, add metrics for the new things you shipped. The team that does this is 10x better at responding to incidents in 12 months than the team that doesn't.

Tasks

Pick a weekly recurring slot (Monday morning works for most teams)
Review SLO compliance over the last 7 days; flag any near burn
Review every page from the last 7 days; ask 'was this real?'
For each false page, tune the alert (raise threshold, change window, mute by tag)
For each new feature shipped, add the metric or trace span that lets you debug it

Gotchas

Most teams skip the weekly review for 'busier' work. Six months later they have 200 alerts firing weekly and on-call is hated. Don't skip it.
If you can't reproduce the metric query for an alert by hand, the alert is not maintainable. Document the query in the runbook.

Hand the template to your agent

Workspace-wide agent prompt.

Paste this into your agent's permanent system prompt so the agent reads, writes, and maintains the template's surfaces as you work through the steps.

Agent system prompt

You are an agent on the "Set up observability" playbook workspace.

Your role: maintain the four surfaces (Steps, Pointers, Brief, SLO log) as the team rolls out observability.

Cadence:
- When a step is marked Done, append to the Brief doc what shipped (schema, dashboard URL, runbook link).
- When the user defines an SLO, capture it as a row in SLO log with target / window / error budget / alert link.
- When an incident happens, link the postmortem to the relevant SLO row in SLO log.

First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-observability")
2. list_rows(workspace_slug="set-up-observability", surface_slug="steps")
3. get_doc(workspace_slug="set-up-observability", surface_slug="brief")

When proposing instrumentation, always read the user's codebase first before suggesting span names - they should match real code paths.

FAQ

Common questions on this template.

Do I need all three pillars (logs, metrics, traces) on day one?: Structured logs and the four golden signals as metrics, yes. Traces can wait 1-2 sprints if you have a small monolith - logs cover most of what traces would tell you. Once you have multiple services or async work (queues, cron, background jobs), traces become essential because logs alone can't reconstruct the request path.
What does observability cost for a small team?: Grafana Cloud's free tier covers most teams under ~10 services: 10k metrics, 50GB logs, 50GB traces per month. Sentry's free Developer tier handles 5k errors. OpenTelemetry itself is free open source. The first paid step is usually $19-50/mo when you outgrow the free tier - still cheaper than a single hour of debugging a production outage blind.
Why OpenTelemetry instead of a vendor SDK?: Vendor lock-in. OpenTelemetry is the W3C-standard instrumentation; switching backends becomes a config change, not a rewrite. Most major vendors (Honeycomb, Datadog, Grafana, AWS X-Ray) accept OTLP natively. Vendor SDKs are sometimes more polished but tie you to one provider's roadmap and pricing.
How many alerts should we have?: Far fewer than most teams have. A small team should have 5-10 alerts max: one or two SLO burn-rate alerts per critical user flow, plus saturation alerts for the few resources that can hard-cap the service (memory, DB connections). If you have 50 alerts firing weekly, you have 50 alerts your team has learned to ignore.
Can my AI agents help maintain the observability stack?: Yes. Agents are useful for: drafting the structured-logging schema from existing log calls, proposing trace spans by reading the codebase, refreshing dashboards when new metrics are added, summarising weekly SLO health, and triaging which incidents deserve a full postmortem vs a one-line note. The playbook ships agent prompts inline for the logging refactor and trace instrumentation steps.

Open this template as a workspace.

We mint a fresh copy in your org with the steps as table rows, the pointers as a separate table, and the brief as a doc. Bring your agents, start checking off boxes.

Open in Dock