Dock
Sign in & remix
REMIX PREVIEWUse Cases· MAY 30

Dock for data analytics: A/B experiment review with attributed analyst

Dock turns an Eppo or Statsig experiment readout into a structured decision memo with an attributed analyst-lead reviewer, so ship-or-kill calls carry a name, a timestamp, and a trail back to the underlying dbt model.

By mei· 3 min read· from trydock.ai

A/B experiments stall between "results look green" and "we shipped it." The readout lives in Eppo or Statsig, the metric definition lives in dbt, the decision lives in someone's head. Dock seats an experiment-review agent next to the analyst-lead. The agent reads the readout, drafts a decision memo, and the analyst-lead signs off. The row records who decided, what they saw, and which dbt model fed the numbers. See Dock for data analytics.

Eppo, Statsig, and dbt stay the system of record for the raw data. Dock is the system of record for what the AGENT INTERPRETS. Each Dock row carries a pointer back to the platform record, agent identity, decision, reviewer, and timestamp. The agent re-fetches platform data via fresh API reads when it needs current state.

The Dock surface: Experiment decisions table

experiment platform primary_metric lift p_value agent_call analyst_lead_review decision dbt_model
checkout_v3 eppo/exp_8821 gross_revenue_per_visitor +2.4% 0.018 ship priya@ (approved 2026-05-28) shipped marts.fct_checkout_sessions
onboarding_nudge statsig/exp_4410 d7_activation +0.6% 0.21 hold priya@ (rejected, SRM flag) rerun marts.fct_activation
pricing_banner eppo/exp_9102 trial_starts -1.1% 0.04 kill dan@ (approved 2026-05-29) killed marts.fct_trial_funnel

Each row is one decision. The agent_call column is the draft. The analyst_lead_review column is the binding sign-off. Both are preserved, so a later auditor can ask why the agent said "ship" and the human said "rerun."

The workflow

When Eppo or Statsig marks a test as readout-ready, the agent pulls the result, the variant assignment counts, and the metric definition from the corresponding dbt model. It writes a draft memo into a Dock row: observed lift, confidence interval, sample size per arm, sample-ratio-mismatch check, and a ship/hold/kill recommendation. Priya, the analyst-lead, gets the row in her queue. She opens the linked Eppo dashboard, confirms the cut, and either approves the agent's call or overrides it. Her approval signs the row. The shipped decisions flow to the eng-lead via the same row, with the dbt model pinned for post-launch monitoring. The agent acts under its own identity, not Priya's seat. See agent identity.

Why it matters

Experiment platforms are good at math and bad at memory. Six months later, no one can reconstruct why a borderline test shipped or why a winner got killed. Dock keeps the interpretation. The pinned dbt model means a definition change later does not silently invalidate the decision. This is the audit layer in agent audit and compliance, and the same shape we use for engineering and product decisions.

Statsig's guidance on multiple-comparison corrections 1 and Eppo's writing on sample-ratio mismatch 2 point at the same gap: stat-sig alone is not a decision.

Try it

Point Dock at your Eppo or Statsig workspace and your dbt project. The first experiment readout shows up in your analyst-lead's queue with a draft memo attached.

FAQ

Does the agent decide which experiments ship? No. The agent drafts a recommendation. The analyst-lead's signature is the binding decision. Per agent identity, the agent acts under its own credentials, so draft and override are both attributable.

What if the dbt metric definition changes after the decision? The row pins the model reference at decision time. On re-fetch the agent flags drift and routes affected experiments back to the analyst-lead.

Does this replace Eppo or Statsig? No. Those platforms remain the system of record for assignment, variance, and stat tests. Dock records what a named agent said and what a named human decided.

How does this handle sample-ratio mismatch? The agent runs the SRM check on every readout and refuses to recommend "ship" when the chi-square fails. The hold routes to the analyst-lead with the flag on the row.

Footnotes

  1. Statsig, "Correct me if I'm wrong: Navigating multiple comparison corrections in A/B Testing," statsig.com/blog.

  2. Eppo, "What to Do When You Encounter Sample Ratio Mismatch in A/B Testing," geteppo.com/blog.

Remix this into Dock

Make this yours. Edit, extend, run agents on it.

Sign in (free, 20 workspaces) — Dock mints a copy of this in your own workspace. The original stays untouched.

No Dock account? Sign-in is signup. Magic-link in 30 seconds.