hosted · API-first · agent-operable

A/B testing your agents can operate.

Estimand is a feature-flag and experimentation platform where every action is one API call or CLI command — create a flag, launch an experiment, poll the results, ship the winner. Agents run the loop; the dashboard is where humans audit it.

$ pip install estimand  ·  estimand login
agent session — compono/production
$ estimand features create trial_offer \
    --type experiment --unit telegram_id \
    --variants control:50,trial_on:50 --metrics paid
created  trial_offer  production

$ estimand results trial_offer --metric paid --json
{ "variants": [
  { "key": "control",  "exposures": 420, "conversions": 29,
    "rate": 0.069 },
  { "key": "trial_on", "exposures": 418, "conversions": 47,
    "rate": 0.112, "uplift": 0.63, "p_value": 0.03,
    "significant": true } ] }

$ estimand ship trial_offer --variant trial_on
trial_offer → 100% trial_on  

Built to be driven by machines, defended to humans.

Most experimentation tools assume a person clicking through a console. Estimand assumes a person is reviewing what an agent already did.

POST /v1/features · estimand create

Agent-operable end to end

Create, target, ship, conclude — each is one authenticated call with a machine-readable response. No step in the experiment lifecycle requires a browser.

flag + metric = experiment

Flags and experiments, one primitive

A flag with variants and a metric is an experiment. One key, one SDK call, one lifecycle — stop wiring two systems together.

telegram_id · firebase_uid · anything

Any unit of randomization

Bring whatever identity your app already has — a Telegram bot, an iOS app, a web dashboard, a workspace. The unit type is yours to define.

raw counts, always

Statistics an agent can act on

Every rate ships with its raw counts, uplift vs control, and a p-value — plus an explicit significant flag, so an agent never ships on vibes.

the API is the product

Everything the dashboard shows, one GET away.

Results are computed server-side and served as plain JSON — exposures, conversions, rate, uplift, p-value per variant, per metric. An agent polls, checks significant, and decides.

Auth is two key scopes: sdk keys evaluate flags from your app, admin keys manage them from your agent or CI. Keys are shown once and stored hashed.

HTTP
GET /v1/projects/compono/envs/production/
    features/trial_offer/results
Authorization: Bearer eak_1b7c…

{
  "feature": "trial_offer",
  "computed_at": "2026-07-03T18:10:00Z",
  "metrics": [ {
    "metric": "paid",
    "variants": [
      { "key": "control",  "exposures": 420,
        "conversions": 29, "rate": 0.069 },
      { "key": "trial_on", "exposures": 418,
        "conversions": 47, "rate": 0.112,
        "uplift": 0.63, "p_value": 0.03,
        "significant": true }
    ] } ]
}

Hosted for you — self-host when you want it.

Sign up and you're testing in minutes: we run the platform, scale it, and keep it patched. Multi-tenant from the first request — organizations, projects, environments — so one account covers every product you ship.

Hosted, multi-tenant

Sign up, create a project, get your keys. Orgs → projects → environments, with separate SDK keys per environment — we operate it, you ship.

cached in the SDK

Flag evaluation is deterministic and cached client-side, so a network blip never blocks your app. Assignment never round-trips to us on the hot path.

Self-host later

A single-container distribution against your own Postgres is on the roadmap — bring Estimand fully in-house when you need to.