bench

crabbox bench records and reports local benchmark timing observations. It is a local evidence workflow, not a provider leaderboard.

Benchmark rows are append-only JSONL records. They wrap the existing TimingReport payload with local comparison context such as command fingerprint, command display text, provider family/kind, repo fingerprint, and cold/warm state when known. Crabbox only writes this ledger when explicitly requested.

Default store:

<CrabboxStateDir()>/timings.jsonl

CrabboxStateDir() uses $XDG_STATE_HOME/crabbox when set; otherwise it uses the user config directory's crabbox/state directory.

#Record from a run

Use run --timing-record to append the final timing report from a real run:

crabbox run --timing-record=default -- pnpm test
crabbox run --timing-record ./bench/timings.jsonl --provider aws -- pnpm test

Plain crabbox run remains non-recording by default.

#Run and record a benchmark

bench run runs the same command for each selected provider and repeat, then records observations in the benchmark store:

crabbox bench run --providers aws,hetzner --repeats 3 -- pnpm test
crabbox bench run --provider aws --cold -- go test ./...

The command uses the same execution path as crabbox run, so provider setup, sync, command streaming, timing, and normal cleanup behavior stay centralized. If any provider or repeat fails, Crabbox continues the remaining attempts and exits non-zero after printing the number of recorded observations. Use --store <path> to write a specific JSONL store; the default is the local state store.

#Record an existing timing JSON payload

bench record ingests one saved TimingReport JSON object from a file or stdin:

crabbox bench record --timing-json timing.json --command "pnpm test" --cold

--command or command args after -- set the command display text and command fingerprint used for grouping:

crabbox bench record --timing-json timing.json -- pnpm test

#Report local observations

bench report reads the local store and groups observations by provider, provider family/kind, machine type, command fingerprint, and cold/warm bucket.

crabbox bench report
crabbox bench report --since 7d --providers aws,hetzner
crabbox bench report --command-fingerprint sha256:... --json

Human output includes successful sample count (n), median total duration, p95 total duration when enough samples exist, median sync and command duration, failure count, and an evidence marker. bench report marks groups as insufficient_successful_samples until the group has at least --min-samples successful observations. The default is 2.

JSON output uses the same grouped data:

{
  "schemaVersion": 1,
  "storePath": ".../timings.jsonl",
  "filters": {
    "since": "7d",
    "providers": ["aws"],
    "minSamples": 2
  },
  "groups": [
    {
      "provider": "aws",
      "providerFamily": "aws",
      "providerKind": "ssh-lease",
      "machineType": "c7a.large",
      "commandFingerprint": "sha256:...",
      "n": 2,
      "medianTotalMs": 64000,
      "medianSyncMs": 12000,
      "medianCommandMs": 45000,
      "failureCount": 0,
      "insufficientEvidence": false,
      "evidence": "sufficient_local_samples"
    }
  ]
}

#Privacy and interpretation

Timing records can include repo paths, workdirs, command display text, labels, artifact paths, and lease metadata because they preserve the existing TimingReport payload. Treat the store as local private state. Delete it when you no longer want the observations.

Reports say what happened locally for matching workloads. They do not claim that one provider is fastest globally, do not publish measurements, and do not infer cost unless a future record includes a defensible cost basis.

#Flags

bench run:
--store default|path        JSONL store destination (default: default)
--provider <name>           run one provider
--providers a,b             run comma-separated providers
--repeats <n>               repeat count per provider (default: 1)
--cold                      mark observations as cold runs
--warm                      mark observations as warm/reused runs

bench record:
--store default|path        JSONL store destination (default: default)
--timing-json path|-        TimingReport JSON input, or stdin with -
--source <label>            record source label
--command <text>            command display text for grouping
--cold                      mark observation as a cold run
--warm                      mark observation as a warm/reused run
--repeat-index <n>          one-based repeat index when known

bench report:
--store default|path        JSONL store to read (default: default)
--provider <name>           include one provider
--providers a,b             include comma-separated providers
--command-fingerprint <id>  include one command fingerprint
--since <duration>          include records since 7d, 24h, etc.
--min-samples <n>           successful samples required for sufficient evidence
--json                      print machine-readable report JSON