# Bench Command
Run performance benchmarks for a Homeboy component and surface regression
deltas against a stored baseline.
## Synopsis
```bash
homeboy bench <component> [options] [-- <runner-args>]
```
## Description
The `bench` command invokes the extension's bench runner, which measures
one or more scenarios over N iterations and emits a structured JSON
results file. Homeboy parses the results, compares declared numeric
metrics against a saved baseline, and returns a structured report plus
an exit code suitable for CI gates.
`bench` is a sibling of `test`, `lint`, and `build` under homeboy's
extension capability model. The runner contract, manifest shape, and
baseline primitive (`homeboy.json` → `baselines.bench`) are shared with
the other capabilities.
## Arguments
- `<component>`: Component to benchmark. Auto-detected from the current
working directory if omitted. The component must have a linked
extension that declares a `bench` capability.
## Options
- `--iterations <N>`: Iterations per scenario (default `10`). Forwarded
to the runner via `$HOMEBOY_BENCH_ITERATIONS`. Extensions may clamp.
- `--baseline`: Save the current run as the new baseline under
`homeboy.json` → `baselines.bench`.
- `--ignore-baseline`: Run without comparing to any saved baseline.
- `--ratchet`: When scenarios improve, auto-update the saved baseline so
the improvement "sticks". Ignored when the run regresses.
- `--regression-threshold <PERCENT>`: Legacy p95 regression tolerance
(default `5.0`) used when the runner does not declare `metric_policies`.
A p95 scenario regresses when its current `p95_ms` exceeds
`baseline.p95_ms * (1 + threshold/100)`.
- `--setting <key=value>`: Override component settings (may be repeated).
- `--path <PATH>`: Override the component's `local_path` for this run.
- `--json-summary`: Include a compact machine-readable summary in the
JSON output envelope (for CI wrappers).
- `--shared-state <DIR>`: Mount a stable storage directory across
iterations (and across parallel runner instances when combined with
`--concurrency`). Exposed to the runner as
`$HOMEBOY_BENCH_SHARED_STATE`. Created if it doesn't exist; never
cleaned up by homeboy. See "Shared State and Concurrency" below.
- `--concurrency <N>`: Number of parallel runner instances to spawn
(default `1`). When `> 1`, `--shared-state` is required. Each instance
receives a distinct `$HOMEBOY_BENCH_INSTANCE_ID` (`0..N-1`) plus
`$HOMEBOY_BENCH_CONCURRENCY=N`.
- `--rig <RIG_ID[,RIG_ID...]>`: Pin the run to one or more rigs. Single
rig pins the rig and stores its baseline under a rig-scoped key.
Multiple rigs (comma-separated) run the same component + workload +
iteration count against each rig in sequence and emit a cross-rig
comparison envelope. See "Cross-rig comparison" below.
Arguments after `--` are passed verbatim to the extension's bench runner
script (e.g., `--filter=scenario_id` for selective execution).
## Examples
```bash
# Benchmark a component with defaults (10 iterations, 5% regression threshold)
homeboy bench my-component
# 50 iterations, stricter 2% regression threshold
homeboy bench my-component --iterations 50 --regression-threshold 2.0
# Save a new baseline
homeboy bench my-component --baseline
# Run with auto-ratchet on improvement
homeboy bench my-component --ratchet
# Select a single scenario via passthrough args
homeboy bench my-component -- --filter=hot_path
# Concurrent-writer stress test: 4 parallel instances against a shared
# on-disk state directory. All four runners see the same SQLite +
# markdown files, surfacing lock contention and write loss.
homeboy bench my-component \
--shared-state /tmp/bench-shared \
--concurrency 4
# Crash-recovery / durability test: single instance, persistent state.
# Workload kills mid-stream on iteration N; iteration N+1 boots fresh
# against the same on-disk state and audits integrity.
homeboy bench my-component --shared-state /tmp/bench-durability
# Pin to a single rig — preflight + rig-scoped baseline
homeboy bench studio --rig studio-trunk
# Cross-rig comparison: same workload, two rigs, side-by-side report.
# First rig (`studio-trunk`) is the reference; the diff table expresses
# every other rig's metrics as percent deltas vs the reference.
homeboy bench studio --rig studio-trunk,studio-combined-fixes --iterations 10
# Three-rig comparison to isolate one PR's contribution.
homeboy bench studio \
--rig trunk,combined-fixes,combined-fixes-without-3120 \
--iterations 20
```
## Shared State and Concurrency
Two workload classes need state shared across runtime instances or
surviving a kill:
- **Concurrent writers** — N parallel processes writing against the
same site, surfacing lock contention and write loss under load.
- **Crash recovery** — Start a write stream, kill mid-stream, boot a
fresh runtime against the same on-disk state, audit integrity.
Both fit cleanly under `--shared-state <DIR>`:
| Cold-iteration (default) | `1` | unset | Per-iteration cold boot, no shared state. The original bench design. |
| Persistent single | `1` | `<DIR>` | Single runtime, but state in `<DIR>` survives across iterations. Crash-recovery workloads. |
| Concurrent | `> 1` | `<DIR>` (required) | N parallel runners, all pointed at `<DIR>`. Lock-contention workloads. |
| Concurrent without state | `> 1` | unset | **Rejected** — N parallel cold-boots without shared state are N independent runs. The validation error points you at `--shared-state`. |
Per-instance scenarios are merged with `:i<n>` suffixed IDs in the
aggregated output (`shared_counter:i0`, `shared_counter:i1`, …) so each
instance's measurements stay distinguishable. The baseline ratchet works
unchanged — a regression in instance 2 surfaces as a regression on
`<id>:i2`, not as silent averaging across instances.
### Runner contract additions
When shared-state and concurrency flags are set, three additional env
vars flow into the runner:
- `HOMEBOY_BENCH_SHARED_STATE` — absolute path to the shared directory
(or empty string when not set). Workloads that opt into shared state
read or write files under this path.
- `HOMEBOY_BENCH_INSTANCE_ID` — `0..N-1` for parallel runs, `0` for
single-instance.
- `HOMEBOY_BENCH_CONCURRENCY` — `N` for parallel runs, `1` for
single-instance.
Per-instance results are written to `bench-results-i<n>.json` under the
run dir; homeboy core merges them into the unified `BenchResults`
envelope before applying baseline comparison.
## Cross-rig comparison
`--rig <a>,<b>[,<c>...]` runs the same component + workload + iteration
count against each rig in sequence and emits a single comparison
envelope. Useful for "is my fix actually faster than trunk?" — same
question, two rigs differing only in component commit state.
### How it runs
For each rig, in input order:
1. Load the rig spec and run `rig check`. Failure aborts the entire
comparison — comparing against an unhealthy rig would produce
garbage numbers.
2. Snapshot rig state (each component's git SHA + branch) into the
per-rig output entry.
3. Run bench against the resolved component with the rig pinned.
After every rig finishes, results are aggregated into a
`BenchComparisonOutput` envelope with `comparison: "cross_rig"`. The
**first rig in the list is the reference**: per-metric percent deltas
in the `diff` table express each subsequent rig as `(current -
reference) / reference * 100`.
### What's intentionally not done
- **No baseline writes.** `--baseline` and `--ratchet` are rejected on
cross-rig invocations. Baselines are per-rig; writing one from a
comparison would silently bless one rig over the others. Run `homeboy
bench --rig <id> --baseline` once per rig to ratchet individually.
- **No statistical-significance gating.** Two rigs with overlapping
`p95_ms` distributions still produce a numeric delta. Treat single-digit
percent moves with skepticism. Confidence intervals are a v2 question.
- **No matrix × rig composition.** `--matrix` and multi-`--rig` together
is not yet supported; pick one axis per invocation. Single-rig +
matrix continues to work.
### Output shape (cross-rig)
```json
{
"comparison": "cross_rig",
"passed": true,
"component": "studio",
"exit_code": 0,
"iterations": 10,
"rigs": [
{
"rig_id": "studio-trunk",
"passed": true,
"status": "passed",
"exit_code": 0,
"results": { ... },
"rig_state": { "rig_id": "studio-trunk", "captured_at": "...", "components": { ... } }
},
{
"rig_id": "studio-combined-fixes",
"passed": true,
"status": "passed",
"exit_code": 0,
"results": { ... },
"rig_state": { ... }
}
],
"diff": {
"by_scenario": {
"agent_boot": {
"p95_ms": {
"studio-combined-fixes": {
"reference": 31200.0,
"current": 19400.0,
"delta_percent": -37.82
}
}
}
}
},
"hints": [ ... ]
}
```
The reference rig is omitted from the inner `diff.by_scenario.<id>.<metric>`
map — its delta against itself would always be zero. A scenario or
metric missing from a non-reference rig is silently skipped (no
synthetic zeros).
### Exit code
`exit_code` is `0` only when every rig passed. The first non-zero rig
exit code wins. `passed` is `true` only when every rig passed.
## Baseline Ratchet Semantics
The bench baseline is a list of per-scenario snapshots stored in
`homeboy.json` under the `baselines.bench` key. Each snapshot records
`{ id, metrics }` plus the iteration count at capture time.
On every run without `--baseline` or `--ignore-baseline`:
1. Each current scenario is matched against the baseline by `id`.
2. If the runner declares `metric_policies`, only those metrics are
compared. Each policy declares whether lower or higher values are
better and optional percent/absolute tolerances.
3. If the runner omits `metric_policies`, Homeboy keeps the historical
default: compare `p95_ms` as lower-is-better with the CLI threshold.
4. A scenario improves when any compared metric moves in the better
direction.
5. Scenarios present in one run but not the other are flagged as
`new_scenario_ids` / `removed_scenario_ids`. Neither state triggers
a regression by itself — they're informational.
6. If any scenario regressed, the command exits `1` regardless of the
runner's own exit code.
7. If any scenario improved and `--ratchet` is set, the baseline is
overwritten with the current snapshot.
p95 remains the default for legacy latency benchmarks because it is less
sensitive than mean to one-off GC pauses but more sensitive than p99 to
genuine regressions. Runners that care about non-latency signals should
declare `metric_policies` instead.
## Runner Contract
The extension's bench script must:
1. Read `$HOMEBOY_BENCH_ITERATIONS` to determine iteration count.
2. Write its JSON output to `$HOMEBOY_BENCH_RESULTS_FILE`.
3. Exit with a non-zero status only on runner-level failure (script
error, workload crash) — regressions are homeboy's domain.
### JSON output schema
```json
{
"component_id": "string",
"iterations": 10,
"metric_policies": {
"error_rate": {
"direction": "lower_is_better",
"regression_threshold_absolute": 0.01
},
"requests_per_second": {
"direction": "higher_is_better",
"regression_threshold_percent": 5.0
}
},
"scenarios": [
{
"id": "scenario_slug",
"file": "tests/bench/some-workload.ext",
"iterations": 10,
"metrics": {
"mean_ms": 120.3,
"p50_ms": 118.0,
"p95_ms": 145.0,
"p99_ms": 160.0,
"min_ms": 110.0,
"max_ms": 172.0,
"error_rate": 0.0,
"requests_per_second": 180.5,
"status_500_count": 0
},
"memory": { "peak_bytes": 41943040 }
}
]
}
```
- Top-level keys are strict — unknown top-level fields are rejected to
keep the contract honest.
- `metrics` is an arbitrary map of numeric values. Homeboy core does not
attach domain meaning to metric names.
- `metric_policies` is optional. If omitted, Homeboy compares `p95_ms`
using the legacy lower-is-better latency policy.
- Policy `direction` accepts `lower_is_better` / `lower` and
`higher_is_better` / `higher`.
- Policy thresholds are optional. `regression_threshold_percent` compares
relative movement; `regression_threshold_absolute` compares raw numeric
movement. If both are present, a metric must exceed both tolerances to
regress.
- Scenario-level unknown keys are **tolerated**, so extensions can emit
additional metadata (tags, environment info, warmup counts) without
breaking parsing.
- `memory` is optional. Extensions that can't measure peak memory omit it.
- `file` is optional but recommended for diagnostics.
### Environment variables injected
Bench scripts receive the standard runner contract plus bench-specific
variables:
- `HOMEBOY_BENCH_RESULTS_FILE` — where to write JSON output.
- `HOMEBOY_BENCH_ITERATIONS` — iteration count to use.
- `HOMEBOY_RUN_DIR` — per-run directory (shared with test/lint/build).
- `HOMEBOY_EXTENSION_ID`, `HOMEBOY_COMPONENT_ID`, `HOMEBOY_COMPONENT_PATH`,
and the usual execution-context vars.
- `HOMEBOY_SETTINGS_JSON` — component settings as JSON.
## Component Requirements
For a component to be benchmarkable, it must have:
- A linked extension whose manifest declares a `bench` capability.
- A bench-runner script provided by the extension.
Extension manifest:
```json
{
"bench": {
"extension_script": "scripts/bench/bench-runner.sh"
}
}
```
## Exit Codes
- `0` — All scenarios passed, no regressions detected (or no baseline
exists yet).
- `1` — At least one scenario regressed beyond the threshold, or the
runner itself failed.
- Other non-zero — Runner exit code passthrough (extension-specific).
## Related
- [test](./test.md) — Test sibling capability; bench mirrors its flag
conventions for `--baseline`, `--ignore-baseline`, and `--ratchet`.
- [lint](./lint.md) — Lint sibling capability.
- [build](./build.md) — Build sibling capability.