Module agent_bench

Expand description

§VelaBench v0.26 — agent state-update scoring

Compares a candidate frontier (typically agent-generated) against a gold frontier (curator-validated) and produces a reproducible score.

Unlike the legacy benchmark module — which scores literature extraction quality — this scorer reads two frontiers as data artifacts and judges how well one approximates the other. Determinism is the doctrine: sort by vf_id, no wall-clock, no RNG. Same inputs → same numbers.

Substrate stays dumb: this is pure data comparison. No LLM call, no network, no agent invocation. The scorer never spawns claude or anything else; it operates on already-emitted FindingBundles and StateProposals.

Structs§

BenchInput: Inputs to a single VelaBench run.
BenchReport: Full bench report. Serializable to JSON for --json mode and for checking in as expected.json regression bands.
MetricResult: One metric’s worth of result. pass is purely informational (target met) — the binary’s exit code is driven by the composite, not by individual metrics.

Constants§

W_CLAIM_MATCH: Composite score weights, summing to 1.0 — locked here so the formula is auditable in one line. Adjust deliberately.
W_CONTRADICTION_RECALL
W_DOWNSTREAM_LINK
W_DUPLICATE_INV
W_EVIDENCE_FIDELITY
W_SCOPE

Functions§

render_pretty: Render a human-readable report. JSON callers serialize BenchReport directly.
run: Run a complete bench. Loads both frontiers, computes every metric, and returns the report. Caller decides what to do with the exit code.