Expand description
§VelaBench v0.26 — agent state-update scoring
Compares a candidate frontier (typically agent-generated) against a gold frontier (curator-validated) and produces a reproducible score.
Unlike the legacy benchmark module — which scores literature
extraction quality — this scorer reads two frontiers as data
artifacts and judges how well one approximates the other.
Determinism is the doctrine: sort by vf_id, no wall-clock,
no RNG. Same inputs → same numbers.
Substrate stays dumb: this is pure data comparison. No LLM
call, no network, no agent invocation. The scorer never spawns
claude or anything else; it operates on already-emitted
FindingBundles and StateProposals.
Structs§
- Bench
Input - Inputs to a single VelaBench run.
- Bench
Report - Full bench report. Serializable to JSON for
--jsonmode and for checking in asexpected.jsonregression bands. - Metric
Result - One metric’s worth of result.
passis purely informational (target met) — the binary’s exit code is driven by the composite, not by individual metrics.
Constants§
- W_
CLAIM_ MATCH - Composite score weights, summing to 1.0 — locked here so the formula is auditable in one line. Adjust deliberately.
- W_
CONTRADICTION_ RECALL - W_
DOWNSTREAM_ LINK - W_
DUPLICATE_ INV - W_
EVIDENCE_ FIDELITY - W_SCOPE
Functions§
- render_
pretty - Render a human-readable report. JSON callers serialize
BenchReportdirectly. - run
- Run a complete bench. Loads both frontiers, computes every metric, and returns the report. Caller decides what to do with the exit code.