Expand description
B0 M11 — JSONL output schema for the eval harness. B0 M11 — eval-harness output schema.
Per-test JSONL records emitted by the Python harness in
scripts/eval/ and consumed by the Rust report aggregator
(mur agent eval report, M11.4). Stable on-disk shape — bumping
field semantics requires a new EvalRecord.schema_version.
Spec: docs/superpowers/specs/2026-05-06-b0-m11-eval-harness-design.md §6.
Structs§
- Eval
Hook Decision - One observation from the B0 hook chain during the test run. Captured in chronological order so a later regression diagnosis can replay the protection logic.
- Eval
Record - One test case’s result — written as a single JSONL line by the
Python harness per case, parsed by
mur agent eval reportto build the markdown summary.
Enums§
- Eval
Decision - Outcome the agent took in response to the attack — independent of
whether that outcome was the “right” one (which is up to the test
case’s
expectedfield). - Eval
LlmBackend - Which model backend produced the agent response. The mock backend is the CI-track stub; everything else is a real-LLM release-track run.
- Eval
Suite - Which upstream benchmark the case came from. Determines how the aggregator buckets results in its markdown report.
Constants§
- EVAL_
SCHEMA_ VERSION - Schema version of the JSONL records this build emits / consumes. Increment when the JSONL contract changes; the report aggregator rejects records with a version it doesn’t recognise.