Module eval

Expand description

B0 M11 — JSONL output schema for the eval harness. B0 M11 — eval-harness output schema.

Per-test JSONL records emitted by the Python harness in scripts/eval/ and consumed by the Rust report aggregator (mur agent eval report, M11.4). Stable on-disk shape — bumping field semantics requires a new EvalRecord.schema_version.

Spec: docs/superpowers/specs/2026-05-06-b0-m11-eval-harness-design.md §6.

Structs§

EvalHookDecision: One observation from the B0 hook chain during the test run. Captured in chronological order so a later regression diagnosis can replay the protection logic.
EvalRecord: One test case’s result — written as a single JSONL line by the Python harness per case, parsed by mur agent eval report to build the markdown summary.

Enums§

EvalDecision: Outcome the agent took in response to the attack — independent of whether that outcome was the “right” one (which is up to the test case’s expected field).
EvalLlmBackend: Which model backend produced the agent response. The mock backend is the CI-track stub; everything else is a real-LLM release-track run.
EvalSuite: Which upstream benchmark the case came from. Determines how the aggregator buckets results in its markdown report.

Constants§

EVAL_SCHEMA_VERSION: Schema version of the JSONL records this build emits / consumes. Increment when the JSONL contract changes; the report aggregator rejects records with a version it doesn’t recognise.

Module eval

Module eval Copy item path

Structs§

Enums§

Constants§

Module eval