Skip to main content

Module eval

Module eval 

Source
Expand description

B0 M11 — JSONL output schema for the eval harness. B0 M11 — eval-harness output schema.

Per-test JSONL records emitted by the Python harness in scripts/eval/ and consumed by the Rust report aggregator (mur agent eval report, M11.4). Stable on-disk shape — bumping field semantics requires a new EvalRecord.schema_version.

Spec: docs/superpowers/specs/2026-05-06-b0-m11-eval-harness-design.md §6.

Structs§

EvalHookDecision
One observation from the B0 hook chain during the test run. Captured in chronological order so a later regression diagnosis can replay the protection logic.
EvalRecord
One test case’s result — written as a single JSONL line by the Python harness per case, parsed by mur agent eval report to build the markdown summary.

Enums§

EvalDecision
Outcome the agent took in response to the attack — independent of whether that outcome was the “right” one (which is up to the test case’s expected field).
EvalLlmBackend
Which model backend produced the agent response. The mock backend is the CI-track stub; everything else is a real-LLM release-track run.
EvalSuite
Which upstream benchmark the case came from. Determines how the aggregator buckets results in its markdown report.

Constants§

EVAL_SCHEMA_VERSION
Schema version of the JSONL records this build emits / consumes. Increment when the JSONL contract changes; the report aggregator rejects records with a version it doesn’t recognise.