Expand description
harn eval coding-agent — empirical preset/provider benchmark for a
small coding-agent fixture suite.
§.harn dispatch (W7 partial port — see harn#2307)
The matrix execution pipeline (fixture resolution, model
discovery, per-cell execute_run invocation, Ollama snapshot/
cleanup, scoring, rollups, native/text comparisons, follow-up
generation, baseline diff) stays in Rust. Every cell drives the
embedded coding_agent_suite.harn driver through execute_run,
which itself reaches into VM internals (commands::run,
harn_vm::llm, commands::local::runtime) that aren’t reachable
from script-land today — the same constraint that shaped W5 / W6.
The rendering layer (the summary.md body, the followups.md
body, the one-line human stdout summary, the --json pretty form)
is delegated to
crates/harn-stdlib/src/stdlib/cli/eval/coding_agent.harn. The
Rust shim pre-serialises the assembled EvalSummary to JSON,
forwards it via [CODING_AGENT_SUMMARY_ENV], dispatches four
times (markdown for summary.md, followups for followups.md,
then either the summary line or the --json pretty form for
stdout), and writes the captured payloads to disk / real stdout.
The on-disk JSON artifacts (summary.json, per_run.jsonl,
local_readiness.json) stay on the serde-driven Rust path because
Harn’s json_stringify_pretty sorts dict keys alphabetically and
the on-disk format is consumed by the experiment driver in
experiments/step-judge/run.sh, the local-readiness regression
check, and hosted ingestion — all of which depend on the serde
struct-field byte order.
HARN_CLI_IMPL=rust keeps the legacy direct-render path for the
parity-snapshot harness (#2299) until the C1 ratchet (#2314) lands.