Skip to main content

Module eval_coding_agent

Module eval_coding_agent 

Source
Expand description

harn eval coding-agent — empirical preset/provider benchmark for a small coding-agent fixture suite.

§.harn dispatch (W7 partial port — see harn#2307)

The matrix execution pipeline (fixture resolution, model discovery, per-cell execute_run invocation, Ollama snapshot/ cleanup, scoring, rollups, native/text comparisons, follow-up generation, baseline diff) stays in Rust. Every cell drives the embedded coding_agent_suite.harn driver through execute_run, which itself reaches into VM internals (commands::run, harn_vm::llm, commands::local::runtime) that aren’t reachable from script-land today — the same constraint that shaped W5 / W6.

The rendering layer (the summary.md body, the followups.md body, the one-line human stdout summary, the --json pretty form) is delegated to crates/harn-stdlib/src/stdlib/cli/eval/coding_agent.harn. The Rust shim pre-serialises the assembled EvalSummary to JSON, forwards it via [CODING_AGENT_SUMMARY_ENV], dispatches four times (markdown for summary.md, followups for followups.md, then either the summary line or the --json pretty form for stdout), and writes the captured payloads to disk / real stdout.

The on-disk JSON artifacts (summary.json, per_run.jsonl, local_readiness.json) stay on the serde-driven Rust path because Harn’s json_stringify_pretty sorts dict keys alphabetically and the on-disk format is consumed by the experiment driver in experiments/step-judge/run.sh, the local-readiness regression check, and hosted ingestion — all of which depend on the serde struct-field byte order.

HARN_CLI_IMPL=rust keeps the legacy direct-render path for the parity-snapshot harness (#2299) until the C1 ratchet (#2314) lands.

Functions§

run