Skip to main content

Module eval_coding_agent

Module eval_coding_agent 

Source
Expand description

harn eval coding-agent - empirical preset/provider benchmark for a small coding-agent fixture suite.

§Dispatch boundary

The matrix execution pipeline (fixture resolution, model discovery, per-cell execute_run invocation, Ollama snapshot/ cleanup, scoring, rollups, native/text comparisons, follow-up generation, baseline diff) stays in Rust. Every cell drives the embedded coding_agent_suite.harn driver through execute_run, which itself reaches into VM internals (commands::run, harn_vm::llm, commands::local::runtime) that are not exposed as script capabilities.

The rendering layer (the summary.md body, the followups.md body, the one-line human stdout summary, the --json pretty form) is delegated to crates/harn-stdlib/src/stdlib/cli/eval/coding_agent.harn. The Rust shim pre-serialises the assembled EvalSummary to JSON, forwards it via [CODING_AGENT_SUMMARY_ENV], dispatches four times (markdown for summary.md, followups for followups.md, then either the summary line or the --json pretty form for stdout), and writes the captured payloads to disk / real stdout.

The on-disk JSON artifacts (summary.json, per_run.jsonl, local_readiness.json) stay on the serde-driven Rust path because Harn’s json_stringify_pretty sorts dict keys alphabetically and the on-disk format is consumed by the experiment driver in experiments/step-judge/run.sh, the local-readiness regression check, and hosted ingestion — all of which depend on the serde struct-field byte order.

HARN_CLI_IMPL=rust keeps the direct-render path available for parity snapshot coverage.

Functions§

run