# car-builder NL→workflow eval set
Fixtures + live harness for Part 1 of
`docs/proposals/h2-builder-discovery-acceptance.md` (acceptance spec v2).
The fixtures and their pure shape validator landed first (the spec's
define-done-first mandate, enforced against self-grading); the live
pass-rate harness landed with the Part 1 implementation.
## Files
- `nl2workflow.jsonl` — 30 NL requests in 3 difficulty bands.
- `../tests/eval_fixtures.rs` — pure loader + validator (band counts, unique
ids, `refusal_ok` band scoping, repair-forcing floor, well-formed `must_have`).
No live model calls.
- `../tests/live_nl2workflow_eval.rs` — the secret-gated live harness (the
acceptance gate) plus non-live unit tests for its deterministic core.
## Running the live eval (the acceptance gate)
The live suite is `#[ignore]`d — it hits a real model API and spends tokens.
It needs `ANTHROPIC_API_KEY` (the pinned model's remote path); without it the
test prints `[SKIP]` with these instructions and exits.
```bash
ANTHROPIC_API_KEY=sk-... cargo test -p car-builder --test live_nl2workflow_eval \
-- --ignored --nocapture
```
Eval config (shipped defaults, all printed by the run — no test-only knobs):
- **Model pinned** to `anthropic/claude-sonnet-4-6:latest`, routed explicitly
(never through the adaptive router, so a router change can't silently change
the eval). Override with `CAR_BUILDER_EVAL_MODEL=<catalog id>` — **changing
the pin or overriding it is a spec-visible event**: record it in the h2
acceptance doc; numbers are only comparable against the same pin.
- Temperature 0, `max_tokens` 4096, `max_attempts` 3 (the daemon
`builder.build` default), N=3 repeat runs with **worst-run counts** asserted.
- Catalog: the seven commodity built-in tools the engine registers
(`read_file`, `list_dir`, `find_files`, `grep_files`, `calculate`,
`write_file`, `edit_file`) + the engine's unified model registry — the same
catalog shape the daemon's `builder.build` assembles.
Pass per case = manifest validates via `car-workflow` AND
`car_verify::verify_workflow_graph` reports no structural defects AND
`must_have` holds. A typed refusal passes only on `refusal_ok: true` cases,
capped at 40% of the band; an empty goal is refused before any model call
(mirroring the daemon's `missing 'goal'` guard). The run asserts the targets
below, so a green live run IS Part 1 acceptance.
The harness's deterministic pieces (must_have checker, refusal cap,
worst-of-3 aggregation, workflow→graph bridge, repair statistics) are
unit-tested without a model by the same file's non-ignored tests:
```bash
cargo test -p car-builder --test live_nl2workflow_eval
```
## Case schema
```
{
"id": "...",
"must_have": {
"stage_count": [min, max], // inclusive range
"required_tools": [...], // real commodity tool names
"required_edge_conditions": [...] // substrings expected in edge condition keys/operators
},
"refusal_ok": true|false, // band-3 (adversarial) only
"repair_forcing": true, // seeded to fail first-pass generation
"notes": "..." // repair_forcing cases state HOW they force failure
}
```
Bands: 10 `single_stage`, 10 `multi_stage` (conditional edges), 10 `adversarial`
(underspecified). 6 cases are `repair_forcing` (≥5 required).
## Pass targets (from the acceptance spec, verbatim)
> **Targets** (pinned model): ≥ 90% band 1, ≥ 80% band 2, ≥ 70% band 3
> (pass = manifest validates via `car-workflow` AND `verify_workflow_graph`
> reports no structural defects AND `must_have` holds; band-3 refusals per the
> labels). ≤ 1 repair iteration median. Zero panics/unparseable outputs (typed
> failure only).
> a typed refusal passes only where labeled, and the harness enforces a
> **refusal cap of 40%** on band 3 overall (an always-refuse builder must not
> ace the hardest band). At least **5 cases across bands are seeded to fail
> first-pass generation** (assertion-verified) so the repair loop is exercised:
> the harness asserts repair recovers ≥ 4 of 5 — otherwise "repair-loop
> statistics" measures nothing.
> **Determinism discipline**: model id pinned in the eval config (a router
> change must not silently change the eval), temperature 0, **N=3 repeat runs
> with worst-run counts reported**.