car-builder 0.32.1

# car-builder NL→workflow eval set

Fixtures + live harness for Part 1 of
`docs/proposals/h2-builder-discovery-acceptance.md` (acceptance spec v2).
The fixtures and their pure shape validator landed first (the spec's
define-done-first mandate, enforced against self-grading); the live
pass-rate harness landed with the Part 1 implementation.

## Files

- `nl2workflow.jsonl` — 30 NL requests in 3 difficulty bands.
- `../tests/eval_fixtures.rs` — pure loader + validator (band counts, unique
  ids, `refusal_ok` band scoping, repair-forcing floor, well-formed `must_have`).
  No live model calls.
- `../tests/live_nl2workflow_eval.rs` — the secret-gated live harness (the
  acceptance gate) plus non-live unit tests for its deterministic core.

## Running the live eval (the acceptance gate)

The live suite is `#[ignore]`d — it hits a real model API and spends tokens.
It needs `ANTHROPIC_API_KEY` (the pinned model's remote path); without it the
test prints `[SKIP]` with these instructions and exits.

```bash
ANTHROPIC_API_KEY=sk-... cargo test -p car-builder --test live_nl2workflow_eval \
    -- --ignored --nocapture
```

Eval config (shipped defaults, all printed by the run — no test-only knobs):

- **Model pinned** to `anthropic/claude-sonnet-4-6:latest`, routed explicitly
  (never through the adaptive router, so a router change can't silently change
  the eval). Override with `CAR_BUILDER_EVAL_MODEL=<catalog id>` — **changing
  the pin or overriding it is a spec-visible event**: record it in the h2
  acceptance doc; numbers are only comparable against the same pin.
- Temperature 0, `max_tokens` 4096, `max_attempts` 3 (the daemon
  `builder.build` default), N=3 repeat runs with **worst-run counts** asserted.
- Catalog: the seven commodity built-in tools the engine registers
  (`read_file`, `list_dir`, `find_files`, `grep_files`, `calculate`,
  `write_file`, `edit_file`) + the engine's unified model registry — the same
  catalog shape the daemon's `builder.build` assembles.

Pass per case = manifest validates via `car-workflow` AND
`car_verify::verify_workflow_graph` reports no structural defects AND
`must_have` holds. A typed refusal passes only on `refusal_ok: true` cases,
capped at 40% of the band; an empty goal is refused before any model call
(mirroring the daemon's `missing 'goal'` guard). The run asserts the targets
below, so a green live run IS Part 1 acceptance.

The harness's deterministic pieces (must_have checker, refusal cap,
worst-of-3 aggregation, workflow→graph bridge, repair statistics) are
unit-tested without a model by the same file's non-ignored tests:

```bash
cargo test -p car-builder --test live_nl2workflow_eval
```

## Case schema

```
{
  "id": "...",
  "band": "single_stage" | "multi_stage" | "adversarial",
  "nl_request": "...",
  "must_have": {
    "stage_count": [min, max],      // inclusive range
    "required_tools": [...],         // real commodity tool names
    "required_edge_conditions": [...] // substrings expected in edge condition keys/operators
  },
  "refusal_ok": true|false,          // band-3 (adversarial) only
  "repair_forcing": true,            // seeded to fail first-pass generation
  "notes": "..."                     // repair_forcing cases state HOW they force failure
}
```

Bands: 10 `single_stage`, 10 `multi_stage` (conditional edges), 10 `adversarial`
(underspecified). 6 cases are `repair_forcing` (≥5 required).

## Pass targets (from the acceptance spec, verbatim)

> **Targets** (pinned model): ≥ 90% band 1, ≥ 80% band 2, ≥ 70% band 3
> (pass = manifest validates via `car-workflow` AND `verify_workflow_graph`
> reports no structural defects AND `must_have` holds; band-3 refusals per the
> labels). ≤ 1 repair iteration median. Zero panics/unparseable outputs (typed
> failure only).

> a typed refusal passes only where labeled, and the harness enforces a
> **refusal cap of 40%** on band 3 overall (an always-refuse builder must not
> ace the hardest band). At least **5 cases across bands are seeded to fail
> first-pass generation** (assertion-verified) so the repair loop is exercised:
> the harness asserts repair recovers ≥ 4 of 5 — otherwise "repair-loop
> statistics" measures nothing.

> **Determinism discipline**: model id pinned in the eval config (a router
> change must not silently change the eval), temperature 0, **N=3 repeat runs
> with worst-run counts reported**.