yardlet 0.8.0

Yardlet: a local AI workbench. Plan, queue, route, validate, and hand off long-running work using your already-installed Codex and Claude Code CLIs as hidden workers.
# Worker Routing + Telemetry Loop — Design

> Status: implemented (all four phases)
> Decides: which worker (Codex CLI / Claude Code CLI) runs each task, and how
> that decision stays correct as models, costs, and task mixes change.

## Problem

Yardlet routes work between two subscription-backed CLIs. The "right" worker
varies three ways, so any fixed rule rots:

- by **task** (a tight diff edit vs a multi-file refactor want different engines),
- over **time** (relative strength flips as the CLIs/models evolve),
- by **cost** (the user's tolerance for the pricier engine).

Today's gaps: the planner gets no guidance so it picks one worker for
everything; `routing.implementation` is dead config; there is no run-time
readiness fallback.

## Principles (from a survey of modern routers)

1. **Two-layer binding.** Plan-time picks the *intended* worker per task
   (capability/role). Run-time validates readiness and falls back. This is the
   production consensus (OpenRouter, Claude Code, LiteLLM).
2. **No learned router.** With two workers there is nothing to train. The
   planner LLM *is* the router; it only needs a rubric.
3. **Policy in config, mechanism in code.** Which worker suits which task and
   the cost dial are policy (editable; the planner reads them). Readiness
   checks, the fallback walk, telemetry, and the execution gate are mechanism
   (code).
4. **Express guidance in model-independent terms.** Route on task
   characteristics + a cost dial, never on "Claude > Codex" constants, so the
   rubric survives model changes.
5. **Hard capability rules beat rubric preferences.** If one worker has a
   concrete capability the other lacks, make that a deterministic mechanism,
   not a planner preference. Current rule: image/asset generation routes to
   Codex; if Codex is not ready, do not fall back to Claude Code.
6. **Telemetry never binds at run-time.** Run-time stays deterministic and
   auditable. Telemetry only produces *suggestions* to update the policy, which
   a human approves.

## What rots, and who maintains it

Not everything rots at the same rate — split it:

| Fact | Maintained by | Cadence |
| --- | --- | --- |
| Concrete model version a CLI uses | the CLI itself | automatic; Yardlet never tracks it |
| Relative strength per task type | **Yardlet telemetry → human-approved suggestion** | occasional |
| `cost_bias` (cheap vs quality) | the human (a preference) | rarely; never automated |

Only the middle row needs a loop.

## Policy surface: worker profiles (editable config)

Extend `.agents/workers.yaml` (the worker SOT). New, human-editable fields:

```yaml
workers:
  - id: codex
    # ...existing invocation/limits...
    best_for: "image/asset generation, issue-to-patch implementation, test-driven bugfixes, shell-heavy build/test/debug loops, visual UI implementation, mechanical transforms, schema/format constrained output"
    cost_weight: low
  - id: claude-code
    # ...existing...
    best_for: "ambiguity reduction, acceptance criteria, PRDs/strategy briefs, evidence synthesis, long-form writing/editing, broad exploration, architecture planning, policy-bound reasoning"
    cost_weight: high

routing:
  cost_bias: balanced          # cheap | balanced | quality   (human dial)
  default_worker: codex
  fallback_order: [codex, claude-code]
  planning_gate: { primary: claude-code, fallback: codex }
```

`best_for` is written as task characteristics, not model names. Updating it when
trends shift is a one-line edit — no code change, no retrain.

## Layer 1 — Plan-time selection (guided planner)

The planning packet gains a **Worker selection** section listing each worker's
`best_for` + the current `cost_bias`, instructing the planner: for each task,
emit `preferred_worker` *and* a one-line `worker_rationale`, weighing the task's
characteristics against `best_for` and the cost dial.

`planning-result.json` task shape gains:

```json
{ "preferred_worker": "codex", "worker_rationale": "tight single-file edit; cheap worker per profile" }
```

Yardlet stores `preferred_worker` on the task (already does) and keeps
`worker_rationale` for the audit log. This fixes the "picks codex for
everything" bug: the judge now has a rubric and its choice is explainable.

## Layer 2 — Run-time resolution (deterministic, code)

```
resolve_worker(task, routing) -> (worker_id, reason):
  candidate = run_override ?? hard_capability_rule ?? task.preferred_worker ?? routing.default_worker
  if probe(candidate).ready:        return (candidate, "preferred")
  if hard_capability_rule:          error "required worker not ready"
  for w in routing.fallback_order:
      if w != candidate and probe(w).ready:
          return (w, "fallback: {candidate} not ready")
  error "no ready worker"           # surfaced as a hard stop / NeedsUser
```

Run-time does **not** consult telemetry (keeps it predictable). The chosen
worker + reason + the planner's rationale are recorded on the run.

## The telemetry loop

### Collect (mechanism)

On every completed run, append one line to `.agents/telemetry/runs.jsonl`
(append-only):

```json
{
  "ts": "...", "task_id": "YARD-002", "kind": "implementation", "risk": "medium",
  "worker": "codex", "chosen_reason": "preferred",
  "result_status": "done", "eval_state": "done",
  "wall_seconds": 95, "retries": 0,
  "user_override": null            // e.g. "codex->claude" when the user forced the other
}
```

All of these already exist in `run.yaml` / `evaluation.json`; this is a compact
projection for analysis.

### Aggregate + suggest (mechanism; output is advice, not action)

`yardlet routing review` reads the telemetry and aggregates per `(kind, worker)`:
success rate (`eval_state == done` / total), avg wall time, override count.
Thresholds (config-tunable) turn deltas into suggestions:

- non-preferred worker's success rate beats the default by margin `M` over `>= K`
  samples for a kind → suggest flipping that kind's preference,
- override rate for a kind exceeds a threshold → suggest aligning the profile
  with what the user keeps choosing,
- a worker's success drops sharply after its CLI version changed → flag it.

Output: human-readable findings **plus a proposed diff to `workers.yaml`**. It
never edits config itself.

### Human gate (policy change)

Applying a suggestion edits `best_for` / `default_worker` / `fallback_order` —
that is a policy change, so a human approves it (`yardlet routing apply` stages the
diff for confirmation, or you edit the file). This matches Yardlet's "human
approval for shared-state changes" rule and the observe → candidate → review →
promote learning lifecycle.

### Triggers (event-based, not calendar)

- `yardlet status` shows a one-line nudge when suggestions are pending
  (`routing: 2 suggestions — run \`yardlet routing review\``).
- a worker's CLI version change (Yardlet stores last-seen versions) flags a
  re-evaluation,
- an override spike for a kind.

### Cold start

Early on there is little data, so suggestions stay silent until `>= K` samples
per kind accrue. The profile is human-seeded (the `best_for` defaults above);
the loop refines it once real runs exist. `cost_bias` is always manual.

## Loop

```
profile (seeded)
   -> planner reads rubric -> task.preferred_worker + rationale
   -> run-time resolve: preferred -> ready? -> fallback        (deterministic)
   -> execute -> evaluate
   -> append telemetry (kind, worker, outcome, override)
   -> aggregate -> thresholds -> suggestion (proposed profile diff)
   -> human reviews/approves -> profile updated
   ^________________________________________________________|
```

## Policy vs mechanism (explicit)

- **Mechanism (code):** readiness probe, hard capability rules, fallback walk,
  telemetry collection, aggregation + thresholds, suggestion computation,
  version-change detection, the worker execution gate.
- **Policy (editable config / human):** `best_for`, `cost_bias`,
  `default_worker`, `fallback_order`, the thresholds, and the planner's per-task
  choice derived from the rubric. Applying a suggestion is a human-gated policy
  edit.

## Implementation phases

1. **Run-time fallback + telemetry collection** (mechanism). `resolve_worker`
   with readiness + `fallback_order`; append `runs.jsonl`. Fixes the missing
   fallback immediately and starts gathering data. Low risk.
2. **Guided planner.** `best_for` + `cost_bias` into the planning packet;
   `worker_rationale` in the result. Fixes the codex-default bug.
3. **Aggregate + `yardlet routing review`.** Suggestions, the status nudge,
   version-change flag.
4. **`yardlet routing apply` + override detection.** Assisted, human-confirmed
   profile edits.

Phase 1 is shippable on its own and is the highest-value first slice.