yarli 0.6.1

CLI, stream mode renderer, interactive TUI, scheduler, store, and API
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
# YARLI Operations

This runbook covers the baseline local operator workflow for durable mode, migrations, and test execution.
For final verification and release acceptance decisions, use `docs/ACCEPTANCE_RUBRIC.md`.

For an exhaustive, command-by-command CLI usage guide, see `docs/CLI.md`.
For memory behavior (when YARLI stores/queries memories), see `docs/MEMORY_POLICY.md`.

## Requirements

- Rust toolchain compatible with workspace `rust-version` (currently 1.75+).
- PostgreSQL accessible from the local machine.
- `psql` client for manual migration execution.
- CI/CD pipeline integrations for API-triggered runs are documented in
  `docs/CI_CD_INTEGRATION_EXAMPLES.md` with GitHub Actions, GitLab CI, Jenkins, and wait-for-completion scripts.

## Production Deployment Prerequisites

- Kubernetes:
  - Kubernetes `v1.27+` cluster.
  - `kubectl` matching cluster version.
- Container tooling:
  - `helm` `v3+`.
  - Container runtime tooling for image signing/build reproducibility.
- Event store:
  - PostgreSQL `16+` for production deployments.
  - Stable connectivity from all scheduler pods to the database.
  - Database role policy allowing migration execution and backup tooling.
  - Backups and restore drills included in runbook cadence.
- Runtime:
  - Secrets mounted from platform vault/secret manager for DB credentials.
  - At least one namespace boundary around scheduler and operator tooling.

Deployment prerequisites before first rollout:

- Set `core.backend = "postgres"` in effective config.
- Set one of `DATABASE_URL` or `postgres.database_url_file`.
- Run `yarli migrate status` and confirm no unapplied mandatory migrations.
- Validate `cargo test --workspace` in a clean environment before release packaging.

## Zero-Drift Deployment Hygiene

- Use immutable image tags for scheduler and api images.
- Keep migration execution and scheduler startup in separate rollout steps.
- Preserve `run` and `task` identifiers in event store evidence.

## Required Environment Variables

- `YARLI_TEST_DATABASE_URL`
  - Required when running Postgres integration tests locally.
  - CI uses: `postgres://postgres:postgres@localhost:5432/postgres`.
  - Must point to an admin database that can create and drop temporary databases.
- `DATABASE_URL`
  - Optional for durable runtime.
  - If set, overrides `postgres.database_url` and `postgres.database_url_file`.
- `YARLI_REQUIRE_POSTGRES_TESTS`
  - Set to `1` for strict mode.
  - In strict mode, missing `YARLI_TEST_DATABASE_URL` is a hard failure (never a skip-success).

## One-Command Acceptance Verification

From repository root:

```bash
export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
export YARLI_REQUIRE_POSTGRES_TESTS=1
bash scripts/verify_acceptance_rubric.sh r8
```

Expected behavior:

- Exit `0` and print `PASS` only when every rubric check is proven.
- Print `UNVERIFIED` and exit non-zero for any missing or failing proof.
- Write evidence to `evidence/r8/` (or generally `evidence/<loop-id>/`) with command logs and `README.md`.

## Fresh-Clone and Clean-Shell Repro Path

Use this sequence to prove reproducibility from a fresh clone and a clean shell.

```bash
git clone <repo-url> yarli
cd yarli
docker rm -f yarli-r8-postgres >/dev/null 2>&1 || true
docker run -d --name yarli-r8-postgres \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=postgres \
  -p 55432:5432 \
  postgres:16
until pg_isready -h 127.0.0.1 -p 55432 -U postgres -d postgres >/dev/null 2>&1; do sleep 1; done
env -i HOME="$HOME" PATH="$PATH" bash --noprofile --norc -lc '
  export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
  export YARLI_REQUIRE_POSTGRES_TESTS=1
  bash scripts/verify_acceptance_rubric.sh r8
'
docker rm -f yarli-r8-postgres
```

Expected verification signals:

- Command output prints `PASS`.
- Exit code is `0`.
- `evidence/r8/README.md` records command blocks with `Exit Code:` and `Key Output:`.

## Packaging and Deployment Smoke Verification (Loop-8)

This sequence is the deterministic "can we deploy" proof path for API and durable CLI behavior:

```bash
export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
export YARLI_REQUIRE_POSTGRES_TESTS=1

# Packaging/build step
cargo build --release -p yarli --bin yarli

# API smoke checks (in-process and persistent read paths)
cargo test --workspace
cargo test -p yarli --test yarli_api_postgres_integration -- --nocapture

# Durable CLI write/read roundtrip
cargo test -p yarli --test yarli_cli_postgres_integration -- --nocapture
```

Expected smoke signals:

- Release build command exits `0`.
- API logs contain `health_endpoint_returns_ok` and `run_status_endpoint_replays_persisted_events`.
- API deploy check log contains `run_and_task_status_reflect_writes_from_postgres`.
- CLI strict-postgres smoke log contains both:
  - `run_start_and_status_roundtrip_against_postgres`
  - `merge_request_and_status_roundtrip_against_postgres`
- These commands are tracked in `evidence/<loop-id>/` when run via `scripts/verify_acceptance_rubric.sh`.

## Durable Local Configuration

Create `yarli.toml` using `yarli.example.toml` as a starting point. Durable write commands require:

```toml
[core]
backend = "postgres"

[postgres]
database_url = "postgres://postgres:postgres@localhost:5432/yarli"
# or
database_url_file = "/run/secrets/yarli-postgres-url"
```

`core.backend = "in-memory"` is supported only for explicit ephemeral workflows, and write commands are blocked unless `core.allow_in_memory_writes = true`.

### Secret rotation for durable deployments

To avoid rebuilding images when credentials rotate:

- Set `DATABASE_URL` in the runtime environment and rely on deployment restart/reload.
- Or mount credentials at `postgres.database_url_file` and rotate the mounted secret.

### API key management and rotation

API keys control optional API authentication when `YARLI_API_KEYS` is set in the runtime environment.

- Set `YARLI_API_KEYS` to a comma-separated list of allowed API keys.
- Set `YARLI_API_RATE_LIMIT_PER_MINUTE` (default: `120`) to tune per-key request throttling.
- Rotate keys by updating the environment value and restarting or reloading the API process.

Authentication is required when `YARLI_API_KEYS` is non-empty and applies to all non-health/metrics routes.

### Zero-downtime scheduler upgrades (blue/green)

Use this runbook for minor/patch upgrades and scheduler-safe rollouts:

1. Build and validate the upgraded image as `yarli-scheduler:<new-version>`.
2. Deploy the upgraded scheduler as a green replica (same queue/postgres view, no traffic yet).
3. Pause active work on the blue scheduler: `yarli run pause --all-active --reason "pre-upgrade drain"`.
4. Wait for in-flight tasks to reach a natural checkpoint (`TaskComplete` or explicit operator-reviewed pause state).
5. Stop the blue scheduler process after queue drain progress indicates no additional claims.
6. Shift traffic to green (`kubectl rollout`, service selector swap, or equivalent).
7. Resume paused runs: `yarli run resume --all-paused --reason "upgrade complete"`.

If green shows stable health and queue progress resumes, proceed. If not, swap traffic back to blue and use
`yarli run pause`/`yarli run cancel` to contain impact while diagnostics run.

Operational checkpoint before rollout:

- `yarli run status --all` shows bounded backlog growth and no unexpected `TaskFailed` spikes.
- Existing active run IDs remain visible and resumable from `run status` and `run explain-exit`.
- Queue lease owner identities should only be from the active scheduler deployment.

### Event schema compatibility matrix

YARLI does not require coordinated downtime for additive event-schema changes.

| Producer schema | Consumer schema | Additive fields only | Backward reads | Breaking reads |
| --- | --- | --- | --- | --- |
| 1.0 | 1.0 | allowed | allowed | blocked |
| 1.1 (schema extension) | 1.0 | allowed | allowed | blocked |
| 1.0 | 1.1 | allowed | allowed | blocked |
| 1.1 | 1.2 | allowed | allowed with defaults | blocked |
| 1.2 | 1.1 | required | blocked | blocked |

Guidance:

- Keep event consumers resilient to optional/new fields.
- For breaking payload changes, land a compatibility shim before enabling mixed-version rollouts.
- Record any approved schema bump in `docs/ACCEPTANCE_RUBRIC.md` and rollback plan.

## Execution Backend Selection

Native process execution remains default:

```toml
[execution]
runner = "native"
```

## Default `yarli run` (Config-First)

To avoid repeating long `--cmd ...` lists, YARLI is opinionated: `yarli run` resolves prompt context in this order, expands `@include <path>`, and drives execution from config + plan state.

1. `yarli run --prompt-file <path>`
2. `yarli.toml` `[run].prompt_file`
3. fallback lookup of `PROMPT.md` by walking upward from the current directory

Execution behavior:

- Discover incomplete tranches from `IMPLEMENTATION_PLAN.md` in plan order.
- Dispatch each tranche as its own Yarli task via `[cli]` command settings.
- Optional grouping: set `[run].enable_plan_tranche_grouping = true` and annotate related plan lines with `tranche_group=<name>` to dispatch shared tranches.
- Optional per-tranche file-scope policy: annotate plan lines with `allowed_paths=path/a,path/b` and set `[run].enforce_plan_tranche_allowed_paths = true` to inject explicit scope constraints into tranche prompts.
- Optional tranche hardening: configure `[run.tranche_contract]` to require `verify`, require `done_when`, and fail merge finalization on out-of-scope edits. These checks are opt-in and preserve legacy behavior when unset.
- Optional run-spec defaults can live in `yarli.toml` (`[run]`, `[[run.tasks]]`, `[[run.tranches]]`, `[run.plan_guard]`), with `PROMPT.md` `yarli-run` blocks used only for per-prompt overrides/backward compatibility.
- Optional backend startup write roots can be configured with `[execution].trusted_backend_write_roots`; these are added only to native runner enforcement and do not expand tranche merge scope.
- Parallel mode defaults to enabled (`[features].parallel = true`) and requires `[execution].worktree_root`.
- In parallel mode, YARLI prepares one workspace copy per task under `execution.worktree_root` before execution.
- After `RunCompleted`, YARLI auto-merges task workspace changes into the source repo using `git apply --3way`.
- If a workspace merge conflicts, YARLI emits `run.parallel_merge_failed`, exits non-zero, and leaves conflict markers for manual resolution.
- Append a verification task automatically.
- If no embedded run spec exists and no incomplete tranches are found, dispatch the full prompt text as one task.
- `[cli].env_unset` can remove parent-session environment variables before CLI launch (for example `CLAUDECODE`).

Control model clarifications:

- `yarli run` is the authoritative execution entry point.
- Built-in Yarli policy gates are code-defined checks; verification command chains come from plan/config/script content.
- Observer integrations emit telemetry only and must not gate active run progression.
- Use explicit operator controls for run state transitions: `yarli run pause`, `yarli run resume`, `yarli run cancel`.

```bash
yarli run --stream
```

Notes:

- `yarli run start ... --cmd ...` remains available for ad-hoc runs.
- `yarli run batch` and `[run]` paces are supported for backward compatibility but are no longer the primary workflow.

Optional Overwatch-backed execution is opt-in:

```toml
[execution]
runner = "overwatch"

[execution.overwatch]
service_url = "http://127.0.0.1:8089"
profile = "default"
soft_timeout_seconds = 900
silent_timeout_seconds = 300
max_log_bytes = 131072
```

Behavior notes:

- `runner = "native"` keeps current local process behavior unchanged.
- `runner = "overwatch"` submits commands to Overwatch (`/run`), polls `/status/{task_id}`, reads final logs from `/output/{task_id}`, and maps terminal state into YARLI command transitions.
- Scheduler shutdown cancellation propagates to command execution; Overwatch runner calls `/cancel/{task_id}` for in-flight tasks.
- Cancellation transitions emit structured provenance (`run.cancel_provenance` / `task.cancel_provenance`) including signal identity, receiver/parent PID context, actor kind/detail, and stage attribution.
- `ui.verbose_output` affects stream verbosity only. Provenance capture is independent; enable `ui.cancellation_diagnostics = true` to append extra diagnostic context in provenance `actor_detail`.
- Durable provenance history requires Postgres backend. In-memory backend keeps provenance only for the lifetime of the current process (plus any existing `.yarli/continuation.json` artifact).

## Runner Hardening Telemetry and cgroup Delegation

Use these signals when validating the runner hardening path in production-like environments:

- `yarli_enforcement_outcomes_total{mechanism,outcome,reason}`
  - `mechanism="rlimit"` confirms pre-exec `setrlimit(2)` application.
  - `mechanism="pidfd"` distinguishes race-free pidfd control from `raw_pid` fallback.
  - `mechanism="cgroup"` reports `attached` success or `rlimits_only_*` fallback reasons.
  - `mechanism="pid_termination"` records bounded failure reasons during cancellation escalation.
- `yarli_command_overhead_duration_seconds`
  - Useful for spotting expensive spawn/resource-capture phases during runner rollout.
- `/proc/self/cgroup`
  - Confirms whether the YARLI process is operating inside the expected delegated subtree.

Delegation requirements for cgroup v2:

- YARLI needs a writable delegated cgroup subtree, not blanket write access to the root hierarchy.
- In systemd environments, run the service in a delegated unit/slice so it can create leaf cgroups below its assigned boundary.
- In container/Kubernetes environments, verify the container runtime exposes a writable subtree for the service user before expecting cgroup enforcement.
- Non-root execution is supported; if the subtree is read-only or not delegated, YARLI falls back to rlimits-only mode and emits `rlimits_only_*` telemetry.

Operator checklist before enabling cgroup enforcement:

1. Confirm the service user is non-root and can create/remove a test directory under its delegated cgroup subtree.
2. Start YARLI and hit `/metrics`, then verify `yarli_enforcement_outcomes_total` emits either `attached` or an explicit `rlimits_only_*` fallback.
3. Run a bounded verification command and confirm `yarli debug resource-usage <run-id>` and `/proc/self/cgroup` match expectations.
4. If metrics show repeated `permission_denied` or `read_only` fallback, keep the deployment in rlimits-only mode until delegation is fixed.

## sw4rm Fallback Behavior

`yarli run sw4rm` is intentionally fail-closed around transport setup and stream loss:

- Invalid merged sw4rm run-spec configuration falls back to a minimal verification suite (`cargo build`) so the agent still has a bounded verification path.
- Router/report client initialization, runtime init, or registry registration failures stop startup immediately and return a non-zero error to the operator.
- If the response correlation is dropped because shutdown/preemption cancels the in-flight stream, the orchestrator surfaces `Cancelled` rather than fabricating a result.
- If the correlation stays pending past `sw4rm.llm_response_timeout_secs`, the orchestrator cleans up the pending entry and returns `LlmTimeout`.

Recommended operator response:

1. Treat startup failures as connectivity/configuration issues, not partial success.
2. Retry only after router/registry reachability and credentials are confirmed.
3. Treat `Cancelled` as an interrupted orchestration and resume or requeue explicitly.
4. Treat `LlmTimeout` as an execution-budget or transport-latency issue and inspect router/runtime health before retrying.

## Runtime Resource and Token Budgets

YARLI can enforce explicit per-task and per-run budgets from `yarli.toml`:

```toml
[budgets]
max_task_total_tokens = 25000
max_run_total_tokens = 250000
max_task_rss_bytes = 1073741824
max_run_peak_rss_bytes = 2147483648
```

Behavior:

- Budget breaches are fail-closed (`task.failed` with `reason = "budget_exceeded"`).
- Breach events include observed metric values, limits, command resource usage, token usage, and run usage totals.
- Token usage currently uses deterministic `char_count_div4_estimate_v1` estimation and is recorded in command/task events.

Operator-visible command surfaces for resource_usage and token_usage:

- `yarli run status <run-id>`: shows per-task `budget_exceeded`, `token_usage` (prompt_tokens, completion_tokens, total_tokens), and `resource_usage` (max_rss_bytes).
- `yarli run explain-exit <run-id>`: shows `Budget breaches:` section with task name and breach detail, plus per-task token and resource usage.
- `yarli task explain <task-id>`: shows `budget_exceeded` reason, token usage, and resource usage for individual tasks.
- `yarli task output <task-id>`: prints the raw captured stdout/stderr for a task when durable output events are available.

Durability note:

- `core.backend = "in-memory"` is ephemeral; captured task output and local audit history are lost when the process exits.
- `core.backend = "postgres"` preserves run/task state and captured output for later diagnosis.

## Sequence Deterioration Observer

YARLI emits rolling deterioration analysis events (`run.observer.deterioration`) during run execution.
The observer consumes event deltas incrementally using event-store cursor reads (`after_event_id`), not full-history rescans.
YARLI also emits heartbeat progress events (`run.observer.progress`) so stream output remains active during long-running tasks.

Signals included in the score:

- Runtime drift for repeated command keys.
- Retry inflation and blocker churn.
- Failure-rate drift by reason bucket.
- Budget headroom erosion (when observed/limit values are present).

Operator surfaces:

- `yarli run status <run-id>` includes latest deterioration score/trend/factors when available.
- `yarli run explain-exit <run-id>` includes a sequence deterioration section.
- `yarli-api` (if you run it) exposes the latest report as an optional `deterioration` field on `GET /v1/runs/<run-id>/status`.

## Runtime Guard Telemetry and Operator Playbook

Runtime guards now combine hard guardrails and quality gating:

- Budget guardrails from `[budgets]` stop execution on token/usage overages.
- Task-health and soft-token-cap guardrails produce continuation guidance (`checkpoint-now`, `force-pivot`, or `stop-and-summarize`) on completion.
- Guard telemetry is observability-first and does not itself gate active scheduler semantics.

Telemetry to monitor:

- `run.observer.deterioration` and `run.observer.deterioration_cycle`
  - Detect recurring instability patterns before automatic continuation paths drift too far.
- `run.observer.progress`
  - Confirms progress visibility for long-running tasks and supports hang triage.
- `task.failed` with `reason = "budget_exceeded"`
  - Includes observed/limit metrics, command usage, token usage, and aggregate run usage snapshots.
- `run.continuation` (`quality_gate` payload)
  - Captures `task_health_action`, `reason`, and optional `trend` for operator follow-up.

Operator playbook:

Quick diagnosis cheat sheet:

- Why did the run stop? -> `yarli run explain-exit <run-id>`
- What did the task actually print? -> `yarli task output <task-id>`
- Why is a task blocked or failing? -> `yarli task explain <task-id>`
- What policy or gate decided this? -> `yarli audit query --task-id <task-id> --category gate_evaluation`

- Budget breach (`budget_exceeded`) response:
  1. Pause if active: `yarli run pause <run-id>`
  2. Inspect: `yarli run status <run-id>`, `yarli run explain-exit <run-id>`, `yarli task explain <task-id>`, `yarli task output <task-id>`
  3. Decide: lower scope, raise limits, or cancel and rerun (`yarli run cancel <run-id>`)
- Deterioration-cycle response:
  1. Inspect current state with `yarli run status <run-id>` and `yarli run explain-exit <run-id>`
  2. If guidance is `force-pivot` or `stop-and-summarize`, pause/stop before continuing
  3. Re-run with narrower scope, then continue via `yarli run continue` only if the continuation snapshot still matches current open tranches
- Soft-cap continuation guidance (`checkpoint-now`):
  1. Treat as a review checkpoint and avoid blind continuation
  2. Evaluate remaining objective and current usage headroom
  3. If safe, continue; otherwise cancel and restart with adjusted `[run]` strategy

Continuation semantics:

- `yarli run continue` replays the prior continuation snapshot and is the right tool for retry/unfinished/planned-next work that already belongs to that snapshot.
- `yarli run --fresh-from-tranches` rebuilds from the current prompt/plan/tranches state and is the right tool after you enqueue new tranches or otherwise change `.yarli/tranches.toml`.
- When current `.yarli/tranches.toml` contains open tranche keys not represented in the continuation snapshot, `yarli run continue` now refuses with an actionable drift message instead of silently replaying stale scope.

## Merge Conflict Policy and Incident Response

- `run.parallel_merge_failed` is emitted for unresolved or unrecoverable conflicts during parallel workspace merge finalization.
- `run.parallel_merge_succeeded` confirms successful parallel merge finalization.
- Merge telemetry events are also emitted as part of merge apply tracking:
  - `merge.apply.started`
  - `merge.apply.conflict`
  - `merge.apply.finalized`
  - `merge.repair.succeeded`
  - `merge.repair.failed`

Merge policy modes (`[run].merge_conflict_resolution`) are:

- `fail` (default): stop on conflict, preserve workspace for operator review.
- `manual`: preserve workspace and emit explicit recovery instructions for manual intervention.
- `auto-repair`: attempt deterministic patch-side repair before falling back to manual recovery.

Incident response workflow:

- Pull latest failure context:
  1. `yarli run status <run-id>`
  2. `yarli run explain-exit <run-id>`
  3. `yarli task output <task-id>` for raw command stdout/stderr from the failing task.
  4. `yarli audit tail` or `yarli audit query` for merge telemetry, policy decisions, and task-level context.
- Open `PARALLEL_MERGE_RECOVERY.txt` in the preserved workspace root reported by `run.parallel_merge_failed`.
- Run recovery commands from the note in order (status, patch diff stats, manual `git apply` retry).
- Resolve conflicts and re-run with your normal operator decision path:
  - `yarli run continue` when continuation is available and still matches current open tranches, or
  - rerun the target tranche after scope and policy review.
- If recovery requires policy change, update `run.merge_conflict_resolution` (`fail`, `manual`, or `auto-repair`) and resume according to post-incident policy.

Evidence capture for incident records should include:

- `run.parallel_merge_failed` payload (including `task_key`, `patch_path`, `workspace_path`, `conflicted_files`, `repo_status`, `recovery_hints`)
- associated `merge.apply.*` and `merge.repair.*` telemetry events
- `PARALLEL_MERGE_RECOVERY.txt` and operator remediation steps.

Policy defaults:

- Keep guard-related telemetry in evidence (`evidence/<loop-id>/`) to preserve trend history.
- Avoid runtime-guard bypass shortcuts; use continuation/replan actions and explicit operator commands.
- Treat `yarli audit tail` as a local JSONL view by default; in durable deployments it complements, but does not replace, the persisted event-store record.

## External Agent Skill Contract

The external `yarli-execution-loop` skill is expected to treat Yarli as the durable control plane:

- Inspect Yarli state first (`.yarli/continuation.json`, `.yarli/tranches.toml`, `yarli run status`, `yarli run explain-exit`).
- Enqueue newly discovered work durably through `yarli plan tranche add --idempotent`.
- Choose `yarli run continue` only for snapshot-owned work with no drift.
- Choose `yarli run --fresh-from-tranches` after new tranche enqueue or other live-plan changes.
- End each agent cycle with a `YARLI_DECISION_V1` block containing `status`, `reason`, `enqueued_tranches`, and `next_command`.

### Idempotent Tranche Enqueue

`yarli plan tranche add --idempotent` is designed for safe repeated invocation in
agent loops. The semantics are:

- **Matching fields → no-op.** If a tranche with the given key already exists and
  all effective fields (summary, group, allowed_paths, verify, done_when, max_tokens)
  match the request, the command prints a confirmation message and returns success
  without modifying the file.
- **Mismatched fields → error.** If the key exists but any effective field differs,
  the command fails with a message listing which fields differ. This prevents
  accidental overwrites while making the conflict visible to the caller.
- **New key → normal add.** If no tranche with the key exists, the tranche is
  appended exactly as with a non-idempotent `add`.
- **Safe for repeated invocation.** Because identical calls are no-ops and
  conflicting calls are errors, agents can call `--idempotent` unconditionally at the
  start of every cycle without side effects or silent data loss.

### Strict Tranche Contract

Use `[run.tranche_contract]` when agent consumers need stronger execution guarantees without breaking legacy repositories:

- `strict = true` turns on all checks below.
- `require_verify = true` rejects open tranches that omit `verify`.
- `require_done_when = true` rejects open tranches that omit `done_when`.
- `enforce_allowed_paths_on_merge = true` fail-closes merge finalization when a tranche edits paths outside its declared `allowed_paths`.

Behavior notes:

- These checks apply to open tranches only; completed legacy tranches are left alone.
- When merge-time path enforcement is enabled, YARLI automatically surfaces `allowed_paths` in tranche prompts even if `[run].enforce_plan_tranche_allowed_paths` is otherwise false.
- Control-plane files such as `IMPLEMENTATION_PLAN.md` and `.yarli/*` remain exempt so normal plan/evidence bookkeeping does not trip the scope gate.

## Reproducing Budget Stress Checks Locally

Run these commands to verify budget governance under parallel workload:

```bash
# Single-task budget breach: task exceeds token limit, fails without retry
cargo test -p yarli test_budget_exceeded_fails_task_without_retry -- --nocapture

# Run-level budget breach: cumulative tokens across tasks exceed run limit
cargo test -p yarli test_run_token_budget_exceeded_across_tasks -- --nocapture

# Parallel-task budget stress: 4 concurrent tasks with tight run budget,
# proves accounting consistency and no silent continuation after breach
cargo test -p yarli test_parallel_tasks_budget_accounting_consistency -- --nocapture

# Command execution resource capture (exec layer)
cargo test -p yarli-exec -- --nocapture
```

Expected behavior:

- All commands exit `0`.
- Budget failure events include `reason = "budget_exceeded"` with observed/limit metrics.
- Run does not reach `RunCompleted` after any budget breach.

## Migration Workflow

YARLI schema SQL lives under `crates/yarli-store/migrations/`.

Use CLI migration workflow:

```bash
yarli migrate status
yarli migrate up
yarli migrate backup --label <label>   # optional before maintenance windows
yarli migrate down --target 0002         # optional rollback (creates backup)
yarli migrate restore --label <label>     # rollback recovery
```

Recommended:

- Use a dedicated database for YARLI runtime data.
- Apply migrations before running durable CLI write commands.

## Event Store Backup and Restore for Production

YARLI’s source of truth is the Postgres event store and queue state.

Recommended production backup policy:

- Daily validated dump.
- Timely incremental backup or PITR according to retention policy.
- Quarterly restore drill in a non-production cluster.

Example full backup command:

```bash
export YARLI_DB_URL="postgres://postgres:postgres@postgres.example.internal:5432/yarli"
export BACKUP_FILE="/var/backups/yarli-eventstore-$(date -u +%Y%m%dT%H%M%SZ).dump"
pg_dump -Fc "$YARLI_DB_URL" > "$BACKUP_FILE"
sha256sum "$BACKUP_FILE" > "${BACKUP_FILE}.sha256"
```

Example restore validation workflow:

```bash
export YARLI_RESTORE_DB_URL="postgres://postgres:postgres@postgres.example.internal:5432/yarli_restore"
createdb -h postgres.example.internal -U postgres yarli_restore
pg_restore --clean --if-exists --no-owner --no-privileges --dbname "$YARLI_RESTORE_DB_URL" "$BACKUP_FILE"
export DATABASE_URL="$YARLI_RESTORE_DB_URL"
yarli migrate status
```

Validation rules:

- Restore commands complete without ownership/permission warnings.
- `yarli migrate status` shows a healthy, readable event store.
- `yarli run status --all` works against restored data.

## Scaling Strategy for Queue and Scheduler Capacity

- Use `yarli run status --all` and run-level backlog patterns as first indicators.
- Horizontal scale first when sustained backlog grows across multiple active runs:
  - `helm upgrade <release> deploy/helm/yarli --set scheduler.replicas=<N>`
- Vertical tuning when backlog remains with healthy pod counts:
  - update scheduler CPU/memory in `deploy/helm/yarli/values.yaml` (`scheduler.resources`).
- Throughput tuning for event-driven claim bursts:
  - `queue.per_run_cap`
  - `queue.io_cap`
  - `queue.cpu_cap`
  - `queue.git_cap`
  - `queue.tool_cap`
- Vertical and horizontal changes should be paired with post-scale verification.

## Incident Response Playbook

Use this flow for production incidents requiring operator action without code changes.

### 1) Stuck runs

1. `yarli run status --all`
2. For the affected run:
   - `yarli run explain-exit <run-id>`
   - `yarli task list <run-id>`
   - `yarli task explain <task-id>`
3. Pull recent operator telemetry:
   - `yarli audit tail`
4. If the run must be stopped:
   - `yarli run pause --all-active --reason "incident triage"`
   - fix root cause, then `yarli run resume --all-paused --reason "triage complete"`

### 2) Queue backlogs

1. Confirm queue behavior:
   - `yarli run status --all`
2. Correlate with `run` and `task` telemetry:
   - `yarli audit tail`
3. Contain and recover:
   - reduce new claim pressure if needed
   - increase scheduler replicas
   - pause and drain if the backlog is destabilizing critical workloads
4. Resume in a controlled mode after scheduler and DB signals stabilize.

### 3) Budget breaches

1. Detect breach and isolate impact:
   - `yarli task explain <task-id>`
   - `yarli run explain-exit <run-id>`
2. Preserve incident evidence:
   - `yarli run status <run-id>`
   - `yarli run explain-exit <run-id>`
   - `yarli audit tail`
3. Corrective actions:
   - `yarli run pause <run-id>` if mitigation needs immediate human hold
   - adjust `[budgets]` for rerun
   - `yarli run continue` after intent review.

## Deterministic Local Strict Postgres Verification Workflow

Run this exact sequence from repository root on a clean shell.

Start and prepare a local Postgres 16 instance:

```bash
docker rm -f yarli-r4v-postgres >/dev/null 2>&1 || true
docker run -d --name yarli-r4v-postgres \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=postgres \
  -p 55432:5432 \
  postgres:16
until pg_isready -h 127.0.0.1 -p 55432 -U postgres -d postgres >/dev/null 2>&1; do sleep 1; done
```

Sanity-check strict mode fails fast when DB env is missing:

```bash
unset YARLI_TEST_DATABASE_URL
export YARLI_REQUIRE_POSTGRES_TESTS=1
cargo test -p yarli --test yarli_store_postgres_integration -- --nocapture
```

Expected result: command fails and includes
`postgres integration tests require YARLI_TEST_DATABASE_URL when YARLI_REQUIRE_POSTGRES_TESTS=1`.

Run strict Postgres integration suites (real execution path):

```bash
export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
export YARLI_REQUIRE_POSTGRES_TESTS=1
cargo test -p yarli --test yarli_store_postgres_integration -- --nocapture
cargo test -p yarli --test yarli_queue_postgres_integration -- --nocapture
cargo test -p yarli --test yarli_cli_postgres_integration -- --nocapture
```

True execution signals:

- Each command exits with status `0`.
- Output contains `test result: ok`.
- Output does not contain `skipping postgres integration test`.

Cleanup:

```bash
docker rm -f yarli-r4v-postgres
```

## Baseline Workspace Verification

```bash
cargo fmt --all
cargo clippy --workspace --all-targets
cargo test --workspace
```

## Consistency and Governance Contract

See `docs/CONSISTENCY_CONTRACT.md` for the canonical strong-consistency and runtime-governance contract, including:

- Per-aggregate transition ordering and linearizable state.
- Durable-state-before-side-effects invariant.
- Idempotency key replay behavior.
- Queue lease single-owner invariant.
- Resource and token accounting fields and budget breach outcomes.

## Acceptance Decision

Use `docs/ACCEPTANCE_RUBRIC.md` to determine a binary outcome:

- `PASS`: all required checks are proven with tracked evidence.
- `UNVERIFIED`: any required check is missing, failing, or not evidenced.

## Telemetry and Observability

YARLI exports structured telemetry via OpenTelemetry (OTLP) when configured. This enables deep introspection into scheduler performance, command execution latency, and resource utilization.

### OTLP Collector Setup (Local Example)

To capture and visualize telemetry locally using Jaeger (traces) and Prometheus (metrics):

1. **Start the collector stack (Jaeger + Prometheus + OTel Collector):**

   ```bash
   docker run -d --name jaeger \
     -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
     -p 5775:5775/udp \
     -p 6831:6831/udp \
     -p 6832:6832/udp \
     -p 5778:5778 \
     -p 16686:16686 \
     -p 14268:14268 \
     -p 14250:14250 \
     -p 9411:9411 \
     jaegertracing/all-in-one:1.60

   # (Optional) Prometheus and OTel Collector would go here for a full stack.
   # For basic tracing, Jaeger all-in-one accepts OTLP over gRPC at 4317 if configured,
   # or you can point YARLI directly to a collector.
   ```

2. **Configure YARLI to export telemetry:**

   Set environment variables before running `yarli run`:

   ```bash
   export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
   export OTEL_SERVICE_NAME="yarli-scheduler"
   export RUST_LOG="info,yarli=debug" # Optional: enable debug logs for correlation
   ```

3. **Visualize Traces:**
   Open http://localhost:16686/ in your browser.

### Metric and Trace Conventions

**Metrics (Prometheus format):**

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `yarli_queue_depth` | Gauge | - | Current tasks in queue (pending + leased). |
| `yarli_runs_total` | Counter | `state` | Run state transitions. |
| `yarli_tasks_total` | Counter | `state`, `command_class` | Task state transitions. |
| `yarli_commands_total` | Counter | `command_class`, `exit_reason` | Command execution outcomes. |
| `yarli_command_duration_seconds` | Histogram | `command_class` | End-to-end command duration. |
| `yarli_command_overhead_duration_seconds` | Histogram | `command_class`, `phase` | Internal overhead (spawn, capture). |
| `yarli_scheduler_tick_duration_seconds` | Histogram | `stage` | Scheduler loop timing (`scan`, `claim`, etc). |
| `yarli_store_duration_seconds` | Histogram | `operation` | Event store latency (`append`, `query`). |
| `yarli_store_slow_queries` | Counter | `operation` | Database operations exceeding 1s. |

**Traces (Spans):**

- `run_execution`: Root span for a `yarli run` invocation.
- `scheduler.tick`: Wraps a single scheduler loop iteration.
- `scheduler.execute_task`: Wraps a task execution attempt.
- `command.execute`: Wraps the low-level process spawn and wait.
- `store.append`, `store.query`: Wraps database operations.

### Recommended Alerting Thresholds

For production deployments, monitor these signals:

1. **Stalled Queue:**
   - Alert if `yarli_queue_depth > 0` and `yarli_scheduler_tick_duration_seconds_count` stops increasing for 5m.
   - Indicates scheduler hang or crash.

2. **High Failure Rate:**
   - Alert if `rate(yarli_commands_total{exit_reason!="success"}[5m]) / rate(yarli_commands_total[5m]) > 0.1`.
   - Indicates systemic failure (bad config, network down).

3. **Slow Database:**
   - Alert if `rate(yarli_store_slow_queries[5m]) > 0`.
   - Indicates DB performance degradation affecting scheduler throughput.

4. **Resource Saturation:**
   - Alert if `yarli_run_resource_usage` approaches configured budget limits.

### Troubleshooting Telemetry

**Missing Exports:**
- Verify `OTEL_EXPORTER_OTLP_ENDPOINT` is reachable.
- Check logs for "OpenTelemetry trace error" or "metrics export failed".
- Ensure the collector accepts OTLP/gRPC (port 4317) or OTLP/HTTP (port 4318) matching your endpoint scheme.

**High Cardinality:**
- YARLI metrics use bounded label sets (enums).
- Avoid adding unbounded labels (like `run_id` or `task_id`) to long-lived metrics.
- `run_resource_usage` is an exception; it is labelled by `run_id` but should be transient or aggregated.