# YARLI Operations
This runbook covers the baseline local operator workflow for durable mode, migrations, and test execution.
For final verification and release acceptance decisions, use `docs/ACCEPTANCE_RUBRIC.md`.
For an exhaustive, command-by-command CLI usage guide, see `docs/CLI.md`.
For memory behavior (when YARLI stores/queries memories), see `docs/MEMORY_POLICY.md`.
## Requirements
- Rust toolchain compatible with workspace `rust-version` (currently 1.75+).
- PostgreSQL accessible from the local machine.
- `psql` client for manual migration execution.
- CI/CD pipeline integrations for API-triggered runs are documented in
`docs/CI_CD_INTEGRATION_EXAMPLES.md` with GitHub Actions, GitLab CI, Jenkins, and wait-for-completion scripts.
## Production Deployment Prerequisites
- Kubernetes:
- Kubernetes `v1.27+` cluster.
- `kubectl` matching cluster version.
- Container tooling:
- `helm` `v3+`.
- Container runtime tooling for image signing/build reproducibility.
- Event store:
- PostgreSQL `16+` for production deployments.
- Stable connectivity from all scheduler pods to the database.
- Database role policy allowing migration execution and backup tooling.
- Backups and restore drills included in runbook cadence.
- Runtime:
- Secrets mounted from platform vault/secret manager for DB credentials.
- At least one namespace boundary around scheduler and operator tooling.
Deployment prerequisites before first rollout:
- Set `core.backend = "postgres"` in effective config.
- Set one of `DATABASE_URL` or `postgres.database_url_file`.
- Run `yarli migrate status` and confirm no unapplied mandatory migrations.
- Validate `cargo test --workspace` in a clean environment before release packaging.
## Zero-Drift Deployment Hygiene
- Use immutable image tags for scheduler and api images.
- Keep migration execution and scheduler startup in separate rollout steps.
- Preserve `run` and `task` identifiers in event store evidence.
## Required Environment Variables
- `YARLI_TEST_DATABASE_URL`
- Required when running Postgres integration tests locally.
- CI uses: `postgres://postgres:postgres@localhost:5432/postgres`.
- Must point to an admin database that can create and drop temporary databases.
- `DATABASE_URL`
- Optional for durable runtime.
- If set, overrides `postgres.database_url` and `postgres.database_url_file`.
- `YARLI_REQUIRE_POSTGRES_TESTS`
- Set to `1` for strict mode.
- In strict mode, missing `YARLI_TEST_DATABASE_URL` is a hard failure (never a skip-success).
## One-Command Acceptance Verification
From repository root:
```bash
export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
export YARLI_REQUIRE_POSTGRES_TESTS=1
bash scripts/verify_acceptance_rubric.sh r8
```
Expected behavior:
- Exit `0` and print `PASS` only when every rubric check is proven.
- Print `UNVERIFIED` and exit non-zero for any missing or failing proof.
- Write evidence to `evidence/r8/` (or generally `evidence/<loop-id>/`) with command logs and `README.md`.
## Fresh-Clone and Clean-Shell Repro Path
Use this sequence to prove reproducibility from a fresh clone and a clean shell.
```bash
git clone <repo-url> yarli
cd yarli
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=postgres \
-p 55432:5432 \
postgres:16
until pg_isready -h 127.0.0.1 -p 55432 -U postgres -d postgres >/dev/null 2>&1; do sleep 1; done
env -i HOME="$HOME" PATH="$PATH" bash --noprofile --norc -lc '
export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
export YARLI_REQUIRE_POSTGRES_TESTS=1
bash scripts/verify_acceptance_rubric.sh r8
'
docker rm -f yarli-r8-postgres
```
Expected verification signals:
- Command output prints `PASS`.
- Exit code is `0`.
- `evidence/r8/README.md` records command blocks with `Exit Code:` and `Key Output:`.
## Packaging and Deployment Smoke Verification (Loop-8)
This sequence is the deterministic "can we deploy" proof path for API and durable CLI behavior:
```bash
export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
export YARLI_REQUIRE_POSTGRES_TESTS=1
# Packaging/build step
cargo build --release -p yarli --bin yarli
# API smoke checks (in-process and persistent read paths)
cargo test --workspace
cargo test -p yarli --test yarli_api_postgres_integration -- --nocapture
# Durable CLI write/read roundtrip
cargo test -p yarli --test yarli_cli_postgres_integration -- --nocapture
```
Expected smoke signals:
- Release build command exits `0`.
- API logs contain `health_endpoint_returns_ok` and `run_status_endpoint_replays_persisted_events`.
- API deploy check log contains `run_and_task_status_reflect_writes_from_postgres`.
- CLI strict-postgres smoke log contains both:
- `run_start_and_status_roundtrip_against_postgres`
- `merge_request_and_status_roundtrip_against_postgres`
- These commands are tracked in `evidence/<loop-id>/` when run via `scripts/verify_acceptance_rubric.sh`.
## Durable Local Configuration
Create `yarli.toml` using `yarli.example.toml` as a starting point. Durable write commands require:
```toml
[core]
backend = "postgres"
[postgres]
database_url = "postgres://postgres:postgres@localhost:5432/yarli"
# or
database_url_file = "/run/secrets/yarli-postgres-url"
```
`core.backend = "in-memory"` is supported only for explicit ephemeral workflows, and write commands are blocked unless `core.allow_in_memory_writes = true`.
### Secret rotation for durable deployments
To avoid rebuilding images when credentials rotate:
- Set `DATABASE_URL` in the runtime environment and rely on deployment restart/reload.
- Or mount credentials at `postgres.database_url_file` and rotate the mounted secret.
### API key management and rotation
API keys control optional API authentication when `YARLI_API_KEYS` is set in the runtime environment.
- Set `YARLI_API_KEYS` to a comma-separated list of allowed API keys.
- Set `YARLI_API_RATE_LIMIT_PER_MINUTE` (default: `120`) to tune per-key request throttling.
- Rotate keys by updating the environment value and restarting or reloading the API process.
Authentication is required when `YARLI_API_KEYS` is non-empty and applies to all non-health/metrics routes.
### Zero-downtime scheduler upgrades (blue/green)
Use this runbook for minor/patch upgrades and scheduler-safe rollouts:
1. Build and validate the upgraded image as `yarli-scheduler:<new-version>`.
2. Deploy the upgraded scheduler as a green replica (same queue/postgres view, no traffic yet).
3. Pause active work on the blue scheduler: `yarli run pause --all-active --reason "pre-upgrade drain"`.
4. Wait for in-flight tasks to reach a natural checkpoint (`TaskComplete` or explicit operator-reviewed pause state).
5. Stop the blue scheduler process after queue drain progress indicates no additional claims.
6. Shift traffic to green (`kubectl rollout`, service selector swap, or equivalent).
7. Resume paused runs: `yarli run resume --all-paused --reason "upgrade complete"`.
If green shows stable health and queue progress resumes, proceed. If not, swap traffic back to blue and use
`yarli run pause`/`yarli run cancel` to contain impact while diagnostics run.
Operational checkpoint before rollout:
- `yarli run status --all` shows bounded backlog growth and no unexpected `TaskFailed` spikes.
- Existing active run IDs remain visible and resumable from `run status` and `run explain-exit`.
- Queue lease owner identities should only be from the active scheduler deployment.
### Event schema compatibility matrix
YARLI does not require coordinated downtime for additive event-schema changes.
| 1.0 | 1.0 | allowed | allowed | blocked |
| 1.1 (schema extension) | 1.0 | allowed | allowed | blocked |
| 1.0 | 1.1 | allowed | allowed | blocked |
| 1.1 | 1.2 | allowed | allowed with defaults | blocked |
| 1.2 | 1.1 | required | blocked | blocked |
Guidance:
- Keep event consumers resilient to optional/new fields.
- For breaking payload changes, land a compatibility shim before enabling mixed-version rollouts.
- Record any approved schema bump in `docs/ACCEPTANCE_RUBRIC.md` and rollback plan.
## Execution Backend Selection
Native process execution remains default:
```toml
[execution]
runner = "native"
```
## Default `yarli run` (Config-First)
To avoid repeating long `--cmd ...` lists, YARLI is opinionated: `yarli run` resolves prompt context in this order, expands `@include <path>`, and drives execution from config + plan state.
1. `yarli run --prompt-file <path>`
2. `yarli.toml` `[run].prompt_file`
3. fallback lookup of `PROMPT.md` by walking upward from the current directory
Execution behavior:
- Discover incomplete tranches from `IMPLEMENTATION_PLAN.md` in plan order.
- Dispatch each tranche as its own Yarli task via `[cli]` command settings.
- Optional grouping: set `[run].enable_plan_tranche_grouping = true` and annotate related plan lines with `tranche_group=<name>` to dispatch shared tranches.
- Optional per-tranche file-scope policy: annotate plan lines with `allowed_paths=path/a,path/b` and set `[run].enforce_plan_tranche_allowed_paths = true` to inject explicit scope constraints into tranche prompts.
- Optional tranche hardening: configure `[run.tranche_contract]` to require `verify`, require `done_when`, and fail merge finalization on out-of-scope edits. These checks are opt-in and preserve legacy behavior when unset.
- Optional run-spec defaults can live in `yarli.toml` (`[run]`, `[[run.tasks]]`, `[[run.tranches]]`, `[run.plan_guard]`), with `PROMPT.md` `yarli-run` blocks used only for per-prompt overrides/backward compatibility.
- Parallel mode defaults to enabled (`[features].parallel = true`) and requires `[execution].worktree_root`.
- In parallel mode, YARLI prepares one workspace copy per task under `execution.worktree_root` before execution.
- After `RunCompleted`, YARLI auto-merges task workspace changes into the source repo using `git apply --3way`.
- If a workspace merge conflicts, YARLI emits `run.parallel_merge_failed`, exits non-zero, and leaves conflict markers for manual resolution.
- Append a verification task automatically.
- If no embedded run spec exists and no incomplete tranches are found, dispatch the full prompt text as one task.
- `[cli].env_unset` can remove parent-session environment variables before CLI launch (for example `CLAUDECODE`).
Control model clarifications:
- `yarli run` is the authoritative execution entry point.
- Built-in Yarli policy gates are code-defined checks; verification command chains come from plan/config/script content.
- Observer integrations emit telemetry only and must not gate active run progression.
- Use explicit operator controls for run state transitions: `yarli run pause`, `yarli run resume`, `yarli run cancel`.
```bash
yarli run --stream
```
Notes:
- `yarli run start ... --cmd ...` remains available for ad-hoc runs.
- `yarli run batch` and `[run]` paces are supported for backward compatibility but are no longer the primary workflow.
Optional Overwatch-backed execution is opt-in:
```toml
[execution]
runner = "overwatch"
[execution.overwatch]
service_url = "http://127.0.0.1:8089"
profile = "default"
soft_timeout_seconds = 900
silent_timeout_seconds = 300
max_log_bytes = 131072
```
Behavior notes:
- `runner = "native"` keeps current local process behavior unchanged.
- `runner = "overwatch"` submits commands to Overwatch (`/run`), polls `/status/{task_id}`, reads final logs from `/output/{task_id}`, and maps terminal state into YARLI command transitions.
- Scheduler shutdown cancellation propagates to command execution; Overwatch runner calls `/cancel/{task_id}` for in-flight tasks.
- Cancellation transitions emit structured provenance (`run.cancel_provenance` / `task.cancel_provenance`) including signal identity, receiver/parent PID context, actor kind/detail, and stage attribution.
- `ui.verbose_output` affects stream verbosity only. Provenance capture is independent; enable `ui.cancellation_diagnostics = true` to append extra diagnostic context in provenance `actor_detail`.
- Durable provenance history requires Postgres backend. In-memory backend keeps provenance only for the lifetime of the current process (plus any existing `.yarli/continuation.json` artifact).
## Runner Hardening Telemetry and cgroup Delegation
Use these signals when validating the runner hardening path in production-like environments:
- `yarli_enforcement_outcomes_total{mechanism,outcome,reason}`
- `mechanism="rlimit"` confirms pre-exec `setrlimit(2)` application.
- `mechanism="pidfd"` distinguishes race-free pidfd control from `raw_pid` fallback.
- `mechanism="cgroup"` reports `attached` success or `rlimits_only_*` fallback reasons.
- `mechanism="pid_termination"` records bounded failure reasons during cancellation escalation.
- `yarli_command_overhead_duration_seconds`
- Useful for spotting expensive spawn/resource-capture phases during runner rollout.
- `/proc/self/cgroup`
- Confirms whether the YARLI process is operating inside the expected delegated subtree.
Delegation requirements for cgroup v2:
- YARLI needs a writable delegated cgroup subtree, not blanket write access to the root hierarchy.
- In systemd environments, run the service in a delegated unit/slice so it can create leaf cgroups below its assigned boundary.
- In container/Kubernetes environments, verify the container runtime exposes a writable subtree for the service user before expecting cgroup enforcement.
- Non-root execution is supported; if the subtree is read-only or not delegated, YARLI falls back to rlimits-only mode and emits `rlimits_only_*` telemetry.
Operator checklist before enabling cgroup enforcement:
1. Confirm the service user is non-root and can create/remove a test directory under its delegated cgroup subtree.
2. Start YARLI and hit `/metrics`, then verify `yarli_enforcement_outcomes_total` emits either `attached` or an explicit `rlimits_only_*` fallback.
3. Run a bounded verification command and confirm `yarli debug resource-usage <run-id>` and `/proc/self/cgroup` match expectations.
4. If metrics show repeated `permission_denied` or `read_only` fallback, keep the deployment in rlimits-only mode until delegation is fixed.
## sw4rm Fallback Behavior
`yarli run sw4rm` is intentionally fail-closed around transport setup and stream loss:
- Invalid merged sw4rm run-spec configuration falls back to a minimal verification suite (`cargo build`) so the agent still has a bounded verification path.
- Router/report client initialization, runtime init, or registry registration failures stop startup immediately and return a non-zero error to the operator.
- If the response correlation is dropped because shutdown/preemption cancels the in-flight stream, the orchestrator surfaces `Cancelled` rather than fabricating a result.
- If the correlation stays pending past `sw4rm.llm_response_timeout_secs`, the orchestrator cleans up the pending entry and returns `LlmTimeout`.
Recommended operator response:
1. Treat startup failures as connectivity/configuration issues, not partial success.
2. Retry only after router/registry reachability and credentials are confirmed.
3. Treat `Cancelled` as an interrupted orchestration and resume or requeue explicitly.
4. Treat `LlmTimeout` as an execution-budget or transport-latency issue and inspect router/runtime health before retrying.
## Runtime Resource and Token Budgets
YARLI can enforce explicit per-task and per-run budgets from `yarli.toml`:
```toml
[budgets]
max_task_total_tokens = 25000
max_run_total_tokens = 250000
max_task_rss_bytes = 1073741824
max_run_peak_rss_bytes = 2147483648
```
Behavior:
- Budget breaches are fail-closed (`task.failed` with `reason = "budget_exceeded"`).
- Breach events include observed metric values, limits, command resource usage, token usage, and run usage totals.
- Token usage currently uses deterministic `char_count_div4_estimate_v1` estimation and is recorded in command/task events.
Operator-visible command surfaces for resource_usage and token_usage:
- `yarli run status <run-id>`: shows per-task `budget_exceeded`, `token_usage` (prompt_tokens, completion_tokens, total_tokens), and `resource_usage` (max_rss_bytes).
- `yarli run explain-exit <run-id>`: shows `Budget breaches:` section with task name and breach detail, plus per-task token and resource usage.
- `yarli task explain <task-id>`: shows `budget_exceeded` reason, token usage, and resource usage for individual tasks.
- `yarli task output <task-id>`: prints the raw captured stdout/stderr for a task when durable output events are available.
Durability note:
- `core.backend = "in-memory"` is ephemeral; captured task output and local audit history are lost when the process exits.
- `core.backend = "postgres"` preserves run/task state and captured output for later diagnosis.
## Sequence Deterioration Observer
YARLI emits rolling deterioration analysis events (`run.observer.deterioration`) during run execution.
The observer consumes event deltas incrementally using event-store cursor reads (`after_event_id`), not full-history rescans.
YARLI also emits heartbeat progress events (`run.observer.progress`) so stream output remains active during long-running tasks.
Signals included in the score:
- Runtime drift for repeated command keys.
- Retry inflation and blocker churn.
- Failure-rate drift by reason bucket.
- Budget headroom erosion (when observed/limit values are present).
Operator surfaces:
- `yarli run status <run-id>` includes latest deterioration score/trend/factors when available.
- `yarli run explain-exit <run-id>` includes a sequence deterioration section.
- `yarli-api` (if you run it) exposes the latest report as an optional `deterioration` field on `GET /v1/runs/<run-id>/status`.
## Runtime Guard Telemetry and Operator Playbook
Runtime guards now combine hard guardrails and quality gating:
- Budget guardrails from `[budgets]` stop execution on token/usage overages.
- Task-health and soft-token-cap guardrails produce continuation guidance (`checkpoint-now`, `force-pivot`, or `stop-and-summarize`) on completion.
- Guard telemetry is observability-first and does not itself gate active scheduler semantics.
Telemetry to monitor:
- `run.observer.deterioration` and `run.observer.deterioration_cycle`
- Detect recurring instability patterns before automatic continuation paths drift too far.
- `run.observer.progress`
- Confirms progress visibility for long-running tasks and supports hang triage.
- `task.failed` with `reason = "budget_exceeded"`
- Includes observed/limit metrics, command usage, token usage, and aggregate run usage snapshots.
- `run.continuation` (`quality_gate` payload)
- Captures `task_health_action`, `reason`, and optional `trend` for operator follow-up.
Operator playbook:
Quick diagnosis cheat sheet:
- Why did the run stop? -> `yarli run explain-exit <run-id>`
- What did the task actually print? -> `yarli task output <task-id>`
- Why is a task blocked or failing? -> `yarli task explain <task-id>`
- What policy or gate decided this? -> `yarli audit query --task-id <task-id> --category gate_evaluation`
- Budget breach (`budget_exceeded`) response:
1. Pause if active: `yarli run pause <run-id>`
2. Inspect: `yarli run status <run-id>`, `yarli run explain-exit <run-id>`, `yarli task explain <task-id>`, `yarli task output <task-id>`
3. Decide: lower scope, raise limits, or cancel and rerun (`yarli run cancel <run-id>`)
- Deterioration-cycle response:
1. Inspect current state with `yarli run status <run-id>` and `yarli run explain-exit <run-id>`
2. If guidance is `force-pivot` or `stop-and-summarize`, pause/stop before continuing
3. Re-run with narrower scope, then continue via `yarli run continue` only if the continuation snapshot still matches current open tranches
- Soft-cap continuation guidance (`checkpoint-now`):
1. Treat as a review checkpoint and avoid blind continuation
2. Evaluate remaining objective and current usage headroom
3. If safe, continue; otherwise cancel and restart with adjusted `[run]` strategy
Continuation semantics:
- `yarli run continue` replays the prior continuation snapshot and is the right tool for retry/unfinished/planned-next work that already belongs to that snapshot.
- `yarli run --fresh-from-tranches` rebuilds from the current prompt/plan/tranches state and is the right tool after you enqueue new tranches or otherwise change `.yarli/tranches.toml`.
- When current `.yarli/tranches.toml` contains open tranche keys not represented in the continuation snapshot, `yarli run continue` now refuses with an actionable drift message instead of silently replaying stale scope.
## Merge Conflict Policy and Incident Response
- `run.parallel_merge_failed` is emitted for unresolved or unrecoverable conflicts during parallel workspace merge finalization.
- `run.parallel_merge_succeeded` confirms successful parallel merge finalization.
- Merge telemetry events are also emitted as part of merge apply tracking:
- `merge.apply.started`
- `merge.apply.conflict`
- `merge.apply.finalized`
- `merge.repair.succeeded`
- `merge.repair.failed`
Merge policy modes (`[run].merge_conflict_resolution`) are:
- `fail` (default): stop on conflict, preserve workspace for operator review.
- `manual`: preserve workspace and emit explicit recovery instructions for manual intervention.
- `auto-repair`: attempt deterministic patch-side repair before falling back to manual recovery.
Incident response workflow:
- Pull latest failure context:
1. `yarli run status <run-id>`
2. `yarli run explain-exit <run-id>`
3. `yarli task output <task-id>` for raw command stdout/stderr from the failing task.
4. `yarli audit tail` or `yarli audit query` for merge telemetry, policy decisions, and task-level context.
- Open `PARALLEL_MERGE_RECOVERY.txt` in the preserved workspace root reported by `run.parallel_merge_failed`.
- Run recovery commands from the note in order (status, patch diff stats, manual `git apply` retry).
- Resolve conflicts and re-run with your normal operator decision path:
- `yarli run continue` when continuation is available and still matches current open tranches, or
- rerun the target tranche after scope and policy review.
- If recovery requires policy change, update `run.merge_conflict_resolution` (`fail`, `manual`, or `auto-repair`) and resume according to post-incident policy.
Evidence capture for incident records should include:
- `run.parallel_merge_failed` payload (including `task_key`, `patch_path`, `workspace_path`, `conflicted_files`, `repo_status`, `recovery_hints`)
- associated `merge.apply.*` and `merge.repair.*` telemetry events
- `PARALLEL_MERGE_RECOVERY.txt` and operator remediation steps.
Policy defaults:
- Keep guard-related telemetry in evidence (`evidence/<loop-id>/`) to preserve trend history.
- Avoid runtime-guard bypass shortcuts; use continuation/replan actions and explicit operator commands.
- Treat `yarli audit tail` as a local JSONL view by default; in durable deployments it complements, but does not replace, the persisted event-store record.
## External Agent Skill Contract
The external `yarli-execution-loop` skill is expected to treat Yarli as the durable control plane:
- Inspect Yarli state first (`.yarli/continuation.json`, `.yarli/tranches.toml`, `yarli run status`, `yarli run explain-exit`).
- Enqueue newly discovered work durably through `yarli plan tranche add --idempotent`.
- Choose `yarli run continue` only for snapshot-owned work with no drift.
- Choose `yarli run --fresh-from-tranches` after new tranche enqueue or other live-plan changes.
- End each agent cycle with a `YARLI_DECISION_V1` block containing `status`, `reason`, `enqueued_tranches`, and `next_command`.
### Idempotent Tranche Enqueue
`yarli plan tranche add --idempotent` is designed for safe repeated invocation in
agent loops. The semantics are:
- **Matching fields → no-op.** If a tranche with the given key already exists and
all effective fields (summary, group, allowed_paths, verify, done_when, max_tokens)
match the request, the command prints a confirmation message and returns success
without modifying the file.
- **Mismatched fields → error.** If the key exists but any effective field differs,
the command fails with a message listing which fields differ. This prevents
accidental overwrites while making the conflict visible to the caller.
- **New key → normal add.** If no tranche with the key exists, the tranche is
appended exactly as with a non-idempotent `add`.
- **Safe for repeated invocation.** Because identical calls are no-ops and
conflicting calls are errors, agents can call `--idempotent` unconditionally at the
start of every cycle without side effects or silent data loss.
### Strict Tranche Contract
Use `[run.tranche_contract]` when agent consumers need stronger execution guarantees without breaking legacy repositories:
- `strict = true` turns on all checks below.
- `require_verify = true` rejects open tranches that omit `verify`.
- `require_done_when = true` rejects open tranches that omit `done_when`.
- `enforce_allowed_paths_on_merge = true` fail-closes merge finalization when a tranche edits paths outside its declared `allowed_paths`.
Behavior notes:
- These checks apply to open tranches only; completed legacy tranches are left alone.
- When merge-time path enforcement is enabled, YARLI automatically surfaces `allowed_paths` in tranche prompts even if `[run].enforce_plan_tranche_allowed_paths` is otherwise false.
- Control-plane files such as `IMPLEMENTATION_PLAN.md` and `.yarli/*` remain exempt so normal plan/evidence bookkeeping does not trip the scope gate.
## Reproducing Budget Stress Checks Locally
Run these commands to verify budget governance under parallel workload:
```bash
# Single-task budget breach: task exceeds token limit, fails without retry
cargo test -p yarli test_budget_exceeded_fails_task_without_retry -- --nocapture
# Run-level budget breach: cumulative tokens across tasks exceed run limit
cargo test -p yarli test_run_token_budget_exceeded_across_tasks -- --nocapture
# Parallel-task budget stress: 4 concurrent tasks with tight run budget,
# proves accounting consistency and no silent continuation after breach
cargo test -p yarli test_parallel_tasks_budget_accounting_consistency -- --nocapture
# Command execution resource capture (exec layer)
cargo test -p yarli-exec -- --nocapture
```
Expected behavior:
- All commands exit `0`.
- Budget failure events include `reason = "budget_exceeded"` with observed/limit metrics.
- Run does not reach `RunCompleted` after any budget breach.
## Migration Workflow
YARLI schema SQL lives under `crates/yarli-store/migrations/`.
Use CLI migration workflow:
```bash
yarli migrate status
yarli migrate up
yarli migrate backup --label <label> # optional before maintenance windows
yarli migrate down --target 0002 # optional rollback (creates backup)
yarli migrate restore --label <label> # rollback recovery
```
Recommended:
- Use a dedicated database for YARLI runtime data.
- Apply migrations before running durable CLI write commands.
## Event Store Backup and Restore for Production
YARLI’s source of truth is the Postgres event store and queue state.
Recommended production backup policy:
- Daily validated dump.
- Timely incremental backup or PITR according to retention policy.
- Quarterly restore drill in a non-production cluster.
Example full backup command:
```bash
export YARLI_DB_URL="postgres://postgres:postgres@postgres.example.internal:5432/yarli"
export BACKUP_FILE="/var/backups/yarli-eventstore-$(date -u +%Y%m%dT%H%M%SZ).dump"
pg_dump -Fc "$YARLI_DB_URL" > "$BACKUP_FILE"
sha256sum "$BACKUP_FILE" > "${BACKUP_FILE}.sha256"
```
Example restore validation workflow:
```bash
export YARLI_RESTORE_DB_URL="postgres://postgres:postgres@postgres.example.internal:5432/yarli_restore"
createdb -h postgres.example.internal -U postgres yarli_restore
pg_restore --clean --if-exists --no-owner --no-privileges --dbname "$YARLI_RESTORE_DB_URL" "$BACKUP_FILE"
export DATABASE_URL="$YARLI_RESTORE_DB_URL"
yarli migrate status
```
Validation rules:
- Restore commands complete without ownership/permission warnings.
- `yarli migrate status` shows a healthy, readable event store.
- `yarli run status --all` works against restored data.
## Scaling Strategy for Queue and Scheduler Capacity
- Use `yarli run status --all` and run-level backlog patterns as first indicators.
- Horizontal scale first when sustained backlog grows across multiple active runs:
- `helm upgrade <release> deploy/helm/yarli --set scheduler.replicas=<N>`
- Vertical tuning when backlog remains with healthy pod counts:
- update scheduler CPU/memory in `deploy/helm/yarli/values.yaml` (`scheduler.resources`).
- Throughput tuning for event-driven claim bursts:
- `queue.per_run_cap`
- `queue.io_cap`
- `queue.cpu_cap`
- `queue.git_cap`
- `queue.tool_cap`
- Vertical and horizontal changes should be paired with post-scale verification.
## Incident Response Playbook
Use this flow for production incidents requiring operator action without code changes.
### 1) Stuck runs
1. `yarli run status --all`
2. For the affected run:
- `yarli run explain-exit <run-id>`
- `yarli task list <run-id>`
- `yarli task explain <task-id>`
3. Pull recent operator telemetry:
- `yarli audit tail`
4. If the run must be stopped:
- `yarli run pause --all-active --reason "incident triage"`
- fix root cause, then `yarli run resume --all-paused --reason "triage complete"`
### 2) Queue backlogs
1. Confirm queue behavior:
- `yarli run status --all`
2. Correlate with `run` and `task` telemetry:
- `yarli audit tail`
3. Contain and recover:
- reduce new claim pressure if needed
- increase scheduler replicas
- pause and drain if the backlog is destabilizing critical workloads
4. Resume in a controlled mode after scheduler and DB signals stabilize.
### 3) Budget breaches
1. Detect breach and isolate impact:
- `yarli task explain <task-id>`
- `yarli run explain-exit <run-id>`
2. Preserve incident evidence:
- `yarli run status <run-id>`
- `yarli run explain-exit <run-id>`
- `yarli audit tail`
3. Corrective actions:
- `yarli run pause <run-id>` if mitigation needs immediate human hold
- adjust `[budgets]` for rerun
- `yarli run continue` after intent review.
## Deterministic Local Strict Postgres Verification Workflow
Run this exact sequence from repository root on a clean shell.
Start and prepare a local Postgres 16 instance:
```bash
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=postgres \
-p 55432:5432 \
postgres:16
until pg_isready -h 127.0.0.1 -p 55432 -U postgres -d postgres >/dev/null 2>&1; do sleep 1; done
```
Sanity-check strict mode fails fast when DB env is missing:
```bash
unset YARLI_TEST_DATABASE_URL
export YARLI_REQUIRE_POSTGRES_TESTS=1
cargo test -p yarli --test yarli_store_postgres_integration -- --nocapture
```
Expected result: command fails and includes
`postgres integration tests require YARLI_TEST_DATABASE_URL when YARLI_REQUIRE_POSTGRES_TESTS=1`.
Run strict Postgres integration suites (real execution path):
```bash
export YARLI_TEST_DATABASE_URL=postgres://postgres:postgres@localhost:55432/postgres
export YARLI_REQUIRE_POSTGRES_TESTS=1
cargo test -p yarli --test yarli_store_postgres_integration -- --nocapture
cargo test -p yarli --test yarli_queue_postgres_integration -- --nocapture
cargo test -p yarli --test yarli_cli_postgres_integration -- --nocapture
```
True execution signals:
- Each command exits with status `0`.
- Output contains `test result: ok`.
- Output does not contain `skipping postgres integration test`.
Cleanup:
```bash
docker rm -f yarli-r4v-postgres
```
## Baseline Workspace Verification
```bash
cargo fmt --all
cargo clippy --workspace --all-targets
cargo test --workspace
```
## Consistency and Governance Contract
See `docs/CONSISTENCY_CONTRACT.md` for the canonical strong-consistency and runtime-governance contract, including:
- Per-aggregate transition ordering and linearizable state.
- Durable-state-before-side-effects invariant.
- Idempotency key replay behavior.
- Queue lease single-owner invariant.
- Resource and token accounting fields and budget breach outcomes.
## Acceptance Decision
Use `docs/ACCEPTANCE_RUBRIC.md` to determine a binary outcome:
- `PASS`: all required checks are proven with tracked evidence.
- `UNVERIFIED`: any required check is missing, failing, or not evidenced.
## Telemetry and Observability
YARLI exports structured telemetry via OpenTelemetry (OTLP) when configured. This enables deep introspection into scheduler performance, command execution latency, and resource utilization.
### OTLP Collector Setup (Local Example)
To capture and visualize telemetry locally using Jaeger (traces) and Prometheus (metrics):
1. **Start the collector stack (Jaeger + Prometheus + OTel Collector):**
```bash
docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
-p 14250:14250 \
-p 9411:9411 \
jaegertracing/all-in-one:1.60
```
2. **Configure YARLI to export telemetry:**
Set environment variables before running `yarli run`:
```bash
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_SERVICE_NAME="yarli-scheduler"
export RUST_LOG="info,yarli=debug" ```
3. **Visualize Traces:**
Open http://localhost:16686/ in your browser.
### Metric and Trace Conventions
**Metrics (Prometheus format):**
| `yarli_queue_depth` | Gauge | - | Current tasks in queue (pending + leased). |
| `yarli_runs_total` | Counter | `state` | Run state transitions. |
| `yarli_tasks_total` | Counter | `state`, `command_class` | Task state transitions. |
| `yarli_commands_total` | Counter | `command_class`, `exit_reason` | Command execution outcomes. |
| `yarli_command_duration_seconds` | Histogram | `command_class` | End-to-end command duration. |
| `yarli_command_overhead_duration_seconds` | Histogram | `command_class`, `phase` | Internal overhead (spawn, capture). |
| `yarli_scheduler_tick_duration_seconds` | Histogram | `stage` | Scheduler loop timing (`scan`, `claim`, etc). |
| `yarli_store_duration_seconds` | Histogram | `operation` | Event store latency (`append`, `query`). |
| `yarli_store_slow_queries` | Counter | `operation` | Database operations exceeding 1s. |
**Traces (Spans):**
- `run_execution`: Root span for a `yarli run` invocation.
- `scheduler.tick`: Wraps a single scheduler loop iteration.
- `scheduler.execute_task`: Wraps a task execution attempt.
- `command.execute`: Wraps the low-level process spawn and wait.
- `store.append`, `store.query`: Wraps database operations.
### Recommended Alerting Thresholds
For production deployments, monitor these signals:
1. **Stalled Queue:**
- Alert if `yarli_queue_depth > 0` and `yarli_scheduler_tick_duration_seconds_count` stops increasing for 5m.
- Indicates scheduler hang or crash.
2. **High Failure Rate:**
- Alert if `rate(yarli_commands_total{exit_reason!="success"}[5m]) / rate(yarli_commands_total[5m]) > 0.1`.
- Indicates systemic failure (bad config, network down).
3. **Slow Database:**
- Alert if `rate(yarli_store_slow_queries[5m]) > 0`.
- Indicates DB performance degradation affecting scheduler throughput.
4. **Resource Saturation:**
- Alert if `yarli_run_resource_usage` approaches configured budget limits.
### Troubleshooting Telemetry
**Missing Exports:**
- Verify `OTEL_EXPORTER_OTLP_ENDPOINT` is reachable.
- Check logs for "OpenTelemetry trace error" or "metrics export failed".
- Ensure the collector accepts OTLP/gRPC (port 4317) or OTLP/HTTP (port 4318) matching your endpoint scheme.
**High Cardinality:**
- YARLI metrics use bounded label sets (enums).
- Avoid adding unbounded labels (like `run_id` or `task_id`) to long-lived metrics.
- `run_resource_usage` is an exception; it is labelled by `run_id` but should be transient or aggregated.