yore-cli 0.4.0

Fast document indexer for finding duplicates and searching content
# IMPLEMENTATION_PLAN

## Project: yore vocabulary feature (for faster-whisper prompting)

This plan is split into **small, YARLI-friendly tranches** so we can execute work in parallel where safe.

## Global objective
Add a `yore vocabulary` command that:
- derives vocabulary from index coverage (no full-file re-index),
- returns unstemmed, domain-relevant terms with stable ranking,
- defaults to one term per line,
- supports `--format json`,
- supports `--format prompt` (comma-separated for Whisper),
- supports `--limit N` (default 100),
- resolves `--index`,
- filters stop/common words (default dictionary stop set + optional custom set).

## Tranche model
- **Priority:** earlier tranches unblock later tranches.
- **Parallelism:** tasks without hard dependency can run concurrently in separate YARLI runs.
- **Execution guard:** keep each tranche focused to avoid token churn.

### Tranche T1 — Contract and dispatch
**Priority:** high  
**Parallelizable:** yes (with T2 once option names are agreed)  
**Status:** ✅ Implemented (2026-02-21 run completed for `YORE-CORE-01`; command wiring, argument surface, and dispatch stub are now present in `src/main.rs`; runtime behavior remains no-op placeholder by design until later tranches)  
**Goal:** Make the feature discoverable and runnable.

- Add `Commands::Vocabulary` to `src/main.rs` with:
  - `--index <path>` (default `.yore`)
  - `--limit <usize>` (default `100`)
  - `--format [lines|json|prompt]` (default `lines`)
  - `--json` alias for `--format json`
  - optional `--stopwords <path>` (custom stop-list file)
  - optional `--include-stemming` (if we decide to retain stem-only terms as fallback)
- Add dispatch branch in `run()` to call `cmd_vocabulary`.
- Add `cmd_vocabulary` signature and output/result structs (lightweight).
- Add help text + examples.

**Acceptance criteria**
- `yore help` shows `vocabulary` command and flags.
- Build continues to compile with existing features untouched.

Evidence:
- Added `Commands::Vocabulary` to `src/main.rs` with `--index`, `--limit`, `--format [lines|json|prompt]`, `--json`, `--stopwords`, and `--include-stemming`.
- Added `Commands::Vocabulary` dispatch branch in `run()` and a `cmd_vocabulary` stub that currently emits empty results in the requested format.
- Added `VocabularyResult` output struct for staged JSON path and format validation in `cmd_vocabulary` (`lines`, `json`, `prompt`).
- Verification for this tranche run:
  - `cargo check -q`
  - `cargo run --quiet -- vocabulary --help`
  - `cargo run --quiet -- vocabulary --format json --limit 3`

### Tranche T2 — Candidate extraction (reverse-index intake)
**Status:** ✅ complete
**Priority:** high  
**Parallelizable:** **only** with T1’s type definitions finalized  
**Goal:** Harvest candidate terms from reverse-index-backed content.

- Read existing reverse-index postings (`stem -> doc_ids`) and supporting index metadata.
- Keep stem keys stable and deterministic during extraction.
- Track term and document frequency maps with stable ordering by first-seen and frequency.

**Acceptance criteria**
- Terms missing from index gracefully degrade to empty list.
- No stem-only output leaks into the default CLI path.

**Status update**
- `YORE-CORE-02` completed in `src/main.rs` by loading `reverse_index.json` in `cmd_vocabulary`, aggregating candidate terms across postings, tracking doc/term frequencies, and producing deterministic ordering by frequency plus deterministic first-seen tuple.
- The command now returns extracted terms instead of an empty result set, and missing `reverse_index.json` is handled as an empty candidate set.
- Evidence command run: `cargo test --quiet`

### Tranche T2A — Stem-to-surface mapping
**Priority:** high
**Parallelizable:** no (depends on T2)
**Goal:** Resolve stems to representative surface forms.

- Map stemmed candidates back to source surface forms using reverse/forward evidence.
- Preserve deterministic casing behavior for equal candidates.
- Document first-seen precedence and tie-break semantics.

**Status:** ✅ complete

Status update:
- `YORE-CORE-02A` completed in `src/main.rs` by adding a deterministic stem-to-surface resolver inside `cmd_vocabulary`.
- Surface candidates are sourced from heading evidence in `ReverseEntry` first, then forward index term evidence as fallback; ties are resolved deterministically by source rank, path, line, token order, and lexical casing.
- When no surface match is available, output remains stem-only only when `--include-stemming` is enabled; otherwise that candidate is skipped.

Verification for this tranche run:
- `cargo run --quiet -- vocabulary --index .yore --format lines --limit 8`
- `cargo run --quiet -- vocabulary --index .yore --format json --limit 8`
- `cargo run --quiet -- vocabulary --index .yore --format prompt --include-stemming --limit 8`
  - observed on `2026-02-21`:
    - lines output: `yore`, `output`, `command`, `similar`, `json`, `indexer`, `build`, `files`
    - json output: total=`873`, terms=`["yore","output","command","similar","json","indexer","build","files"]`
    - prompt output: `yore, output, command, similar, json, indexer, build, files`

**Acceptance criteria**
- Surface output is stable across runs with identical index data.

### Tranche T2B — Ranking and truncation
**Priority:** high  
**Parallelizable:** no (depends on T2A)  
**Goal:** Enforce deterministic ranking and limit behavior.

- Apply stable ranking and tie-break rules after frequency scoring.
- Enforce `--limit` against deterministic ordered candidates.

**Status:** ✅ complete

Status update:
- `YORE-CORE-03B` completed in `src/main.rs` by adding an explicit first-heading tie-break into `cmd_vocabulary` ranking.
- Candidate order now resolves ties using first-seen file, line, heading, and lexical term after frequency scores.

**Acceptance criteria**
- Ranked output remains stable and deterministic.

Validation for this tranche run:
- `cargo test --quiet`

### Tranche T3 — Filtering and scoring
**Priority:** high  
**Depends on:** T1, T2  
**Goal:** Deliver only domain-relevant vocabulary.

- **Implemented in current tranche run (2026-02-21 run):**
  - Added default English/common stop-word filtering in `cmd_vocabulary`.
  - Added optional `--stopwords` file merging with default stop words.
  - Added hygiene filtering for term length, numeric-heavy tokens, and punctuation-only tokens.
  - Enforced these filters before `--limit` truncation while preserving deterministic ranking.
  - Fixed `cmd_vocabulary` output construction to preserve custom stop-word path metadata correctly (avoid accidental `HashSet::map` call).

Validation evidence:
- `cargo run --quiet -- vocabulary --index .yore --format lines --limit 12`
- `cargo run --quiet -- vocabulary --index .yore --format json --limit 12`
- `cargo run --quiet -- vocabulary --index .yore --stopwords /tmp/yore-vocabulary-stopwords.txt --limit 12`

**Acceptance criteria**
- Common-language terms are filtered by default.
- Output ordering is stable and deterministic.

### Tranche T4A — JSON output contract
**Priority:** medium  
**Depends on:** T2B, T3  
**Goal:** Finalize structured JSON output schema.

- Define a stable JSON payload including term, score, and count fields.
- Ensure schema remains constant for unchanged index state.

**Status:** ✅ complete

Implemented in this tranche:
- Changed `VocabularyResult` JSON payload to emit term objects with `term`, `score`, and `count` fields.
- Kept `lines` and `prompt` output renderers compatible while sourcing rendered values from the same structured vocabulary model.

Validation evidence:
- `cargo run --quiet -- vocabulary --index .yore --format json --limit 12`

**Acceptance criteria**
- `--format json` emits schema-compatible, deterministic payloads.

### Tranche T4 — Renderers and formatting
**Priority:** medium  
**Depends on:** T2B, T3  
**Goal:** Produce required output modes.

**Status:** ✅ complete

- Implement formatter path:
  - `lines` (default): one term per line.
  - `json`: list of terms or objects (`{ term, score, count }`) as decided.
  - `prompt`: comma-separated string (no trailing separator).
- Validate `--json` alias behavior and format validation.

Status update:
- `YORE-CORE-04` completed in `src/main.rs` by adding dedicated `lines` and `prompt` renderers, plus prompt-term normalization for whitespace/control-character cleanup.
- Verification for this tranche run:
  - `cargo run --quiet -- vocabulary --index .yore --format lines --limit 12`
  - `cargo run --quiet -- vocabulary --index .yore --format prompt --limit 12`
  - `cargo run --quiet -- vocabulary --index .yore --format json --limit 12`

**Acceptance criteria**
- `--format json` emits valid JSON.
- `--format prompt` is whitespace-normalized and LLM-safe.

### Tranche T5A — CLI integration tests
**Priority:** medium  
**Depends on:** T1, T2, T2A, T2B, T3, T4  
**Status:** ✅ complete  

Status update:
- Implemented in this tranche in `tests/vocabulary_cli_integration.rs`:
  - Added command-level wiring checks for `yore vocabulary --help`.
  - Added end-to-end format coverage for `lines`, `json`, and `prompt` output using the same fixture index.
  - Added `--limit` and default filtering assertions.
  - Added custom stop-word flow and `--json` alias coverage.
  
Validation command for this tranche:
- `cargo test --test vocabulary_cli_integration`

**Goal:** Validate command-level behavior for all formats.

- Add integration coverage for:
- `yore vocabulary` wiring and flags,
  - JSON prompt and line output consistency,
  - `--limit` and custom stopword flows.

**Acceptance criteria**
- `cargo test --test` level checks for CLI command behavior pass.

### Tranche T5 — Unit tests
**Priority:** medium  
**Depends on:** T1, T2, T2A, T2B, T3, T4  
**Goal:** Prevent regressions in core data transforms.

- Add/extend unit tests:
  - stem-to-surface mapping,
  - stopword/common-word filtering,
  - output formatting for each mode,
  - `--limit` bounds.
- Add/refresh integration tests:
  - command wiring for `yore vocabulary ...`
  - JSON schema shape check.

**Acceptance criteria**
- `cargo test` passes for new and existing vocabulary-related paths.
- Command returns expected structures on fixture corpus.

**Status:** ✅ complete (2026-02-21 run)

Status update:
- Implemented in this tranche:
  - Added dedicated vocabulary helpers in `src/main.rs`:
    - `apply_vocabulary_limit` (preserves total candidate count)
    - `render_vocabulary_lines`
    - `resolve_vocabulary_surface`
  - Added unit coverage for:
    - `apply_vocabulary_limit` bounds and total-count behavior
    - `render_vocabulary_lines` output
    - `render_vocabulary_prompt` normalization
    - `VocabularyResult` JSON schema shape
    - deterministic `resolve_vocabulary_surface` heading-vs-forward and fallback behavior
- Refactored vocabulary internals so helpers are separately testable without changing CLI output semantics.
- Verification command run:
  - `cargo test`

### Tranche T6 — Docs + release notes
**Priority:** low  
**Depends on:** T1..T5, T5A  
**Goal:** Publish feature with clear user guidance.

**Status:** ✅ complete

- Update `README.md` command reference and examples.
- Add changelog entry for the release.
- Update any hook/task docs that reference command coverage.

Status update:
- `YORE-DOCS-01` completed by adding query diagnostics usage in `README.md` (text + JSON payload explanation), adding a full `yore vocabulary` command reference and examples, and updating `CHANGELOG.md`.
- `yore query --explain` documentation now references JSON diagnostics fields (`tokens`, `stems`, `missing_terms`, `idf`, `bm25`, `index_path`, `doc_count`).
- `yore vocabulary` is now documented with format modes (`lines`, `json`, `prompt`), stop-word controls, and example prompts.

**Acceptance criteria**
- New command discoverable in docs and examples.
- Release note explicitly states what changed.

Validation command for this tranche:
- `rg -q "vocabulary|query diagnostics" README.md CHANGELOG.md`

### Tranche YORE-RELEASE-01 — Release workflow and version semantics
**Priority:** medium  
**Depends on:** T6, YARLI onboarding  
**Goal:** Align release workflow and version semantics.

**Status:** ✅ complete

Implemented in this tranche:
- Changed publish workflow trigger to explicit semver-tag patterns (`v*.*.*`) instead of wildcard tags.
- Added an early secret/token guard for `CARGO_REGISTRY_TOKEN` using repository secret context.
- Tightened tag/version validation to require explicit `v`-prefixed SemVer-like tags and exact match to `Cargo.toml` version.

Validation command for this tranche:
- `bash -n .github/workflows/publish-crates.yml`

### Tranche Hook-01 — Sensitive-content scan hardening
**Priority:** high
**Depends on:** baseline hook wiring
**Status:** ✅ complete (2026-02-21 run)

- Expanded pre-commit sensitive scan patterns for additional high-risk secrets:
  - AWS session/access keys (`ASIA...`, `A3T...`)
  - GitHub token variants (`ghr_...`)
  - Slack, OpenAI, Google, Stripe, and Hugging Face API-style tokens
  - Encrypted/private key variant coverage (`BEGIN ENCRYPTED PRIVATE KEY`)
- Kept existing bypass and failure behavior unchanged; only detection coverage changed.

Validation for this tranche:
- `bash scripts/abs-path-leak-scan.sh`

### Tranche Hook-01A — Absolute-path disclosure scan hardening
**Priority:** high
**Depends on:** Hook-01
**Status:** ✅ complete (2026-02-21 run)

- Expanded absolute-path detection in `scripts/abs-path-leak-scan.sh` to include:
  - explicit `file://` URI paths,
  - UNC path forms,
  - expanded POSIX absolute-location coverage for `/tmp`, `/opt`, `/var`, `/etc`, `/mnt`, and `/Users`/`/home` home-style paths.
- Added clearer scanner guidance with optional debug visibility via `YORE_ABS_PATH_MASK=0` while still masking by default.
- Preserved existing skip behavior (`YORE_SKIP_ABSOLUTE_PATH_CHECK=1`) and kept scan strictness non-regressive for home/repo path cases.
- Added masking coverage in diagnostics for `/tmp`, `/opt`, `/var`, `/etc`, and `/mnt`.

Validation for this tranche:
- `bash scripts/abs-path-leak-scan.sh`

### Tranche Hook-02 — Finalize squash-scope reporting UX
**Priority:** medium
**Depends on:** Hook-01A
**Status:** ✅ complete (2026-02-21 run)

- Added richer heuristic candidate rendering in `scripts/squash-scope-review.sh` by including commit subjects for each flagged pair.
- Added explicit rewrite guidance and next-action recommendation (interactive rebase command and `squash`/`fixup` direction) when candidates are found.
- Preserved existing `SQUASH_SCOPE_NEEDS_REWRITE` machine-readable signal and heuristic disclaimer.

Validation for this tranche:
- `bash scripts/squash-scope-review.sh`

### Tranche Hook-02A — Optional AI-backed squash recommendations
**Priority:** medium  
**Depends on:** Hook-02  
**Status:** ✅ complete (2026-02-21 run)

- Added AI-assisted mode to `scripts/squash-scope-review.sh` gated by `YORE_SQUASH_AI_ENABLED=1`.
- Added required helper configuration via `YORE_AI_HELPER`, optional `YORE_AI_ARGS`, and `YORE_AI_INPUT_MODE` support.
- Captured structured heuristic candidate payload (`score`, `overlap`, `reason`) and passed it to the AI helper for actionable recommendations.
- Added graceful degradation:
  - helper unavailable or command missing -> `AI review skipped` message
  - helper failures or empty output -> fallback to heuristic report only

Validation for this tranche:
- `bash scripts/squash-scope-review.sh`

### Tranche YARLI-01 — Finalize YARLI execution onboarding map
**Priority:** medium  
**Status:** ✅ complete (2026-02-21 run)

- Confirmed `yarli-orchestration` is represented as its own onboarding tranche in the execution map.
- Finalized and validated onboarding metadata in `.yarl/tranches.toml`.

Validation for this tranche:
- `yarli plan validate`

## YARLI execution map
- **Wave 1:** T1 + T2 (parallel if safe within dependencies).
- **Wave 2:** T2A + T2B + T3 (depends on Wave 1).
- **Wave 3:** T4 + T4A + T5.
- **Wave 4:** T5A + T6 + final PR/docs sign-off.

## Onboarding: active YARLI tranches
These work packages are mirrored in `.yarli/tranches.toml` for machine execution:
- `vocabulary-feature` (`YORE-CORE-01`, `YORE-CORE-02*`, `YORE-CORE-03*`, `YORE-CORE-04*`, `YORE-CORE-05*`)
- `commit-hardening` (`YORE-HOOK-01`, `YORE-HOOK-01A`, `YORE-HOOK-02`, `YORE-HOOK-02A`)
- `yarli-orchestration` (YORE-YARLI-01)
- `release-and-docs` (YORE-DOCS-01..YORE-RELEASE-01)

Use `yarli plan validate` to verify syntax and `yarli plan tranche list` to inspect state.

## Exit criteria
- New command works: `yore vocabulary --index .yore --limit 200 --format prompt`.
- Existing CLI behavior remains stable.
- Change is ready for review, squashing review, and PR merge.

## Next tranche map draft (post-release hardening)
This is the next execution plan, split so each YARLI task stays narrow.

### Wave A: hook hardening cleanup
- **YORE-MAINT-01 — Cross-project hook onboarding**
  - Validate and document `.githooks` versus shared `/home/<user>/Documents/commithooks` behavior for mixed clone layouts.
  - Add clarifying examples for `YORE_SKIP_*` and `YORE_ABS_PATH_MASK` in a shared hook setup note.
- **YORE-MAINT-02 — Scanner coverage hardening**
  - Add/refresh regression fixtures for absolute-path detection and masking outputs.
  - Keep false positives low while still catching home/repo paths, file:// URIs, and mount-style locations.

### Wave B: docs and operational checks
- **YORE-MAINT-03 — Workflow behavior**
  - Verify publish workflow still requires explicit tags and tag/version matching before `cargo publish`.
- **YORE-MAINT-04 — Runbook update**
  - Capture the exact PR-to-publish flow (feature branch -> PR -> merge -> tag -> publish) in docs for operators.

### Wave C: release readiness
- **YORE-MAINT-05 — Reconcile commit scope**
  - Produce a concise commit-scope report for the above waves and confirm whether any work should be squashed before merge.

Planned tranches are tracked as draft work items and can be materialized into `.yarl/tranches.toml` in the next cycle when you want YARLI to execute them.