spool-memory 0.2.3

# Phase 4 Round 15 Plan — Measure & Excerpt

## Summary

Phase 4 (Retrieval Intelligence) has five eventual deliverables —
stale detection, contradiction detection, confidence scoring,
semantic retrieval, and better excerpt extraction. Round 15 is the
**measurement-first** entry round: it expands what we can see about
the existing heuristic ranker before deciding which deliverable
deserves Round 16.

`docs/NEXT_STEPS.md` explicitly forbids jumping to embeddings before
"current heuristic / debuggability limits are measured". Round 15
honors that constraint by:

1. promoting structured score breakdown to a first-class output
2. centering the excerpt window on the actual matched term
3. extending the criterion bench to a larger vault size

The aim is to make every retrieval failure case answer a single
question: *which contributions added up to this score, and which
section did the excerpt come from*. Once that question is cheap to
answer, deciding whether the next round is "stale detection" or
"semantic retrieval" becomes evidence-based.

## Sub-tasks

| # | Range | Status |
|---|-------|--------|
| 1 | `ScoreContribution` / `ScoreSource` data model in `src/domain/note.rs` | ✅ |
| 2 | `score_note` returns `(i32, Vec<String>, Vec<ScoreContribution>)` via an `Accumulator` that keeps reasons + breakdown in sync with `score` | ✅ |
| 3 | `CandidateNote` carries `score_breakdown`; explain markdown shows nested per-contribution rows | ✅ |
| 4 | `cli_smoke` json case asserts `breakdown.weights.sum() == score` (the ground-truth contract) | ✅ |
| 5 | `excerpt_for_input` adds wikilink dimension to section scoring; `build_section_excerpt_for_terms` centers the body window on the first hit + ellipsizes edges | ✅ |
| 6 | bench widens to `[250, 1000, 5000]` notes across `scan_notes` / `build_context` / `build_bundle` | ✅ |
| 7 | this plan + `docs/PHASE_4_R1_BASELINE.md` baseline numbers | ✅ |

## Why this order

- **Breakdown before excerpt before bench**: every change up to (5)
  is local to scorer/excerpt and adds zero runtime cost, so the
  bench numbers we record at the end already reflect the new
  surface. We do not need a "before/after" comparison to ship Round
  15 — the contract is *new visibility*, not faster code.
- **Reasons string stays**: it is still emitted alongside breakdown.
  Markdown renderers, existing CLI users, and any external scripts
  that grep for reason text keep working. `score_breakdown` is the
  preferred surface for new code.
- **Sum-equals-score** is a load-bearing invariant. Breakdown is
  only useful if it accounts for the score completely; the cli_smoke
  assertion guards against any future contributor adding a `score
  +=` that bypasses `Accumulator::add`.

## Test Plan

- `cargo test -q --lib` (355+ unit tests)
- `cargo test --test cli_smoke` — 63 cases including the new
  breakdown-sum assertion
- `cargo test --test mcp_smoke` — 10 cases (unchanged but must stay
  green because the sampling pipeline reads the same scoring core)
- `cargo bench --bench retrieval` — see baseline numbers below

## Out of Scope (carry into Round 16+)

- ❌ Embedding-based semantic retrieval. Decision deferred until
  the breakdown shows the heuristic genuinely runs out of signal.
- ❌ Stale-memory detection. Needs a `last_used` source on the
  ledger which we do not collect today; punted to Round 17.
- ❌ Contradiction detection. Wants the same sampling reverse-call
  channel that R4b shipped; tooling-ready, scope-deferred.
- ❌ Confidence score on memories. Will likely be Round 16 because
  we already have heuristic / sampling / accepted as natural
  confidence tiers.
- ❌ Persistent retrieval index. Needs evidence the in-memory scan
  is the bottleneck; the new bench at 5000 notes is the input to
  that decision.

## Round 16 Candidate Order (proposal, not commitment)

Based on what Round 15 makes visible:

1. **Confidence scoring** — lowest risk, plugs into existing
   wakeup ranking, and the breakdown row format already supports
   adding a `Confidence` source.
2. **Stale detection** — needs a small ledger schema addition for
   `last_referenced_at`, then a decay function over the same
   breakdown surface.
3. **Contradiction detection** — reuses the sampling reverse-call
   from R4b; do not start before users have actually exercised the
   sampling path.
4. **Semantic retrieval** — only if the new bench shows the
   heuristic ranker plateauing despite breakdown-driven tuning.

## Completion Status

Last checked: `2026-05-08`