# Phase 4 Round 15 Plan — Measure & Excerpt
## Summary
Phase 4 (Retrieval Intelligence) has five eventual deliverables —
stale detection, contradiction detection, confidence scoring,
semantic retrieval, and better excerpt extraction. Round 15 is the
**measurement-first** entry round: it expands what we can see about
the existing heuristic ranker before deciding which deliverable
deserves Round 16.
`docs/NEXT_STEPS.md` explicitly forbids jumping to embeddings before
"current heuristic / debuggability limits are measured". Round 15
honors that constraint by:
1. promoting structured score breakdown to a first-class output
2. centering the excerpt window on the actual matched term
3. extending the criterion bench to a larger vault size
The aim is to make every retrieval failure case answer a single
question: *which contributions added up to this score, and which
section did the excerpt come from*. Once that question is cheap to
answer, deciding whether the next round is "stale detection" or
"semantic retrieval" becomes evidence-based.
## Sub-tasks
| 1 | `ScoreContribution` / `ScoreSource` data model in `src/domain/note.rs` | ✅ |
| 2 | `score_note` returns `(i32, Vec<String>, Vec<ScoreContribution>)` via an `Accumulator` that keeps reasons + breakdown in sync with `score` | ✅ |
| 3 | `CandidateNote` carries `score_breakdown`; explain markdown shows nested per-contribution rows | ✅ |
| 4 | `cli_smoke` json case asserts `breakdown.weights.sum() == score` (the ground-truth contract) | ✅ |
| 5 | `excerpt_for_input` adds wikilink dimension to section scoring; `build_section_excerpt_for_terms` centers the body window on the first hit + ellipsizes edges | ✅ |
| 6 | bench widens to `[250, 1000, 5000]` notes across `scan_notes` / `build_context` / `build_bundle` | ✅ |
| 7 | this plan + `docs/PHASE_4_R1_BASELINE.md` baseline numbers | ✅ |
## Why this order
- **Breakdown before excerpt before bench**: every change up to (5)
is local to scorer/excerpt and adds zero runtime cost, so the
bench numbers we record at the end already reflect the new
surface. We do not need a "before/after" comparison to ship Round
15 — the contract is *new visibility*, not faster code.
- **Reasons string stays**: it is still emitted alongside breakdown.
Markdown renderers, existing CLI users, and any external scripts
that grep for reason text keep working. `score_breakdown` is the
preferred surface for new code.
- **Sum-equals-score** is a load-bearing invariant. Breakdown is
only useful if it accounts for the score completely; the cli_smoke
assertion guards against any future contributor adding a `score
+=` that bypasses `Accumulator::add`.
## Test Plan
- `cargo test -q --lib` (355+ unit tests)
- `cargo test --test cli_smoke` — 63 cases including the new
breakdown-sum assertion
- `cargo test --test mcp_smoke` — 10 cases (unchanged but must stay
green because the sampling pipeline reads the same scoring core)
- `cargo bench --bench retrieval` — see baseline numbers below
## Out of Scope (carry into Round 16+)
- ❌ Embedding-based semantic retrieval. Decision deferred until
the breakdown shows the heuristic genuinely runs out of signal.
- ❌ Stale-memory detection. Needs a `last_used` source on the
ledger which we do not collect today; punted to Round 17.
- ❌ Contradiction detection. Wants the same sampling reverse-call
channel that R4b shipped; tooling-ready, scope-deferred.
- ❌ Confidence score on memories. Will likely be Round 16 because
we already have heuristic / sampling / accepted as natural
confidence tiers.
- ❌ Persistent retrieval index. Needs evidence the in-memory scan
is the bottleneck; the new bench at 5000 notes is the input to
that decision.
## Round 16 Candidate Order (proposal, not commitment)
Based on what Round 15 makes visible:
1. **Confidence scoring** — lowest risk, plugs into existing
wakeup ranking, and the breakdown row format already supports
adding a `Confidence` source.
2. **Stale detection** — needs a small ledger schema addition for
`last_referenced_at`, then a decay function over the same
breakdown surface.
3. **Contradiction detection** — reuses the sampling reverse-call
from R4b; do not start before users have actually exercised the
sampling path.
4. **Semantic retrieval** — only if the new bench shows the
heuristic ranker plateauing despite breakdown-driven tuning.
## Completion Status
Last checked: `2026-05-08`