# Phase 4 Round 16 Plan — Confidence Scoring
## Summary
Round 15 made retrieval scoring inspectable (`score_breakdown`) and
established a 5000-note baseline (`docs/PHASE_4_R1_BASELINE.md`).
Round 16 layers a **confidence tier** on top of the same surface so
that callers (wakeup output, explain JSON, future eval harnesses)
can distinguish *"high-trust canonical fact"* from *"heuristic body
match"* without re-deriving it from frontmatter every time.
Confidence is **derived, not stored**. We do not extend the ledger
schema. The tier is computed at scoring time from signals already
present:
- note path: `frontmatter.source_of_truth`, `frontmatter.sensitivity`,
`frontmatter.memory_type`, `frontmatter.retrieval_priority`
- lifecycle path: `MemoryRecord.state`, `MemoryRecord.origin.source_kind`,
`MemoryRecord.sensitivity`, `MemoryRecord.memory_type`
This keeps Round 16 strictly additive: no migration, no projection
rebuild, no hot-path structural change.
## Tier Derivation
`ConfidenceTier` has three variants: `high`, `medium`, `low`.
### Note candidates (`score_note`)
| `source_of_truth = true` | force `high` |
| `retrieval_priority = high` AND `memory_type ∈ {constraint, decision, project}` | `high` |
| `retrieval_priority = low` OR `sensitivity = secret` | `low` |
| otherwise | `medium` |
### Lifecycle candidates (`score_lifecycle_candidate`)
| `state = Canonical` | `high` |
| `state = Accepted` AND `origin.source_kind = Manual` | `high` |
| `state = Accepted` AND `origin.source_kind ∈ {AI, Distill, …}` | `medium` |
| `state = Candidate` | `low` |
| `sensitivity = secret` | clamp down one tier (high → medium → low) |
`Archived` records do not currently flow into the candidate set, so
no rule is needed for that state.
## Scoring Surface Change
`ScoreSource` gets one new variant: `Confidence`. The accumulator
emits exactly one `ScoreContribution` per note:
```text
Weights are intentionally small so they don't dominate routing:
| high | +6 | +5 |
| medium | 0 | 0 |
| low | -4 | -3 |
Why a contribution and not a free-form annotation: the cli_smoke
sum-equals-score invariant from Round 15 stays meaningful. Any
ranking effect of confidence is auditable in `score_breakdown`.
## DTO Changes
- `domain::note::ConfidenceTier` (new enum, `#[derive(TS, Serialize)]`)
- `domain::note::ScoreSource::Confidence` (new variant)
- `domain::note::CandidateNote.confidence: ConfidenceTier`
- `domain::note::ScoredNote.confidence: ConfidenceTier`
- `domain::lifecycle_candidate::LifecycleCandidate.confidence: ConfidenceTier`
- `domain::wakeup::WakeupMemoryItem.confidence: ConfidenceTier`
- `domain::wakeup::WakeupRecommendedNote.confidence: ConfidenceTier`
Frontend bindings will be regenerated via `cargo test --lib export_bindings`.
## Wakeup Behavior
Round 16 **does not change** wakeup ordering. Confidence is
displayed alongside each memory item but the existing
`memory_type → section` mapping and per-section `score`-based
truncation remain in charge of ranking. A future round can use
confidence as a tiebreaker once we have evidence that confidence
disagrees usefully with score.
Rationale: changing both score and ranking in one round makes
regressions hard to attribute. Round 16 ships *visibility*; ranking
adjustments are a follow-up.
## Out of Scope
- ❌ No ledger schema changes (`last_referenced_at` etc. stay for
Round 17 stale detection).
- ❌ No new wakeup sections — `confidence` is a per-item attribute.
- ❌ No re-ranking on confidence; only the small breakdown weight
feeds into `score`.
- ❌ No frontend display changes beyond regenerated TS bindings.
## Test Plan
1. **Unit (scorer)**:
- `score_note` with `source_of_truth=true` returns `confidence=high`
and emits a `Confidence` breakdown row with `+6`.
- `score_note` with `sensitivity=secret` returns `confidence=low`.
- `score_lifecycle_candidate` with `state=Canonical` returns `high`.
- `score_lifecycle_candidate` with `state=Candidate` returns `low`.
- Sum invariant: `breakdown.iter().map(|c| c.weight).sum() ==
score` still holds, including the new contribution.
2. **Wakeup**:
- `build_packet` propagates per-item `confidence` from `ScoredNote`
into `WakeupMemoryItem` / `WakeupRecommendedNote`.
3. **CLI smoke** (`tests/cli_smoke.rs`):
- JSON output of `spool get … --format json` includes
`confidence` per candidate.
- High-trust seed note (with `source_of_truth: true`) reports
`confidence: "high"`.
4. **MCP smoke** (`tests/mcp_smoke.rs`):
- Existing assertions still pass; sampling pipeline is unchanged.
5. **Bench** (`cargo bench --bench retrieval`):
- Re-run to confirm `build_bundle@5000` stays under ~200 ms (the
red line set by `docs/PHASE_4_R1_BASELINE.md`). Update the
baseline file if the new contribution adds measurable cost.
## Round 17 Candidate Order (proposal, not commitment)
After Round 16 ships:
1. **Stale detection** — needs `last_referenced_at` ledger field,
then a decay function feeding the same `score_breakdown`.
2. **Contradiction detection** — only after sampling has been
exercised on real workflows.
3. **Semantic retrieval** — only if heuristic + confidence
plateau in real-vault evaluation.
## Completion Status
Last checked: `2026-05-08`