spool-memory 0.2.3

# Phase 4 Round 16 Plan — Confidence Scoring

## Summary

Round 15 made retrieval scoring inspectable (`score_breakdown`) and
established a 5000-note baseline (`docs/PHASE_4_R1_BASELINE.md`).
Round 16 layers a **confidence tier** on top of the same surface so
that callers (wakeup output, explain JSON, future eval harnesses)
can distinguish *"high-trust canonical fact"* from *"heuristic body
match"* without re-deriving it from frontmatter every time.

Confidence is **derived, not stored**. We do not extend the ledger
schema. The tier is computed at scoring time from signals already
present:

- note path: `frontmatter.source_of_truth`, `frontmatter.sensitivity`,
  `frontmatter.memory_type`, `frontmatter.retrieval_priority`
- lifecycle path: `MemoryRecord.state`, `MemoryRecord.origin.source_kind`,
  `MemoryRecord.sensitivity`, `MemoryRecord.memory_type`

This keeps Round 16 strictly additive: no migration, no projection
rebuild, no hot-path structural change.

## Tier Derivation

`ConfidenceTier` has three variants: `high`, `medium`, `low`.

### Note candidates (`score_note`)

| Signal | Effect |
|---|---|
| `source_of_truth = true` | force `high` |
| `retrieval_priority = high` AND `memory_type ∈ {constraint, decision, project}` | `high` |
| `retrieval_priority = low` OR `sensitivity = secret` | `low` |
| otherwise | `medium` |

### Lifecycle candidates (`score_lifecycle_candidate`)

| Signal | Effect |
|---|---|
| `state = Canonical` | `high` |
| `state = Accepted` AND `origin.source_kind = Manual` | `high` |
| `state = Accepted` AND `origin.source_kind ∈ {AI, Distill, …}` | `medium` |
| `state = Candidate` | `low` |
| `sensitivity = secret` | clamp down one tier (high → medium → low) |

`Archived` records do not currently flow into the candidate set, so
no rule is needed for that state.

## Scoring Surface Change

`ScoreSource` gets one new variant: `Confidence`. The accumulator
emits exactly one `ScoreContribution` per note:

```text
{ source: Confidence, field: "confidence", term: "high|medium|low", weight: ±N }
```

Weights are intentionally small so they don't dominate routing:

| Tier | Note weight | Lifecycle weight |
|---|---:|---:|
| high | +6 | +5 |
| medium | 0 | 0 |
| low | -4 | -3 |

Why a contribution and not a free-form annotation: the cli_smoke
sum-equals-score invariant from Round 15 stays meaningful. Any
ranking effect of confidence is auditable in `score_breakdown`.

## DTO Changes

- `domain::note::ConfidenceTier` (new enum, `#[derive(TS, Serialize)]`)
- `domain::note::ScoreSource::Confidence` (new variant)
- `domain::note::CandidateNote.confidence: ConfidenceTier`
- `domain::note::ScoredNote.confidence: ConfidenceTier`
- `domain::lifecycle_candidate::LifecycleCandidate.confidence: ConfidenceTier`
- `domain::wakeup::WakeupMemoryItem.confidence: ConfidenceTier`
- `domain::wakeup::WakeupRecommendedNote.confidence: ConfidenceTier`

Frontend bindings will be regenerated via `cargo test --lib export_bindings`.

## Wakeup Behavior

Round 16 **does not change** wakeup ordering. Confidence is
displayed alongside each memory item but the existing
`memory_type → section` mapping and per-section `score`-based
truncation remain in charge of ranking. A future round can use
confidence as a tiebreaker once we have evidence that confidence
disagrees usefully with score.

Rationale: changing both score and ranking in one round makes
regressions hard to attribute. Round 16 ships *visibility*; ranking
adjustments are a follow-up.

## Out of Scope

- ❌ No ledger schema changes (`last_referenced_at` etc. stay for
  Round 17 stale detection).
- ❌ No new wakeup sections — `confidence` is a per-item attribute.
- ❌ No re-ranking on confidence; only the small breakdown weight
  feeds into `score`.
- ❌ No frontend display changes beyond regenerated TS bindings.

## Test Plan

1. **Unit (scorer)**:
   - `score_note` with `source_of_truth=true` returns `confidence=high`
     and emits a `Confidence` breakdown row with `+6`.
   - `score_note` with `sensitivity=secret` returns `confidence=low`.
   - `score_lifecycle_candidate` with `state=Canonical` returns `high`.
   - `score_lifecycle_candidate` with `state=Candidate` returns `low`.
   - Sum invariant: `breakdown.iter().map(|c| c.weight).sum() ==
     score` still holds, including the new contribution.

2. **Wakeup**:
   - `build_packet` propagates per-item `confidence` from `ScoredNote`
     into `WakeupMemoryItem` / `WakeupRecommendedNote`.

3. **CLI smoke** (`tests/cli_smoke.rs`):
   - JSON output of `spool get … --format json` includes
     `confidence` per candidate.
   - High-trust seed note (with `source_of_truth: true`) reports
     `confidence: "high"`.

4. **MCP smoke** (`tests/mcp_smoke.rs`):
   - Existing assertions still pass; sampling pipeline is unchanged.

5. **Bench** (`cargo bench --bench retrieval`):
   - Re-run to confirm `build_bundle@5000` stays under ~200 ms (the
     red line set by `docs/PHASE_4_R1_BASELINE.md`). Update the
     baseline file if the new contribution adds measurable cost.

## Round 17 Candidate Order (proposal, not commitment)

After Round 16 ships:

1. **Stale detection** — needs `last_referenced_at` ledger field,
   then a decay function feeding the same `score_breakdown`.
2. **Contradiction detection** — only after sampling has been
   exercised on real workflows.
3. **Semantic retrieval** — only if heuristic + confidence
   plateau in real-vault evaluation.

## Completion Status

Last checked: `2026-05-08`