Expand description
Entity canonicalization - resolve_or_create (gap-catalog gap 04).
This module implements the collapse-or-create decision for a string
query against an HNSW-indexed population of already-known nodes.
The design follows research/gap-catalog/04-entity-canonicalization/
R1-R6, in particular:
- Distribution-derived collapse threshold
tau_ncomputed from a k=2 Gaussian Mixture over the HNSW-local cosine sample. No global magic cosine constant; the threshold tracks the corpus geometry.tau_n = max(mu_same - 2*sigma_same, mu_diff + sigma_diff). - Two-of-three consensus collapse gate: at least two of (cosine, normalized_levenshtein, namespace/trust) must agree for two nodes to be merged. Single-signal collapses are refused.
- Commit-id-derived HNSW seed: the HNSW walk seed is
BLAKE3(commit_cid || domain_sep)[..8]- two runs against the same commit get the same seed; different commits get independent seeds. Bootstrap fallback0xCANO_N_0001_u64when the commit CID is the zero CID. CommitBudgetGuardwiring: caller passeslatency_budget_ms: Option<u32>and the module opens a guard atRESOLVE_OR_CREATE_P99_MShard wall; exhaustion returnsResolveResult::BudgetExhaustedcarrying the best-effort candidate.
§p99 floor-c apparatus (R6)
RESOLVE_OR_CREATE_P99_MS is a tunable floor-c constant:
- Reference standard: `p95_hnsw_walk_ms + consensus_overhead_ms
- p99_headroom = 50
on the reference repo (|V|=1M,avg_degree=12`).
- p99_headroom = 50
- Gauge:
mnem_resolve_or_create_p99_breach_total. - Proptest: [
tests::resolve_or_create_hits_50ms_hard_wall]. - Unit test:
[
tests::resolve_creates_below_threshold], [tests::resolve_merges_above_threshold], [tests::threshold_derived_from_local_samples], [tests::commit_budget_guard_cuts_off].
§Rollback template (see scripts/rollback-gap-04.sql)
Rolling canonicalization back uses the following idempotent SQL template, kept here as a comment so readers don’t have to chase the script file:
-- scripts/rollback-gap-04.sql
-- Rollback entity canonicalization emitted after <ROLLBACK_CID>.
-- Invocation: mnem admin rollback --feature=canonicalization --after=<CID>
-- Idempotent: re-running is safe (second run is a no-op).
BEGIN TRANSACTION;
-- 1. Drop canonical_cid props from nodes committed after the point.
UPDATE nodes
SET props = json_remove(props, '$.canonical_cid')
WHERE commit_cid > :ROLLBACK_CID
AND json_extract(props, '$.canonical_cid') IS NOT NULL;
-- 2. Drop the canonical cluster manifest rows.
DROP TABLE IF EXISTS canonical_manifest_staging;
DELETE FROM canonical_manifest
WHERE commit_cid > :ROLLBACK_CID;
-- 3. Cache-flush NOTIFY handled post-SQL by mnem admin rollback:
-- posts INTERNAL ResetCanonicalCache event to runtime, which
-- drains AppState::canonical_cache + rebuilds lazily.
NOTIFY canonical_cache_flush, :ROLLBACK_CID;
-- 4. Reset rolling-telemetry derived counters so SLO alerting
-- does not attribute post-rollback baselines to rolled commits.
UPDATE rolling_stats
SET p50_canonicalize_ms = NULL,
p99_canonicalize_ms = NULL
WHERE last_updated_commit_cid > :ROLLBACK_CID;
COMMIT;Structs§
- Candidate
- A sampled (candidate_id, cosine_to_query, name_for_edit_dist, namespace, trust) tuple. Lifetime-free for testability: a real caller pulls these from the HNSW walk.
- Local
Threshold - Per-node distribution-derived threshold and its component stats.
- Resolve
Outcome - Full outcome of a resolve call, including the guard’s report for embedding in the commit envelope and the (seed, source) pair used for the HNSW walk.
- Resolve
Request - Request payload for
resolve_or_create.
Enums§
- Hnsw
Seed Source - Origin of the HNSW build seed used for this run.
- Refusal
Reason - Reasons a resolve call was refused (not merged, not created).
- Resolve
Result - Outcome of
resolve_or_create.
Constants§
- EDIT_
DISTANCE_ TAU - R3 same-class edit-distance tau (embedder-calibrated). Max 25% normalized Levenshtein distance qualifies as an edit-dist collapse signal.
- EF_
SEARCH_ CANONICAL - R4 pinned ef_search for canonicalization HNSW handle. Separate from retrieve ef_search to avoid cross-path drift. Reference standard: Malkov-Yashunin 2016 §4 recall-vs-latency envelope (ef=128 yields recall >= 0.95 at p95 latency < 20ms for 768-dim).
- HNSW_
SEED_ FALLBACK - R5 bootstrap-only HNSW seed fallback for when
commit_cidis the zero CID (e.g. the first commit in an empty repo). - MIN_
SAMPLE_ SIZE - R4 minimum HNSW neighbourhood size below which threshold
derivation refuses to emit
canonical_cid. - RESOLVE_
OR_ CREATE_ P99_ MS - R5 numeric p99 SLO for
mnem_resolve_or_create. - SIGMA_
MULTIPLIER_ FOR_ COLLAPSE - R3 same-class sigma multiplier for collapse threshold. Derivation:
DBSCAN-/HDBSCAN-style inlier boundary
mean - 2*sigma. Clamped to[1.5, 3.0]at manifest-load time.
Functions§
- derive_
local_ threshold - Run k=2 Gaussian Mixture on a pre-computed HNSW-local cosine sample and return the distribution-derived collapse threshold.
- normalized_
levenshtein - Normalized Levenshtein distance in
[0, 1].0= identical,1= maximally different. Used by the edit-distance consensus signal. Implementation is the classic O(m*n) DP matrix, pure-Rust, no extra crate. Short names dominate here so memory is a non-issue. - resolve_
hnsw_ seed - Resolve the commit-derived HNSW build seed.
- resolve_
or_ create - Resolve a query string onto an existing canonical node, or decide that a new node should be created.
- resolve_
or_ create_ simple - Tight
(query, threshold) -> ResolveResultshape from the gap brief. - two_
of_ three_ consensus - Two-of-three consensus: returns
(signals_passed, per_signal)whereper_signal = [cosine_ok, edit_ok, namespace_ok].