Skip to main content

Module resolve

Module resolve 

Source
Expand description

Entity canonicalization - resolve_or_create (gap-catalog gap 04).

This module implements the collapse-or-create decision for a string query against an HNSW-indexed population of already-known nodes. The design follows research/gap-catalog/04-entity-canonicalization/ R1-R6, in particular:

  • Distribution-derived collapse threshold tau_n computed from a k=2 Gaussian Mixture over the HNSW-local cosine sample. No global magic cosine constant; the threshold tracks the corpus geometry. tau_n = max(mu_same - 2*sigma_same, mu_diff + sigma_diff).
  • Two-of-three consensus collapse gate: at least two of (cosine, normalized_levenshtein, namespace/trust) must agree for two nodes to be merged. Single-signal collapses are refused.
  • Commit-id-derived HNSW seed: the HNSW walk seed is BLAKE3(commit_cid || domain_sep)[..8] - two runs against the same commit get the same seed; different commits get independent seeds. Bootstrap fallback 0xCANO_N_0001_u64 when the commit CID is the zero CID.
  • CommitBudgetGuard wiring: caller passes latency_budget_ms: Option<u32> and the module opens a guard at RESOLVE_OR_CREATE_P99_MS hard wall; exhaustion returns ResolveResult::BudgetExhausted carrying the best-effort candidate.

§p99 floor-c apparatus (R6)

RESOLVE_OR_CREATE_P99_MS is a tunable floor-c constant:

  • Reference standard: `p95_hnsw_walk_ms + consensus_overhead_ms
    • p99_headroom = 50 on the reference repo (|V|=1M, avg_degree=12`).
  • Gauge: mnem_resolve_or_create_p99_breach_total.
  • Proptest: [tests::resolve_or_create_hits_50ms_hard_wall].
  • Unit test: [tests::resolve_creates_below_threshold], [tests::resolve_merges_above_threshold], [tests::threshold_derived_from_local_samples], [tests::commit_budget_guard_cuts_off].

§Rollback template (see scripts/rollback-gap-04.sql)

Rolling canonicalization back uses the following idempotent SQL template, kept here as a comment so readers don’t have to chase the script file:

-- scripts/rollback-gap-04.sql
-- Rollback entity canonicalization emitted after <ROLLBACK_CID>.
-- Invocation: mnem admin rollback --feature=canonicalization --after=<CID>
-- Idempotent: re-running is safe (second run is a no-op).

BEGIN TRANSACTION;

-- 1. Drop canonical_cid props from nodes committed after the point.
UPDATE nodes
   SET props = json_remove(props, '$.canonical_cid')
 WHERE commit_cid > :ROLLBACK_CID
   AND json_extract(props, '$.canonical_cid') IS NOT NULL;

-- 2. Drop the canonical cluster manifest rows.
DROP TABLE IF EXISTS canonical_manifest_staging;
DELETE FROM canonical_manifest
 WHERE commit_cid > :ROLLBACK_CID;

-- 3. Cache-flush NOTIFY handled post-SQL by mnem admin rollback:
--    posts INTERNAL ResetCanonicalCache event to runtime, which
--    drains AppState::canonical_cache + rebuilds lazily.
NOTIFY canonical_cache_flush, :ROLLBACK_CID;

-- 4. Reset rolling-telemetry derived counters so SLO alerting
--    does not attribute post-rollback baselines to rolled commits.
UPDATE rolling_stats
   SET p50_canonicalize_ms = NULL,
       p99_canonicalize_ms = NULL
 WHERE last_updated_commit_cid > :ROLLBACK_CID;

COMMIT;

Structs§

Candidate
A sampled (candidate_id, cosine_to_query, name_for_edit_dist, namespace, trust) tuple. Lifetime-free for testability: a real caller pulls these from the HNSW walk.
LocalThreshold
Per-node distribution-derived threshold and its component stats.
ResolveOutcome
Full outcome of a resolve call, including the guard’s report for embedding in the commit envelope and the (seed, source) pair used for the HNSW walk.
ResolveRequest
Request payload for resolve_or_create.

Enums§

HnswSeedSource
Origin of the HNSW build seed used for this run.
RefusalReason
Reasons a resolve call was refused (not merged, not created).
ResolveResult
Outcome of resolve_or_create.

Constants§

EDIT_DISTANCE_TAU
R3 same-class edit-distance tau (embedder-calibrated). Max 25% normalized Levenshtein distance qualifies as an edit-dist collapse signal.
EF_SEARCH_CANONICAL
R4 pinned ef_search for canonicalization HNSW handle. Separate from retrieve ef_search to avoid cross-path drift. Reference standard: Malkov-Yashunin 2016 §4 recall-vs-latency envelope (ef=128 yields recall >= 0.95 at p95 latency < 20ms for 768-dim).
HNSW_SEED_FALLBACK
R5 bootstrap-only HNSW seed fallback for when commit_cid is the zero CID (e.g. the first commit in an empty repo).
MIN_SAMPLE_SIZE
R4 minimum HNSW neighbourhood size below which threshold derivation refuses to emit canonical_cid.
RESOLVE_OR_CREATE_P99_MS
R5 numeric p99 SLO for mnem_resolve_or_create.
SIGMA_MULTIPLIER_FOR_COLLAPSE
R3 same-class sigma multiplier for collapse threshold. Derivation: DBSCAN-/HDBSCAN-style inlier boundary mean - 2*sigma. Clamped to [1.5, 3.0] at manifest-load time.

Functions§

derive_local_threshold
Run k=2 Gaussian Mixture on a pre-computed HNSW-local cosine sample and return the distribution-derived collapse threshold.
normalized_levenshtein
Normalized Levenshtein distance in [0, 1]. 0 = identical, 1 = maximally different. Used by the edit-distance consensus signal. Implementation is the classic O(m*n) DP matrix, pure-Rust, no extra crate. Short names dominate here so memory is a non-issue.
resolve_hnsw_seed
Resolve the commit-derived HNSW build seed.
resolve_or_create
Resolve a query string onto an existing canonical node, or decide that a new node should be created.
resolve_or_create_simple
Tight (query, threshold) -> ResolveResult shape from the gap brief.
two_of_three_consensus
Two-of-three consensus: returns (signals_passed, per_signal) where per_signal = [cosine_ok, edit_ok, namespace_ok].