Skip to main content

Module sparse_set

Module sparse_set 

Source
Expand description

SparseBucket - per-node leaf object inside the Prolly sidecar that lifts the sparse embedding out of the Node canonical bytes.

§Why this exists

When the sparse embedding lives inline on Node:

NodeCid = blake3(canonical_bytes(Node)) // includes sparse_embed

Different sparse encoders and vocabulary differences produce different byte representations, so two machines indexing the same logical source text with different encoder versions produce different NodeCid values. That breaks mnem’s federated-dedup promise.

Fix: sparse embeddings live in a separate Prolly tree referenced by Commit.sparse: Option<Cid> (the sibling slot to Commit.embeddings). The tree is keyed by 16-byte truncated blake3 of the NodeCid wire form; values are SparseBuckets carrying one (vocab_id, SparseEmbed) pair per indexed vocabulary. Identity bytes (Node) and derived bytes (SparseEmbed) are content-addressed independently. Vocab differences no longer leak into Node CIDs.

§Pattern source

Mirrors the EmbeddingBucket shape from G16 and the AdjacencyBucket shape from the existing IndexSet sidecar: sorted entry list inside each leaf, hand-rolled Serialize/Deserialize carrying a _kind discriminator and a #[serde(flatten)] extra forward-compat carrier so unrelated schema bumps stay round-trippable.

Structs§

SparseBucket
Per-node bucket of sparse embeddings inside the Commit.sparse Prolly tree.
SparseEntry
One (vocab_id, SparseEmbed) pair inside a SparseBucket.