Expand description
SparseBucket - per-node leaf object inside the Prolly sidecar
that lifts the sparse embedding out of the
Node canonical bytes.
§Why this exists
When the sparse embedding lives inline on Node:
NodeCid = blake3(canonical_bytes(Node)) // includes sparse_embedDifferent sparse encoders and vocabulary differences produce
different byte representations, so two machines indexing the same
logical source text with different encoder versions produce different
NodeCid values. That breaks mnem’s federated-dedup promise.
Fix: sparse embeddings live in a separate Prolly tree referenced by
Commit.sparse: Option<Cid> (the sibling slot to
Commit.embeddings). The tree is keyed by 16-byte truncated blake3
of the NodeCid wire form; values are SparseBuckets carrying one
(vocab_id, SparseEmbed) pair per indexed vocabulary. Identity bytes
(Node) and derived bytes (SparseEmbed) are content-addressed
independently. Vocab differences no longer leak into Node CIDs.
§Pattern source
Mirrors the EmbeddingBucket shape from
G16 and the AdjacencyBucket shape from
the existing IndexSet sidecar: sorted entry
list inside each leaf, hand-rolled Serialize/Deserialize
carrying a _kind discriminator and a #[serde(flatten)] extra
forward-compat carrier so unrelated schema bumps stay
round-trippable.
Structs§
- Sparse
Bucket - Per-node bucket of sparse embeddings inside the
Commit.sparseProlly tree. - Sparse
Entry - One
(vocab_id, SparseEmbed)pair inside aSparseBucket.