Skip to main content

Module embedding_set

Module embedding_set 

Source
Expand description

EmbeddingBucket - per-node leaf object inside the Prolly sidecar that lifts the embedding vector out of the Node canonical bytes.

§Why this exists

When the embedding vector lives inline on Node:

NodeCid = blake3(canonical_bytes(Node)) // includes embed.vector

ORT reorders f32 sums across thread counts (TBB-style work-stealing reductions are not associative on f32), so two machines re-deriving the same source text on different core counts produce vectors that differ in the last bit. Different vector → different Node bytes → different NodeCid for embed-bearing chunks. That breaks mnem’s “two machines indexing the same logical event produce identical Node CIDs” federated-dedup promise as soon as the runtime uses available_parallelism() instead of a single thread.

Fix: vectors live in a separate Prolly tree referenced by Commit.embeddings: Option<Cid> (the sibling slot to Commit.indexes). The tree is keyed by 32-byte NodeCid digest; values are EmbeddingBuckets carrying one (model, Embedding) pair per simultaneously-indexed embedder. Identity bytes (Node) and derived bytes (Embedding) are content-addressed independently. Multi-thread ORT no longer leaks into Node CIDs.

§Pattern source

Mirrors the AdjacencyBucket shape from the existing IndexSet sidecar: sorted entry list inside each leaf, hand-rolled Serialize/Deserialize carrying a _kind discriminator and a #[serde(flatten)] extra forward-compat carrier so unrelated schema bumps stay round-trippable.

Structs§

EmbeddingBucket
Per-node bucket of embeddings inside the Commit.embeddings Prolly tree.
EmbeddingEntry
One (model, Embedding) pair inside an EmbeddingBucket.