Expand description
EmbeddingBucket - per-node leaf object inside the Prolly sidecar
that lifts the embedding vector out of the
Node canonical bytes.
§Why this exists
When the embedding vector lives inline on Node:
NodeCid = blake3(canonical_bytes(Node)) // includes embed.vectorORT reorders f32 sums across thread counts (TBB-style work-stealing
reductions are not associative on f32), so two machines re-deriving
the same source text on different core counts produce vectors that
differ in the last bit. Different vector → different Node bytes →
different NodeCid for embed-bearing chunks. That breaks mnem’s
“two machines indexing the same logical event produce identical
Node CIDs” federated-dedup promise as soon as the runtime uses
available_parallelism() instead of a single thread.
Fix: vectors live in a separate Prolly tree referenced by
Commit.embeddings: Option<Cid> (the sibling slot to
Commit.indexes). The tree is keyed by 32-byte NodeCid digest;
values are EmbeddingBuckets carrying one (model, Embedding)
pair per simultaneously-indexed embedder. Identity bytes (Node)
and derived bytes (Embedding) are content-addressed independently.
Multi-thread ORT no longer leaks into Node CIDs.
§Pattern source
Mirrors the AdjacencyBucket shape from
the existing IndexSet sidecar: sorted entry
list inside each leaf, hand-rolled Serialize/Deserialize
carrying a _kind discriminator and a #[serde(flatten)] extra
forward-compat carrier so unrelated schema bumps stay
round-trippable.
Structs§
- Embedding
Bucket - Per-node bucket of embeddings inside the
Commit.embeddingsProlly tree. - Embedding
Entry - One
(model, Embedding)pair inside anEmbeddingBucket.