Expand description
The embedding stage: candle XLM-RoBERTa FP16 (CandleEmbedder) plus
the batch-oriented EmbedWorker that fills messages.vector /
messages.embedding_model (spec.md#search). One message produces one
vector - there is no chunking.
LazyEmbedder caches a loaded backend for pond mcp / pond serve
and drops it after DEFAULT_IDLE_EVICTION of no use. The drop is
clean under macOS phys_footprint (post-drop drops to ~107 MiB
regardless of backend), so time-weighted RSS over an interactive MCP
session stays well under the per-instance budget despite the macOS
Metal buffer pool’s iokit_mapped retention during active queries.
The worker accumulates messages and calls the model once per fixed-size
batch, never once per message, and writes each batch’s vectors to
messages in one column-update commit.
Structs§
- Batch
Progress - Per-batch stats handed to a progress callback. Lets
pond embeddrive anindicatifbar without leaking the crate into this module’s API. - Candle
Embedder - The candle e5 backend: XLM-RoBERTa FP16 weights on the GPU (Metal on
macOS, CUDA on a
cuda-feature non-macOS build, CPU otherwise).forwardis&self, so no interior mutability is needed. - Embed
Summary - Outcome of an
EmbedWorker::runpass. - Embed
Worker - Fills
messages.vector/messages.embedding_modelfor the backlog of un-embedded messages. Readsmessages.search_textdirectly, batches it through the backend one vector each, and writes each batch back tomessagesby primary key. - Lazy
Embedder - Lazy holder for an
Embedderwith idle eviction. The model isn’t loaded until the first hybrid/vector call asks for it - idlepond mcp/pond serveprocesses pay nothing while no vector queries land. Afteridle_thresholdof inactivity the cached backend is dropped on the nextgetcall; under macOSphys_footprintthe drop reclaims ~365-585 MiB cleanly (the post-drop floor is ~107 MiB regardless of backend). Reload cost is one synchronous model-load (300-500 ms), absorbed inside the human-paced gap between MCP queries.
Constants§
- DEFAULT_
BATCH_ SIZE - Messages per model-inference + write batch. e5 truncates at 512 tokens, so a 32-row batch’s padded attention transient stays bounded.
- DEFAULT_
IDLE_ EVICTION - How long the cached backend can sit unused before
LazyEmbedder::getdrops it. Five minutes matches typical interactive-MCP conversational pauses: short enough that a model that’s been unused for a turn or two is gone before the next quiet window, long enough that ordinary query bursts never pay the reload cost. - DEFAULT_
MODEL_ ID - Default embedding model pond ships a loader for (spec.md#search). Used when
[embeddings].modelis absent.pond embedstamps the runtime model id (seemodel_id) intomessages.embedding_modelwith every vector. e5-small (384-dim) is the default; the paraphrase benchmark set showed no statistically-significant quality loss vs e5-base while halving vector storage and ~halving model RSS. - DEFAULT_
SORT_ WINDOW - Messages buffered and length-sorted before being cut into model batches.
The tokenizer pads every batch to its longest member, so a batch mixing a short
and a long message embeds the short one at the long one’s length. Sorting a
window first clusters similar-length messages, so each batch pads near its
own longest, not the corpus worst case. Bounded so peak memory stays one
window, not the whole backlog. See
EmbedWorker::with_sort_window.
Traits§
- Embedder
- The embedding seam (spec.md#search): text in, vectors out. The real
backend is
CandleEmbedder; tests substitute an instrumented fake to assert batching behavior. The vector width is checked at the write boundary and the model id is whatevermodel_idreturns at the time of the write.
Functions§
- format_
passage - Format a document (one message’s
search_text) for the embedder - thepassage:half of the pair documented onformat_query. Used byEmbedWorkerwhen batching messages forpond embed. - format_
query - Format a search query for the embedder. e5 is an asymmetric retriever:
its model card prescribes
query:on the search side,passage:on documents. Used bypond_searchto prepare the query text before the candle/Metal embed call. - init_
model_ id - Seed
model_idfrom config. First call wins; later calls with a different id are silently ignored - the process loads its config once. - model_
id - The active model id. Returns the value installed by
init_model_idorDEFAULT_MODEL_IDwhen nothing has installed one (tests, ad-hoc tooling).