Expand description
Local embedding generation (LLM-only, one-shot per invocation). Embedding generation for the GraphRAG memory.
v1.0.76: the default build is LLM-only — the binary does NOT bundle
fastembed / ort / ndarray / tokenizers. All embeddings are produced
by a headless invocation of claude code or codex (OAuth, no MCP,
no hooks) and stored as a BLOB in memory_embeddings(memory_id, embedding, source). Vector similarity is computed in pure Rust at query time.
§Workload classification (G42/S3, BLOCO 1 — OBRIGATÓRIA)
LLM embedding is I/O-bound + subprocess-bound: each call waits
5-60s on a network round-trip through a headless claude -p /
codex exec subprocess while the local CPU stays idle. Concurrency
therefore uses tokio (async I/O concurrency) and NEVER rayon
(reserved for CPU-bound work).
§Permit formula (G42/S3, BLOCO 2)
permits = clamp(--llm-parallelism, 1, 32)
.min(available_parallelism())
.min(available_ram_mb * 0.5 / LLM_WORKER_RSS_MB)LLM_WORKER_RSS_MB = 350 (crate::constants): claude -p and
codex exec are node processes with a typical Maximum RSS of
200-400 MB (measured via /usr/bin/time -l on macOS /
/usr/bin/time -v on Linux), so the RAM bound is pertinent.
§Locking contract (G42/A3 fix)
The process-wide Mutex<LlmEmbedding> protects ONLY the cheap clone
of the client configuration (flavour + binary path + model + shared
schema tempfiles). It is NEVER held across network I/O — the
v1.0.76-v1.0.78 flush_group held it for the whole sequential
embedding loop, which is why --llm-parallelism 8 measured an
effective parallelism of 1.
Constants§
- CHUNK_
EMBED_ BATCH_ SIZE - Calibration base: chunk (long-text) batch size per LLM call at the
calibration dimensionality (G42/S2). Use
chunk_embed_batch_sizefor the dim-adaptive value (G44). - EMBED_
BATCH_ CALIBRATION_ DIM - Dimensionality the batch bases above were calibrated against (G44).
- ENTITY_
EMBED_ BATCH_ SIZE - Calibration base: entity-name (short-text) batch size per LLM call at
the calibration dimensionality (G42/S2). Use
entity_embed_batch_sizefor the dim-adaptive value (G44).
Functions§
- bytes_
to_ f32 - chunk_
embed_ batch_ size - Dim-adaptive batch size for chunk (long-text) embedding calls (G44).
- effective_
permits - G42/S3 BLOCO 2: effective permit count.
- embed_
passage - Embeds a single passage for storage. Delegates to the configured LLM headless (claude code / codex). Returns a vector of the active dimensionality.
- embed_
passage_ local - embed_
passages_ controlled - Embeds a batch of passages with token-count-aware batching.
- embed_
passages_ controlled_ local - embed_
passages_ parallel_ local - G42/S3: embeds
textsthrough the bounded parallel fan-out and returns vectors in input order. - embed_
query - Embeds a single query for similarity search. Same model and dim as
embed_passage; the only difference is the LLM-side prompt prefix that the headless invocation uses to disambiguate. - embed_
query_ local - embed_
texts_ parallel - G42/S3 core: bounded parallel batch embedding.
- embed_
texts_ parallel_ with - Like
embed_texts_parallelbut invokeson_resultas soon as each embedding arrives (BLOCO 5: incremental persistence — a kill loses at most the in-flight batches, never the already-delivered items). - embedding_
dim - Returns the dimensionality of the embedding space. Used to validate LLM responses and to size the in-memory cache.
- entity_
embed_ batch_ size - Dim-adaptive batch size for entity-name (short-text) embedding calls (G44).
- f32_
to_ bytes - get_
embedder - Initialises the LLM-embedding client on first use and returns it.