Skip to main content

Module embedder

Module embedder 

Source
Expand description

Local embedding generation (LLM-only, one-shot per invocation). Embedding generation for the GraphRAG memory.

v1.0.76: the default build is LLM-only — the binary does NOT bundle fastembed / ort / ndarray / tokenizers. All embeddings are produced by a headless invocation of claude code or codex (OAuth, no MCP, no hooks) and stored as a BLOB in memory_embeddings(memory_id, embedding, source). Vector similarity is computed in pure Rust at query time.

§Workload classification (G42/S3, BLOCO 1 — OBRIGATÓRIA)

LLM embedding is I/O-bound + subprocess-bound: each call waits 5-60s on a network round-trip through a headless claude -p / codex exec subprocess while the local CPU stays idle. Concurrency therefore uses tokio (async I/O concurrency) and NEVER rayon (reserved for CPU-bound work).

§Permit formula (G42/S3, BLOCO 2)

permits = clamp(--llm-parallelism, 1, 32)
          .min(available_parallelism())
          .min(available_ram_mb * 0.5 / LLM_WORKER_RSS_MB)

LLM_WORKER_RSS_MB = 350 (crate::constants): claude -p and codex exec are node processes with a typical Maximum RSS of 200-400 MB (measured via /usr/bin/time -l on macOS / /usr/bin/time -v on Linux), so the RAM bound is pertinent.

§Locking contract (G42/A3 fix)

The process-wide Mutex<LlmEmbedding> protects ONLY the cheap clone of the client configuration (flavour + binary path + model + shared schema tempfiles). It is NEVER held across network I/O — the v1.0.76-v1.0.78 flush_group held it for the whole sequential embedding loop, which is why --llm-parallelism 8 measured an effective parallelism of 1.

Constants§

CHUNK_EMBED_BATCH_SIZE
Calibration base: chunk (long-text) batch size per LLM call at the calibration dimensionality (G42/S2). Use chunk_embed_batch_size for the dim-adaptive value (G44).
EMBED_BATCH_CALIBRATION_DIM
Dimensionality the batch bases above were calibrated against (G44).
ENTITY_EMBED_BATCH_SIZE
Calibration base: entity-name (short-text) batch size per LLM call at the calibration dimensionality (G42/S2). Use entity_embed_batch_size for the dim-adaptive value (G44).

Functions§

bytes_to_f32
chunk_embed_batch_size
Dim-adaptive batch size for chunk (long-text) embedding calls (G44).
effective_permits
G42/S3 BLOCO 2: effective permit count.
embed_passage
Embeds a single passage for storage. Delegates to the configured LLM headless (claude code / codex). Returns a vector of the active dimensionality.
embed_passage_local
embed_passages_controlled
Embeds a batch of passages with token-count-aware batching.
embed_passages_controlled_local
embed_passages_parallel_local
G42/S3: embeds texts through the bounded parallel fan-out and returns vectors in input order.
embed_query
Embeds a single query for similarity search. Same model and dim as embed_passage; the only difference is the LLM-side prompt prefix that the headless invocation uses to disambiguate.
embed_query_local
embed_texts_parallel
G42/S3 core: bounded parallel batch embedding.
embed_texts_parallel_with
Like embed_texts_parallel but invokes on_result as soon as each embedding arrives (BLOCO 5: incremental persistence — a kill loses at most the in-flight batches, never the already-delivered items).
embedding_dim
Returns the dimensionality of the embedding space. Used to validate LLM responses and to size the in-memory cache.
entity_embed_batch_size
Dim-adaptive batch size for entity-name (short-text) embedding calls (G44).
f32_to_bytes
get_embedder
Initialises the LLM-embedding client on first use and returns it.