Module local_embed

Expand description

Local ONNX embedding backend (all-MiniLM-L6-v2) driven directly through ort.

Replaces the fastembed crate. We own the ORT session so we can cap intra-op threads — fastembed hardcoded with_intra_threads(all cores), which pegged every core during indexing (the sustained-CPU complaint). We cap to num_cpus / 2, which an earlier measurement showed is both faster (1.7x) and far lighter (3.5x less CPU) than oversubscribing all cores.

The pipeline reproduces fastembed’s MiniLM path byte-for-byte (verified: cosine 1.000000 vs fastembed across code + prose), so existing semantic indexes remain valid with no re-embed:

tokenizer.json, truncation forced to max_length=512 (the Qdrant tokenizer ships an embedded max_length=128 that fastembed overrides), add_special_tokens=true
ONNX inputs input_ids / attention_mask / token_type_ids (i64) → output last_hidden_state [batch, seq, dim]
mean pool: sum(mask · tok, over seq) / max(sum(mask), 1)
L2 normalize: v / (||v|| + 1e-12)

Structs§

LocalEmbedder