Expand description
Local ONNX embedding backend (all-MiniLM-L6-v2) driven directly through
ort.
Replaces the fastembed crate. We own the ORT session so we can cap
intra-op threads — fastembed hardcoded with_intra_threads(all cores),
which pegged every core during indexing (the sustained-CPU complaint). We
cap to num_cpus / 2, which an earlier measurement showed is both faster
(1.7x) and far lighter (3.5x less CPU) than oversubscribing all cores.
The pipeline reproduces fastembed’s MiniLM path byte-for-byte (verified: cosine 1.000000 vs fastembed across code + prose), so existing semantic indexes remain valid with no re-embed:
- tokenizer.json, truncation forced to max_length=512 (the Qdrant tokenizer ships an embedded max_length=128 that fastembed overrides), add_special_tokens=true
- ONNX inputs input_ids / attention_mask / token_type_ids (i64) → output last_hidden_state [batch, seq, dim]
- mean pool: sum(mask · tok, over seq) / max(sum(mask), 1)
- L2 normalize: v / (||v|| + 1e-12)