Expand description
Pluggable embedding backend for the semantic axis.
The default TfIdfEmbedder implements smoothed TF-IDF cosine over
the corpus of texts being compared — production-quality for lexical
similarity but blind to paraphrase (“yes” vs “I agree” score 0).
For paraphrase-robust similarity, callers supply an Embedder
that produces dense vectors per text. The crate stays free of heavy
ML dependencies; users bring their own embedding source via:
BoxedEmbedder::new(|texts| { ... })— a closure returningVec<Vec<f32>>for any external source (ONNX runtime, HF Inference API, OpenAI embeddings, in-house service, …).- A direct impl of
Embedderfor stateful adapters that need to hold model handles, HTTP clients, or tokenizer state.
Cross-language consistency: the cosine similarity computation happens in Rust regardless of where vectors come from. As long as two embedders produce comparable vectors (same dimensionality, similar magnitudes), their semantic-axis output stays meaningful.
§Why no built-in ONNX backend
Bundling ort + tokenizers + a real embedding model would
either blow past PyPI’s per-wheel size limit (~100 MB) or force
users to download the model file on first use — both create
friction for the 99% of Shadow users whose semantic-axis needs
are already met by TF-IDF over response text. The trait keeps the
door open for users with paraphrase-heavy workloads to plug in
whatever embedding source they already run, without forcing the
cost on the default install.
Structs§
- Boxed
Embedder - Adapter that wraps any
Fn(&[&str]) -> Vec<Vec<f32>>closure into anEmbedder.
Traits§
- Embedder
- A backend that produces dense embedding vectors for a slice of input texts.
Functions§
- cosine
- Cosine similarity between two equal-length dense vectors.