Skip to main content

Module embedder

Module embedder 

Source
Expand description

Pluggable embedding backend for the semantic axis.

The default TfIdfEmbedder implements smoothed TF-IDF cosine over the corpus of texts being compared — production-quality for lexical similarity but blind to paraphrase (“yes” vs “I agree” score 0).

For paraphrase-robust similarity, callers supply an Embedder that produces dense vectors per text. The crate stays free of heavy ML dependencies; users bring their own embedding source via:

  • BoxedEmbedder::new(|texts| { ... }) — a closure returning Vec<Vec<f32>> for any external source (ONNX runtime, HF Inference API, OpenAI embeddings, in-house service, …).
  • A direct impl of Embedder for stateful adapters that need to hold model handles, HTTP clients, or tokenizer state.

Cross-language consistency: the cosine similarity computation happens in Rust regardless of where vectors come from. As long as two embedders produce comparable vectors (same dimensionality, similar magnitudes), their semantic-axis output stays meaningful.

§Why no built-in ONNX backend

Bundling ort + tokenizers + a real embedding model would either blow past PyPI’s per-wheel size limit (~100 MB) or force users to download the model file on first use — both create friction for the 99% of Shadow users whose semantic-axis needs are already met by TF-IDF over response text. The trait keeps the door open for users with paraphrase-heavy workloads to plug in whatever embedding source they already run, without forcing the cost on the default install.

Structs§

BoxedEmbedder
Adapter that wraps any Fn(&[&str]) -> Vec<Vec<f32>> closure into an Embedder.

Traits§

Embedder
A backend that produces dense embedding vectors for a slice of input texts.

Functions§

cosine
Cosine similarity between two equal-length dense vectors.