Expand description
§symproj
Symbolic projection and embeddings.
Maps discrete symbols to continuous vectors using a Codebook.
Naming note: this crate was previously named proj, but proj is already taken on crates.io
by GeoRust’s PROJ bindings (geospatial). We publish this crate as symproj.
§Intuition First
Imagine a library where every book has a call number. The call number
isn’t just a label; it tells you where the book sits in a 3D space.
symproj is the system that maps “book names” (tokens) to “library coordinates” (vectors).
§Provenance (minimal citations)
What this crate implements is the long-lived primitive: [ (t_1,\dots,t_n)\mapsto \mathbb{R}^d ] via (1) embedding lookup (a codebook) and (2) pooling (mean).
- Word embeddings / lookup tables: Mikolov et al. (word2vec), 2013.
arXiv:1301.3781 - Subword tokenization:
- BPE for NMT: Sennrich et al., 2016.
P16-1162 - SentencePiece / Unigram LM: Kudo, 2018.
arXiv:1808.06226
- BPE for NMT: Sennrich et al., 2016.
- Sentence embeddings baseline: Arora et al. (SIF), 2017.
ICLR OpenReview - Modern sentence embedding fine-tuning:
- SBERT: Reimers & Gurevych, 2019.
D19-1410 - SimCSE: Gao et al., 2021.
EMNLP 2021
- SBERT: Reimers & Gurevych, 2019.
- Retrieval context (token vectors + pooling/compression):
- ColBERT (late interaction): Khattab & Zaharia, 2020.
arXiv:2004.12832
- ColBERT (late interaction): Khattab & Zaharia, 2020.
§Nearby Rust ecosystem crates (context, not dependencies)
tokenizers(Hugging Face tokenization): https://docs.rs/tokenizers/sentencepiece(SentencePiece model loading): https://crates.io/crates/sentencepiecefinalfusion/rust2vec(word embedding formats): https://docs.rs/finalfusion/ / https://docs.rs/rust2vec/fastembed(embedding generation via ONNX): https://docs.rs/fastembed/candle(Rust ML runtime): https://github.com/huggingface/candle
Structs§
- Codebook
- A Codebook maps token IDs to dense vectors.
- Projection
- A Projection combines a Tokenizer and a Codebook.
Enums§
Functions§
- l2_
normalize_ in_ place - L2-normalize a vector in place.
- remove_
component_ in_ place - Remove a (unit) component direction (u) from a vector (v): [ v \leftarrow v - u ,(u \cdot v) ]
- sif_
weight - SIF (Smooth Inverse Frequency) weight from Arora et al. (2017): [ w(p) = \frac{a}{a + p} ] where (p) is token probability and (a) is a small smoothing constant (often (10^{-3})).