symproj
Symbolic projection (embeddings) for Tekne.
Maps discrete symbols to continuous vectors using a Codebook.
Naming note: this crate was previously named proj, but proj is already taken on crates.io
by GeoRust's PROJ bindings (geospatial). We publish this crate as symproj.
Intuition First
Imagine a library where every book has a call number. The call number
isn't just a label; it tells you where the book sits in a 3D space.
symproj is the system that maps "book names" (tokens) to "library coordinates" (vectors).
Provenance (minimal citations)
What this crate implements is the long-lived primitive: [ (t_1,\dots,t_n)\mapsto \mathbb{R}^d ] via (1) embedding lookup (a codebook) and (2) pooling (mean).
- Word embeddings / lookup tables: Mikolov et al. (word2vec), 2013.
arXiv:1301.3781 - Subword tokenization:
- BPE for NMT: Sennrich et al., 2016.
P16-1162 - SentencePiece / Unigram LM: Kudo, 2018.
arXiv:1808.06226
- BPE for NMT: Sennrich et al., 2016.
- Sentence embeddings baseline: Arora et al. (SIF), 2017.
ICLR OpenReview - Modern sentence embedding fine-tuning:
- SBERT: Reimers & Gurevych, 2019.
D19-1410 - SimCSE: Gao et al., 2021.
EMNLP 2021
- SBERT: Reimers & Gurevych, 2019.
- Retrieval context (token vectors + pooling/compression):
- ColBERT (late interaction): Khattab & Zaharia, 2020.
arXiv:2004.12832
- ColBERT (late interaction): Khattab & Zaharia, 2020.
Nearby Rust ecosystem crates (context, not dependencies)
tokenizers(Hugging Face tokenization): https://docs.rs/tokenizers/sentencepiece(SentencePiece model loading): https://crates.io/crates/sentencepiecefinalfusion/rust2vec(word embedding formats): https://docs.rs/finalfusion/ / https://docs.rs/rust2vec/fastembed(embedding generation via ONNX): https://docs.rs/fastembed/candle(Rust ML runtime): https://github.com/huggingface/candle