Skip to main content

Crate symproj

Crate symproj 

Source
Expand description

§symproj

Symbolic projection and embeddings.

Maps discrete symbols to continuous vectors using a Codebook.

Naming note: this crate was previously named proj, but proj is already taken on crates.io by GeoRust’s PROJ bindings (geospatial). We publish this crate as symproj.

§Intuition First

Imagine a library where every book has a call number. The call number isn’t just a label; it tells you where the book sits in a 3D space. symproj is the system that maps “book names” (tokens) to “library coordinates” (vectors).

§Provenance (minimal citations)

What this crate implements is the long-lived primitive: [ (t_1,\dots,t_n)\mapsto \mathbb{R}^d ] via (1) embedding lookup (a codebook) and (2) pooling (mean).

  • Word embeddings / lookup tables: Mikolov et al. (word2vec), 2013. arXiv:1301.3781
  • Subword tokenization:
  • Sentence embeddings baseline: Arora et al. (SIF), 2017. ICLR OpenReview
  • Modern sentence embedding fine-tuning:
  • Retrieval context (token vectors + pooling/compression):

§Nearby Rust ecosystem crates (context, not dependencies)

Structs§

Codebook
A Codebook maps token IDs to dense vectors.
Projection
A Projection combines a Tokenizer and a Codebook.

Enums§

Error

Functions§

l2_normalize_in_place
L2-normalize a vector in place.
remove_component_in_place
Remove a (unit) component direction (u) from a vector (v): [ v \leftarrow v - u ,(u \cdot v) ]
sif_weight
SIF (Smooth Inverse Frequency) weight from Arora et al. (2017): [ w(p) = \frac{a}{a + p} ] where (p) is token probability and (a) is a small smoothing constant (often (10^{-3})).

Type Aliases§

Result