tf-idf-vectorizer-0.7.8 has been yanked.
lang [ en | jp ]
Supports the full pipeline: corpus construction → TF calculation → IDF calculation → TF‑IDF vectorization / similarity search.
Features
- Engine with generic parameters (f32 / f64 / unsigned integers, etc.) and quantization support
- All core structs are serializable / deserializable (
TFIDFData) for persistence - Similarity utilities (
SimilarityAlgorithm,Hits) for search use cases - No index-building step — immediate add/remove; real-time updates
- Thread-safe
- Corpus information separated and replaceable with respect to the index
- Restorability — retains document statistics
Setup
Cargo.toml
[]
= "0.7" # This README is targeted at `v0.7.x`
Basic usage
use Arc;
use ;
Serialization / Restoration
TFIDFVectorizer contains references and cannot be deserialized directly.
For serialization it is converted to TFIDFData, and on restoration you can call into_tf_idf_vectorizer(&Corpus) to restore it.
The corpus provided at restoration can be any corpus; terms not present in the corpus index will be ignored.
// Save
let dump = to_string?;
// Restore
let data: TFIDFData = from_str?;
let restored = data.into_tf_idf_vectorizer;
Similarity search (concept)
- Vectorize input tokens into a query vector (SimilarityAlgorithm)
- Compare with each document using dot product / cosine, etc.
- Return all results as Hits
Performance guidelines
- Cache token dictionaries (token_dim_sample / token_dim_set) to avoid reconstruction
- Sparse TF to omit zeros
- Using integer scaled types (u16/u32) reduces memory (normalization uses 1/max multiplication; floating-point arithmetic is slightly faster)
- Generate reverse index immediately
Type overview
| Type | Role |
|---|---|
| Corpus | Document collection metadata / frequency lookup |
| TokenFrequency | Token frequency within a single document |
| TFVector | TF sparse vector for a single document |
| IDFVector | Global IDF and metadata |
| TFIDFVectorizer | TF/IDF management and search entry point |
| TFIDFData | Intermediate form for serialization |
| DefaultTFIDFEngine | TF/IDF computation backend |
| SimilarityAlgorithm / Hits | Search query and results |
Customization
- Switch numeric types (f32/f64/u16/u32, etc.)
- Replace TFIDFEngine to experiment with different weighting schemes
Examples (examples/)
Run the minimal example with cargo run --example basic.
Contributions via pull requests.