tf-idf-vectorizer-0.6.3 has been yanked.
lang [ en | ja ]
Supports everything from corpus construction → TF calculation → IDF calculation → TF-IDF vectorization / similarity search.
Features
- Generic parameter engine (f32 / f64 / unsigned integer quantization)
- Full struct serialization / deserialization (
TFIDFData) for persistence - Similarity calculation utilities (
SimilarityAlgorithm,Hits) for search - No index build step: instant add/remove, real-time operation
- Thread-safe
- Corpus info separated: index can be swapped independently
- Restorable: keeps document statistics
Setup
Cargo.toml
[]
= "0.4.3" # This README is for v0.4.x
Basic Usage
use ;
use ;
Serialization / Restoration
TFIDFVectorizer contains references and cannot be deserialized directly.
Serialize as TFIDFData, and restore with into_tf_idf_vectorizer(Arc<Corpus>).
You can use any corpus for restoration; if the index contains tokens not in the corpus, they are ignored.
// Save
let dump = to_string?;
// Restore
let data: TFIDFData = from_str?;
let restored = data.into_tf_idf_vectorizer;
Similarity Search (Concept)
- Convert input tokens to query vector (
SimilarityAlgorithm) - Compare with each document using dot product / cosine similarity / bm25, etc.
- Return all results as
Hits
You can inject your own scoring function by replacing the implemented Compare trait / DefaultCompare.
Performance Tips
- Cache token dictionary (
token_dim_sample/token_dim_set) to avoid rebuilding - Sparse TF representation omits zeros
- Using integer scale types (u16/u32) compresses memory (normalization is just 1/max multiplication; float ops are slightly faster)
- Combine iterators to avoid temporary Vec allocation (
tf.zip(idf).map(...))
Type Overview
| Type | Role |
|---|---|
| Corpus | Document set meta / frequency getter |
| TokenFrequency | Token frequency in a single document |
| TFVector | Sparse TF vector for one document |
| IDFVector | Global IDF and meta |
| TFIDFVectorizer | TF/IDF management and search entry |
| TFIDFData | Intermediate for serialization |
| DefaultTFIDFEngine | TF/IDF calculation backend |
| SimilarityAlgorithm / Hits | Search query and results |
Customization
- Switch numeric type: f32/f64/u16/u32, etc.
- Extend scoring by implementing the
Comparetrait - Swap out
TFIDFEnginefor different weighting schemes
Examples (examples/)
Run the minimal example with:
cargo run --example basic