lang [ en | ja ]
Supports the full pipeline: corpus building → TF calculation → IDF calculation → TF‑IDF vectorization / similarity search.
Features
- Engine with generic parameters (f32 / f64 / unsigned integers, etc.) and quantization support
- Serde support
- Similarity computation utilities (
SimilarityAlgorithm,Hits,Query) for search use cases - No index build step; supports immediate add/delete in real time
- Thread-safe
- Corpus information is separable and can be swapped for an index
- Restorable: preserves document statistics
Setup
Cargo.toml
[]
= "0.9" # This README targets `v0.9.x`
Basic usage
use Arc;
use ;
Serialization / Restore
Because TFIDFVectorizer contains references, it cannot be deserialized directly.
When serializing, it is converted into TFIDFData, and can be restored with into_tf_idf_vectorizer(&Corpus).
At that time, any corpus (not only the original one) can be used and it will still work correctly (tokens that exist in the index but not in the corpus are ignored).
// save
let dump = to_string?;
// restore
let data: TFIDFData = from_str?;
let restored = data.into_tf_idf_vectorizer;
Performance guidelines
- Cache token dictionaries (token_dim_sample / token_dim_set) to avoid rebuilding
- Sparsify TF vectors to omit zeros
- Using integer-scaled types (u16/u32) reduces memory usage (during normalization, only 1/max multiplication is needed; floats are slightly faster for computation)
- Build the inverted index on the fly
Type overview
| Type | Role |
|---|---|
| Corpus | Document set metadata / frequency lookup |
| TokenFrequency | Token frequencies within a single document |
| TFVector | TF sparse vector for one document |
| IDFVector | Global IDF and metadata |
| TFIDFVectorizer | TF/IDF management and search entry point |
| TFIDFData | Intermediate type for serialization |
| DefaultTFIDFEngine | Backend for TF/IDF computation |
| SimilarityAlgorithm / Hits / Query | Search query and results |
Customization
- Switch numeric type to f16/f32/f64/u16/u32, etc.
- Replace
TFIDFEngineto use different weighting schemes
Examples (examples/)
Run the minimal example with cargo run --example basic.