Expand description
§TF-IDF Vectorizer
This crate provides a document analysis engine based on a highly customizable TF-IDF vectorizer.
It is designed for:
- Full-text search engines
- Document similarity analysis
- Large-scale corpus processing
§Architecture Overview
The crate is composed of the following core concepts:
- Corpus: Global document-frequency statistics (IDF base)
- TokenFrequency: Per-document token statistics (TF base)
- TFIDFVectorizer: Converts documents into sparse TF-IDF vectors
- TFIDFEngine: Pluggable TF / IDF calculation strategy
- SimilarityAlgorithm: Multiple scoring algorithms (Cosine, Dot, BM25-like)
§Example
use std::sync::Arc;
use tf_idf_vectorizer::{Corpus, SimilarityAlgorithm, TFIDFVectorizer, TokenFrequency, vectorizer::evaluate::query::Query};
fn main() {
// build corpus
let corpus = Arc::new(Corpus::new());
// make token frequencies
let mut freq1 = TokenFrequency::new();
freq1.add_tokens(&["rust", "高速", "並列", "rust"]);
let mut freq2 = TokenFrequency::new();
freq2.add_tokens(&["rust", "柔軟", "安全", "rust"]);
// add documents to vectorizer
let mut vectorizer: TFIDFVectorizer<u16> = TFIDFVectorizer::new(corpus);
vectorizer.add_doc("doc1".to_string(), &freq1);
vectorizer.add_doc("doc2".to_string(), &freq2);
vectorizer.del_doc(&"doc1".to_string());
vectorizer.add_doc("doc3".to_string(), &freq1);
let query = Query::and(Query::token("rust"), Query::token("安全"));
let algorithm = SimilarityAlgorithm::CosineSimilarity;
let mut result = vectorizer.search(&algorithm, query);
result.sort_by_score_desc();
// print result
println!("Search Results: \n{}", result);
// debug
println!("result count: {}", result.list.len());
println!("{:?}", vectorizer);
}§Thread Safety
Corpusis thread-safe and can be shared across vectorizers- Designed for parallel indexing and search workloads
§Serialization
TFIDFVectorizerandTFIDFDatasupport serializationTFIDFDatadoes not hold aCorpusreference and is suitable for storage
Re-exports§
pub use vectorizer::TFIDFVectorizer;pub use vectorizer::serde::TFIDFData;pub use vectorizer::corpus::Corpus;pub use vectorizer::token::TokenFrequency;pub use vectorizer::tfidf::DefaultTFIDFEngine;pub use vectorizer::tfidf::TFIDFEngine;pub use vectorizer::evaluate::scoring::SimilarityAlgorithm;pub use vectorizer::evaluate::query::Query;pub use vectorizer::evaluate::scoring::Hits;pub use vectorizer::evaluate::scoring::HitEntry;