Crate tf_idf_vectorizer

Crate tf_idf_vectorizer 

Source
Expand description

§TF-IDF Vectorizer

This crate provides a document analysis engine based on a highly customizable TF-IDF vectorizer.

It is designed for:

  • Full-text search engines
  • Document similarity analysis
  • Large-scale corpus processing

§Architecture Overview

The crate is composed of the following core concepts:

  • Corpus: Global document-frequency statistics (IDF base)
  • TokenFrequency: Per-document token statistics (TF base)
  • TFIDFVectorizer: Converts documents into sparse TF-IDF vectors
  • TFIDFEngine: Pluggable TF / IDF calculation strategy
  • SimilarityAlgorithm: Multiple scoring algorithms (Cosine, Dot, BM25-like)

§Example

use std::sync::Arc;
 
use tf_idf_vectorizer::{Corpus, SimilarityAlgorithm, TFIDFVectorizer, TokenFrequency, vectorizer::evaluate::query::Query};
 
fn main() {
    // build corpus
    let corpus = Arc::new(Corpus::new());
 
    // make token frequencies
    let mut freq1 = TokenFrequency::new();
    freq1.add_tokens(&["rust", "高速", "並列", "rust"]);
    let mut freq2 = TokenFrequency::new();
    freq2.add_tokens(&["rust", "柔軟", "安全", "rust"]);
 
    // add documents to vectorizer
    let mut vectorizer: TFIDFVectorizer<u16> = TFIDFVectorizer::new(corpus);    
    vectorizer.add_doc("doc1".to_string(), &freq1);
    vectorizer.add_doc("doc2".to_string(), &freq2);
    vectorizer.del_doc(&"doc1".to_string());
    vectorizer.add_doc("doc3".to_string(), &freq1);
 
    let query = Query::and(Query::token("rust"), Query::token("安全"));
    let algorithm = SimilarityAlgorithm::CosineSimilarity;
    let mut result = vectorizer.search(&algorithm, query);
    result.sort_by_score_desc();
 
    // print result
    println!("Search Results: \n{}", result);
    // debug
    println!("result count: {}", result.list.len());
    println!("{:?}", vectorizer);
}

§Thread Safety

  • Corpus is thread-safe and can be shared across vectorizers
  • Designed for parallel indexing and search workloads

§Serialization

  • TFIDFVectorizer and TFIDFData support serialization
  • TFIDFData does not hold a Corpus reference and is suitable for storage

Re-exports§

pub use vectorizer::TFIDFVectorizer;
pub use vectorizer::serde::TFIDFData;
pub use vectorizer::corpus::Corpus;
pub use vectorizer::token::TokenFrequency;
pub use vectorizer::tfidf::DefaultTFIDFEngine;
pub use vectorizer::tfidf::TFIDFEngine;
pub use vectorizer::evaluate::scoring::SimilarityAlgorithm;
pub use vectorizer::evaluate::query::Query;
pub use vectorizer::evaluate::scoring::Hits;
pub use vectorizer::evaluate::scoring::HitEntry;

Modules§

utils
vectorizer