tf_idf_vectorizer/lib.rs
1/// This crate is a Document Analysis Engine using a TF-IDF Vectorizer.
2pub mod vectorizer;
3pub mod utils;
4
5/// TF-IDF Vectorizer
6/// The top-level struct of this crate, providing the main TF-IDF vectorizer features.
7/// It converts a document collection into TF-IDF vectors and supports similarity
8/// computation and search functionality.
9///
10/// Internally, it holds:
11/// - The corpus vocabulary
12/// - Sparse TF vectors for each document
13/// - A token map for TF vectors
14/// - An IDF vector cache
15/// - A TF-IDF calculation engine
16/// - An inverted index of documents
17///
18/// `TFIDFVectorizer<N, K, E>` has the following generic parameters:
19/// - `N`: Vector parameter type (e.g., f32, f64, u8, u16, u32)
20/// - `K`: Document key type (e.g., String, usize)
21/// - `E`: TF-IDF calculation engine type (e.g., DefaultTFIDFEngine)
22///
23/// When creating an instance, you must pass a corpus reference as `Arc<Corpus>`.
24/// The `Corpus` can optionally be replaced, and can be shared among multiple
25/// `TFIDFVectorizer` instances.
26///
27/// # Serialization
28/// Supported.
29/// In this case, the `Corpus` reference is included as well.
30/// You can also use `TFIDFData` as a serializable data structure.
31/// `TFIDFData` does not hold a `Corpus` reference, so it can be stored separately
32/// from the `Corpus`.
33///
34/// # Deserialization
35/// Supported, including data expansion/unpacking.
36pub use vectorizer::TFIDFVectorizer;
37
38/// TF-IDF Vectorizer Data Structure for Serialization
39/// This struct provides a serializable data structure that does not hold a `Corpus`
40/// reference (unlike `TFIDFVectorizer`).
41/// You can convert it into `TFIDFVectorizer` by passing an `Arc<Corpus>` via
42/// `into_tf_idf_vectorizer`.
43///
44/// Compared to `TFIDFVectorizer`, it has a smaller footprint.
45///
46/// # Serialization
47/// Supported.
48///
49/// # Deserialization
50/// Supported, including data expansion/unpacking.
51pub use vectorizer::serde::TFIDFData;
52
53/// Corpus for TF-IDF Vectorizer
54/// This struct manages a collection of documents.
55/// It does not store document text or IDs; it only manages:
56/// - The number of documents
57/// - The number of documents in which each token appears across the corpus
58///
59/// It is used as the base data for IDF (Inverse Document Frequency) calculation.
60///
61/// When creating a `TFIDFVectorizer`, you must pass a corpus reference as
62/// `Arc<Corpus>`.
63/// `Corpus` is thread-safe and can be shared among multiple `TFIDFVectorizer`
64/// instances.
65///
66/// For statistics/analysis, `TokenFrequency` may be more suitable.
67/// You can convert to `TokenFrequency` if needed, but note that it represents
68/// fundamentally different statistical meaning.
69///
70/// # Thread Safety
71/// This struct is thread-safe and can be accessed concurrently from multiple threads.
72/// Implemented using DashMap and atomics.
73pub use vectorizer::corpus::Corpus;
74
75/// Token Frequency structure
76/// A struct for analyzing/managing token occurrence frequency within a document.
77/// It manages:
78/// - The count of occurrences of each token
79/// - The total number of tokens in the document
80///
81/// Used as base data for TF (Term Frequency) calculation.
82///
83/// Provides rich functionality such as adding tokens, setting/getting counts,
84/// and retrieving statistics.
85pub use vectorizer::token::TokenFrequency;
86
87/// TF IDF Calculation Engine Trait
88/// A trait that defines the behavior of a TF-IDF calculation engine.
89///
90/// By implementing this trait, you can plug different TF-IDF calculation strategies
91/// into `TFIDFVectorizer<E>`.
92/// A default implementation, `DefaultTFIDFEngine`, is provided and performs
93/// textbook-style TF-IDF calculation.
94///
95/// The default implementation supports the following parameter quantizations:
96/// - f16
97/// - f32
98/// - f64
99/// - u8
100/// - u16
101/// - u32
102pub use vectorizer::tfidf::{DefaultTFIDFEngine, TFIDFEngine};
103
104/// Similarity Algorithm for TF-IDF Vectorizer
105/// The `SimilarityAlgorithm` enum defines similarity-scoring algorithms used by the
106/// TF-IDF vectorizer.
107///
108/// Currently, the following algorithms are supported:
109/// - Contains: simple containment check (whether it contains the token)
110/// - Dot: dot product (suitable for long-document search)
111/// - Cosine Similarity: cosine similarity (suitable for proper noun search)
112/// - BM25 Like: BM25-like scoring (suitable for general document search)
113pub use vectorizer::evaluate::scoring::SimilarityAlgorithm;
114
115/// Query Structure for TF-IDF Vectorizer
116/// Represents a search query used by the TF-IDF vectorizer.
117/// It provides a flexible way to filter documents by combining complex logical
118/// conditions.
119pub use vectorizer::evaluate::query::Query;
120
121/// Search Hits and Hit Entry structures
122/// Data structures for managing search results.
123/// - `Hits`: holds a list of search results and provides features such as sorting by score
124/// - `HitEntry`: represents a single result entry, containing the document key and score
125pub use vectorizer::evaluate::scoring::{Hits, HitEntry};