tf_idf_vectorizer/lib.rs
1//! # TF-IDF Vectorizer
2//!
3//! This crate provides a **document analysis engine** based on a highly customizable
4//! **TF-IDF vectorizer**.
5//!
6//! It is designed for:
7//! - Full-text search engines
8//! - Document similarity analysis
9//! - Large-scale corpus processing
10//!
11//! ## Architecture Overview
12//!
13//! The crate is composed of the following core concepts:
14//!
15//! - **Corpus**: Global document-frequency statistics (IDF base)
16//! - **TermFrequency**: Per-document term statistics (TF base)
17//! - **TFIDFVectorizer**: Converts documents into sparse TF-IDF vectors
18//! - **TFIDFEngine**: Pluggable TF / IDF calculation strategy
19//! - **SimilarityAlgorithm**: Multiple scoring algorithms (Cosine, Dot, BM25-like)
20//!
21//! ## Example
22//!
23//! ```rust
24//! use std::sync::Arc;
25//!
26//! use half::f16;
27//! use tf_idf_vectorizer::{Corpus, SimilarityAlgorithm, TFIDFVectorizer, TermFrequency, vectorizer::evaluate::query::Query};
28//!
29//! fn main() {
30//! // build corpus
31//! let corpus = Arc::new(Corpus::new());
32//!
33//! // make term frequencies
34//! let mut freq1 = TermFrequency::new();
35//! freq1.add_terms(&["rust", "高速", "並列", "rust"]);
36//! let mut freq2 = TermFrequency::new();
37//! freq2.add_terms(&["rust", "柔軟", "安全", "rust"]);
38//!
39//! // add documents to vectorizer
40//! let mut vectorizer: TFIDFVectorizer<f16> = TFIDFVectorizer::new(corpus);
41//! vectorizer.add_doc("doc1".to_string(), &freq1);
42//! vectorizer.add_doc("doc2".to_string(), &freq2);
43//! vectorizer.del_doc(&"doc1".to_string());
44//! vectorizer.add_doc("doc3".to_string(), &freq1);
45//!
46//! let query = Query::and(Query::term("rust"), Query::term("安全"));
47//! let algorithm = SimilarityAlgorithm::CosineSimilarity;
48//! let mut result = vectorizer.search(&algorithm, query);
49//! result.sort_by_score_desc();
50//!
51//! // print result
52//! println!("Search Results: \n{}", result);
53//! // debug
54//! println!("result count: {}", result.list.len());
55//! println!("{:?}", vectorizer);
56//! }
57//! ```
58//!
59//! ## Thread Safety
60//!
61//! - `Corpus` is thread-safe and can be shared across vectorizers
62//! - Designed for parallel indexing and search workloads
63//!
64//! ## Serialization
65//!
66//! - `TFIDFVectorizer` and `TFIDFData` support serialization
67//! - `TFIDFData` does **not** hold a `Corpus` reference and is suitable for storage
68
69pub mod vectorizer;
70pub mod utils;
71
72#[doc = "## Core Vectorizer"]
73/// TF-IDF Vectorizer
74///
75/// The top-level struct of this crate, providing the main TF-IDF vectorizer features.
76///
77/// It converts a document collection into TF-IDF vectors and supports similarity
78/// computation and search functionality.
79///
80/// ### Internals
81/// - Corpus vocabulary
82/// - Sparse TF vectors per document
83/// - term index mapping
84/// - Cached IDF vector
85/// - Pluggable TF-IDF engine
86/// - Inverted document index
87///
88/// ### Type Parameters
89/// - `N`: Vector parameter type (e.g., `f32`, `f64`, `u16`)
90/// - `K`: Document key type (e.g., `String`, `usize`)
91/// - `E`: TF-IDF calculation engine
92///
93/// ### Notes
94/// - Requires an `Arc<Corpus>` on construction
95/// - `Corpus` can be shared across multiple vectorizers
96///
97/// ### Serialization
98/// Supported.
99/// Serialized data includes the `Corpus` reference.
100///
101/// For corpus-independent storage, use [`TFIDFData`].
102pub use vectorizer::TFIDFVectorizer;
103
104#[doc = "## Serializable Data Structures"]
105/// TF-IDF Vectorizer Data Structure (Corpus-free)
106///
107/// A compact, serializable representation of a TF-IDF vectorizer.
108///
109/// Unlike [`TFIDFVectorizer`], this struct does **not** hold a `Corpus` reference.
110/// It can be converted back into a `TFIDFVectorizer` by providing an `Arc<Corpus>`.
111///
112/// ### Use Cases
113/// - Persistent storage
114/// - Network transfer
115/// - Memory-efficient snapshots
116///
117/// ### Serialization
118/// Supported.
119///
120/// ### Deserialization
121/// Supported, including internal data expansion.
122pub use vectorizer::serde::TFIDFData;
123
124#[doc = "## Corpus & Statistics"]
125/// Corpus for TF-IDF Vectorizer
126///
127/// Manages global document-frequency statistics required for IDF calculation.
128///
129/// This struct does **not** store document text or identifiers.
130/// It only tracks:
131/// - Total number of documents
132/// - Number of documents containing each term
133///
134/// ### Thread Safety
135/// - Fully thread-safe
136/// - Implemented using `DashMap` and atomics
137///
138/// ### Notes
139/// - Must be shared via `Arc<Corpus>`
140/// - Can be reused across multiple vectorizers
141pub use vectorizer::corpus::Corpus;
142
143/// term Frequency Structure
144///
145/// Manages per-document term statistics used for TF calculation.
146///
147/// Tracks:
148/// - term occurrence counts
149/// - Total term count in the document
150///
151/// ### Use Cases
152/// - TF calculation
153/// - term-level statistics
154pub use vectorizer::term::TermFrequency;
155
156#[doc = "## TF-IDF Engines"]
157/// TF-IDF Calculation Engine Trait
158///
159/// Defines the behavior of a TF-IDF calculation engine.
160///
161/// Custom engines can be implemented and plugged into
162/// [`TFIDFVectorizer`].
163///
164/// A default implementation, [`DefaultTFIDFEngine`], is provided.
165///
166/// ### Supported Numeric Types
167/// - `f16`
168/// - `f32`
169/// - `f64`
170/// - `u8`
171/// - `u16`
172/// - `u32`
173pub use vectorizer::tfidf::{DefaultTFIDFEngine, TFIDFEngine};
174
175#[doc = "## Similarity & Search"]
176/// Similarity Algorithm
177///
178/// Defines scoring algorithms used during search.
179///
180/// ### Variants
181/// - `Contains`: term containment check
182/// - `Dot`: Dot product (long documents)
183/// - `Cosine`: Cosine similarity (proper nouns)
184/// - `BM25Like`: BM25-inspired scoring
185pub use vectorizer::evaluate::scoring::SimilarityAlgorithm;
186
187/// Query Structure
188///
189/// Represents a search query with logical filtering conditions.
190pub use vectorizer::evaluate::query::Query;
191
192/// Search Results
193///
194/// - `Hits`: A collection of ranked search results
195/// - `HitEntry`: A single search result entry
196pub use vectorizer::evaluate::scoring::{Hits, HitEntry};