1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
//! # TF-IDF Vectorizer
//!
//! This crate provides a **document analysis engine** based on a highly customizable
//! **TF-IDF vectorizer**.
//!
//! It is designed for:
//! - Full-text search engines
//! - Document similarity analysis
//! - Large-scale corpus processing
//!
//! ## Architecture Overview
//!
//! The crate is composed of the following core concepts:
//!
//! - **Corpus**: Global document-frequency statistics (IDF base)
//! - **TermFrequency**: Per-document term statistics (TF base)
//! - **TFIDFVectorizer**: Converts documents into sparse TF-IDF vectors
//! - **TFIDFEngine**: Pluggable TF / IDF calculation strategy
//! - **SimilarityAlgorithm**: Multiple scoring algorithms (Cosine, Dot, BM25-like)
//!
//! ## Example
//!
//! ```rust
//! use std::sync::Arc;
//!
//! use half::f16;
//! use tf_idf_vectorizer::{Corpus, SimilarityAlgorithm, TFIDFVectorizer, TermFrequency, vectorizer::evaluate::query::Query};
//!
//! fn main() {
//! // build corpus
//! let corpus = Arc::new(Corpus::new());
//!
//! // make term frequencies
//! let mut freq1 = TermFrequency::new();
//! freq1.add_terms(&["rust", "高速", "並列", "rust"]);
//! let mut freq2 = TermFrequency::new();
//! freq2.add_terms(&["rust", "柔軟", "安全", "rust"]);
//!
//! // add documents to vectorizer
//! let mut vectorizer: TFIDFVectorizer<f16> = TFIDFVectorizer::new(corpus);
//! vectorizer.add_doc("doc1".to_string(), &freq1);
//! vectorizer.add_doc("doc2".to_string(), &freq2);
//! vectorizer.del_doc(&"doc1".to_string());
//! vectorizer.add_doc("doc3".to_string(), &freq1);
//!
//! let query = Query::and(Query::term("rust"), Query::term("安全"));
//! let algorithm = SimilarityAlgorithm::CosineSimilarity;
//! let mut result = vectorizer.search(&algorithm, query);
//! result.sort_by_score_desc();
//!
//! // print result
//! println!("Search Results: \n{}", result);
//! // debug
//! println!("result count: {}", result.list.len());
//! println!("{:?}", vectorizer);
//! }
//! ```
//!
//! ## Thread Safety
//!
//! - `Corpus` is thread-safe and can be shared across vectorizers
//! - Designed for parallel indexing and search workloads
//!
//! ## Serialization
//!
//! - `TFIDFVectorizer` and `TFIDFData` support serialization
//! - `TFIDFData` does **not** hold a `Corpus` reference and is suitable for storage
pub use TFIDFVectorizer;
pub use TFIDFData;
pub use Corpus;
pub use TermFrequency;
pub use ;
pub use SimilarityAlgorithm;
pub use Query;
pub use ;