Module count_vectorizer

Expand description

Count vectorizer: convert text documents to a term-count matrix.

Tokenizes documents into runs of 2+ word characters (the Rust analog of scikit-learn’s default token_pattern=r"(?u)\b\w\w+\b", sklearn/feature_extraction/text.py:1161), builds an alphabetically-sorted vocabulary, and produces a term-count matrix of shape (n_docs, n_vocab).

Translation target: scikit-learn 1.5.2 class CountVectorizer (text.py:929). Design: .design/preprocess/count_vectorizer.md. Tracking: #1216.

## REQ status

REQ	Status	Anchor
REQ-1 default fit/transform, sorted vocab, count matrix	SHIPPED (scoped: dense)	`CountVectorizer::fit` / `FittedCountVectorizer::transform`; sklearn `_count_vocab` `text.py:1242-1305`
REQ-2 default token_pattern (drop length-1, `_` word char)	SHIPPED (#1217)	`fn tokenize`; sklearn `text.py:1161`, `build_tokenizer:350`
REQ-3 binary count clipping	SHIPPED	`FittedCountVectorizer::transform`; sklearn `text.py:1374`
REQ-4 lowercase toggle	SHIPPED	`fn tokenize`; sklearn `text.py:1157`,`:323`
REQ-5 max_df/min_df int-vs-float duality + threshold errors	NOT-STARTED (#1219; ceil sub-fix shipped #1218; max_df<min_df + post-prune empty-vocab errors shipped #2337)	`fit` df-filter; sklearn `text.py:1379-1382`,`:1236-1239`
REQ-6 ngram_range word n-grams	NOT-STARTED (#1220)	sklearn `_word_ngrams` `text.py:242`
REQ-7 max_features top-N + tie/sort	SHIPPED (scoped)	`fit`; sklearn `_limit_features` `text.py:1222-1227`
REQ-8 tokenizer/token_pattern/preprocessor/analyzer/strip_accents	NOT-STARTED (#1221)	sklearn `build_analyzer` `text.py:419`
REQ-9 stop_words	NOT-STARTED (#1222)	sklearn `get_stop_words` `text.py:370`
REQ-10 fixed vocabulary param + dtype	NOT-STARTED (#1223)	sklearn `_count_vocab` `text.py:1242-1244`,`:1147`
REQ-11 sparse CSR output	NOT-STARTED (#1224)	sklearn `_count_vocab` `text.py:1299-1304`
REQ-12 get_feature_names_out contract	NOT-STARTED (#1225)	sklearn `text.py:1455`
REQ-13 HashingVectorizer	NOT-STARTED (#1226)	sklearn `class HashingVectorizer` `text.py:562`
REQ-14 full 16-param ctor + _parameter_constraints	NOT-STARTED (#1227)	sklearn `text.py:1124-1148`
REQ-14a empty-vocabulary ValueError parity (post-tokenize + max_df<min_df + post-prune)	SHIPPED (#2336 #2337)	`CountVectorizer::fit` empty-vocab/`max_df`/post-prune `Err(InvalidParameter)`; sklearn `text.py:1277-1279`,`:1381-1382`,`:1236-1239`. Consumer: crate re-export `pub use count_vectorizer::CountVectorizer` (`lib.rs`).
REQ-15 PyO3 binding	NOT-STARTED (#1228)	`ferrolearn-python/src/transformers.rs` (absent)

Structs§

CountVectorizer: An unfitted count vectorizer.
FittedCountVectorizer: A fitted count vectorizer holding the learned vocabulary.