Skip to main content

Module count_vectorizer

Module count_vectorizer 

Source
Expand description

Count vectorizer: convert text documents to a term-count matrix.

Tokenizes documents into runs of 2+ word characters (the Rust analog of scikit-learn’s default token_pattern=r"(?u)\b\w\w+\b", sklearn/feature_extraction/text.py:1161), builds an alphabetically-sorted vocabulary, and produces a term-count matrix of shape (n_docs, n_vocab).

Translation target: scikit-learn 1.5.2 class CountVectorizer (text.py:929). Design: .design/preprocess/count_vectorizer.md. Tracking: #1216.

## REQ status

REQStatusAnchor
REQ-1 default fit/transform, sorted vocab, count matrixSHIPPED (scoped: dense)CountVectorizer::fit / FittedCountVectorizer::transform; sklearn _count_vocab text.py:1242-1305
REQ-2 default token_pattern (drop length-1, _ word char)SHIPPED (#1217)fn tokenize; sklearn text.py:1161, build_tokenizer:350
REQ-3 binary count clippingSHIPPEDFittedCountVectorizer::transform; sklearn text.py:1374
REQ-4 lowercase toggleSHIPPEDfn tokenize; sklearn text.py:1157,:323
REQ-5 max_df/min_df int-vs-float duality + threshold errorsNOT-STARTED (#1219; ceil sub-fix shipped #1218; max_df<min_df + post-prune empty-vocab errors shipped #2337)fit df-filter; sklearn text.py:1379-1382,:1236-1239
REQ-6 ngram_range word n-gramsNOT-STARTED (#1220)sklearn _word_ngrams text.py:242
REQ-7 max_features top-N + tie/sortSHIPPED (scoped)fit; sklearn _limit_features text.py:1222-1227
REQ-8 tokenizer/token_pattern/preprocessor/analyzer/strip_accentsNOT-STARTED (#1221)sklearn build_analyzer text.py:419
REQ-9 stop_wordsNOT-STARTED (#1222)sklearn get_stop_words text.py:370
REQ-10 fixed vocabulary param + dtypeNOT-STARTED (#1223)sklearn _count_vocab text.py:1242-1244,:1147
REQ-11 sparse CSR outputNOT-STARTED (#1224)sklearn _count_vocab text.py:1299-1304
REQ-12 get_feature_names_out contractNOT-STARTED (#1225)sklearn text.py:1455
REQ-13 HashingVectorizerNOT-STARTED (#1226)sklearn class HashingVectorizer text.py:562
REQ-14 full 16-param ctor + _parameter_constraintsNOT-STARTED (#1227)sklearn text.py:1124-1148
REQ-14a empty-vocabulary ValueError parity (post-tokenize + max_df<min_df + post-prune)SHIPPED (#2336 #2337)CountVectorizer::fit empty-vocab/max_df/post-prune Err(InvalidParameter); sklearn text.py:1277-1279,:1381-1382,:1236-1239. Consumer: crate re-export pub use count_vectorizer::CountVectorizer (lib.rs).
REQ-15 PyO3 bindingNOT-STARTED (#1228)ferrolearn-python/src/transformers.rs (absent)

Structs§

CountVectorizer
An unfitted count vectorizer.
FittedCountVectorizer
A fitted count vectorizer holding the learned vocabulary.