Expand description
Count vectorizer: convert text documents to a term-count matrix.
Tokenizes documents into runs of 2+ word characters (the Rust analog of
scikit-learn’s default token_pattern=r"(?u)\b\w\w+\b",
sklearn/feature_extraction/text.py:1161), builds an alphabetically-sorted
vocabulary, and produces a term-count matrix of shape (n_docs, n_vocab).
Translation target: scikit-learn 1.5.2 class CountVectorizer (text.py:929).
Design: .design/preprocess/count_vectorizer.md. Tracking: #1216.
## REQ status
| REQ | Status | Anchor |
|---|---|---|
| REQ-1 default fit/transform, sorted vocab, count matrix | SHIPPED (scoped: dense) | CountVectorizer::fit / FittedCountVectorizer::transform; sklearn _count_vocab text.py:1242-1305 |
REQ-2 default token_pattern (drop length-1, _ word char) | SHIPPED (#1217) | fn tokenize; sklearn text.py:1161, build_tokenizer:350 |
| REQ-3 binary count clipping | SHIPPED | FittedCountVectorizer::transform; sklearn text.py:1374 |
| REQ-4 lowercase toggle | SHIPPED | fn tokenize; sklearn text.py:1157,:323 |
| REQ-5 max_df/min_df int-vs-float duality + threshold errors | NOT-STARTED (#1219; ceil sub-fix shipped #1218; max_df<min_df + post-prune empty-vocab errors shipped #2337) | fit df-filter; sklearn text.py:1379-1382,:1236-1239 |
| REQ-6 ngram_range word n-grams | NOT-STARTED (#1220) | sklearn _word_ngrams text.py:242 |
| REQ-7 max_features top-N + tie/sort | SHIPPED (scoped) | fit; sklearn _limit_features text.py:1222-1227 |
| REQ-8 tokenizer/token_pattern/preprocessor/analyzer/strip_accents | NOT-STARTED (#1221) | sklearn build_analyzer text.py:419 |
| REQ-9 stop_words | NOT-STARTED (#1222) | sklearn get_stop_words text.py:370 |
| REQ-10 fixed vocabulary param + dtype | NOT-STARTED (#1223) | sklearn _count_vocab text.py:1242-1244,:1147 |
| REQ-11 sparse CSR output | NOT-STARTED (#1224) | sklearn _count_vocab text.py:1299-1304 |
| REQ-12 get_feature_names_out contract | NOT-STARTED (#1225) | sklearn text.py:1455 |
| REQ-13 HashingVectorizer | NOT-STARTED (#1226) | sklearn class HashingVectorizer text.py:562 |
| REQ-14 full 16-param ctor + _parameter_constraints | NOT-STARTED (#1227) | sklearn text.py:1124-1148 |
| REQ-14a empty-vocabulary ValueError parity (post-tokenize + max_df<min_df + post-prune) | SHIPPED (#2336 #2337) | CountVectorizer::fit empty-vocab/max_df/post-prune Err(InvalidParameter); sklearn text.py:1277-1279,:1381-1382,:1236-1239. Consumer: crate re-export pub use count_vectorizer::CountVectorizer (lib.rs). |
| REQ-15 PyO3 binding | NOT-STARTED (#1228) | ferrolearn-python/src/transformers.rs (absent) |
Structs§
- Count
Vectorizer - An unfitted count vectorizer.
- Fitted
Count Vectorizer - A fitted count vectorizer holding the learned vocabulary.