pub struct CountVectorizer { /* private fields */ }Expand description
Converts text documents into a sparse term-count matrix.
Each document becomes a row, each unique token a column. Cell values are the number of times that token appears in that document.
§Example
use scry_learn::text::CountVectorizer;
let mut cv = CountVectorizer::new();
let docs = ["the cat sat", "the dog sat", "the cat played"];
let matrix = cv.fit_transform(&docs);
assert_eq!(matrix.n_rows(), 3);
assert_eq!(matrix.n_cols(), cv.vocabulary().len());Implementations§
Source§impl CountVectorizer
impl CountVectorizer
Sourcepub fn min_df(self, n: usize) -> Self
pub fn min_df(self, n: usize) -> Self
Set minimum document frequency (absolute). Tokens appearing in fewer documents are excluded. Default: 1.
Sourcepub fn max_df(self, frac: f64) -> Self
pub fn max_df(self, frac: f64) -> Self
Set maximum document frequency as a fraction in (0.0, 1.0].
Tokens appearing in more than this fraction of documents are
excluded. Default: 1.0 (no filtering).
Sourcepub fn ngram_range(self, min_n: usize, max_n: usize) -> Self
pub fn ngram_range(self, min_n: usize, max_n: usize) -> Self
Set n-gram range. Default: (1, 1) (unigrams only).
Sourcepub fn max_features(self, n: usize) -> Self
pub fn max_features(self, n: usize) -> Self
Limit vocabulary to the top n features by total frequency.
Default: no limit.
Sourcepub fn binary(self, b: bool) -> Self
pub fn binary(self, b: bool) -> Self
If true, all non-zero counts become 1 (presence/absence). Default: false.
Sourcepub fn transform<S: AsRef<str>>(&self, documents: &[S]) -> CsrMatrix
pub fn transform<S: AsRef<str>>(&self, documents: &[S]) -> CsrMatrix
Transform documents into a sparse CSR matrix of counts.
Panics if fit() has not been called.
Sourcepub fn fit_transform<S: AsRef<str>>(&mut self, documents: &[S]) -> CsrMatrix
pub fn fit_transform<S: AsRef<str>>(&mut self, documents: &[S]) -> CsrMatrix
Fit the vocabulary and transform in one step.
Sourcepub fn vocabulary(&self) -> &HashMap<String, usize>
pub fn vocabulary(&self) -> &HashMap<String, usize>
Return the learned vocabulary (token → column index).
Sourcepub fn get_feature_names(&self) -> Vec<String>
pub fn get_feature_names(&self) -> Vec<String>
Return feature names sorted by column index.
Sourcepub fn n_features(&self) -> usize
pub fn n_features(&self) -> usize
Number of features in the vocabulary.
Trait Implementations§
Source§impl Clone for CountVectorizer
impl Clone for CountVectorizer
Source§fn clone(&self) -> CountVectorizer
fn clone(&self) -> CountVectorizer
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for CountVectorizer
impl Debug for CountVectorizer
Auto Trait Implementations§
impl Freeze for CountVectorizer
impl RefUnwindSafe for CountVectorizer
impl Send for CountVectorizer
impl Sync for CountVectorizer
impl Unpin for CountVectorizer
impl UnsafeUnpin for CountVectorizer
impl UnwindSafe for CountVectorizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more