Skip to main content

Module text

Module text

Expand description

Text processing and feature extraction for NLP tasks.

Provides tokenization, count-based vectorization, and TF-IDF weighting. All vectorizers produce sparse CSR matrices via crate::sparse::CsrMatrix.

§Example

use scry_learn::text::{CountVectorizer, TfidfVectorizer};

let docs = ["the cat sat", "the dog sat", "the cat played"];

// Count vectorizer
let mut cv = CountVectorizer::new();
let counts = cv.fit_transform(&docs);

// TF-IDF vectorizer
let mut tfidf = TfidfVectorizer::new();
let matrix = tfidf.fit_transform(&docs);

Re-exports§

pub use count::CountVectorizer;
pub use tfidf::TfidfNorm;
pub use tfidf::TfidfVectorizer;

Modules§

count: Count-based text vectorizer.
tfidf: TF-IDF text vectorizer.
tokenizer: Text tokenization utilities.

Functions§

sparse_to_dataset: Convert a sparse CSR matrix (from a text vectorizer) into a Dataset.