Expand description
Text vectorization for machine learning.
This module provides vectorization tools to convert text documents into numerical feature vectors suitable for machine learning models:
CountVectorizer: Bag of Words representation (word counts)TfidfVectorizer: TF-IDF weighted features (term frequency-inverse document frequency)
§Design Principles
- Zero
unwrap()calls (Cloudflare-class safety) - Result-based error handling
- Comprehensive test coverage (≥95%)
- Integration with tokenizers and stop words
§Quick Start
use aprender::text::vectorize::CountVectorizer;
use aprender::text::tokenize::WhitespaceTokenizer;
let documents = vec![
"the cat sat on the mat",
"the dog sat on the log",
];
let mut vectorizer = CountVectorizer::new()
.with_tokenizer(Box::new(WhitespaceTokenizer::new()));
let matrix = vectorizer.fit_transform(&documents).expect("vectorization should succeed");
// matrix shape: (2 documents, vocabulary_size features)Structs§
- Count
Vectorizer - Bag of Words vectorizer that converts text to word count matrix.
- Hashing
Vectorizer - Stateless hashing vectorizer for streaming/large-scale text.
- Tfidf
Vectorizer - TF-IDF vectorizer that converts text to TF-IDF weighted matrix.