Module vectorize

Expand description

Text vectorization for machine learning.

This module provides vectorization tools to convert text documents into numerical feature vectors suitable for machine learning models:

CountVectorizer: Bag of Words representation (word counts)
TfidfVectorizer: TF-IDF weighted features (term frequency-inverse document frequency)

§Design Principles

Zero unwrap() calls (Cloudflare-class safety)
Result-based error handling
Comprehensive test coverage (≥95%)
Integration with tokenizers and stop words

§Quick Start

use aprender::text::vectorize::CountVectorizer;
use aprender::text::tokenize::WhitespaceTokenizer;

let documents = vec![
    "the cat sat on the mat",
    "the dog sat on the log",
];

let mut vectorizer = CountVectorizer::new()
    .with_tokenizer(Box::new(WhitespaceTokenizer::new()));

let matrix = vectorizer.fit_transform(&documents).expect("vectorization should succeed");
// matrix shape: (2 documents, vocabulary_size features)

Structs§

CountVectorizer: Bag of Words vectorizer that converts text to word count matrix.
HashingVectorizer: Stateless hashing vectorizer for streaming/large-scale text.
TfidfVectorizer: TF-IDF vectorizer that converts text to TF-IDF weighted matrix.

Module vectorize

Module vectorize Copy item path

§Design Principles

§Quick Start

Structs§

Module vectorize