Skip to main content

Module vectorize

Module vectorize 

Source
Expand description

Text vectorization for machine learning.

This module provides vectorization tools to convert text documents into numerical feature vectors suitable for machine learning models:

  • CountVectorizer: Bag of Words representation (word counts)
  • TfidfVectorizer: TF-IDF weighted features (term frequency-inverse document frequency)

§Design Principles

  • Zero unwrap() calls (Cloudflare-class safety)
  • Result-based error handling
  • Comprehensive test coverage (≥95%)
  • Integration with tokenizers and stop words

§Quick Start

use aprender::text::vectorize::CountVectorizer;
use aprender::text::tokenize::WhitespaceTokenizer;

let documents = vec![
    "the cat sat on the mat",
    "the dog sat on the log",
];

let mut vectorizer = CountVectorizer::new()
    .with_tokenizer(Box::new(WhitespaceTokenizer::new()));

let matrix = vectorizer.fit_transform(&documents).expect("vectorization should succeed");
// matrix shape: (2 documents, vocabulary_size features)

Structs§

CountVectorizer
Bag of Words vectorizer that converts text to word count matrix.
HashingVectorizer
Stateless hashing vectorizer for streaming/large-scale text.
TfidfVectorizer
TF-IDF vectorizer that converts text to TF-IDF weighted matrix.