Module text

Source

Expand description

Text processing and NLP utilities.

This module provides text preprocessing tools for Natural Language Processing:

Tokenization (word-level, character-level)
BPE tokenization (Byte Pair Encoding for LLMs/speech models)
Stop words filtering
Stemming (Porter stemmer)
Vectorization (Bag of Words, TF-IDF)
Sentiment analysis (lexicon-based)
Topic modeling (LDA)
Document similarity (cosine, Jaccard, edit distance)
Entity extraction (emails, URLs, mentions, hashtags)
Text summarization (TextRank, TF-IDF extractive)

§Design Principles

Following the Toyota Way and aprender’s quality standards:

Zero unwrap() calls (Cloudflare-class safety)
Result-based error handling with AprenderError
Comprehensive test coverage (≥95%)
Property-based testing with proptest
Pure Rust implementation (no external NLP dependencies)

§Quick Start

use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::text::Tokenizer;

let tokenizer = WhitespaceTokenizer::new();
let tokens = tokenizer.tokenize("Hello, world! This is aprender.").expect("tokenize should succeed");
assert_eq!(tokens, vec!["Hello,", "world!", "This", "is", "aprender."]);

§References

Based on the comprehensive NLP specification: docs/specifications/nlp-models-techniques-spec.md

Re-exports§

pub use chat_template::auto_detect_template;
pub use chat_template::contains_injection_patterns;
pub use chat_template::create_template;
pub use chat_template::detect_format_from_name;
pub use chat_template::sanitize_user_content;
pub use chat_template::AlpacaTemplate;
pub use chat_template::ChatMLTemplate;
pub use chat_template::ChatMessage;
pub use chat_template::ChatTemplateEngine;
pub use chat_template::HuggingFaceTemplate;
pub use chat_template::Llama2Template;
pub use chat_template::MistralTemplate;
pub use chat_template::PhiTemplate;
pub use chat_template::RawTemplate;
pub use chat_template::SpecialTokens;
pub use chat_template::TemplateFormat;

Modules§

bpe: Byte Pair Encoding (BPE) tokenizer (GH-128).
chat_template: Chat Template Engine
entities: Pattern-based entity extraction.
incremental_idf: Incremental IDF (Inverse Document Frequency) tracking.
llama_tokenizer: LLaMA/TinyLlama SentencePiece-style BPE tokenizer (GH-145).
sentiment: Sentiment analysis with lexicon-based scoring.
similarity: Document similarity metrics.
stem: Stemming algorithms for text normalization.
stopwords: Stop words filtering for text preprocessing.
summarize: Extractive text summarization.
tokenize: Tokenization algorithms for text preprocessing.
topic: Topic modeling with Latent Dirichlet Allocation (LDA).
vectorize: Text vectorization for machine learning.

Traits§

Tokenizer: Trait for text tokenization.

Module text

Module text Copy item path

§Design Principles

§Quick Start

§References

Re-exports§

Modules§

Traits§

Module text