Expand description
Text processing and NLP utilities.
This module provides text preprocessing tools for Natural Language Processing:
- Tokenization (word-level, character-level)
- BPE tokenization (Byte Pair Encoding for LLMs/speech models)
- Stop words filtering
- Stemming (Porter stemmer)
- Vectorization (Bag of Words, TF-IDF)
- Sentiment analysis (lexicon-based)
- Topic modeling (LDA)
- Document similarity (cosine, Jaccard, edit distance)
- Entity extraction (emails, URLs, mentions, hashtags)
- Text summarization (
TextRank, TF-IDF extractive)
§Design Principles
Following the Toyota Way and aprender’s quality standards:
- Zero
unwrap()calls (Cloudflare-class safety) - Result-based error handling with
AprenderError - Comprehensive test coverage (≥95%)
- Property-based testing with proptest
- Pure Rust implementation (no external NLP dependencies)
§Quick Start
use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::text::Tokenizer;
let tokenizer = WhitespaceTokenizer::new();
let tokens = tokenizer.tokenize("Hello, world! This is aprender.").expect("tokenize should succeed");
assert_eq!(tokens, vec!["Hello,", "world!", "This", "is", "aprender."]);§References
Based on the comprehensive NLP specification:
docs/specifications/nlp-models-techniques-spec.md
Re-exports§
pub use chat_template::auto_detect_template;pub use chat_template::contains_injection_patterns;pub use chat_template::create_template;pub use chat_template::detect_format_from_name;pub use chat_template::sanitize_user_content;pub use chat_template::AlpacaTemplate;pub use chat_template::ChatMLTemplate;pub use chat_template::ChatMessage;pub use chat_template::ChatTemplateEngine;pub use chat_template::HuggingFaceTemplate;pub use chat_template::Llama2Template;pub use chat_template::MistralTemplate;pub use chat_template::PhiTemplate;pub use chat_template::RawTemplate;pub use chat_template::SpecialTokens;pub use chat_template::TemplateFormat;
Modules§
- bpe
- Byte Pair Encoding (BPE) tokenizer (GH-128).
- chat_
template - Chat Template Engine
- entities
- Pattern-based entity extraction.
- incremental_
idf - Incremental IDF (Inverse Document Frequency) tracking.
- llama_
tokenizer - LLaMA/TinyLlama SentencePiece-style BPE tokenizer (GH-145).
- sentiment
- Sentiment analysis with lexicon-based scoring.
- similarity
- Document similarity metrics.
- stem
- Stemming algorithms for text normalization.
- stopwords
- Stop words filtering for text preprocessing.
- summarize
- Extractive text summarization.
- tokenize
- Tokenization algorithms for text preprocessing.
- topic
- Topic modeling with Latent Dirichlet Allocation (LDA).
- vectorize
- Text vectorization for machine learning.
Traits§
- Tokenizer
- Trait for text tokenization.