Skip to main content

Module text

Module text 

Source
Expand description

Text processing and NLP utilities.

This module provides text preprocessing tools for Natural Language Processing:

  • Tokenization (word-level, character-level)
  • BPE tokenization (Byte Pair Encoding for LLMs/speech models)
  • Stop words filtering
  • Stemming (Porter stemmer)
  • Vectorization (Bag of Words, TF-IDF)
  • Sentiment analysis (lexicon-based)
  • Topic modeling (LDA)
  • Document similarity (cosine, Jaccard, edit distance)
  • Entity extraction (emails, URLs, mentions, hashtags)
  • Text summarization (TextRank, TF-IDF extractive)

§Design Principles

Following the Toyota Way and aprender’s quality standards:

  • Zero unwrap() calls (Cloudflare-class safety)
  • Result-based error handling with AprenderError
  • Comprehensive test coverage (≥95%)
  • Property-based testing with proptest
  • Pure Rust implementation (no external NLP dependencies)

§Quick Start

use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::text::Tokenizer;

let tokenizer = WhitespaceTokenizer::new();
let tokens = tokenizer.tokenize("Hello, world! This is aprender.").expect("tokenize should succeed");
assert_eq!(tokens, vec!["Hello,", "world!", "This", "is", "aprender."]);

§References

Based on the comprehensive NLP specification: docs/specifications/nlp-models-techniques-spec.md

Re-exports§

pub use chat_template::auto_detect_template;
pub use chat_template::contains_injection_patterns;
pub use chat_template::create_template;
pub use chat_template::detect_format_from_name;
pub use chat_template::sanitize_user_content;
pub use chat_template::AlpacaTemplate;
pub use chat_template::ChatMLTemplate;
pub use chat_template::ChatMessage;
pub use chat_template::ChatTemplateEngine;
pub use chat_template::HuggingFaceTemplate;
pub use chat_template::Llama2Template;
pub use chat_template::MistralTemplate;
pub use chat_template::PhiTemplate;
pub use chat_template::RawTemplate;
pub use chat_template::SpecialTokens;
pub use chat_template::TemplateFormat;

Modules§

bpe
Byte Pair Encoding (BPE) tokenizer (GH-128).
chat_template
Chat Template Engine
entities
Pattern-based entity extraction.
incremental_idf
Incremental IDF (Inverse Document Frequency) tracking.
llama_tokenizer
LLaMA/TinyLlama SentencePiece-style BPE tokenizer (GH-145).
sentiment
Sentiment analysis with lexicon-based scoring.
similarity
Document similarity metrics.
stem
Stemming algorithms for text normalization.
stopwords
Stop words filtering for text preprocessing.
summarize
Extractive text summarization.
tokenize
Tokenization algorithms for text preprocessing.
topic
Topic modeling with Latent Dirichlet Allocation (LDA).
vectorize
Text vectorization for machine learning.

Traits§

Tokenizer
Trait for text tokenization.