Expand description
NLP utilities for AxonML.
Tokenizers (Whitespace, Char, WordPunct, NGram, BPE, Unigram), Vocab
management with special tokens (PAD/UNK/BOS/EOS/MASK), and datasets
(TextDataset, LanguageModelDataset, SyntheticSentimentDataset,
SyntheticSeq2SeqDataset) for classification, language modeling, and
sequence-to-sequence tasks.
§File
crates/axonml-text/src/lib.rs
§Author
Andrew Jewell Sr. — AutomataNexus LLC ORCID: 0009-0005-2158-7060
§Updated
April 14, 2026 11:15 PM EST
§Disclaimer
Use at own risk. This software is provided “as is”, without warranty of any kind, express or implied. The author and AutomataNexus shall not be held liable for any damages arising from the use of this software.
Re-exports§
pub use vocab::BOS_TOKEN;pub use vocab::EOS_TOKEN;pub use vocab::MASK_TOKEN;pub use vocab::PAD_TOKEN;pub use vocab::UNK_TOKEN;pub use vocab::Vocab;pub use tokenizer::BasicBPETokenizer;pub use tokenizer::CharTokenizer;pub use tokenizer::NGramTokenizer;pub use tokenizer::Tokenizer;pub use tokenizer::UnigramTokenizer;pub use tokenizer::WhitespaceTokenizer;pub use tokenizer::WordPunctTokenizer;pub use datasets::LanguageModelDataset;pub use datasets::SyntheticSentimentDataset;pub use datasets::SyntheticSeq2SeqDataset;pub use datasets::TextDataset;