Skip to main content

Crate axonml_text

Crate axonml_text 

Source
Expand description

NLP utilities for AxonML.

Tokenizers (Whitespace, Char, WordPunct, NGram, BPE, Unigram), Vocab management with special tokens (PAD/UNK/BOS/EOS/MASK), and datasets (TextDataset, LanguageModelDataset, SyntheticSentimentDataset, SyntheticSeq2SeqDataset) for classification, language modeling, and sequence-to-sequence tasks.

§File

crates/axonml-text/src/lib.rs

§Author

Andrew Jewell Sr. — AutomataNexus LLC ORCID: 0009-0005-2158-7060

§Updated

April 14, 2026 11:15 PM EST

§Disclaimer

Use at own risk. This software is provided “as is”, without warranty of any kind, express or implied. The author and AutomataNexus shall not be held liable for any damages arising from the use of this software.

Re-exports§

pub use vocab::BOS_TOKEN;
pub use vocab::EOS_TOKEN;
pub use vocab::MASK_TOKEN;
pub use vocab::PAD_TOKEN;
pub use vocab::UNK_TOKEN;
pub use vocab::Vocab;
pub use tokenizer::BasicBPETokenizer;
pub use tokenizer::CharTokenizer;
pub use tokenizer::NGramTokenizer;
pub use tokenizer::Tokenizer;
pub use tokenizer::UnigramTokenizer;
pub use tokenizer::WhitespaceTokenizer;
pub use tokenizer::WordPunctTokenizer;
pub use datasets::LanguageModelDataset;
pub use datasets::SyntheticSentimentDataset;
pub use datasets::SyntheticSeq2SeqDataset;
pub use datasets::TextDataset;

Modules§

datasets
Text Datasets
prelude
Common imports for text processing.
tokenizer
Tokenizer - Text Tokenization
vocab
Vocabulary - Token to Index Mapping