axonml-text
Overview
axonml-text provides natural language processing utilities for the AxonML machine learning framework: a serializable Vocab, six tokenizers, and three dataset families (classification, language modeling, synthetic seq2seq). Labels are emitted as class-index tensors of shape [1] — directly compatible with AxonML's CrossEntropyLoss.
Features
- Vocabulary management —
Vocabwith token/index maps, special-token indices (PAD, UNK, BOS, EOS, MASK), frequency-threshold construction viaVocab::from_text, auto-inserted UNK/PAD infrom_tokens, and JSON save/load (serde). - Six tokenizers implementing a common
Tokenizertrait:WhitespaceTokenizer(with optional lowercasing)CharTokenizer(optional whitespace filtering)WordPunctTokenizer(separates words and punctuation)NGramTokenizer(word- or character-level n-grams)BasicBPETokenizer(trainable byte-pair encoding with priority-ordered merges and</w>end markers)UnigramTokenizer(Viterbi-optimal segmentation from scored vocabulary)
- Text classification dataset —
TextDatasetstores a tokenizer and pads/truncates tomax_length;from_samplesbuilds the vocab from tokenized text with amin_freqthreshold. - Language modeling dataset —
LanguageModelDatasetproduces next-token (input, target) pairs of shape[seq_length]. - Synthetic datasets —
SyntheticSentimentDataset(small/train/test presets, deterministic per-index generation) andSyntheticSeq2SeqDataset(reverse / copy task). - Prelude module for concise imports.
Modules
| Module | Description |
|---|---|
vocab |
Vocab struct, special-token constants (PAD_TOKEN, UNK_TOKEN, BOS_TOKEN, EOS_TOKEN, MASK_TOKEN), JSON save/load |
tokenizer |
Tokenizer trait plus Whitespace, Char, WordPunct, NGram, BasicBPE, Unigram implementations |
datasets |
TextDataset, LanguageModelDataset, SyntheticSentimentDataset, SyntheticSeq2SeqDataset |
Usage
Add the dependency to your Cargo.toml:
[]
= "0.6.1"
Building a Vocabulary
use *;
// Frequency-threshold construction (adds special tokens automatically)
let text = "the quick brown fox jumps over the lazy dog";
let vocab = from_text;
// Or build manually
let mut vocab = with_special_tokens;
vocab.add_token;
vocab.add_token;
// Encode and decode
let indices = vocab.encode;
let tokens = vocab.decode;
// Unknown tokens resolve to the UNK index
assert_eq!;
// Persistence
vocab.save.unwrap;
let loaded = load.unwrap;
Tokenization
use *;
let ws = new;
let tokens = ws.tokenize; // ["Hello", "World"]
let chars = new;
let t = chars.tokenize; // ["H", "i", "!"]
let wp = lowercase;
let t = wp.tokenize; // ["hello", ",", "world", "!"]
let bigrams = word_ngrams;
let t = bigrams.tokenize; // ["one two", "two three"]
let trigrams = char_ngrams;
let t = trigrams.tokenize; // ["hel", "ell", "llo"]
// Trainable BPE
let mut bpe = new;
bpe.train;
let tokens = bpe.tokenize; // uses priority-ordered merges
// Unigram (Viterbi-optimal segmentation)
let unigram = from_tokens;
let tokens = unigram.tokenize;
Text Classification Dataset
use *;
let samples = vec!;
let tokenizer = new;
// Builds vocab from tokenized samples with min_freq=1, pads/truncates to 10.
let dataset = from_samples;
assert_eq!;
let loader = new;
for batch in loader.iter
Language Modeling Dataset
use *;
let text = "one two three four five six seven eight nine ten";
let dataset = from_text;
let = dataset.get.unwrap;
// input : [seq_length] — tokens at positions i..i+seq_length
// target : [seq_length] — tokens at positions i+1..i+seq_length+1
Synthetic Datasets
use *;
// Deterministic sentiment dataset (binary, reproducible per-index)
let sentiment = small; // 100 samples, max_len=32, vocab=1000
let sentiment = train; // 10000 samples, max_len=64, vocab=10000
let sentiment = test; // 2000 samples
// Seq2seq reverse (copy_task makes src_len == tgt_len)
let seq2seq = copy_task;
// For each sample: tgt is src reversed.
Tests
License
Licensed under either of:
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
at your option.
Last updated: 2026-04-16 (v0.6.1)