axonml-text

Overview

axonml-text provides natural language processing utilities for the AxonML machine learning framework: a serializable Vocab, six tokenizers, and three dataset families (classification, language modeling, synthetic seq2seq). Labels are emitted as class-index tensors of shape [1] — directly compatible with AxonML's CrossEntropyLoss.

Features

Vocabulary management — Vocab with token/index maps, special-token indices (PAD, UNK, BOS, EOS, MASK), frequency-threshold construction via Vocab::from_text, auto-inserted UNK/PAD in from_tokens, and JSON save/load (serde).
Six tokenizers implementing a common Tokenizer trait:
- WhitespaceTokenizer (with optional lowercasing)
- CharTokenizer (optional whitespace filtering)
- WordPunctTokenizer (separates words and punctuation)
- NGramTokenizer (word- or character-level n-grams)
- BasicBPETokenizer (trainable byte-pair encoding with priority-ordered merges and </w> end markers)
- UnigramTokenizer (Viterbi-optimal segmentation from scored vocabulary)
Text classification dataset — TextDataset stores a tokenizer and pads/truncates to max_length; from_samples builds the vocab from tokenized text with a min_freq threshold.
Language modeling dataset — LanguageModelDataset produces next-token (input, target) pairs of shape [seq_length].
Synthetic datasets — SyntheticSentimentDataset (small/train/test presets, deterministic per-index generation) and SyntheticSeq2SeqDataset (reverse / copy task).
Prelude module for concise imports.

Modules

Module	Description
`vocab`	`Vocab` struct, special-token constants (`PAD_TOKEN`, `UNK_TOKEN`, `BOS_TOKEN`, `EOS_TOKEN`, `MASK_TOKEN`), JSON save/load
`tokenizer`	`Tokenizer` trait plus Whitespace, Char, WordPunct, NGram, BasicBPE, Unigram implementations
`datasets`	`TextDataset`, `LanguageModelDataset`, `SyntheticSentimentDataset`, `SyntheticSeq2SeqDataset`

Usage

Add the dependency to your Cargo.toml:

[dependencies]
axonml-text = "0.6.1"

Building a Vocabulary

use axonml_text::prelude::*;

// Frequency-threshold construction (adds special tokens automatically)
let text = "the quick brown fox jumps over the lazy dog";
let vocab = Vocab::from_text(text, /*min_freq=*/ 1);

// Or build manually
let mut vocab = Vocab::with_special_tokens();
vocab.add_token("hello");
vocab.add_token("world");

// Encode and decode
let indices = vocab.encode(&["hello", "world"]);
let tokens = vocab.decode(&indices);

// Unknown tokens resolve to the UNK index
assert_eq!(vocab.token_to_index("foo"), vocab.unk_index().unwrap());

// Persistence
vocab.save(std::path::Path::new("vocab.json")).unwrap();
let loaded = Vocab::load(std::path::Path::new("vocab.json")).unwrap();

Tokenization

use axonml_text::prelude::*;

let ws = WhitespaceTokenizer::new();
let tokens = ws.tokenize("Hello World");            // ["Hello", "World"]

let chars = CharTokenizer::new();
let t = chars.tokenize("Hi!");                      // ["H", "i", "!"]

let wp = WordPunctTokenizer::lowercase();
let t = wp.tokenize("Hello, World!");               // ["hello", ",", "world", "!"]

let bigrams = NGramTokenizer::word_ngrams(2);
let t = bigrams.tokenize("one two three");          // ["one two", "two three"]

let trigrams = NGramTokenizer::char_ngrams(3);
let t = trigrams.tokenize("hello");                 // ["hel", "ell", "llo"]

// Trainable BPE
let mut bpe = BasicBPETokenizer::new();
bpe.train("low lower lowest newer newest", /*num_merges=*/ 10);
let tokens = bpe.tokenize("lowest");                // uses priority-ordered merges

// Unigram (Viterbi-optimal segmentation)
let unigram = UnigramTokenizer::from_tokens(&["hel", "lo", "wor", "ld"]);
let tokens = unigram.tokenize("hello world");

Text Classification Dataset

use axonml_text::prelude::*;

let samples = vec![
    ("good movie".to_string(), 1),
    ("bad movie".to_string(), 0),
    ("great film".to_string(), 1),
    ("terrible movie".to_string(), 0),
];

let tokenizer = WhitespaceTokenizer::new();
// Builds vocab from tokenized samples with min_freq=1, pads/truncates to 10.
let dataset = TextDataset::from_samples(&samples, &tokenizer, 1, 10);

assert_eq!(dataset.num_classes(), 2);

let loader = DataLoader::new(dataset, 16);
for batch in loader.iter() {
    // batch.data    : [batch_size, max_length] (float token indices)
    // batch.targets : [batch_size, 1]          (float class index)
}

Language Modeling Dataset

use axonml_text::prelude::*;

let text = "one two three four five six seven eight nine ten";
let dataset = LanguageModelDataset::from_text(text, /*seq_len=*/ 3, /*min_freq=*/ 1);

let (input, target) = dataset.get(0).unwrap();
// input  : [seq_length] — tokens at positions i..i+seq_length
// target : [seq_length] — tokens at positions i+1..i+seq_length+1

Synthetic Datasets

use axonml_text::prelude::*;

// Deterministic sentiment dataset (binary, reproducible per-index)
let sentiment = SyntheticSentimentDataset::small(); // 100 samples, max_len=32, vocab=1000
let sentiment = SyntheticSentimentDataset::train(); // 10000 samples, max_len=64, vocab=10000
let sentiment = SyntheticSentimentDataset::test();  // 2000 samples

// Seq2seq reverse (copy_task makes src_len == tgt_len)
let seq2seq = SyntheticSeq2SeqDataset::copy_task(/*size=*/ 100, /*length=*/ 5, /*vocab_size=*/ 50);
// For each sample: tgt is src reversed.

Tests

cargo test -p axonml-text

License

Licensed under either of:

MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)

at your option.

Last updated: 2026-04-16 (v0.6.1)

axonml-text 0.6.2