axonml-text
Overview
axonml-text provides natural language processing utilities for the AxonML machine learning framework. It includes vocabulary management, multiple tokenization strategies, and dataset implementations for common NLP tasks like text classification, language modeling, and sequence-to-sequence learning.
Features
- Vocabulary Management - Token-to-index mapping with special tokens (PAD, UNK, BOS, EOS, MASK) and frequency-based filtering
- Multiple Tokenizers - Whitespace, character-level, word-punctuation, n-gram, BPE, and unigram tokenization strategies
- Text Classification Datasets - Build datasets from labeled text samples with automatic vocabulary construction
- Language Modeling Datasets - Create next-token prediction datasets with configurable sequence lengths
- Synthetic Datasets - Pre-built sentiment and seq2seq datasets for testing and prototyping
- Prelude Module - Convenient re-exports for common imports
Modules
| Module | Description |
|---|---|
vocab |
Vocabulary management with token-to-index mapping and special token support |
tokenizer |
Tokenizer trait and implementations (Whitespace, Char, WordPunct, NGram, BPE, Unigram) |
datasets |
Dataset implementations for text classification, language modeling, and seq2seq tasks |
Usage
Add the dependency to your Cargo.toml:
[]
= "0.1.0"
Building a Vocabulary
use *;
// Build vocabulary from text with minimum frequency threshold
let text = "the quick brown fox jumps over the lazy dog";
let vocab = from_text;
// Or create with special tokens
let mut vocab = with_special_tokens;
vocab.add_token;
vocab.add_token;
// Encode and decode
let indices = vocab.encode;
let tokens = vocab.decode;
Tokenization
use *;
// Whitespace tokenizer
let tokenizer = new;
let tokens = tokenizer.tokenize; // ["Hello", "World"]
// Character-level tokenizer
let char_tokenizer = new;
let chars = char_tokenizer.tokenize; // ["H", "i", "!"]
// Word-punctuation tokenizer
let wp_tokenizer = lowercase;
let tokens = wp_tokenizer.tokenize; // ["hello", ",", "world", "!"]
// N-gram tokenizer
let bigrams = word_ngrams;
let tokens = bigrams.tokenize; // ["one two", "two three"]
// BPE tokenizer
let mut bpe = new;
bpe.train;
let tokens = bpe.tokenize;
Text Classification Dataset
use *;
let samples = vec!;
let tokenizer = new;
let dataset = from_samples;
// Use with DataLoader
let loader = new;
for batch in loader.iter
Language Modeling Dataset
use *;
let text = "one two three four five six seven eight nine ten";
let dataset = from_text;
let = dataset.get.unwrap;
// input: [seq_length] - tokens at positions 0..seq_length
// target: [seq_length] - tokens at positions 1..seq_length+1
Synthetic Datasets
use *;
// Sentiment dataset for testing
let sentiment = small; // 100 samples
let sentiment = train; // 10000 samples
// Seq2seq copy/reverse task
let seq2seq = copy_task;
Tests
Run the test suite:
License
Licensed under either of:
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
at your option.