trustformers-tokenizers
High-performance tokenization library for transformer models with support for 50+ tokenization algorithms. Version 0.1.0 — Stable.
Version: 0.1.0 | Status: Stable | Tests: 500 | SLoC: 51,211 | Last Updated: 2026-03-21
Current State
This crate provides production-ready tokenizer implementations covering BPE (Byte-Pair Encoding), WordPiece, SentencePiece, TikToken, Fairseq, language-specific tokenizers (Arabic, Chinese, Japanese, Korean), domain-specific tokenizers (Chemical, Music, Math, Code, BIO, Multimodal), and many more. It is designed to be fast, memory-efficient, and compatible with popular tokenizer formats.
Features
Implemented Tokenizers (50+)
General-Purpose
- BPE (Byte-Pair Encoding): Used by GPT models
- Byte-level BPE for better unicode handling
- Efficient merge operations
- Pre-tokenization with regex patterns
- WordPiece: Used by BERT models
- Greedy longest-match-first algorithm
- Unknown token handling
- Case and accent normalization options
- SentencePiece: Unsupervised text tokenizer
- Unigram and BPE modes
- Direct training from raw text
- Language-agnostic design
- TikToken: OpenAI tokenizer (cl100k_base, p50k_base, r50k_base)
- Compatible with GPT-4, ChatGPT, Codex
- Fast BPE implementation
- Fairseq: Dictionary format support
- Moses-style tokenization
- Subword NMT integration
Language-Specific
- Arabic: Morphological segmentation, right-to-left handling, Farasa integration
- Chinese: Character-based, jieba-based word segmentation, radical decomposition
- Japanese: MeCab/SudachiPy integration, kanji/kana normalization, reading variants
- Korean: Morpheme-based with Mecab/Komoran, Hangul decomposition
Domain-Specific
- Chemical: SMILES notation, molecular formula, IUPAC names
- Music: ABC notation, MusicXML, chord/tempo symbols
- Math: LaTeX, MathML, expression tree tokenization
- Code: Language-aware (Python, Rust, JavaScript, C/C++, SQL)
- BIO: FASTA/FASTQ, amino acids, gene ontology terms
- Multimodal: Image patches, audio frames, video token interleaving
Core Features
- Zero-copy vocabulary access: Memory-mapped vocabularies for large-scale use
- SIMD acceleration: Vectorized encoding operations for high throughput
- Async batch processing: Non-blocking tokenization via scirs2-core
- Vocabulary intelligence: Semantic analysis, compression efficiency, cross-lingual coverage
- Training infrastructure: BPE, WordPiece, SentencePiece trainers from corpus
- Batch processing: Efficient handling of multiple texts
- Offset mapping: Track original text positions
- Special tokens: Configurable special token handling
- Padding/Truncation: Automatic sequence length management
- Thread-safe: Safe concurrent tokenization
Pre/Post Processing
- Normalization: Unicode normalization (NFC, NFD, NFKC, NFKD)
- Pre-tokenization: Whitespace, punctuation, regex-based splitting
- Post-processing: Template-based token type IDs and attention masks
- Decoding: Convert tokens back to text with proper formatting
Feature Flags
python— PyO3 Python bindings (pip-installable package)mecab— Japanese/CJK tokenization via MeCabgpu— GPU-accelerated tokenization for large batchesjax— JAX integration for JAX/Flax workflowsonnx— ONNX export for tokenizer graphspytorch— PyTorch DataLoader integrationtensorflow— TensorFlow tf.data pipeline integration
Usage Example
Basic Tokenization
use ;
// Create a tokenizer
let mut tokenizer = new;
// Add pre-tokenizer
tokenizer.with_pre_tokenizer;
// Add post-processor for BERT-style tokens
tokenizer.with_post_processor;
// Tokenize text
let encoding = tokenizer.encode?;
println!;
println!;
Loading Pre-trained Tokenizers
use Tokenizer;
// Load from file
let tokenizer = from_file?;
// Load from Hugging Face format
let tokenizer = from_pretrained?;
// Tokenize with offsets
let encoding = tokenizer.encode_with_offsets?;
for in encoding.get_tokens.iter
.zip
Batch Tokenization
let texts = vec!;
let encodings = tokenizer.encode_batch?;
// Pad to same length
let padded = tokenizer.pad_batch?;
Language-Specific Tokenization
use MeCabTokenizer;
// Japanese tokenizer with MeCab (requires `mecab` feature)
let tokenizer = new?;
let tokens = tokenizer.encode?;
use JiebaTokenizer;
// Chinese word segmentation
let tokenizer = new?;
let tokens = tokenizer.encode?;
Domain-Specific Tokenization
use SmilesTokenizer;
// Chemical SMILES tokenizer
let tokenizer = new?;
let tokens = tokenizer.encode?; // Aspirin
use CodeTokenizer;
// Code-aware tokenizer
let tokenizer = for_language?;
let tokens = tokenizer.encode?;
Architecture
trustformers-tokenizers/
├── src/
│ ├── tokenizer/ # Main tokenizer interface
│ ├── models/ # Tokenization algorithms
│ │ ├── bpe/ # BPE implementation
│ │ ├── wordpiece/ # WordPiece implementation
│ │ ├── unigram/ # SentencePiece unigram
│ │ ├── tiktoken/ # TikToken implementation
│ │ └── fairseq/ # Fairseq dictionary
│ ├── languages/ # Language-specific tokenizers
│ │ ├── arabic/ # Arabic morphological
│ │ ├── chinese/ # Chinese segmentation
│ │ ├── japanese/ # Japanese MeCab/SudachiPy
│ │ └── korean/ # Korean morpheme
│ ├── domains/ # Domain-specific tokenizers
│ │ ├── chemical/ # SMILES/IUPAC
│ │ ├── music/ # ABC/MusicXML
│ │ ├── math/ # LaTeX/MathML
│ │ ├── code/ # Programming languages
│ │ ├── bio/ # FASTA/amino acids
│ │ └── multimodal/ # Vision/audio tokens
│ ├── pre_tokenizers/ # Pre-processing steps
│ ├── normalizers/ # Text normalization
│ ├── processors/ # Post-processing
│ ├── decoders/ # Token-to-text decoding
│ ├── training/ # Tokenizer trainers
│ └── intelligence/ # Vocabulary analysis tools
Performance
Benchmarks
| Tokenizer | Text Size | Time (ms) | Throughput (MB/s) |
|---|---|---|---|
| BPE | 1KB | 0.12 | 8.3 |
| BPE | 1MB | 45 | 22.2 |
| WordPiece | 1KB | 0.15 | 6.7 |
| WordPiece | 1MB | 52 | 19.2 |
| SentencePiece | 1KB | 0.18 | 5.6 |
| SentencePiece | 1MB | 61 | 16.4 |
| BPE (SIMD) | 1MB | 28 | 35.7 |
| BPE (Batch/Async) | 16x1KB | 0.85 | 18.8 |
Benchmarks on Apple M1, single-threaded unless noted
Memory Usage
- BPE with 50k vocabulary: ~12MB
- WordPiece with 30k vocabulary: ~8MB
- SentencePiece with 32k vocabulary: ~10MB
- Zero-copy memory-mapped vocab (100k): ~2MB resident
Training Tokenizers
use ;
// Configure trainer
let mut trainer = builder
.vocab_size
.min_frequency
.special_tokens
.build;
// Train from files
let files = vec!;
tokenizer.train?;
// Save trained tokenizer
tokenizer.save?;
Vocabulary Intelligence
use VocabAnalyzer;
let analyzer = new;
// Semantic clustering
let clusters = analyzer.cluster_semantic_tokens?;
// Compression efficiency
let stats = analyzer.compression_stats?;
println!;
// Cross-lingual coverage
let cov = analyzer.cross_lingual_coverage?;
Compatibility
Supported Formats
- Hugging Face: Full compatibility with
tokenizerslibrary - SentencePiece: Load
.modelfiles directly - TikToken: Load
.tiktokenencoding files - Fairseq: Dictionary format support
- Custom: JSON-based configuration
Integration
- Direct use in TrustformeRS models
- Python bindings via
trustformers-py(PyO3,pythonfeature) - WASM support via
trustformers-wasm - C API for other language bindings
Advanced Features
Custom Pre-tokenizers
use ;
;
Performance Tips
- Reuse tokenizers: Create once, use many times
- Batch processing: Tokenize multiple texts together
- Pre-compile regex: For custom pre-tokenizers
- Zero-copy vocab: Memory-map vocabularies for 100k+ tokens
- Use appropriate tokenizer: BPE for generation, WordPiece for understanding
- Enable SIMD: Compile with
RUSTFLAGS="-C target-feature=+simd128"
Testing
- 500 unit tests with 100% pass rate
- Cross-validation with Python tokenizers (HuggingFace, tiktoken, SentencePiece)
- Fuzzing tests for edge cases
- Performance benchmarks (throughput regression detection)
- Memory leak detection
License
Apache-2.0