bbpe
Binary byte pair encoding (BPE) tokenizer for Rust. Generates production-ready tokenizer.json files compatible with Hugging Face tokenizers.
Installation
# Install CLI and library
# Library only
For library-only usage without CLI:
= { = "0.2", = false }
Quick Start
Train a tokenizer on binary files and use it:
# Train tokenizer (270MB ISO takes ~3 minutes, achieves 1.46 MiB/s)
# Encode binary to tokens
}
# Decode tokens back to bytes
# Inspect tokenizer
||>, <||>, <||>, <||>, <||>, <||>, <||>
Library Usage
use ;
// Configure and train
let cfg = builder
.target_vocab_size
.min_frequency
.build?;
let trainer = new;
let artifacts = trainer.train_from_paths?;
// Save tokenizer
artifacts.model.save_huggingface?;
// Use for encoding/decoding
let tokenizer = artifacts.model.binary_tokenizer?;
let tokens = tokenizer.encode_bytes?;
let decoded = tokenizer.decode_to_bytes?;
assert_eq!;
CLI Commands
train
Build a tokenizer from binary inputs.
Example with 1GB corpus:
chunk-train (experimental)
Train independent tokenizers on fixed-size chunks, capture their intermediate merges, and combine the per-chunk vocabularies using a selectable strategy. This workflow keeps peak memory predictable while letting you experiment with ensemble-style reducers.
)
| | | )
)
)
Example:
The generated report records every chunk's merge sequence and metadata so that additional combination techniques can be prototyped without retraining.
Combination modes
first– reuse the vocabulary from the first chunk verbatim.frequency– aggregate merges by their summed per-chunk frequency.support– favour merges that appear consistently across chunks (recommended baseline).entropy– frequency weighting with entropy-based chunk weights (useful for heterogeneous corpora).
Performance snapshot
The chunked pipeline is designed to be competitive with the full trainer while slashing peak RAM:
| corpus & vocab | mode | chunk size | peak RSS | wall time | bytes/token |
|---|---|---|---|---|---|
| sample-001.txt (1.38 GiB), vocab 4 096 | full train | — | ~6.9 GiB | 16 m 24 s | 4.39 |
| sample-001.txt (1.38 GiB), vocab 4 096 | support | 4 MiB | ~1.4 GiB | 16 m 04 s | 3.71 |
| sample-002.txt (21.7 MiB), vocab 16 384 | full train | — | 0.20 GiB | 45 s | 6.22 |
| sample-002.txt (21.7 MiB), vocab 16 384 | support | 4 MiB | 0.09 GiB | 43 s | 4.73 |
Support-mode chunking consistently realises the full merge budget, keeps throughput on par with the monolithic trainer, and produces more composable subword tokens.
Inspecting token differences
The helper script scripts/compare_tokenizations.py makes it easy to compare vocabularies:
The script prints the raw excerpt and the token sequence for each model so you can eyeball how the chunk combiners differ from the full trainer.
encode
Convert binary files to token sequences.
Examples:
# JSON output
# Save to file
decode
Reconstruct bytes from token IDs.
)
Examples:
# Decode from arguments
# Decode from file
info
Display tokenizer metadata.
Configuration
TrainerConfig
builder
.target_vocab_size // Target vocabulary size
.min_frequency // Minimum pair frequency for merging
.allowed_token_lengths // Token length range
.special_tokens // Special tokens to add
.show_progress // Display progress bar
.build
IngestConfig
builder
.chunk_size // File reading chunk size
.recursive // Traverse directories recursively
.follow_symlinks // Follow symbolic links
.build
Python Interoperability
Generated tokenizers work directly with Hugging Face:
=
=
# [6299, 144, 144, ...]
Performance
- Training: >1 MiB/s on release builds
- Parallel processing with Rayon
- Incremental pair counting for efficient merging
- Tested on 270MB files with sub-4-minute training time
Development
License
Apache-2.0