bbpe
Binary byte pair encoding (BPE) tokenizer for Rust. Generates production-ready tokenizer.json files compatible with Hugging Face tokenizers.
Installation
# Install CLI and library
# Library only
For library-only usage without CLI:
= { = "0.1", = false }
Quick Start
Train a tokenizer on binary files and use it:
# Train tokenizer (270MB ISO takes ~3 minutes, achieves 1.46 MiB/s)
# Encode binary to tokens
}
# Decode tokens back to bytes
# Inspect tokenizer
||>, <||>, <||>, <||>, <||>, <||>, <||>
Library Usage
use ;
// Configure and train
let cfg = builder
.target_vocab_size
.min_frequency
.build?;
let trainer = new;
let artifacts = trainer.train_from_paths?;
// Save tokenizer
artifacts.model.save_huggingface?;
// Use for encoding/decoding
let tokenizer = artifacts.model.binary_tokenizer?;
let tokens = tokenizer.encode_bytes?;
let decoded = tokenizer.decode_to_bytes?;
assert_eq!;
CLI Commands
train
Build a tokenizer from binary inputs.
Example with 1GB corpus:
encode
Convert binary files to token sequences.
Examples:
# JSON output
# Save to file
decode
Reconstruct bytes from token IDs.
)
Examples:
# Decode from arguments
# Decode from file
info
Display tokenizer metadata.
Configuration
TrainerConfig
builder
.target_vocab_size // Target vocabulary size
.min_frequency // Minimum pair frequency for merging
.allowed_token_lengths // Token length range
.special_tokens // Special tokens to add
.show_progress // Display progress bar
.build
IngestConfig
builder
.chunk_size // File reading chunk size
.recursive // Traverse directories recursively
.follow_symlinks // Follow symbolic links
.build
Python Interoperability
Generated tokenizers work directly with Hugging Face:
=
=
# [6299, 144, 144, ...]
Performance
- Training: >1 MiB/s on release builds
- Parallel processing with Rayon
- Incremental pair counting for efficient merging
- Tested on 270MB files with sub-4-minute training time
Development
License
Apache-2.0