bbpe 0.3.0

Binary byte pair encoding (BPE) trainer and CLI compatible with Hugging Face tokenizers
Documentation

bbpe

Binary byte pair encoding (BPE) tokenizer for Rust. Generates production-ready tokenizer.json files compatible with Hugging Face tokenizers.

Installation

# Install CLI and library
cargo install bbpe

# Library only
cargo add bbpe

For library-only usage without CLI:

bbpe = { version = "0.2", default-features = false }

Quick Start

Train a tokenizer on binary files and use it:

# Train tokenizer (270MB ISO takes ~3 minutes, achieves 1.46 MiB/s)
bbpe train /tmp/alpine-standard-3.22.2-x86_64.iso --vocab-size 8192 -o tokenizer.json

# Encode binary to tokens
bbpe encode -m tokenizer.json binary.file --json
{"path":"binary.file","tokens":[6299,144,144,144,6299,6335,213,4238]}

# Decode tokens back to bytes
bbpe decode -m tokenizer.json --output decoded.bin 6299 144 144 144

# Inspect tokenizer
bbpe info -m tokenizer.json
Model type   : BPE
Vocab size   : 8185
Merges       : 7929
Byte fallback: true
Special tokens: <|start|>, <|end|>, <|pad|>, <|unk|>, <|cls|>, <|sep|>, <|mask|>

Library Usage

use bbpe::{Trainer, TrainerConfig, IngestConfig};

// Configure and train
let cfg = TrainerConfig::builder()
    .target_vocab_size(4096)
    .min_frequency(2)
    .build()?;

let trainer = Trainer::new(cfg);
let artifacts = trainer.train_from_paths(
    &["/path/to/binaries"],
    &IngestConfig::default()
)?;

// Save tokenizer
artifacts.model.save_huggingface("tokenizer.json")?;

// Use for encoding/decoding
let tokenizer = artifacts.model.binary_tokenizer()?;
let tokens = tokenizer.encode_bytes(b"\x7fELF\x02\x01", false)?;
let decoded = tokenizer.decode_to_bytes(&tokens, true)?;
assert_eq!(decoded, b"\x7fELF\x02\x01");

CLI Commands

train

Build a tokenizer from binary inputs.

bbpe train <INPUT_PATH> [OPTIONS]

Options:
  --vocab-size <SIZE>       Target vocabulary size [default: 32768]
  --min-frequency <FREQ>    Minimum pair frequency [default: 4]
  --chunk-size <BYTES>      File chunk size [default: 8192]
  -o, --output <PATH>       Output tokenizer path [default: tokenizer.json]
  --no-progress             Disable progress bar
  --threads <NUM>           Thread pool size

Example with 1GB corpus:

bbpe train ./corpus --vocab-size 16384 --min-frequency 128 -o large.json

chunk-train (experimental)

Train independent tokenizers on fixed-size chunks, capture their intermediate merges, and combine the per-chunk vocabularies using a selectable strategy. This workflow keeps peak memory predictable while letting you experiment with ensemble-style reducers.

bbpe chunk-train <INPUT_PATH>... [OPTIONS]

Key options:
  --chunk-size <BYTES>     Target chunk size (default: 32 MiB)
  --vocab-size <SIZE>      Target vocabulary size per chunk
  --min-frequency <FREQ>   Minimum pair frequency per chunk
  --combine-mode <MODE>    Vocabulary combiner: first | frequency | support | entropy (default: first)
  --output <PATH>          Combined tokenizer path (default: chunked-tokenizer.json)
  --report <PATH>          JSON report capturing per-chunk merges (default: chunk_train_report.json)

Example:

bbpe chunk-train ./corpus \
  --chunk-size $((8 * 1024 * 1024)) \
  --vocab-size 4096 \
  --combine-mode first \
  --output chunked.json \
  --report chunked_report.json

The generated report records every chunk's merge sequence and metadata so that additional combination techniques can be prototyped without retraining.

Combination modes

  • first – reuse the vocabulary from the first chunk verbatim.
  • frequency – aggregate merges by their summed per-chunk frequency.
  • support – favour merges that appear consistently across chunks (recommended baseline).
  • entropy – frequency weighting with entropy-based chunk weights (useful for heterogeneous corpora).

Performance snapshot

The chunked pipeline is designed to be competitive with the full trainer while slashing peak RAM:

corpus & vocab mode chunk size peak RSS wall time bytes/token
sample-001.txt (1.38 GiB), vocab 4 096 full train ~6.9 GiB 16 m 24 s 4.39
sample-001.txt (1.38 GiB), vocab 4 096 support 4 MiB ~1.4 GiB 16 m 04 s 3.71
sample-002.txt (21.7 MiB), vocab 16 384 full train 0.20 GiB 45 s 6.22
sample-002.txt (21.7 MiB), vocab 16 384 support 4 MiB 0.09 GiB 43 s 4.73

Support-mode chunking consistently realises the full merge budget, keeps throughput on par with the monolithic trainer, and produces more composable subword tokens.

Inspecting token differences

The helper script scripts/compare_tokenizations.py makes it easy to compare vocabularies:

uv run --with tokenizers python scripts/compare_tokenizations.py \
  --text ~/sample-001.txt \
  --bytes 256 \
  --model full:/tmp/sample001-full.json \
  --model support-4m:/tmp/sample001-support-4m.json

The script prints the raw excerpt and the token sequence for each model so you can eyeball how the chunk combiners differ from the full trainer.

encode

Convert binary files to token sequences.

bbpe encode -m <TOKENIZER> <FILES>... [OPTIONS]

Options:
  --json                    Output JSON format
  --output <PATH>           Save tokens to file
  --skip-special-tokens     Skip special tokens

Examples:

# JSON output
bbpe encode -m tokenizer.json binary.exe --json

# Save to file
bbpe encode -m tokenizer.json data.bin --output data.tokens

decode

Reconstruct bytes from token IDs.

bbpe decode -m <TOKENIZER> [IDS]... [OPTIONS]

Options:
  --input <PATH>            Read tokens from file
  --output <PATH>           Save decoded bytes (default: stdout)
  --skip-special-tokens     Skip special tokens

Examples:

# Decode from arguments
bbpe decode -m tokenizer.json 6299 144 144 --output restored.bin

# Decode from file
bbpe decode -m tokenizer.json --input tokens.txt --output restored.bin

info

Display tokenizer metadata.

bbpe info -m <TOKENIZER> [--json]

Configuration

TrainerConfig

TrainerConfig::builder()
    .target_vocab_size(8192)        // Target vocabulary size
    .min_frequency(2)                // Minimum pair frequency for merging
    .allowed_token_lengths(1..=16)   // Token length range
    .special_tokens(vec![...])      // Special tokens to add
    .show_progress(true)             // Display progress bar
    .build()

IngestConfig

IngestConfig::builder()
    .chunk_size(8192)       // File reading chunk size
    .recursive(true)        // Traverse directories recursively
    .follow_symlinks(false) // Follow symbolic links
    .build()

Python Interoperability

Generated tokenizers work directly with Hugging Face:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
encoded = tokenizer.encode(binary_data)
print(encoded.ids)  # [6299, 144, 144, ...]

Performance

  • Training: >1 MiB/s on release builds
  • Parallel processing with Rayon
  • Incremental pair counting for efficient merging
  • Tested on 270MB files with sub-4-minute training time

Development

cargo fmt
cargo clippy --all-targets --all-features -- -D warnings
cargo test

License

Apache-2.0