# bbpe
Binary-aware byte pair encoding (BPE) toolkit for Rust. `bbpe` trains Hugging Face–compatible `tokenizer.json` assets directly from raw binaries or newline-delimited JSON (JSONL) text, making it easy to build tokenizers for firmware, malware, structured data dumps, or hybrid corpora.
- **Binary-first**: Streams arbitrary bytes, honors null-delimited boundaries, and always exports true byte fallback Hugging Face models.
- **Configurable preprocessing**: Deterministic ASCII/Unicode/null splitters or probabilistic boundary sampling to encourage multi-word merges.
- **JSONL ingestion**: Point at `file.jsonl[:field.path]` (gzip optional) to train on textual corpora without pre-extracting files.
- **Hierarchical vocabularies**: Train once at a large vocab, derive smaller siblings whose token IDs remain aligned.
- **Chunked training**: Experimental `chunk-train` command allows ensemble-style training on arbitrarily large corpora.
## Installation
```bash
# CLI + library
cargo install bbpe
# Add as a dependency (library only)
cargo add bbpe@0.5.0
```
For library-only usage without the CLI feature:
```toml
bbpe = { version = "0.5", default-features = false }
```
## Quick Start
### Train on binaries
```bash
bbpe train firmware.bin --vocab-size 4096 --min-frequency 4 \
--preprocessor null-delimited -o tokenizer.json
```
### Train on JSONL text (gzip supported)
```bash
bbpe train \
--jsonl corpus.jsonl:text \
--jsonl docs.jsonl.gz:data.body \
--vocab-size 8192 --min-frequency 2 \
--preprocessor ascii-whitespace \
--preprocessor-probability 0.75 \
--preprocessor-seed 42 \
--family-size 4096 --family-size 2048 \
--family-template out/tokenizer-{size}.json \
-o tokenizer-8192.json
```
### Encode / Decode
```bash
bbpe encode -m tokenizer.json binary.bin --json > tokens.json
bbpe decode -m tokenizer.json --input tokens.txt --output restored.bin
```
## CLI Reference
Every command accepts `-q/--quiet` and `-v/--verbose` for logging control. Paths can be repeated.
### `train`
Build a tokenizer from binary files, directories, and/or JSONL specs.
```
bbpe train [FILES]... [--jsonl PATH:FIELD]... [OPTIONS]
```
Key options:
| `--jsonl PATH:FIELD` | Train from newline-delimited JSON. `FIELD` uses `dot.separated.keys`. Files ending in `.gz` are transparently decompressed. Repeatable. |
| `-o, --output PATH` | Output `tokenizer.json` (default `tokenizer.json`). |
| `--vocab-size SIZE` | Target vocabulary (default 32768). Must be ≥ 256 + special tokens. |
| `--min-frequency COUNT` | Minimum pair frequency for merges (default 4). |
| `--chunk-size BYTES` | Read binary inputs in chunks (default 8192, `0` = whole file). |
| `--special-token TOKEN` | Append custom special tokens (repeatable). |
| `--preprocessor MODE` | `none`, `ascii-whitespace`, `unicode-whitespace`, or `null-delimited`. |
| `--preprocessor-probability P` | Probability `[0,1]` that each detected boundary is kept (default `1.0`). Set `< 1` to occasionally merge across whitespace/null runs. |
| `--preprocessor-seed SEED` | Seed RNG for probabilistic preprocessing. |
| `--family-size SIZE` | Emit derived vocabularies after training (repeat flag). |
| `--family-template PATTERN` | Output template for derived models (supports `{size}`). |
| `--max-merge-iterations COUNT` | Hard merge ceiling (defaults to vocab budget). |
| `--allowed-length LEN` | Restrict merge lengths (repeat). |
| `--threads N` | Override Rayon worker count. |
| `--no-progress` | Disable spinner output (logging still shows start/end). |
| `--min-entropy BITS`, `--max-entropy BITS` | Filter sequences outside entropy window before training. |
### `chunk-train` (experimental)
Train chunk-wise vocabularies and combine them. Useful when full-corpus training would exceed memory.
```
bbpe chunk-train [FILES]... [OPTIONS]
```
Notable flags:
- `--chunk-size BYTES` (required, default 32 MiB)
- `--combine-mode first|frequency|support|entropy`
- `--duplicates count|unique`
- `--output PATH`, `--report PATH`, `--no-report`
- Same preprocessing / probability knobs as `train`. (Currently, JSONL ingestion is only available on the main `train` command.)
### `encode`
```
bbpe encode -m tokenizer.json <FILES>... [--json] [--output PATH] [--skip-special-tokens]
```
### `decode`
```
bbpe decode -m tokenizer.json [IDS]... [--input PATH] [--output PATH] [--skip-special-tokens]
```
### `info`
```
bbpe info -m tokenizer.json [--json]
```
## Preprocessing Modes
`bbpe` optionally splits inputs before merge counting:
- **ascii-whitespace** – Splits on ASCII whitespace (`space`, `\t`, `\r`, `\n`, `VT`, `FF`).
- **unicode-whitespace** – Uses `char::is_whitespace`, preserving Unicode separators (implemented with `bstr`).
- **null-delimited** – Splits on contiguous `0x00` runs (great for binaries with C strings or padded records).
- **none** – Raw byte stream.
Set `--preprocessor-probability <P>` (or `TrainerConfig::builder().preprocessor_split_probability(P)`) to randomly *keep* only a subset of boundaries. Example: `P=0.8` keeps 80 % of whitespace splits while letting 20 % of tokens cross word boundaries, encouraging the model to learn multi-word merges. Provide `--preprocessor-seed` for reproducibility. Hugging Face exports automatically drop the pre-tokenizer section when `P < 1.0`, so downstream inference sees the exact byte stream used during training.
## JSONL Ingestion
Use `--jsonl path.jsonl:field.path` to read newline-delimited JSON alongside binary files. Examples:
- `--jsonl data.jsonl:text`
- `--jsonl downloads/articles.jsonl.gz:payload.body`
Field paths use dot notation; numeric array indices are allowed (`choices.0.text`). Empty lines are skipped, non-string fields error out. The reader accepts gzip-compressed inputs (`.jsonl.gz`).
Library equivalent:
```rust
use bbpe::{Trainer, TrainerConfig, TrainerArtifacts};
use bbpe::corpus::JsonlSpec;
let cfg = TrainerConfig::builder()
.target_vocab_size(8192)
.min_frequency(2)
.preprocessor_split_probability(0.7)
.preprocessor_seed(Some(1337))
.build()?;
let trainer = Trainer::new(cfg);
let specs = ["corpus.jsonl:text".parse::<JsonlSpec>()?];
let TrainerArtifacts { model, .. } = trainer.train_from_jsonl(&specs)?;
model.save_huggingface("tokenizer.json")?;
```
## Hierarchical Tokenizer Families
After training the largest vocab, derive smaller siblings whose token IDs stay aligned:
```bash
bbpe train corpus.bin --vocab-size 32768 \
--family-size 8192 --family-size 4096 \
--family-template artifacts/model-{size}.json \
-o artifacts/model-32768.json
```
`BpeModel::derive_with_vocab(size)` mirrors the CLI for library users. Special tokens remain clustered at the end so IDs stay consistent across the family.
## Chunk Training (experimental)
`bbpe chunk-train` slices the corpus into fixed-size windows, trains independent vocabularies, and merges them via user-selectable strategies (`support`, `frequency`, etc.). Duplicate chunk caching, entropy filtering, and reporting are built in so you can prototype large corpora without huge memory spikes.
## Library Primer
```rust
use bbpe::{Trainer, TrainerConfig, IngestConfig, PreprocessorConfig, PreprocessorKind};
let trainer_cfg = TrainerConfig::builder()
.target_vocab_size(4096)
.min_frequency(2)
.preprocessor(PreprocessorConfig {
kind: PreprocessorKind::UnicodeWhitespace,
split_probability: 0.9,
seed: Some(42),
})
.build()?;
let trainer = Trainer::new(trainer_cfg);
let ingest_cfg = IngestConfig::builder().chunk_size(0).build();
let artifacts = trainer.train_from_paths(&["firmware.bin"], &ingest_cfg)?;
artifacts.model.save_huggingface("tokenizer.json")?;
// Hierarchy
let small = artifacts.model.derive_with_vocab(2048)?;
small.save_huggingface("tokenizer-2048.json")?;
```
`Trainer::train_from_jsonl(&[JsonlSpec])` mirrors the CLI `--jsonl` behavior.
## Python Interoperability
All tokenizers are Hugging Face JSON artifacts:
```python
from tokenizers import Tokenizer
model = Tokenizer.from_file("tokenizer.json")
encoding = model.encode("\x7fELF", add_special_tokens=False)
print(encoding.ids)
```
## Development & Testing
```bash
cargo fmt
cargo clippy --all-targets --all-features -- -D warnings
cargo test
# Optional sanity check before publishing
cargo package
```
## License
Apache-2.0