wordchipper - HPC Rust BPE Tokenizer
Status
This is ready for alpha users, and is 2x the speed of tiktoken-rs
for many current models.
The productionization towards an LTR stable release can be tracked in the Alpha Release Tracking Issue.
Overview
This is a high-performance rust BPE tokenizer trainer/encoder/decoder.
Suite Crates
This is the main crate for the wordchipper project.
The core additional user-facing crates are:
- wordchipper-cli - a multi-tool tokenizer binary; notably:
wordchipper-cli cat- in-line encoder/decoder tool.wordchipper-cli train- tokenizer training tool.
- wordchipper-training - an extension crate for training tokenizers.
Encode/Decode Side-by-Side Benchmarks
| Model | wordchipper | tiktoken-rs | tokenizers |
|---|---|---|---|
| r50k_base | 239.19 MiB/s | 169.30 MiB/s | 22.03 MiB/s |
| p50k_base | 250.55 MiB/s | 163.07 MiB/s | 22.23 MiB/s |
| p50k_edit | 241.69 MiB/s | 169.76 MiB/s | 21.27 MiB/s |
| cl100k_base | 214.26 MiB/s | 125.43 MiB/s | 21.62 MiB/s |
| o200k_base | 119.49 MiB/s | 123.75 MiB/s | 22.03 MiB/s |
| o200k_harmony | 121.80 MiB/s | 121.54 MiB/s | 22.08 MiB/s |
- Help? - I'm assuming some bug on my part for
tokenizers+rayon. - Methodology; 90MB shards of 1024 samples each, 48 threads.
$ for m in openai/{r50k_base,p50k_base,p50k_edit,cl100k_base,o200k_base,o200k_harmony}; \
do RAYON_NUM_THREADS=48 cargo run --release -p sample-timer -- \
--dataset-dir $DATASET_DIR --shards 0 --model $m; done
Client Usage
Pretrained Vocabularies
Encoders and Decoders
Loading Pretrained Models
Loading a pre-trained model requires reading the vocabulary, as well as configuring the spanning (regex and special words) configuration.
For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.
use Arc;
use ;