wordchipper - HPC Rust BPE Tokenizer
Overview
wordchipper is a high-performance Rust byte-pair encoder tokenizer for the OpenAI GPT-2 tokenizer
family. It achieves throughput speedups relative to tiktoken-rs
in rust on a 64 core machine of ~4.3-5.7x (4 to 64 cores) for general regex BPE vocabularies,
and ~6.9x-9.2x when using custom DFA lexers for specific OpenAI vocabularies.
Under python wrappers, we see a range of ~2x-4x (4 to 64 cores) speedups over
tiktoken.
Suite Crates
This is the main crate for the wordchipper project.
The core additional user-facing crates are:
- wordchipper-cli - a multi-tool tokenizer binary;
notably:
wordchipper-cli cat- in-line encoder/decoder tool.wordchipper-cli train- tokenizer training tool.
- wordchipper-training - an extension crate for training tokenizers.
Encode/Decode Side-by-Side Benchmarks
| x 64 Core | r50k rust | gpt2 python | o200k rust | o200k python |
|---|---|---|---|---|
| wordchipper:logos | 2.7 GiB/s | 114.1 MiB/s | 2.4 GiB/s | 123.7 MiB/s |
| wordchipper | 1.7 GiB/s | 110.5 MiB/s | 1.5 GiB/s | 106.5 MiB/s |
| tiktoken* | 386.0 MiB/s | 25.5 MiB/s | 265.2 MiB/s | 32.7 MiB/s |
| bpe-openai | 60.9 MiB/s | 11.1 MiB/s | ||
| tokenizers | 49.7 MiB/s | 20.8 MiB/s | 50.2 MiB/s | 23.2 MiB/s |
Read the full performance paper:
Client Usage
Pretrained Vocabularies
Encoders and Decoders
Loading Pretrained Models
Loading a pre-trained model requires reading the vocabulary, as well as configuring the spanning (regex and special words) configuration.
For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.
use Arc;
use ;