wordchipper - HPC Rust BPE Tokenizer

Status

This is ready for alpha users, and is 2x the speed of tiktoken-rs for many current models.

The productionization towards an LTR stable release can be tracked in the Alpha Release Tracking Issue.

Overview

This is a high-performance rust BPE tokenizer trainer/encoder/decoder.

Suite Crates

This is the main crate for the wordchipper project.

The core additional user-facing crates are:

wordchipper-cli - a multi-tool tokenizer binary; notably:
- wordchipper-cli cat - in-line encoder/decoder tool.
- wordchipper-cli train - tokenizer training tool.
wordchipper-training - an extension crate for training tokenizers.

Encode/Decode Side-by-Side Benchmarks

Model	wordchipper	tiktoken-rs	tokenizers
r50k_base	239.19 MiB/s	169.30 MiB/s	22.03 MiB/s
p50k_base	250.55 MiB/s	163.07 MiB/s	22.23 MiB/s
p50k_edit	241.69 MiB/s	169.76 MiB/s	21.27 MiB/s
cl100k_base	214.26 MiB/s	125.43 MiB/s	21.62 MiB/s
o200k_base	119.49 MiB/s	123.75 MiB/s	22.03 MiB/s
o200k_harmony	121.80 MiB/s	121.54 MiB/s	22.08 MiB/s

Help? - I'm assuming some bug on my part for tokenizers + rayon.
Methodology; 90MB shards of 1024 samples each, 48 threads.

$ for m in openai/{r50k_base,p50k_base,p50k_edit,cl100k_base,o200k_base,o200k_harmony}; \
  do RAYON_NUM_THREADS=48 cargo run --release -p sample-timer -- \
   --dataset-dir $DATASET_DIR --shards 0 --model $m; done

Client Usage

Pretrained Vocabularies

OpenAI OATokenizer

Encoders and Decoders

Loading Pretrained Models

Loading a pre-trained model requires reading the vocabulary, as well as configuring the spanning (regex and special words) configuration.

For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.

See: wordchipper::get_model

use std::sync::Arc;

use wordchipper::{
    get_model,
    TokenDecoder,
    TokenEncoder,
    UnifiedTokenVocab,
    disk_cache::WordchipperDiskCache,
};

fn example() -> wordchipper::errors::Result<(Arc<dyn TokenEncoder<u32>>, Arc<dyn TokenDecoder<u32>>)> {
    let mut disk_cache = WordchipperDiskCache::default();
    let vocab: UnifiedTokenVocab<u32> = get_model("openai/o200k_harmony", &mut disk_cache)?;

    let encoder = vocab.to_default_encoder();
    let decoder = vocab.to_default_decoder();

    Ok((encoder, decoder))
}

Acknowledgements

Thank you to @karpathy and nanochat for the work on rustbpe.
Thank you to tiktoken for their initial work in the rust tokenizer space.

wordchipper 0.8.1