wordchipper 0.9.1

HPC Rust LLM Tokenizer Library
Documentation

wordchipper - HPC Rust BPE Tokenizer

Crates.io Version Documentation Test Status

Discord Ask DeepWiki

Overview

wordchipper is a high-performance Rust byte-pair encoder tokenizer for the OpenAI GPT-2 tokenizer family. It achieves throughput speedups relative to tiktoken-rs in rust on a 64 core machine of ~4.3-5.7x (4 to 64 cores) for general regex BPE vocabularies, and ~6.9x-9.2x when using custom DFA lexers for specific OpenAI vocabularies. Under python wrappers, we see a range of ~2x-4x (4 to 64 cores) speedups over tiktoken.

Suite Crates

This is the main crate for the wordchipper project.

The core additional user-facing crates are:

  • wordchipper-cli - a multi-tool tokenizer binary; notably:
    • wordchipper-cli cat - in-line encoder/decoder tool.
    • wordchipper-cli train - tokenizer training tool.
  • wordchipper-training - an extension crate for training tokenizers.

Encode/Decode Side-by-Side Benchmarks

x 64 Core r50k rust gpt2 python o200k rust o200k python
wordchipper:logos 2.7 GiB/s 114.1 MiB/s 2.4 GiB/s 123.7 MiB/s
wordchipper 1.7 GiB/s 110.5 MiB/s 1.5 GiB/s 106.5 MiB/s
tiktoken* 386.0 MiB/s 25.5 MiB/s 265.2 MiB/s 32.7 MiB/s
bpe-openai 60.9 MiB/s 11.1 MiB/s
tokenizers 49.7 MiB/s 20.8 MiB/s 50.2 MiB/s 23.2 MiB/s

Read the full performance paper:

Client Usage

Pretrained Vocabularies

Encoders and Decoders

Loading Pretrained Models

Loading a pre-trained model requires reading the vocabulary, as well as configuring the spanning (regex and special words) configuration.

For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.

See: wordchipper::get_model

use std::sync::Arc;

use wordchipper::{
    get_model,
    TokenDecoder,
    TokenEncoder,
    UnifiedTokenVocab,
    disk_cache::WordchipperDiskCache,
};

fn example() -> wordchipper::errors::Result<(Arc<dyn TokenEncoder<u32>>, Arc<dyn TokenDecoder<u32>>)> {
    let mut disk_cache = WordchipperDiskCache::default();
    let vocab: UnifiedTokenVocab<u32> = get_model("openai/o200k_harmony", &mut disk_cache)?;

    let encoder = vocab.to_default_encoder();
    let decoder = vocab.to_default_decoder();

    Ok((encoder, decoder))
}

Acknowledgements

  • Thank you to @karpathy and nanochat for the work on rustbpe.
  • Thank you to tiktoken for their initial work in the rust tokenizer space.