wordchipper - HPC Rust BPE Tokenizer

Overview

This is a high-performance rust BPE tokenizer trainer/encoder/decoder.

The current status is productionization towards an alpha release.

Encode/Decode Side-by-Side Benchmarks

% RAYON_NUM_THREADS=48 cargo run --release -p sample-timer  -- \
    --dataset-dir $DATASET_CACHE_DIR --decode
Args {
    dataset_dir: "/media/Data/nanochat/dataset",
    shards: [
        0,
        1,
    ],
    batch_size: 1024,
    model: OpenaiO200kHarmony,
    ignore_missing: true,
    tiktoken: true,
    tokenizers: true,
    decode: false,
    validate: true,
    respan_input_for_decode_check: true,
}
Model: "openai/o200k_harmony"

Samples Summary:
- num batches: 104
- avg bytes/sample: 4777
- avg bytes/token: 4.8

Encoder Batch Timing:
- "wordchipper"
  - batch:      36.2ms
  - sample:     35.3µs
  - bps:    128.96 MiB/s
- "tiktoken-rs"
  - batch:      36.5ms
  - sample:     35.6µs
  - bps:    127.86 MiB/s
- "tokenizers"
  - batch:     214.7ms
  - sample:    209.6µs
  - bps:    21.73 MiB/s

Client Usage

Pretrained Vocabularies

OpenAI OATokenizer

Encoders and Decoders

Loading Pretrained Models

Loading a pre-trained model requires reading the vocabulary, as well as configuring the spanning (regex and special words) configuration.

For a number of pretrained models, simplified constructors are available to download, cache, and load the vocabulary.

See: wordchipper::pretrained::openai::OATokenizer

use std::sync::Arc;

use wordchipper::{
    decoders::{DefaultTokenDecoder, TokenDecoder},
    disk_cache::WordchipperDiskCache,
    encoders::{DefaultTokenEncoder, TokenEncoder},
    pretrained::openai::OATokenizer,
    vocab::UnifiedTokenVocab,
};

fn example() -> anyhow::Result<(Arc<dyn TokenEncoder<u32>>, Arc<dyn TokenDecoder<u32>>)> {
    let model = OATokenizer::O200kHarmony;
    let mut disk_cache = WordchipperDiskCache::default();
    let vocab: UnifiedTokenVocab<u32> = model.load(&mut disk_cache)?;

    let encoder: Arc<DefaultTokenEncoder<u32>> =
        DefaultTokenEncoder::new(vocab.clone(), None).into();
    let decoder: Arc<DefaultTokenDecoder<u32>> =
        DefaultTokenDecoder::from_unified_vocab(vocab).into();
        
    #[cfg(feature = "rayon")]
    use wordchipper::concurrency::rayon::*;
    
    #[cfg(feature = "rayon")]
    let encoder = Arc::new(ParallelRayonEncoder::new(encoder));

    #[cfg(feature = "rayon")]
    let decoder = Arc::new(ParallelRayonDecoder::new(decoder));

    Ok((encoder, decoder))
}

Training Overview

Training Example

This is a code snippet overview of training.

Expect training to take ~1s/10MB of input; and to be slowed primarily by how well the stream logic of loading the training samples is parallelized.

Note: currently, training has limited logging, and no progress reporting.

A common training binary is probably a good idea; and much of the messiness of supporting many different training data sources could be hidden in the isolated deps of such a tool.

Each shard is ~90MB parquet file.

$ time cargo run --release -p tokenizer_trainer -- --dataset-dir ~/Data/nanochat/dataset --shards 
..8 --voc
ab-size=65536 --time-encode-decode                                        
   Compiling anyhow v1.0.100
   Compiling wordchipper-disk-cache v0.2.2 (/Users/crutcher/git/wordchipper/crates/wordchipper-disk-cache)
   Compiling wordchipper-data v0.0.0 (/Users/crutcher/git/wordchipper/crates/wordchipper-data)
   Compiling wordchipper v0.2.2 (/Users/crutcher/git/wordchipper/crates/wordchipper)
   Compiling tokenizer_trainer v0.0.0 (/Users/crutcher/git/wordchipper/examples/tokenizer_trainer)
    Finished `release` profile [optimized] target(s) in 2.68s
     Running `target/release/tokenizer_trainer --dataset-dir /Users/crutcher/Data/nanochat/dataset --shards ..8 --vocab-size=65536 --time-encode-decode`
Loading Shards: [0, 1, 2, 3, 4, 5, 6, 7]
...

Training Tokenizer on shards: [0, 1, 2, 3, 4, 5, 6, 7]
- shard: 0
- shard: 1
- shard: 2
- shard: 3
- shard: 4
- shard: 5
- shard: 6
- shard: 7
- train
- training_duration: 106.70s
- vocab_size: 65535

Samples Summary:
- count: 20480
- avg size: 4741

Timing Config:
- batch size: 512

Timing Encode:
- batch avg: 18.276894ms
- sample avg: 35.697µs
- avg bps: 132.81 MB/s

Observed Bytes/Token Stats:
- total bytes: 97103222
- total tokens: 24645141
- sample byte/token: 3.94

Timing Decode:
- batch avg: 1.829894ms
- sample avg: 3.574µs

Acknowledgements

Thank you to @karpathy and nanochat for the work on rustbpe.
Thank you to tiktoken for their initial work in the rust tokenizer space.

wordchipper 0.6.0