wordchipper - HPC Rust BPE Tokenizer
Overview
This is a high-performance rust BPE tokenizer trainer/encoder/decoder.
The current status is productionization towards an alpha release.
Crate Features
feature: default
clientdownload
The default feature does not enable training.
feature: client
ahashrayonstd
The default client is focused on loading vocabularies and running high performance encoders / decoders.
feature: download
wordchipper-disk-cachestd
The download feature enables downloading vocabularies from the internet.
feature: training
compact_strdary_heapstd
The training feature enables the training code.
feature: std / no_std
The std feature enables the use of the std library;
and the no_std feature enables deps needed when std is not enabled.
(Negative feature deps are not stable yet.)
Note: I am unsure if this is complete. It is tested CI, but I'm unsure
if I've fully covered it; and I haven't worked out a no_std deploy test yet.
feature: ahash
This swaps all HashMap/HashSet implementations for ``ahash`; which is a performance win on many/(most?) modern CPUs.
This is done by the types::hash_types::CommonHash{*} type alias machinery.
See also the hashbrown dep used by no_std.
feature: rayon
This enables some parallelism wrappers using the rayon crate.
TODO: I intend on providing a tokio based async parallelism mechanism
as well, to structure more direct regex`>`encode pipeline parallelism.
feature: tracing
This enables a number of tracing instrumentation points.
This is only useful for timing tracing of the library itself.
Client Usage
UnifiedTokenVocab
The UnifiedTokenVocab is a unified representation of the vocabularies
used by the TokenEncoder and TokenDecoder clients. It contains:
SegmentationConfig- describing the span/word regex and the special token map.ByteMapVocab- describing the{ u8 -> T }mapping.SpanMapVocab- describing the{ Vec<u8> -> T }mapping.PairMapVocab- describing known{ (T, T) -> T }merge pairs.
Loading Pretrained Vocabularies
This is only partially implemented; it still requires a fair amount of manual work.
A collection of metadata about known pretrained vocabularies is available:
wordchipper::vocab::public
What is incomplete is a local URL cache plus workflow for assembling a vocab from the known metadata.
A loading example exists in the examples/token-cli crate.
use ;
use ;
use ;
use ;
use ;
use load_o200k_harmony_vocab;
use UnifiedTokenVocab;
use WordchipperDiskCache;
type T = u32;
let mut disk_cache = default;
let vocab: = load_o200k_harmony_vocab?.into;
let encoder: =
init;
let encoder = new;
let decoder = from_unified_vocab;
let decoder = new;
TokenEncoder Clients
Encoder clients should use:
DefaultTokenEncoder- the current default (only?)TokenEncoder.ParallelRayonEncoder- a batch parallelism wrapper around anyTokenEncoder.
use UnifiedTokenVocab;
use DefaultTokenEncoder;
use TokenEncoder;
use TokenType;
use Arc;
TokenDecoder Clients
Decoder clients should use:
DictionaryDecoder- the fastestTokenDecoder.ParallelRayonDecoder- a batch parallelism wrapper around anyTokenDecoder.
use UnifiedTokenVocab;
use DictionaryDecoder;
use TokenDecoder;
use TokenType;
use Arc;
Side-by-side Comparison to tiktoken-rs
Each shard is ~90MB parquet file.
- 128/64 Core Thread Ripper
- NOTE: there are still some tokenization differences to resolve here.
$ RAYON_NUM_THREADS=16 cargo run --release -p token-cli -- --dataset-dir /media/Data/nanochat/dataset
Compiling wordchipper v0.1.2 (/home/crutcher/git/wordchipper/crates/wordchipper)
Compiling token-cli v0.0.0 (/home/crutcher/git/wordchipper/examples/token-cli)
Finished `release` profile [optimized] target(s) in 1.87s
Running `target/release/token-cli --dataset-dir /media/Data/nanochat/dataset`
Samples Summary:
- count: 53248
- total size: 254737840
- avg size: 4783
- avg batch size bytes: 2449402
Timing Config:
- batch size: 512
- num batches: 104
Timing Encode:
- wordchipper: 14.6ms, 160.32 MiB/s
- tiktoken-rs: 32.7ms, 71.33 MiB/s
Observed Bytes/Token Stats:
- wordchipper token count: 54749669
- wordchipper byte/token: 4.65
- tiktoken-rs token count: 53251930
- tiktoken-rs byte/token: 4.78
Timing Decode:
- wordchipper: 2.8ms, 840.54 MiB/s
- tiktoken-rs: 2.1ms, 1.08 GiB/s
Training Overview
See examples/tokenizer_trainer.
This is a code snippet overview of training.
Expect training to take ~1s/10MB of input; and to be slowed primarily by how well the stream logic of loading the training samples is parallelized.
Note: currently, training has limited logging, and no progress reporting.
A common training binary is probably a good idea; and much of the messiness of supporting many different training data sources could be hidden in the isolated deps of such a tool.
Here:
- The iterator stream for samples may be quite large.
- Training a
nanochatequivalent tokenizer takes ~80 CPU minutes.
use ;
use save_span_map_to_tiktoken_path;
use OA_GPT3_CL100K_WORD_PATTERN;
use ;
use DefaultTokenEncoder;
use DictionaryDecoder;
use ;
use Arc;
Running examples/tokenizer_trainer
Each shard is ~90MB parquet file.
$ time cargo run --release -p tokenizer_trainer -- --dataset-dir ~/Data/nanochat/dataset --shards
..8 --voc
ab-size=65536 --time-encode-decode
Compiling anyhow v1.0.100
Compiling wordchipper-disk-cache v0.2.2 (/Users/crutcher/git/wordchipper/crates/wordchipper-disk-cache)
Compiling wordchipper-data v0.0.0 (/Users/crutcher/git/wordchipper/crates/wordchipper-data)
Compiling wordchipper v0.2.2 (/Users/crutcher/git/wordchipper/crates/wordchipper)
Compiling tokenizer_trainer v0.0.0 (/Users/crutcher/git/wordchipper/examples/tokenizer_trainer)
Finished `release` profile [optimized] target(s) in 2.68s
Running `target/release/tokenizer_trainer --dataset-dir /Users/crutcher/Data/nanochat/dataset --shards ..8 --vocab-size=65536 --time-encode-decode`
Loading Shards: [0, 1, 2, 3, 4, 5, 6, 7]
...
Training Tokenizer on shards: [0, 1, 2, 3, 4, 5, 6, 7]
- shard: 0
- shard: 1
- shard: 2
- shard: 3
- shard: 4
- shard: 5
- shard: 6
- shard: 7
- train
- training_duration: 106.70s
- vocab_size: 65535
Samples Summary:
- count: 20480
- avg size: 4741
Timing Config:
- batch size: 512
Timing Encode:
- batch avg: 18.276894ms
- sample avg: 35.697µs
- avg bps: 132.81 MB/s
Observed Bytes/Token Stats:
- total bytes: 97103222
- total tokens: 24645141
- sample byte/token: 3.94
Timing Decode:
- batch avg: 1.829894ms
- sample avg: 3.574µs