rust-centric clone of nanochat/rustbpe
This is a high-performance rust BPE tokenizer trainer/encoder/decoder.
It is inspired by nanochat rustbpe
The current status is productionization towards an alpha release.
TODO:
- New Name / New Repo. (
wordchuckconflicts, alas) - Save/Load vocabularies.
- Save/Load well-known / named remote vocabularies.
- Save/Load to
tiktokenvocab format.
- Benchmarks.
- Error handling (as
Results, not panics). - Tuning
- Instrument
tiktoken(viatracing). - Compare / fix perf differences.
- Instrument
- Python/C*/Java Bindings?
See:
training example
- the iterator stream for samples may be quite large.
- training a
nanochatequivalent tokenizer takes ~80 CPU minutes.
use ;
use save_span_map_to_tiktoken_path;
use OA_GPT3_CL100K_WORD_PATTERN;
use ;
use MergeHeapVocabEncoder;
use DictionaryDecoder;
use ;
use default_regex_supplier;
use Arc;
Example Tokenizer Trainer
Each shard is ~90MB parquet file.
- 64 Core Thread Ripper
$ time cargo run --release -p tokenizer_trainer -- --dataset-dir /media/Data/nanochat/dataset --time-encode-decode
Compiling wordchuck v0.0.6 (/home/crutcher/git/brn-nanochat/crates/wordchuck)
Compiling tokenizer_trainer v0.0.0 (/home/crutcher/git/brn-nanochat/crates/wordchuck/examples/tokenizer_trainer)
Finished `release` profile [optimized] target(s) in 1.85s
Running `target/release/tokenizer_trainer --dataset-dir /media/Data/nanochat/dataset --time-encode-decode`
Loading Shards: [0, 1, 2, 3, 4, 5, 6, 7]
...
Training Tokenizer on shards: [0, 1, 2, 3, 4, 5, 6, 7]
- shard: 0
- shard: 1
- shard: 2
- shard: 3
- shard: 4
- shard: 5
- shard: 6
- shard: 7
- train
- training_duration: 203.40s
- vocab_size: 65535
Samples Summary:
- count: 20480
- avg size: 4741
Timing Config:
- batch size: 512
Timing Encode:
- batch avg: 76.543966ms
- sample avg: 149.499µs
- avg bps: 31.71 MB/s
Observed Bytes/Token Stats:
- total bytes: 97103222
- total tokens: 24645141
- sample byte/token: 3.94
Timing Decode:
- batch avg: 2.466106ms
- sample avg: 4.816µs
real 3m28.924s
user 6m37.652s
sys 0m35.035s