# BPE Development Benchmarks
## Shakespeare (~1 MB)
| Version | Corpus size (bytes) | Training time (ns) | Peak memory (MB) | Vocab size | Avg token length | Max token length | Min token length | Avg encode time (ns) | Avg decode time (ns) | Encode throughput (tokens/sec) | Decode throughput (tokens/sec) | Corpus token count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| — | — | — | — | — | — | — | — | — | — | — | — | — |
## Gutenberg sampler (~10 MB)
| Version | Corpus size (bytes) | Training time (ns) | Peak memory (MB) | Vocab size | Avg token length | Max token length | Min token length | Avg encode time (ns) | Avg decode time (ns) | Encode throughput (tokens/sec) | Decode throughput (tokens/sec) | Corpus token count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| — | — | — | — | — | — | — | — | — | — | — | — | — |
## Wikipedia 1% (~100 MB)
| Version | Corpus size (bytes) | Training time (ns) | Peak memory (MB) | Vocab size | Avg token length | Max token length | Min token length | Avg encode time (ns) | Avg decode time (ns) | Encode throughput (tokens/sec) | Decode throughput (tokens/sec) | Corpus token count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| — | — | — | — | — | — | — | — | — | — | — | — | — |
## OpenWebText sample (~1 GB)
| Version | Corpus size (bytes) | Training time (ns) | Peak memory (MB) | Vocab size | Avg token length | Max token length | Min token length | Avg encode time (ns) | Avg decode time (ns) | Encode throughput (tokens/sec) | Decode throughput (tokens/sec) | Corpus token count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| — | — | — | — | — | — | — | — | — | — | — | — | — |