tiktokenx
Fast Rust implementation of OpenAI's tiktoken tokenizer.
Features
- Drop-in replacement for Python tiktoken
- All OpenAI models supported (GPT-4, GPT-5, o1, etc.)
- Real vocabulary file loading with hash verification
- Zero-copy operations for optimal performance
- Comprehensive test suite (24 tests)
Installation
[]
= "0.1"
Usage
use ;
// Get encoding by name
let enc = get_encoding.unwrap;
let tokens = enc.encode.unwrap;
let text = enc.decode.unwrap;
// Get encoding for a model
let enc = encoding_for_model.unwrap;
let token_count = enc.encode.unwrap.len;
Supported Models
| Model Family | Models | Encoding |
|---|---|---|
| GPT-5 | gpt-5 | o200k_base |
| GPT-4 | gpt-4, gpt-4-turbo, gpt-4o | cl100k_base / o200k_base |
| GPT-3.5 | gpt-3.5-turbo | cl100k_base |
| o1 | o1, o1-mini, o1-preview | o200k_base |
| Legacy | text-davinci-003, code-davinci-002 | p50k_base |
Performance
Benchmarks on Apple M1 Pro comparing tiktokenx vs Python tiktoken:
| Implementation | Operation | Time | Throughput | Memory | vs Python |
|---|---|---|---|---|---|
| Python tiktoken | Encode short text | 5.7 μs | 4.8 MiB/s | 0.1 MB | 1.0x |
| tiktokenx | Encode short text | 4.1 μs | 6.7 MiB/s | 0.5 MB | 1.4x |
| Python tiktoken | Encode long text | 482.1 μs | 8.9 MiB/s | 0.1 MB | 1.0x |
| tiktokenx | Encode long text | 175.4 μs | 24.5 MiB/s | 2.0 MB | 2.7x |
tiktokenx is 2.1x faster and uses 0.1x less memory on average!
Development
# Run tests
# Run benchmarks
# Check formatting
# Run clippy
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
License
MIT