shimmytok
Pure Rust tokenizer for GGUF models
100% llama.cpp compatible β’ zero C++ β’ just works
shimmytok is free forever. MIT licensed, no strings attached.
π If shimmytok helps you, consider sponsoring.
Features
- π¦ Pure Rust - No C++ dependencies
- π¦ Load from GGUF - Read tokenizers directly from model files
- β Validated - 10/10 llama.cpp vocab models passing
- π― Complete - All llama.cpp tokenizer types: SPM, BPE, WPM, UGM, RWKV
Installation
[]
= "0.7"
Usage
use Tokenizer;
// Load tokenizer from GGUF file
let tokenizer = from_gguf_file?;
// Encode text to token IDs
let tokens = tokenizer.encode?;
// Decode token IDs back to text
let text = tokenizer.decode?;
Validated Models
All models validated against llama-tokenize with exact token match:
| Model | Type | Status |
|---|---|---|
| bert-bge | WPM | β |
| command-r | BPE | β |
| deepseek-coder | BPE | β |
| deepseek-llm | BPE | β |
| falcon | BPE | β |
| gpt-2 | BPE | β |
| llama-spm | SPM | β |
| qwen2 | BPE | β |
| refact | BPE | β |
| starcoder | BPE | β |
Tokenizer Coverage
| Type | Algorithm | Status |
|---|---|---|
| SPM | SentencePiece resegment | β |
| BPE | Priority queue merge + 41 pre-tokenizer patterns | β |
| WPM | Word-Piece greedy longest match | β |
| UGM | Unigram Viterbi DP | β |
| RWKV | Trie-based greedy | β |
| PLaMo-2 | Table-driven reverse DP | β |
API
// Core
from_gguf_file .encode .decode .decode_single // Metadata
tokenizer.vocab_size .bos_token .eos_token .model_type .pre_type // Batch
tokenizer.encode_batch
Why shimmytok?
- No C++: Works anywhere Rust works (WASM, embedded, etc.)
- No separate files: Loads tokenizer directly from GGUF
- Correctness first: Every tokenizer validated against llama.cpp
Links
- π CHANGELOG - Version history
- πΊοΈ ROADMAP - Future plans
- π€ CONTRIBUTING - How to contribute
- π SECURITY - Vulnerability reporting
License
MIT License - forever.
Maintainer: Michael A. Kuykendall
See Also
- libshimmy - Pure Rust LLM inference engine that uses shimmytok
- llama.cpp - Reference C++ implementation
- GGUF format spec