Expand description
§shimmytok
Pure Rust tokenizer for GGUF models with 100% llama.cpp compatibility.
§Features
- 🦀 Pure Rust — No C++ dependencies, compiles anywhere
- 📦 Load from GGUF — Tokenizer embedded in model file
- ✅ Validated — 10/10 vocab models match llama.cpp exactly
- ⚡ Fast — Batch encoding with Rayon parallelism
- 🌊 Streaming — Token-by-token decoding for LLM output
§Supported Tokenizers
| Type | Algorithm | Models |
|---|---|---|
| SPM | SentencePiece | LLaMA, Mistral, Gemma |
| BPE | Byte-Pair Encoding | GPT-2, Qwen, StarCoder, DeepSeek |
| WPM | WordPiece | BERT, BGE embeddings |
| UGM | Unigram | T5, mT5 |
| RWKV | Trie-based | RWKV World |
| PLaMo-2 | Table-driven DP | PLaMo-2 |
§Quick Start
use shimmytok::Tokenizer;
// Load tokenizer from any GGUF model
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;
// Encode text to tokens
let tokens = tokenizer.encode("Hello, world!", true)?;
// Decode back to text
let text = tokenizer.decode(&tokens, true)?;
// Stream tokens one at a time (for LLM generation)
for token_id in &tokens {
print!("{}", tokenizer.decode_single(*token_id, false)?);
}§Batch Encoding
// Parallel encoding with Rayon
let texts = vec!["Hello", "World", "Rust"];
let batched = tokenizer.encode_batch(&texts, true)?;Re-exports§
pub use plamo2::Plamo2Tokenizer;pub use rwkv::RwkvTokenizer;pub use ugm::UgmTokenizer;pub use vocab::TokenType;pub use vocab::Vocabulary;pub use wpm::WpmTokenizer;
Modules§
- bpe
- BPE (Byte Pair Encoding) tokenizer implementation.
- byte_
encoder - GPT-2 byte-level encoding utilities.
- gguf
- GGUF file format reader.
- invariants
- Runtime invariant assertions for tokenizer correctness.
- plamo2
- PLaMo-2 tokenizer implementation.
- rwkv
- RWKV tokenizer implementation.
- sentencepiece
- SentencePiece tokenizer implementation with resegmentation support.
- ugm
- UGM (Unigram) tokenizer implementation.
- vocab
- Vocabulary loading and management from GGUF model files.
- wpm
- WPM (Word-Piece Model) tokenizer implementation.
Structs§
- Decode
Options - Options for decoding tokens (llama.cpp parity)
- Encode
Options - Options for encoding text (llama.cpp parity)
- Tokenizer
- Main tokenizer interface for encoding and decoding text
Enums§
Constants§
- MAX_
INPUT_ SIZE - Token ID type used throughout the library Maximum input text size in bytes (10MB) - Issue R4#2
- MAX_
OUTPUT_ TOKENS - Maximum output tokens (1M tokens max) - prevents memory exhaustion
Type Aliases§
- TokenId
- Type alias for token IDs