Expand description
§shimmytok
Pure Rust tokenizer for GGUF models with llama.cpp compatibility.
§Features
- 🦀 Pure Rust - no C++ dependencies
- 📦 Load tokenizers directly from GGUF files
- ✅ 100% compatible with llama.cpp
- 🧪 Fully tested against llama.cpp output
- 🎯 Simple API - 3 methods
§Example
use shimmytok::Tokenizer;
// Load tokenizer from GGUF file
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;
// Encode text to token IDs
let tokens = tokenizer.encode("Hello world", true)?;
// Decode token IDs back to text
let text = tokenizer.decode(&tokens, true)?;§Supported Models
- ✅
LLaMA/ Llama-2 / Llama-3 (SentencePiece) - ✅ Mistral (
SentencePiece) - ✅ Phi-3 (
SentencePiece) - ✅ Qwen / Qwen2 (BPE)
- ✅ Gemma (
SentencePiece) - ✅ GPT-2 / GPT-3 style BPE
Re-exports§
pub use plamo2::Plamo2Tokenizer;pub use rwkv::RwkvTokenizer;pub use ugm::UgmTokenizer;pub use vocab::TokenType;pub use vocab::Vocabulary;pub use wpm::WpmTokenizer;
Modules§
- bpe
- BPE (Byte Pair Encoding) tokenizer implementation.
- byte_
encoder - GPT-2 byte-level encoding
Direct port of
OpenAI’sbytes_to_unicode()function - gguf
- GGUF file format reader
- plamo2
- PLaMo-2 tokenizer implementation.
- rwkv
- RWKV tokenizer implementation.
- sentencepiece
- SentencePiece tokenizer implementation with resegmentation support.
- ugm
- UGM (Unigram) tokenizer implementation.
- vocab
- Vocabulary loading and management from GGUF model files.
- wpm
- WPM (Word-Piece Model) tokenizer implementation.
Structs§
- Decode
Options - Options for decoding tokens (llama.cpp parity)
- Encode
Options - Options for encoding text (llama.cpp parity)
- Tokenizer
- Main tokenizer interface for encoding and decoding text
Enums§
Constants§
- MAX_
INPUT_ SIZE - Token ID type used throughout the library Maximum input text size in bytes (10MB) - Issue R4#2
- MAX_
OUTPUT_ TOKENS - Maximum output tokens (1M tokens max) - prevents memory exhaustion
Type Aliases§
- TokenId
- Type alias for token IDs