Crate shimmytok

Expand description

§shimmytok

Pure Rust tokenizer for GGUF models with 100% llama.cpp compatibility.

§Features

🦀 Pure Rust — No C++ dependencies, compiles anywhere
📦 Load from GGUF — Tokenizer embedded in model file
✅ Validated — 10/10 vocab models match llama.cpp exactly
⚡ Fast — Batch encoding with Rayon parallelism
🌊 Streaming — Token-by-token decoding for LLM output

§Supported Tokenizers

Type	Algorithm	Models
SPM	SentencePiece	LLaMA, Mistral, Gemma
BPE	Byte-Pair Encoding	GPT-2, Qwen, StarCoder, DeepSeek
WPM	WordPiece	BERT, BGE embeddings
UGM	Unigram	T5, mT5
RWKV	Trie-based	RWKV World
PLaMo-2	Table-driven DP	PLaMo-2

§Quick Start

use shimmytok::Tokenizer;

// Load tokenizer from any GGUF model
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;

// Encode text to tokens
let tokens = tokenizer.encode("Hello, world!", true)?;

// Decode back to text
let text = tokenizer.decode(&tokens, true)?;

// Stream tokens one at a time (for LLM generation)
for token_id in &tokens {
    print!("{}", tokenizer.decode_single(*token_id, false)?);
}

§Batch Encoding

// Parallel encoding with Rayon
let texts = vec!["Hello", "World", "Rust"];
let batched = tokenizer.encode_batch(&texts, true)?;

Re-exports§

pub use plamo2::Plamo2Tokenizer;
pub use rwkv::RwkvTokenizer;
pub use ugm::UgmTokenizer;
pub use vocab::TokenType;
pub use vocab::Vocabulary;
pub use wpm::WpmTokenizer;

Modules§

bpe: BPE (Byte Pair Encoding) tokenizer implementation.
byte_encoder: GPT-2 byte-level encoding utilities.
gguf: GGUF file format reader.
invariants: Runtime invariant assertions for tokenizer correctness.
plamo2: PLaMo-2 tokenizer implementation.
rwkv: RWKV tokenizer implementation.
sentencepiece: SentencePiece tokenizer implementation with resegmentation support.
ugm: UGM (Unigram) tokenizer implementation.
vocab: Vocabulary loading and management from GGUF model files.
wpm: WPM (Word-Piece Model) tokenizer implementation.

Structs§

DecodeOptions: Options for decoding tokens (llama.cpp parity)
EncodeOptions: Options for encoding text (llama.cpp parity)
Tokenizer: Main tokenizer interface for encoding and decoding text

Enums§

Error

Constants§

MAX_INPUT_SIZE: Token ID type used throughout the library Maximum input text size in bytes (10MB) - Issue R4#2
MAX_OUTPUT_TOKENS: Maximum output tokens (1M tokens max) - prevents memory exhaustion

Type Aliases§

TokenId: Type alias for token IDs

Crate shimmytok

Crate shimmytok Copy item path

§shimmytok

§Features

§Supported Tokenizers

§Quick Start

§Batch Encoding

Re-exports§

Modules§

Structs§

Enums§

Constants§

Type Aliases§

Crate shimmytok