Skip to main content

Crate shimmytok

Crate shimmytok 

Source
Expand description

§shimmytok

Pure Rust tokenizer for GGUF models with 100% llama.cpp compatibility.

§Features

  • 🦀 Pure Rust — No C++ dependencies, compiles anywhere
  • 📦 Load from GGUF — Tokenizer embedded in model file
  • Validated — 10/10 vocab models match llama.cpp exactly
  • Fast — Batch encoding with Rayon parallelism
  • 🌊 Streaming — Token-by-token decoding for LLM output

§Supported Tokenizers

TypeAlgorithmModels
SPMSentencePieceLLaMA, Mistral, Gemma
BPEByte-Pair EncodingGPT-2, Qwen, StarCoder, DeepSeek
WPMWordPieceBERT, BGE embeddings
UGMUnigramT5, mT5
RWKVTrie-basedRWKV World
PLaMo-2Table-driven DPPLaMo-2

§Quick Start

use shimmytok::Tokenizer;

// Load tokenizer from any GGUF model
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;

// Encode text to tokens
let tokens = tokenizer.encode("Hello, world!", true)?;

// Decode back to text
let text = tokenizer.decode(&tokens, true)?;

// Stream tokens one at a time (for LLM generation)
for token_id in &tokens {
    print!("{}", tokenizer.decode_single(*token_id, false)?);
}

§Batch Encoding

// Parallel encoding with Rayon
let texts = vec!["Hello", "World", "Rust"];
let batched = tokenizer.encode_batch(&texts, true)?;

Re-exports§

pub use plamo2::Plamo2Tokenizer;
pub use rwkv::RwkvTokenizer;
pub use ugm::UgmTokenizer;
pub use vocab::TokenType;
pub use vocab::Vocabulary;
pub use wpm::WpmTokenizer;

Modules§

bpe
BPE (Byte Pair Encoding) tokenizer implementation.
byte_encoder
GPT-2 byte-level encoding utilities.
gguf
GGUF file format reader.
invariants
Runtime invariant assertions for tokenizer correctness.
plamo2
PLaMo-2 tokenizer implementation.
rwkv
RWKV tokenizer implementation.
sentencepiece
SentencePiece tokenizer implementation with resegmentation support.
ugm
UGM (Unigram) tokenizer implementation.
vocab
Vocabulary loading and management from GGUF model files.
wpm
WPM (Word-Piece Model) tokenizer implementation.

Structs§

DecodeOptions
Options for decoding tokens (llama.cpp parity)
EncodeOptions
Options for encoding text (llama.cpp parity)
Tokenizer
Main tokenizer interface for encoding and decoding text

Enums§

Error

Constants§

MAX_INPUT_SIZE
Token ID type used throughout the library Maximum input text size in bytes (10MB) - Issue R4#2
MAX_OUTPUT_TOKENS
Maximum output tokens (1M tokens max) - prevents memory exhaustion

Type Aliases§

TokenId
Type alias for token IDs