Crate shimmytok

Crate shimmytok 

Source
Expand description

§shimmytok

Pure Rust tokenizer for GGUF models with llama.cpp compatibility.

§Features

  • 🦀 Pure Rust - no C++ dependencies
  • 📦 Load tokenizers directly from GGUF files
  • ✅ 100% compatible with llama.cpp
  • 🧪 Fully tested against llama.cpp output
  • 🎯 Simple API - 3 methods

§Example

use shimmytok::Tokenizer;

// Load tokenizer from GGUF file
let tokenizer = Tokenizer::from_gguf_file("model.gguf")?;

// Encode text to token IDs
let tokens = tokenizer.encode("Hello world", true)?;

// Decode token IDs back to text
let text = tokenizer.decode(&tokens, true)?;

§Supported Models

  • LLaMA / Llama-2 / Llama-3 (SentencePiece)
  • ✅ Mistral (SentencePiece)
  • ✅ Phi-3 (SentencePiece)
  • ✅ Qwen / Qwen2 (BPE)
  • ✅ Gemma (SentencePiece)
  • ✅ GPT-2 / GPT-3 style BPE

Re-exports§

pub use plamo2::Plamo2Tokenizer;
pub use rwkv::RwkvTokenizer;
pub use ugm::UgmTokenizer;
pub use vocab::TokenType;
pub use vocab::Vocabulary;
pub use wpm::WpmTokenizer;

Modules§

bpe
BPE (Byte Pair Encoding) tokenizer implementation.
byte_encoder
GPT-2 byte-level encoding Direct port of OpenAI’s bytes_to_unicode() function
gguf
GGUF file format reader
plamo2
PLaMo-2 tokenizer implementation.
rwkv
RWKV tokenizer implementation.
sentencepiece
SentencePiece tokenizer implementation with resegmentation support.
ugm
UGM (Unigram) tokenizer implementation.
vocab
Vocabulary loading and management from GGUF model files.
wpm
WPM (Word-Piece Model) tokenizer implementation.

Structs§

DecodeOptions
Options for decoding tokens (llama.cpp parity)
EncodeOptions
Options for encoding text (llama.cpp parity)
Tokenizer
Main tokenizer interface for encoding and decoding text

Enums§

Error

Constants§

MAX_INPUT_SIZE
Token ID type used throughout the library Maximum input text size in bytes (10MB) - Issue R4#2
MAX_OUTPUT_TOKENS
Maximum output tokens (1M tokens max) - prevents memory exhaustion

Type Aliases§

TokenId
Type alias for token IDs