tokie - Fast BPE tokenizer using Aho-Corasick automata
This crate implements Byte Pair Encoding (BPE) tokenization using the algorithm from GitHub's rust-gems, which uses Aho-Corasick automata for efficient suffix matching combined with compatibility checking.
Quick Start
use tokie::Tokenizer;
// Load from HuggingFace tokenizer.json
let tokenizer = Tokenizer::from_json("tokenizer.json")?;
// Encode text (without special tokens)
let tokens = tokenizer.encode("Hello, world!", false);
// Encode with special tokens (for model input)
let tokens_with_special = tokenizer.encode("Hello, world!", true);
// Decode back
let text = tokenizer.decode(&tokens).unwrap();
// Save/load binary format (fast)
tokenizer.to_file("model.tkz")?;
let tokenizer = Tokenizer::from_file("model.tkz")?;
Loading from HuggingFace Hub
Enable the hf feature to load tokenizers directly from HuggingFace:
= { = "0.1", = ["hf"] }
use tokie::Tokenizer;
let tokenizer = Tokenizer::from_pretrained("gpt2")?;
let tokenizer = Tokenizer::from_pretrained("meta-llama/Llama-3.2-8B")?;
Architecture
- [
Tokenizer] - High-level API combining pre-tokenization + BPE encoding + decoding - [
encoder] - BPE encoders (backtracking for tiktoken, heap for LLaMA) - [
Decoder] - Token ID to bytes decoder (can be shared across encoder types) - [
pretok] - Fast pretokenizers (GPT-2: 566 MiB/s, cl100k, o200k)