Expand description
High-performance pure-Rust BPE tokenizer compatible with OpenAI’s tiktoken and all mainstream LLM tokenizers.
Supports 9 encodings across 5 providers: OpenAI (cl100k_base, o200k_base,
p50k_base, p50k_edit, r50k_base), Meta (llama3), DeepSeek (deepseek_v3),
Alibaba (qwen2), and Mistral (mistral_v3).
Includes token encoding, decoding, counting, and multi-provider pricing.
§Quick Start
// by encoding name
let enc = tiktoken::get_encoding("cl100k_base").unwrap();
let tokens = enc.encode("hello world");
let text = enc.decode_to_string(&tokens).unwrap();
assert_eq!(text, "hello world");
// by model name
let enc = tiktoken::encoding_for_model("gpt-4o").unwrap();
let count = enc.count("hello world");
assert_eq!(count, 2);Modules§
- encoding
- Encoding definitions and data parsing for tiktoken-compatible BPE vocabularies.
- pricing
- Per-model pricing data and cost estimation for OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, and Mistral.
Structs§
- CoreBpe
- A Byte Pair Encoding tokenizer engine.
Functions§
- encoding_
for_ model - Get a cached tokenizer by model name.
- get_
encoding - Get a cached tokenizer by encoding name.
- list_
encodings - All available encoding names.
- model_
to_ encoding - Map a model name to its encoding name.