Expand description
§riptoken
Fast BPE tokenizer for LLMs — a drop-in compatible, faster reimplementation
of OpenAI’s tiktoken.
§Design
riptoken is structured as three layers:
- A pure-Rust core (
CoreBPE) that can be used directly from Rust. - An optional PyO3 binding (enabled with the
pythonfeature). - A Python wrapper package shipped on PyPI.
The core BPE algorithm is a Rust port of tiktoken’s with several
optimizations applied — see README.md for benchmarks and details.
§Example
use riptoken::{CoreBPE, Rank};
use rustc_hash::FxHashMap;
// In practice you would load `encoder` from an o200k_base / cl100k_base
// vocabulary file via `riptoken::load_tiktoken_bpe`.
let mut encoder: FxHashMap<Vec<u8>, Rank> = FxHashMap::default();
encoder.insert(b"h".to_vec(), 0);
encoder.insert(b"i".to_vec(), 1);
encoder.insert(b"hi".to_vec(), 2);
let specials = FxHashMap::default();
let bpe = CoreBPE::new(encoder, specials, r"\w+").unwrap();
let tokens = bpe.encode_ordinary("hi");
assert_eq!(tokens, vec![2]);Structs§
- CoreBPE
- The core BPE encoder/decoder.
Enums§
- Build
Error - Errors produced when constructing a
CoreBPE. - Decode
Error - Errors produced during decoding.
Type Aliases§
- Rank
- Integer rank of a token in the BPE vocabulary.