Skip to main content

Crate riptoken

Crate riptoken 

Source
Expand description

§riptoken

Fast BPE tokenizer for LLMs — a drop-in compatible, faster reimplementation of OpenAI’s tiktoken.

§Design

riptoken is structured as three layers:

  1. A pure-Rust core (CoreBPE) that can be used directly from Rust.
  2. An optional PyO3 binding (enabled with the python feature).
  3. A Python wrapper package shipped on PyPI.

The core BPE algorithm is a Rust port of tiktoken’s with several optimizations applied — see README.md for benchmarks and details.

§Example

use riptoken::{CoreBPE, Rank};
use rustc_hash::FxHashMap;

// In practice you would load `encoder` from an o200k_base / cl100k_base
// vocabulary file via `riptoken::load_tiktoken_bpe`.
let mut encoder: FxHashMap<Vec<u8>, Rank> = FxHashMap::default();
encoder.insert(b"h".to_vec(), 0);
encoder.insert(b"i".to_vec(), 1);
encoder.insert(b"hi".to_vec(), 2);

let specials = FxHashMap::default();
let bpe = CoreBPE::new(encoder, specials, r"\w+").unwrap();

let tokens = bpe.encode_ordinary("hi");
assert_eq!(tokens, vec![2]);

Structs§

CoreBPE
The core BPE encoder/decoder.

Enums§

BuildError
Errors produced when constructing a CoreBPE.
DecodeError
Errors produced during decoding.

Type Aliases§

Rank
Integer rank of a token in the BPE vocabulary.