Skip to main content

Module tokenize

Module tokenize

Expand description

Pure-Rust BPE encoder. Text → token IDs.

Required for the bidirectional Codec endpoint where the client wants to send token-ID prompts (zero text on the wire in either direction).

§Algorithm (for both byte_level and metaspace BPE)

Pre-tokenize: split input into pieces (regex for byte_level; whitespace for metaspace).
Encode each piece into the vocab’s character space (GPT-2 byte chars or ▁-prefixed).
Apply BPE merges greedily by priority — match HuggingFace reference.
Look up final tokens in vocab. Tokens not in vocab fall back to byte tokens (metaspace path).

Structs§

BPETokenizer: Pure-Rust BPE encoder.

Traits§

ITokenizer: Common interface every tokenizer implementation satisfies.