Skip to main content

Module tokenize

Module tokenize 

Source
Expand description

Pure-Rust BPE encoder. Text → token IDs.

Required for the bidirectional Codec endpoint where the client wants to send token-ID prompts (zero text on the wire in either direction).

§Algorithm (for both byte_level and metaspace BPE)

  1. Pre-tokenize: split input into pieces (regex for byte_level; whitespace for metaspace).
  2. Encode each piece into the vocab’s character space (GPT-2 byte chars or -prefixed).
  3. Apply BPE merges greedily by priority — match HuggingFace reference.
  4. Look up final tokens in vocab. Tokens not in vocab fall back to byte tokens (metaspace path).

Structs§

BPETokenizer
Pure-Rust BPE encoder.

Traits§

ITokenizer
Common interface every tokenizer implementation satisfies.