Expand description
Pure-Rust BPE encoder. Text → token IDs.
Required for the bidirectional Codec endpoint where the client wants to send token-ID prompts (zero text on the wire in either direction).
§Algorithm (for both byte_level and metaspace BPE)
- Pre-tokenize: split input into pieces (regex for byte_level; whitespace for metaspace).
- Encode each piece into the vocab’s character space (GPT-2 byte chars or
▁-prefixed). - Apply BPE merges greedily by priority — match HuggingFace reference.
- Look up final tokens in
vocab. Tokens not in vocab fall back to byte tokens (metaspace path).
Structs§
- BPETokenizer
- Pure-Rust BPE encoder.
Traits§
- ITokenizer
- Common interface every tokenizer implementation satisfies.