Expand description
GPT-2 byte ↔ unicode mapping table and helpers shared by the Detokenizer and BPE encoder.
Constants§
- METASPACE
- The metaspace marker (
▁, U+2581) used by SentencePiece tokenizers.
Functions§
- byte_
to_ char - Maps a byte (0–255) to its GPT-2-encoded character.
- char_
to_ byte - Maps a GPT-2-encoded character codepoint back to a byte; returns
Noneif not in the table. - decode_
byte_ level_ token - Decode a byte-level BPE token (e.g.
"Ġhello") to its raw bytes by reversing the GPT-2 byte→unicode table. Characters outside the table fall back to UTF-8 bytes (defensive — shouldn’t happen for valid vocab entries). - encode_
byte_ level_ chars - Encode raw bytes into a string of GPT-2 byte-encoded characters. The result matches the keys of a byte_level vocab.