Skip to main content

Module byte_encoder

Module byte_encoder 

Source
Expand description

GPT-2 byte ↔ unicode mapping table and helpers shared by the Detokenizer and BPE encoder.

Constants§

METASPACE
The metaspace marker (, U+2581) used by SentencePiece tokenizers.

Functions§

byte_to_char
Maps a byte (0–255) to its GPT-2-encoded character.
char_to_byte
Maps a GPT-2-encoded character codepoint back to a byte; returns None if not in the table.
decode_byte_level_token
Decode a byte-level BPE token (e.g. "Ġhello") to its raw bytes by reversing the GPT-2 byte→unicode table. Characters outside the table fall back to UTF-8 bytes (defensive — shouldn’t happen for valid vocab entries).
encode_byte_level_chars
Encode raw bytes into a string of GPT-2 byte-encoded characters. The result matches the keys of a byte_level vocab.