Module tokenizer

Source
Expand description

Pure Rust implementation of the GPT-2 byte-pair encoder (aka “text tokenizer”).

Structs§

Tokenizer
Tokenizer which converts strings into token sequences consumable by the GPT-2 model, and vice-versa.

Constants§

END_OF_TEXT_STRING
END_OF_TEXT_TOKEN
PAD_TOKEN
Token 50256 is used as the padding token, which corresponds to the <|endoftext|> token in the OpenAI GPT-2 encoder.