pub struct Tokenizer { /* private fields */ }
Expand description
Tokenizer which converts strings into token sequences consumable by the GPT-2 model, and vice-versa.
This tokenizer loads its configuration from the original OpenAI GPT-2 encoder and vocabulary “byte-pair encoding” (BPE).
Implementations§
Source§impl Tokenizer
impl Tokenizer
Sourcepub fn new(bpe_path: &str, encoder_path: &str) -> Self
pub fn new(bpe_path: &str, encoder_path: &str) -> Self
Creates a new in-memory tokenizer
from the BPE file at bpe_path
and the character encoding file at encoder_path
.
Sourcepub fn encode_to_length(
&self,
text: &str,
token_sequence_length: usize,
) -> (Vec<i32>, usize)
pub fn encode_to_length( &self, text: &str, token_sequence_length: usize, ) -> (Vec<i32>, usize)
Encodes text
into a token sequence,
truncating and/or “right-padding” the encoded
token sequence to fit token_sequence_length
,
using PAD_TOKEN as the padding token.
The returned tuple contains (token_sequence, padding_length)
,
where padding_length
is the number of padding tokens
in token_sequence
. If the length of token_sequence
before
truncation exceeds token_sequence_length
, padding_length
will always be zero.
§Left vs. Right Padding
Common wisdom in the ML community is to “pad-left” on natural language models like GPT-2; that is, by adding padding tokens to the front of the input tokens until they fit a required input length.
However, this method “pads-right” by adding padding tokens to the end of the input tokens.
Right-padding works because GPT-2 never looks “ahead” (to the right) of its inputs, and so the right-padding will not influence the inference results of any tokens to the left.
Conversely, left-padding on GPT-2 only works if an attention mask is used, which tells the GPT-2 model to ignore certain tokens (like the padding tokens). However, attention masking is slightly more complicated to implement (albeit more efficient); therefore, this implementation does not use it.
Auto Trait Implementations§
impl Freeze for Tokenizer
impl RefUnwindSafe for Tokenizer
impl Send for Tokenizer
impl Sync for Tokenizer
impl Unpin for Tokenizer
impl UnwindSafe for Tokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Downcast for Twhere
T: Any,
impl<T> Downcast for Twhere
T: Any,
Source§fn into_any(self: Box<T>) -> Box<dyn Any>
fn into_any(self: Box<T>) -> Box<dyn Any>
Box<dyn Trait>
(where Trait: Downcast
) to Box<dyn Any>
. Box<dyn Any>
can
then be further downcast
into Box<ConcreteType>
where ConcreteType
implements Trait
.Source§fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>
fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>
Rc<Trait>
(where Trait: Downcast
) to Rc<Any>
. Rc<Any>
can then be
further downcast
into Rc<ConcreteType>
where ConcreteType
implements Trait
.Source§fn as_any(&self) -> &(dyn Any + 'static)
fn as_any(&self) -> &(dyn Any + 'static)
&Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &Any
’s vtable from &Trait
’s.Source§fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)
fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)
&mut Trait
(where Trait: Downcast
) to &Any
. This is needed since Rust cannot
generate &mut Any
’s vtable from &mut Trait
’s.Source§impl<T> DowncastSync for T
impl<T> DowncastSync for T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more