Struct Tokenizer

Source

pub struct Tokenizer { /* private fields */ }

Expand description

Tokenizer which converts strings into token sequences consumable by the GPT-2 model, and vice-versa.

This tokenizer loads its configuration from the original OpenAI GPT-2 encoder and vocabulary “byte-pair encoding” (BPE).

Implementations§

Source §

impl Tokenizer

Source

pub fn new(bpe_path: &str, encoder_path: &str) -> Self

Creates a new in-memory tokenizer from the BPE file at bpe_path and the character encoding file at encoder_path.

Source

pub fn encode_to_length( &self, text: &str, token_sequence_length: usize, ) -> (Vec<i32>, usize)

Encodes text into a token sequence, truncating and/or “right-padding” the encoded token sequence to fit token_sequence_length, using PAD_TOKEN as the padding token.

The returned tuple contains (token_sequence, padding_length), where padding_length is the number of padding tokens in token_sequence. If the length of token_sequence before truncation exceeds token_sequence_length, padding_length will always be zero.

§Left vs. Right Padding

Common wisdom in the ML community is to “pad-left” on natural language models like GPT-2; that is, by adding padding tokens to the front of the input tokens until they fit a required input length.

However, this method “pads-right” by adding padding tokens to the end of the input tokens.

Right-padding works because GPT-2 never looks “ahead” (to the right) of its inputs, and so the right-padding will not influence the inference results of any tokens to the left.

Conversely, left-padding on GPT-2 only works if an attention mask is used, which tells the GPT-2 model to ignore certain tokens (like the padding tokens). However, attention masking is slightly more complicated to implement (albeit more efficient); therefore, this implementation does not use it.

Source

pub fn encode(&self, text: &str) -> Vec<i32>

Encodes text into a token sequence for consumption by the GPT-2 model.

Source

pub fn decode(&self, token_sequence: Vec<i32>) -> String

Decodes token_sequence into text.

Auto Trait Implementations§

§

impl UnwindSafe for Tokenizer

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> Downcast for T
where T: Any,

Source §

fn into_any(self: Box<T>) -> Box<dyn Any>

Convert Box<dyn Trait> (where Trait: Downcast) to Box<dyn Any>. Box<dyn Any> can then be further downcast into Box<ConcreteType> where ConcreteType implements Trait.

Source §

fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>

Convert Rc<Trait> (where Trait: Downcast) to Rc<Any>. Rc<Any> can then be further downcast into Rc<ConcreteType> where ConcreteType implements Trait.

Source §

fn as_any(&self) -> &(dyn Any + 'static)

Convert &Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &Any’s vtable from &Trait’s.

Source §

fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)

Convert &mut Trait (where Trait: Downcast) to &Any. This is needed since Rust cannot generate &mut Any’s vtable from &mut Trait’s.

Source §

impl<T> DowncastSync for T
where T: Any + Send + Sync,

Source §

fn into_any_arc(self: Arc<T>) -> Arc<dyn Any + Send + Sync>

Convert Arc<Trait> (where Trait: Downcast) to Arc<Any>. Arc<Any> can then be further downcast into Arc<ConcreteType> where ConcreteType implements Trait.

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

impl<T, U> TryFrom for T
where U: Into<T>,

Source §

type Error = Infallible

The type returned in the event of a conversion error.

Source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

Source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

Source §

type Error = >::Error

The type returned in the event of a conversion error.

Source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Source §

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source §

Tokenizer

Struct Tokenizer Copy item path

Implementations§

impl Tokenizer

pub fn new(bpe_path: &str, encoder_path: &str) -> Self

pub fn encode_to_length( &self, text: &str, token_sequence_length: usize, ) -> (Vec<i32>, usize)

§Left vs. Right Padding

pub fn encode(&self, text: &str) -> Vec<i32>

pub fn decode(&self, token_sequence: Vec<i32>) -> String

Auto Trait Implementations§

impl Freeze for Tokenizer

impl RefUnwindSafe for Tokenizer

impl Send for Tokenizer

impl Sync for Tokenizer

impl Unpin for Tokenizer

impl UnwindSafe for Tokenizer

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> Downcast for Twhere T: Any,

fn into_any(self: Box<T>) -> Box<dyn Any>

fn into_any_rc(self: Rc<T>) -> Rc<dyn Any>

fn as_any(&self) -> &(dyn Any + 'static)

fn as_any_mut(&mut self) -> &mut (dyn Any + 'static)

impl<T> DowncastSync for Twhere T: Any + Send + Sync,

fn into_any_arc(self: Arc<T>) -> Arc<dyn Any + Send + Sync>

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

Struct Tokenizer

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> Downcast for T
where T: Any,

impl<T> DowncastSync for T
where T: Any + Send + Sync,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,