Tokenizer

Struct Tokenizer 

Source
pub struct Tokenizer { /* private fields */ }
Expand description

Tokenizes text inputs into sequences of token IDs that can be fed to a machine learning model.

Tokenizer wraps a Model which handles specific methods of encoding of individual sequences (eg. WordPiece, Byte Pair Encoding, Unigram) and adds common functionality such as injecting special tokens, splitting sequences into overlapping chunks and truncating long sequences.

Implementations§

Source§

impl Tokenizer

Source

pub fn new<M: Model + 'static>( model: M, options: TokenizerOptions<'_>, ) -> Tokenizer

Create a new tokenizer which wraps the given model.

Source

pub fn with_normalizer(self, normalizer: Box<dyn Normalizer>) -> Self

Configure the normalizer used by this tokenizer.

Source

pub fn with_pre_tokenizer(self, pre_tokenizer: Box<dyn PreTokenizer>) -> Self

Configure the pre-tokenizer used by this tokenizer.

Source

pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Tokenizer, FromJsonError>

Load a tokenizer from the contents of a Hugging Face tokenizer.json file.

Source

pub fn from_json(json: &str) -> Result<Tokenizer, FromJsonError>

Load a tokenizer from the contents of a Hugging Face tokenizer.json file.

Source

pub fn encoder(&self) -> &dyn Model

👎Deprecated: encoder was renamed to model
Source

pub fn model(&self) -> &dyn Model

Return the model used to convert string pieces to token IDs.

Source

pub fn get_token_id(&self, text: &str) -> Result<TokenId, TokenizerError>

Return the ID of a token given its canonical string representation.

This is usually used for looking up the IDs of special/added tokens.

This wraps Model::get_token_id but returns a Result rather than an Option, assuming the token is expected to be valid.

Source

pub fn encode<'a, I: Into<EncoderInput<'a>>>( &self, input: I, options: Option<EncodeOptions>, ) -> Result<Encoded<'a>, TokenizerError>

Encode one or two sequences into a sequence of tokens.

The input can be an &str or tuple of (&str, &str).

In addition to token IDs, the result also includes information about the corresponding offsets in the source text.

Source

pub fn encode_chunks<'a>( &self, input: EncoderInput<'a>, options: EncodeOptions, ) -> Result<Vec<Encoded<'a>>, TokenizerError>

Encode one or two sequences into a sequence of tokens.

The output is split into chunks such that the number of tokens in each chunk is less than the limit specified in EncodeOptions.

Source

pub fn decode(&self, ids: &[TokenId]) -> Result<String, TokenizerError>

Decode a sequence of token IDs to a text string.

For tokenizers which operate on byte sequences (eg. Bpe) this can fail if the token IDs don’t correspond to a complete UTF-8 sequence. In that case the solution is to accumulate more token IDs and then retry decoding.

Special tokens are decoded into their canonical string representations as returned by Model::get_token_str.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.