pub struct Tokenizer { /* private fields */ }Expand description
Tokenizes text inputs into sequences of token IDs that can be fed to a machine learning model.
Tokenizer wraps a Model which handles specific methods of encoding of
individual sequences (eg. WordPiece, Byte Pair Encoding, Unigram) and adds
common functionality such as injecting special tokens, splitting sequences
into overlapping chunks and truncating long sequences.
Implementations§
Source§impl Tokenizer
impl Tokenizer
Sourcepub fn new<M: Model + 'static>(
model: M,
options: TokenizerOptions<'_>,
) -> Tokenizer
pub fn new<M: Model + 'static>( model: M, options: TokenizerOptions<'_>, ) -> Tokenizer
Create a new tokenizer which wraps the given model.
Sourcepub fn with_normalizer(self, normalizer: Box<dyn Normalizer>) -> Self
pub fn with_normalizer(self, normalizer: Box<dyn Normalizer>) -> Self
Configure the normalizer used by this tokenizer.
Sourcepub fn with_pre_tokenizer(self, pre_tokenizer: Box<dyn PreTokenizer>) -> Self
pub fn with_pre_tokenizer(self, pre_tokenizer: Box<dyn PreTokenizer>) -> Self
Configure the pre-tokenizer used by this tokenizer.
Sourcepub fn from_file<P: AsRef<Path>>(path: P) -> Result<Tokenizer, FromJsonError>
pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Tokenizer, FromJsonError>
Load a tokenizer from the contents of a Hugging Face tokenizer.json
file.
Sourcepub fn from_json(json: &str) -> Result<Tokenizer, FromJsonError>
pub fn from_json(json: &str) -> Result<Tokenizer, FromJsonError>
Load a tokenizer from the contents of a Hugging Face tokenizer.json
file.
pub fn encoder(&self) -> &dyn Model
encoder was renamed to modelSourcepub fn get_token_id(&self, text: &str) -> Result<TokenId, TokenizerError>
pub fn get_token_id(&self, text: &str) -> Result<TokenId, TokenizerError>
Return the ID of a token given its canonical string representation.
This is usually used for looking up the IDs of special/added tokens.
This wraps Model::get_token_id but returns a Result rather than
an Option, assuming the token is expected to be valid.
Sourcepub fn encode<'a, I: Into<EncoderInput<'a>>>(
&self,
input: I,
options: Option<EncodeOptions>,
) -> Result<Encoded<'a>, TokenizerError>
pub fn encode<'a, I: Into<EncoderInput<'a>>>( &self, input: I, options: Option<EncodeOptions>, ) -> Result<Encoded<'a>, TokenizerError>
Encode one or two sequences into a sequence of tokens.
The input can be an &str or tuple of (&str, &str).
In addition to token IDs, the result also includes information about the corresponding offsets in the source text.
Sourcepub fn encode_chunks<'a>(
&self,
input: EncoderInput<'a>,
options: EncodeOptions,
) -> Result<Vec<Encoded<'a>>, TokenizerError>
pub fn encode_chunks<'a>( &self, input: EncoderInput<'a>, options: EncodeOptions, ) -> Result<Vec<Encoded<'a>>, TokenizerError>
Encode one or two sequences into a sequence of tokens.
The output is split into chunks such that the number of tokens in
each chunk is less than the limit specified in EncodeOptions.
Sourcepub fn decode(&self, ids: &[TokenId]) -> Result<String, TokenizerError>
pub fn decode(&self, ids: &[TokenId]) -> Result<String, TokenizerError>
Decode a sequence of token IDs to a text string.
For tokenizers which operate on byte sequences (eg. Bpe) this can
fail if the token IDs don’t correspond to a complete UTF-8 sequence.
In that case the solution is to accumulate more token IDs and then
retry decoding.
Special tokens are decoded into their canonical string representations
as returned by Model::get_token_str.