Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Represents a tokenization pipeline.

A Tokenizer is composed of some of the following parts.

  • Normalizer: Takes care of the text normalization (like unicode normalization).
  • PreTokenizer: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.
  • Model: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, …).
  • PostProcessor: Takes care of the processing after tokenization (like truncating, padding, …).

Re-exports§

pub use crate::decoders::DecoderWrapper;
pub use crate::models::ModelWrapper;
pub use crate::normalizers::NormalizerWrapper;
pub use crate::pre_tokenizers::PreTokenizerWrapper;
pub use crate::processors::PostProcessorWrapper;
pub use crate::utils::iter::LinesWithEnding;
pub use crate::utils::padding::pad_encodings;
pub use crate::utils::padding::PaddingDirection;
pub use crate::utils::padding::PaddingParams;
pub use crate::utils::padding::PaddingStrategy;
pub use crate::utils::truncation::truncate_encodings;
pub use crate::utils::truncation::TruncationDirection;
pub use crate::utils::truncation::TruncationParams;
pub use crate::utils::truncation::TruncationStrategy;
pub use normalizer::NormalizedString;
pub use normalizer::OffsetReferential;
pub use normalizer::SplitDelimiterBehavior;
pub use pre_tokenizer::*;

Modules§

normalizer
pattern
pre_tokenizer

Structs§

AddedToken
Represent a token added by the user on top of the existing Model vocabulary. AddedToken can be configured to specify the behavior they should have in various situations like:
AddedVocabulary
A vocabulary built on top of the Model
BuilderError
DecodeStream
DecodeStream will keep the state necessary to produce individual chunks of strings given an input stream of token_ids.
Encoding
Represents the output of a Tokenizer.
Token
Tokenizer
TokenizerBuilder
Builder for Tokenizer structs.
TokenizerImpl
A Tokenizer is capable of encoding/decoding any text.
TruncationParamError

Enums§

DecodeStreamError
EncodeInput
InputSequence
ProcessorError

Traits§

Decoder
A Decoder changes the raw tokens into its more readable form.
Model
Represents a model used during Tokenization (like BPE or Word or Unigram).
Normalizer
Takes care of pre-processing strings.
PostProcessor
A PostProcessor has the responsibility to post process an encoded output of the Tokenizer. It adds any special tokens that a language model would require.
PreTokenizer
The PreTokenizer is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from the NormalizedString. In some occasions, the PreTokenizer might need to modify the given NormalizedString to ensure we can entirely keep track of the offsets and the mapping with the original string.
Trainer
A Trainer has the responsibility to train a model. We feed it with lines/sentences and then it can train the given Model.

Functions§

step_decode_stream
Internal function exposed only to bypass python limitations

Type Aliases§

Error
Offsets
Result