Module tokenizers::tokenizer [−][src]
Expand description
Represents a tokenization pipeline.
A Tokenizer
is composed of some of the following parts.
Normalizer
: Takes care of the text normalization (like unicode normalization).PreTokenizer
: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.Model
: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, …).PostProcessor
: Takes care of the processing after tokenization (like truncating, padding, …).
Re-exports
pub use crate::decoders::DecoderWrapper;
pub use crate::models::ModelWrapper;
pub use crate::normalizers::NormalizerWrapper;
pub use crate::pre_tokenizers::PreTokenizerWrapper;
pub use crate::processors::PostProcessorWrapper;
pub use crate::utils::iter::LinesWithEnding;
pub use crate::utils::padding::pad_encodings;
pub use crate::utils::padding::PaddingDirection;
pub use crate::utils::padding::PaddingParams;
pub use crate::utils::padding::PaddingStrategy;
pub use crate::utils::truncation::truncate_encodings;
pub use crate::utils::truncation::TruncationParams;
pub use crate::utils::truncation::TruncationStrategy;
pub use normalizer::NormalizedString;
pub use normalizer::OffsetReferential;
pub use normalizer::SplitDelimiterBehavior;
pub use pre_tokenizer::*;
Modules
Structs
Represent a token added by the user on top of the existing Model vocabulary. AddedToken can be configured to specify the behavior they should have in various situations like:
Represents the output of a Tokenizer
.
Builder for Tokenizer structs.
A Tokenizer
is capable of encoding/decoding any text.
Enums
Traits
A Decoder
has the responsibility to merge the given Vec<String>
in a String
.
Represents a model used during Tokenization (like BPE or Word or Unigram).
Takes care of pre-processing strings.
A PostProcessor
has the responsibility to post process an encoded output of the Tokenizer
.
It adds any special tokens that a language model would require.
The PreTokenizer
is in charge of doing the pre-segmentation step. It splits the given string
in multiple substrings, keeping track of the offsets of said substrings from the
NormalizedString
. In some occasions, the PreTokenizer
might need to modify the given
NormalizedString
to ensure we can entirely keep track of the offsets and the mapping with
the original string.
A Trainer
has the responsibility to train a model. We feed it with lines/sentences
and then it can train the given Model
.