[−][src]Module tokenizers::tokenizer
Represents a tokenization pipeline.
A Tokenizer
is composed of some of the following parts.
Normalizer
: Takes care of the text normalization (like unicode normalization).PreTokenizer
: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.Model
: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, ...).PostProcessor
: Takes care of the processing after tokenization (like truncating, padding, ...).
Re-exports
pub use crate::utils::pad_encodings; |
pub use crate::utils::truncate_encodings; |
pub use crate::utils::PaddingParams; |
pub use crate::utils::PaddingStrategy; |
pub use crate::utils::TruncationParams; |
pub use crate::utils::TruncationStrategy; |
Structs
AddedToken | |
Encoding | Represents the output of a |
NormalizedString | A normalized string takes care of keeping both versions of a |
Token | |
Tokenizer | A |
Enums
EncodeInput | |
PaddingDirection | The various possible padding directions. |
Traits
Decoder | A |
Model | Represents a model used during Tokenization (like BPE or Word or Unigram). |
Normalizer | Takes care of pre-processing strings. |
PostProcessor | A |
PreTokenizer | Takes care of pre-tokenizing strings before this goes to the model. |
Trainer | A |
Type Definitions
Offsets | |
Result |