[−][src]Module tokenizers::tokenizer
Represents a tokenization pipeline.
A Tokenizer
is composed of some of the following parts.
Normalizer
: Takes care of the text normalization (like unicode normalization).PreTokenizer
: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.Model
: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, ...).PostProcessor
: Takes care of the processing after tokenization (like truncating, padding, ...).
Re-exports
pub use crate::utils::padding::pad_encodings; |
pub use crate::utils::padding::PaddingDirection; |
pub use crate::utils::padding::PaddingParams; |
pub use crate::utils::padding::PaddingStrategy; |
pub use crate::utils::truncation::truncate_encodings; |
pub use crate::utils::truncation::TruncationParams; |
pub use crate::utils::truncation::TruncationStrategy; |
Structs
AddedToken | |
Encoding | Represents the output of a |
NormalizedString | A |
Token | |
Tokenizer | A |
Enums
EncodeInput | |
Range | Represents a Range usable by the NormalizedString to index its content.
A Range can use indices relative to either the |
Traits
Decoder | A |
Model | Represents a model used during Tokenization (like BPE or Word or Unigram). |
Normalizer | Takes care of pre-processing strings. |
PostProcessor | A |
PreTokenizer | The |
Trainer | A |
Functions
get_range_of | Returns a range of the given string slice, by indexing chars instead of bytes |
Type Definitions
Offsets | |
Result |