[−][src]Module tokenizers::tokenizer

Represents a tokenization pipeline.

A Tokenizer is composed of some of the following parts.

Normalizer: Takes care of the text normalization (like unicode normalization).
PreTokenizer: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.
Model: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, ...).
PostProcessor: Takes care of the processing after tokenization (like truncating, padding, ...).

Re-exports

pub use crate::utils::padding::pad_encodings;

pub use crate::utils::padding::PaddingDirection;

pub use crate::utils::padding::PaddingParams;

pub use crate::utils::padding::PaddingStrategy;

pub use crate::utils::truncation::truncate_encodings;

pub use crate::utils::truncation::TruncationParams;

pub use crate::utils::truncation::TruncationStrategy;

AddedToken
Encoding	Represents the output of a `Tokenizer`.
NormalizedString	A `NormalizedString` takes care of processing an "original" string to modify it and obtain a "normalized" string. It keeps both version of the string, alignments information between both and provides an interface to retrieve ranges of each string, using offsets from any of them.
Token
Tokenizer	A `Tokenizer` is capable of encoding/decoding any text.

EncodeInput
Range	Represents a Range usable by the NormalizedString to index its content. A Range can use indices relative to either the `Original` or the `Normalized` string

Decoder	A `Decoder` has the responsibility to merge the given `Vec<String>` in a `String`.
Model	Represents a model used during Tokenization (like BPE or Word or Unigram).
Normalizer	Takes care of pre-processing strings.
PostProcessor	A `PostProcessor` has the responsibility to post process an encoded output of the `Tokenizer`. It adds any special tokens that a language model would require.
PreTokenizer	The `PreTokenizer` is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from the `NormalizedString`. In some occasions, the `PreTokenizer` might need to modify the given `NormalizedString` to ensure we can entirely keep track of the offsets and the mapping with the original string.
Trainer	A `Trainer` has the responsibility to train a model. We feed it with lines/sentences and it returns a `Model` when done.

Returns a range of the given string slice, by indexing chars instead of bytes

Offsets
Result