[−][src]Module tokenizers::tokenizer

Represents a tokenization pipeline.

A Tokenizer is composed of some of the following parts.

Normalizer: Takes care of the text normalization (like unicode normalization).
PreTokenizer: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.
Model: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, ...).
PostProcessor: Takes care of the processing after tokenization (like truncating, padding, ...).

Re-exports

pub use crate::utils::pad_encodings;

pub use crate::utils::truncate_encodings;

pub use crate::utils::PaddingParams;

pub use crate::utils::PaddingStrategy;

pub use crate::utils::TruncationParams;

pub use crate::utils::TruncationStrategy;

AddedToken
Encoding	Represents the output of a `Tokenizer`.
NormalizedString	A normalized string takes care of keeping both versions of a `String`, and provides necessary alignments to retrieve ranges of both strings.
Token
Tokenizer	A `Tokenizer` is capable of encoding/decoding any text.

EncodeInput
PaddingDirection	The various possible padding directions.

Decoder	A `Decoder` has the responsibility to merge the given `Vec<String>` in a `String`.
Model	Represents a model used during Tokenization (like BPE or Word or Unigram).
Normalizer	Takes care of pre-processing strings.
PostProcessor	A `PostProcessor` has the responsibility to post process an encoded output of the `Tokenizer`. It adds any special tokens that a language model would require.
PreTokenizer	Takes care of pre-tokenizing strings before this goes to the model.
Trainer	A `Trainer` has the responsibility to train a model. We feed it with lines/sentences and it returns a `Model` when done.

Offsets
Result