[][src]Module tokenizers::tokenizer

Represents a tokenization pipeline.

A Tokenizer is composed of some of the following parts.

  • Normalizer: Takes care of the text normalization (like unicode normalization).
  • PreTokenizer: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.
  • Model: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, ...).
  • PostProcessor: Takes care of the processing after tokenization (like truncating, padding, ...).

Re-exports

pub use crate::utils::pad_encodings;
pub use crate::utils::truncate_encodings;
pub use crate::utils::PaddingParams;
pub use crate::utils::PaddingStrategy;
pub use crate::utils::TruncationParams;
pub use crate::utils::TruncationStrategy;

Structs

AddedToken
Encoding

Represents the output of a Tokenizer.

NormalizedString

A normalized string takes care of keeping both versions of a String, and provides necessary alignments to retrieve ranges of both strings.

Token
Tokenizer

A Tokenizer is capable of encoding/decoding any text.

Enums

EncodeInput
PaddingDirection

The various possible padding directions.

Traits

Decoder

A Decoder has the responsibility to merge the given Vec<String> in a String.

Model

Represents a model used during Tokenization (like BPE or Word or Unigram).

Normalizer

Takes care of pre-processing strings.

PostProcessor

A PostProcessor has the responsibility to post process an encoded output of the Tokenizer. It adds any special tokens that a language model would require.

PreTokenizer

Takes care of pre-tokenizing strings before this goes to the model.

Trainer

A Trainer has the responsibility to train a model. We feed it with lines/sentences and it returns a Model when done.

Type Definitions

Offsets
Result