[][src]Module tokenizers::tokenizer

Represents a tokenization pipeline.

A Tokenizer is composed of some of the following parts.

  • Normalizer: Takes care of the text normalization (like unicode normalization).
  • PreTokenizer: Takes care of the pre tokenization (ie. How to split tokens and pre-process them.
  • Model: A model encapsulates the tokenization algorithm (like BPE, Word base, character based, ...).
  • PostProcessor: Takes care of the processing after tokenization (like truncating, padding, ...).

Re-exports

pub use crate::utils::padding::pad_encodings;
pub use crate::utils::padding::PaddingDirection;
pub use crate::utils::padding::PaddingParams;
pub use crate::utils::padding::PaddingStrategy;
pub use crate::utils::truncation::truncate_encodings;
pub use crate::utils::truncation::TruncationParams;
pub use crate::utils::truncation::TruncationStrategy;

Structs

AddedToken
Encoding

Represents the output of a Tokenizer.

NormalizedString

A NormalizedString takes care of processing an "original" string to modify it and obtain a "normalized" string. It keeps both version of the string, alignments information between both and provides an interface to retrieve ranges of each string, using offsets from any of them.

Token
Tokenizer

A Tokenizer is capable of encoding/decoding any text.

Enums

EncodeInput
Range

Represents a Range usable by the NormalizedString to index its content. A Range can use indices relative to either the Original or the Normalized string

Traits

Decoder

A Decoder has the responsibility to merge the given Vec<String> in a String.

Model

Represents a model used during Tokenization (like BPE or Word or Unigram).

Normalizer

Takes care of pre-processing strings.

PostProcessor

A PostProcessor has the responsibility to post process an encoded output of the Tokenizer. It adds any special tokens that a language model would require.

PreTokenizer

The PreTokenizer is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from the NormalizedString. In some occasions, the PreTokenizer might need to modify the given NormalizedString to ensure we can entirely keep track of the offsets and the mapping with the original string.

Trainer

A Trainer has the responsibility to train a model. We feed it with lines/sentences and it returns a Model when done.

Functions

get_range_of

Returns a range of the given string slice, by indexing chars instead of bytes

Type Definitions

Offsets
Result