Expand description

Tokenizers

This module contains the tokenizers to split an input text in a sequence of tokens. These rely on the vocabularies for defining the subtokens a given word should be decomposed to. There are 3 main classes of tokenizers implemented in this crate:

  • WordPiece tokenizers
    • BERT
    • DistilBERT
  • Byte-Pair Encoding tokenizers:
    • GPT
    • GPT2
    • RoBERTa
    • CTRL
    • DeBERTa
  • SentencePiece (Unigram) tokenizers:
    • SentencePiece
    • ALBERT
    • XLMRoBERTa
    • XLNet
    • T5
    • Marian
    • Reformer
    • DeBERTa (v2)

All tokenizers are Send, Sync and support multi-threaded tokenization and encoding.

Structs

Enums

Traits

Functions