Module rust_tokenizers::tokenizer[][src]

Expand description

Tokenizers

This module contains the tokenizers to split an input text in a sequence of tokens. These rely on the vocabularies for defining the subtokens a given word should be decomposed to. There are 3 main classes of tokenizers implemented in this crate:

  • WordPiece tokenizers
    • BERT
    • DistilBERT
  • Byte-Pair Encoding tokenizers:
    • GPT
    • GPT2
    • RoBERTa
    • CTRL
  • SentencePiece (Unigram) tokenizers:
    • SentencePiece
    • ALBERT
    • XLMRoBERTa
    • XLNet
    • T5
    • Marian
    • Reformer

All tokenizers are Send, Sync and support multi-threaded tokenization and encoding.

Structs

ALBERT tokenizer

Base tokenizer

BERT tokenizer

CTRL tokenizer

GPT2 tokenizer

M2M100 tokenizer

MBart50 tokenizer

Marian tokenizer

GPT tokenizer

Pegasus tokenizer

ProphetNet tokenizer

Reformer tokenizer

RoBERTa tokenizer

SentencePiece tokenizer

SentencePiece tokenizer

T5 tokenizer

XLM RoBERTa tokenizer

XLNet tokenizer

Enums

Truncation strategy variants

Traits

Extension for multithreaded tokenizers

Base trait for tokenizers

Functions

Truncates a sequence pair in place to the maximum length.