Module rust_tokenizers::tokenizer[][src]

Expand description

Tokenizers

This module contains the tokenizers to split an input text in a sequence of tokens. These rely on the vocabularies for defining the subtokens a given word should be decomposed to. There are 3 main classes of tokenizers implemented in this crate:

  • WordPiece tokenizers
    • BERT
    • DistilBERT
  • Byte-Pair Encoding tokenizers:
    • GPT
    • GPT2
    • RoBERTa
    • CTRL
  • SentencePiece (Unigram) tokenizers:
    • SentencePiece
    • ALBERT
    • XLMRoBERTa
    • XLNet
    • T5
    • Marian
    • Reformer

All tokenizers are Send, Sync and support multi-threaded tokenization and encoding.

Structs

AlbertTokenizer

ALBERT tokenizer

BaseTokenizer

Base tokenizer

BertTokenizer

BERT tokenizer

CtrlTokenizer

CTRL tokenizer

Gpt2Tokenizer

GPT2 tokenizer

MBart50Tokenizer

MBart50 tokenizer

MarianTokenizer

Marian tokenizer

OpenAiGptTokenizer

GPT tokenizer

PegasusTokenizer

Pegasus tokenizer

ProphetNetTokenizer

ProphetNet tokenizer

ReformerTokenizer

Reformer tokenizer

RobertaTokenizer

RoBERTa tokenizer

SentencePieceTokenizer

SentencePiece tokenizer

T5Tokenizer

T5 tokenizer

XLMRobertaTokenizer

XLM RoBERTa tokenizer

XLNetTokenizer

XLNet tokenizer

Enums

TruncationStrategy

Truncation strategy variants

Traits

MultiThreadedTokenizer

Extension for multithreaded tokenizers

Tokenizer

Base trait for tokenizers

Functions

truncate_sequences

Truncates a sequence pair in place to the maximum length.