Module rust_tokenizers::tokenizer [−][src]
Expand description
Tokenizers
This module contains the tokenizers to split an input text in a sequence of tokens. These rely on the vocabularies for defining the subtokens a given word should be decomposed to. There are 3 main classes of tokenizers implemented in this crate:
- WordPiece tokenizers
- BERT
- DistilBERT
- Byte-Pair Encoding tokenizers:
- GPT
- GPT2
- RoBERTa
- CTRL
- SentencePiece (Unigram) tokenizers:
- SentencePiece
- ALBERT
- XLMRoBERTa
- XLNet
- T5
- Marian
- Reformer
All tokenizers are Send
, Sync
and support multi-threaded tokenization and encoding.
Structs
ALBERT tokenizer
Base tokenizer
BERT tokenizer
CTRL tokenizer
GPT2 tokenizer
M2M100 tokenizer
MBart50 tokenizer
Marian tokenizer
GPT tokenizer
Pegasus tokenizer
ProphetNet tokenizer
Reformer tokenizer
RoBERTa tokenizer
SentencePiece tokenizer
SentencePiece tokenizer
T5 tokenizer
XLM RoBERTa tokenizer
XLNet tokenizer
Enums
Truncation strategy variants
Traits
Extension for multithreaded tokenizers
Base trait for tokenizers
Functions
Truncates a sequence pair in place to the maximum length.