[−][src]Module rust_tokenizers::tokenizer
Tokenizers
This module contains the tokenizers to split an input text in a sequence of tokens. These rely on the vocabularies for defining the subtokens a given word should be decomposed to. There are 3 main classes of tokenizers implemented in this crate:
- WordPiece tokenizers
- BERT
- DistilBERT
- Byte-Pair Encoding tokenizers:
- GPT
- GPT2
- RoBERTa
- CTRL
- SentencePiece (Unigram) tokenizers:
- SentencePiece
- ALBERT
- XLMRoBERTa
- XLNet
- T5
- Marian
- Reformer
All tokenizers are Send
, Sync
and support multi-threaded tokenization and encoding.
Structs
AlbertTokenizer | ALBERT tokenizer |
BaseTokenizer | Base tokenizer |
BertTokenizer | BERT tokenizer |
CtrlTokenizer | CTRL tokenizer |
Gpt2Tokenizer | GPT2 tokenizer |
MarianTokenizer | Marian tokenizer |
OpenAiGptTokenizer | GPT tokenizer |
ReformerTokenizer | Reformer tokenizer |
RobertaTokenizer | RoBERTa tokenizer |
SentencePieceTokenizer | SentencePiece tokenizer |
T5Tokenizer | T5 tokenizer |
XLMRobertaTokenizer | XLM RoBERTa tokenizer |
XLNetTokenizer | XLNet tokenizer |
Enums
TruncationStrategy | Truncation strategy variants |
Traits
MultiThreadedTokenizer | Extension for multithreaded tokenizers |
Tokenizer | Base trait for tokenizers |
Functions
truncate_sequences | Truncates a sequence pair in place to the maximum length. |