Module rust_tokenizers::tokenizer
source · Expand description
Tokenizers
This module contains the tokenizers to split an input text in a sequence of tokens. These rely on the vocabularies for defining the subtokens a given word should be decomposed to. There are 3 main classes of tokenizers implemented in this crate:
- WordPiece tokenizers
- BERT
- DistilBERT
- Byte-Pair Encoding tokenizers:
- GPT
- GPT2
- RoBERTa
- CTRL
- DeBERTa
- SentencePiece (Unigram) tokenizers:
- SentencePiece
- ALBERT
- XLMRoBERTa
- XLNet
- T5
- Marian
- Reformer
- DeBERTa (v2)
All tokenizers are Send
, Sync
and support multi-threaded tokenization and encoding.
Structs
- ALBERT tokenizer
- Base tokenizer
- BERT tokenizer
- CTRL tokenizer
- DeBERTa tokenizer
- DeBERTaV2 tokenizer
- FNet tokenizer
- GPT2 tokenizer
- M2M100 tokenizer
- MBart50 tokenizer
- Marian tokenizer
- GPT tokenizer
- Pegasus tokenizer
- ProphetNet tokenizer
- Reformer tokenizer
- RoBERTa tokenizer
- SentencePiece tokenizer
- SentencePiece tokenizer
- T5 tokenizer
- XLM RoBERTa tokenizer
- XLNet tokenizer
Enums
- Truncation strategy variants
Traits
- Extension for multithreaded tokenizers
- Base trait for tokenizers
Functions
- Truncates a sequence pair in place to the maximum length.