Module rust_tokenizers::vocab [−][src]
Vocabularies
This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:
- dictionaries (mapping from token to token ids)
- merge files (used by Byte-Pair Encoding tokenizers)
- sentence-piece models (trie structure and methods to find common prefix subtokens)
The following vocabularies have been implemented:
- BERT
- ALBERT
- GPT2
- GPT
- Marian
- RoBERTa
- T5
- XLMRoBERTa
- XLNet
- SentencePiece
All vocabularies implement the Vocab
trait exposing a standard interface for integration with
the tokenizers.
Structs
AlbertVocab | AlbertVocab |
BaseVocab | BaseVocab |
BertVocab | BERT Vocab |
BpePairRef | Byte pair query |
BpePairVocab | Byte pair Encoding Vocab |
Gpt2Vocab | GPT2 Vocab |
MarianVocab | Marian Vocab |
OpenAiGptVocab | GPT Vocab |
PegasusVocab | Pegasus Vocab |
ProphetNetVocab | ProphetNet Vocab |
ReformerVocab | ReformerVocab |
RobertaVocab | RoBERTa Vocab |
SentencePieceModel | SentencePiece Model |
SentencePieceVocab | SentencePieceVocab |
T5Vocab | T5 Vocab |
XLMRobertaVocab | XLMRoBERTa Vocab |
XLNetVocab | XLNet Vocab |
Traits
Vocab | Base Vocab trait |