Module rust_tokenizers::vocab [−][src]
Expand description
Vocabularies
This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:
- dictionaries (mapping from token to token ids)
- merge files (used by Byte-Pair Encoding tokenizers)
- sentence-piece models (trie structure and methods to find common prefix subtokens)
The following vocabularies have been implemented:
- BERT
- ALBERT
- GPT2
- GPT
- Marian
- RoBERTa
- T5
- XLMRoBERTa
- XLNet
- SentencePiece
All vocabularies implement the Vocab
trait exposing a standard interface for integration with
the tokenizers.
Structs
AlbertVocab
BaseVocab
BERT Vocab
Byte pair query
Byte pair Encoding Vocab
GPT2 Vocab
M2M100 Vocab
MBart50 Vocab
Marian Vocab
GPT Vocab
Pegasus Vocab
ProphetNet Vocab
ReformerVocab
RoBERTa Vocab
SentencePiece BPE Model
SentencePiece Model
SentencePieceVocab
T5 Vocab
XLMRoBERTa Vocab
XLNet Vocab
Traits
Base Vocab trait