Module rust_tokenizers::vocab
source · Expand description
Vocabularies
This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:
- dictionaries (mapping from token to token ids)
- merge files (used by Byte-Pair Encoding tokenizers)
- sentence-piece models (trie structure and methods to find common prefix subtokens)
The following vocabularies have been implemented:
- BERT
- ALBERT
- GPT2
- GPT
- Marian
- RoBERTa
- T5
- XLMRoBERTa
- XLNet
- SentencePiece
All vocabularies implement the Vocab
trait exposing a standard interface for integration with
the tokenizers.
Structs
- AlbertVocab
- BaseVocab
- BERT Vocab
- Byte pair query
- Byte pair Encoding Vocab
- DeBERTaV2Vocab
- DeBERTa Vocab
- FNetVocab
- GPT2 Vocab
- M2M100 Vocab
- MBart50 Vocab
- Marian Vocab
- GPT Vocab
- Pegasus Vocab
- ProphetNet Vocab
- ReformerVocab
- RoBERTa Vocab
- SentencePiece BPE Model
- SentencePiece Model
- SentencePieceVocab
- T5 Vocab
- XLMRoBERTa Vocab
- XLNet Vocab
Traits
- Base Vocab trait