Module rust_tokenizers::vocab

Expand description

Vocabularies

This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:

dictionaries (mapping from token to token ids)
merge files (used by Byte-Pair Encoding tokenizers)
sentence-piece models (trie structure and methods to find common prefix subtokens)

The following vocabularies have been implemented:

BERT
ALBERT
GPT2
GPT
Marian
RoBERTa
T5
XLMRoBERTa
XLNet
SentencePiece

All vocabularies implement the Vocab trait exposing a standard interface for integration with the tokenizers.

Structs

AlbertVocab
AlbertVocab
BaseVocab
BaseVocab
BertVocab
BERT Vocab
BpePairRef
Byte pair query
BpePairVocab
Byte pair Encoding Vocab
DeBERTaV2Vocab
DeBERTaV2Vocab
DeBERTaVocab
DeBERTa Vocab
FNetVocab
FNetVocab
Gpt2Vocab
GPT2 Vocab
M2M100Vocab
M2M100 Vocab
MBart50Vocab
MBart50 Vocab
MarianVocab
Marian Vocab
NLLBVocab
OpenAiGptVocab
GPT Vocab
PegasusVocab
Pegasus Vocab
ProphetNetVocab
ProphetNet Vocab
ReformerVocab
ReformerVocab
RobertaVocab
RoBERTa Vocab
SentencePieceBpeModel
SentencePiece BPE Model
SentencePieceModel
SentencePiece Model
SentencePieceVocab
SentencePieceVocab
T5Vocab
T5 Vocab
XLMRobertaVocab
XLMRoBERTa Vocab
XLNetVocab
XLNet Vocab

Traits

Vocab
Base Vocab trait