Module rust_tokenizers::vocab[−][src]

Vocabularies

This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:

dictionaries (mapping from token to token ids)
merge files (used by Byte-Pair Encoding tokenizers)
sentence-piece models (trie structure and methods to find common prefix subtokens)

The following vocabularies have been implemented:

BERT
ALBERT
GPT2
GPT
Marian
RoBERTa
T5
XLMRoBERTa
XLNet
SentencePiece

All vocabularies implement the Vocab trait exposing a standard interface for integration with the tokenizers.

Structs

AlbertVocab	AlbertVocab
BaseVocab	BaseVocab
BertVocab	BERT Vocab
BpePairRef	Byte pair query
BpePairVocab	Byte pair Encoding Vocab
Gpt2Vocab	GPT2 Vocab
MarianVocab	Marian Vocab
OpenAiGptVocab	GPT Vocab
PegasusVocab	Pegasus Vocab
ProphetNetVocab	ProphetNet Vocab
ReformerVocab	ReformerVocab
RobertaVocab	RoBERTa Vocab
SentencePieceModel	SentencePiece Model
SentencePieceVocab	SentencePieceVocab
T5Vocab	T5 Vocab
XLMRobertaVocab	XLMRoBERTa Vocab
XLNetVocab	XLNet Vocab

Traits

Vocab

Base Vocab trait