Module rust_tokenizers::vocab[−][src]

Expand description

Vocabularies

This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:

dictionaries (mapping from token to token ids)
merge files (used by Byte-Pair Encoding tokenizers)
sentence-piece models (trie structure and methods to find common prefix subtokens)

The following vocabularies have been implemented:

BERT
ALBERT
GPT2
GPT
Marian
RoBERTa
T5
XLMRoBERTa
XLNet
SentencePiece

All vocabularies implement the Vocab trait exposing a standard interface for integration with the tokenizers.

Structs

AlbertVocab

BaseVocab

BERT Vocab

Byte pair query

Byte pair Encoding Vocab

GPT2 Vocab

M2M100 Vocab

MBart50 Vocab

Marian Vocab

GPT Vocab

Pegasus Vocab

ProphetNetVocab

ProphetNet Vocab

ReformerVocab

RoBERTa Vocab

SentencePieceBpeModel

SentencePiece BPE Model

SentencePieceModel

SentencePiece Model

SentencePieceVocab

SentencePieceVocab

T5 Vocab

XLMRobertaVocab

XLMRoBERTa Vocab

XLNet Vocab

Traits

Base Vocab trait