Module rust_tokenizers::vocab[][src]

Expand description

Vocabularies

This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:

  • dictionaries (mapping from token to token ids)
  • merge files (used by Byte-Pair Encoding tokenizers)
  • sentence-piece models (trie structure and methods to find common prefix subtokens)

The following vocabularies have been implemented:

  • BERT
  • ALBERT
  • GPT2
  • GPT
  • Marian
  • RoBERTa
  • T5
  • XLMRoBERTa
  • XLNet
  • SentencePiece

All vocabularies implement the Vocab trait exposing a standard interface for integration with the tokenizers.

Structs

AlbertVocab

AlbertVocab

BaseVocab

BaseVocab

BertVocab

BERT Vocab

BpePairRef

Byte pair query

BpePairVocab

Byte pair Encoding Vocab

Gpt2Vocab

GPT2 Vocab

MBart50Vocab

MBart50 Vocab

MarianVocab

Marian Vocab

OpenAiGptVocab

GPT Vocab

PegasusVocab

Pegasus Vocab

ProphetNetVocab

ProphetNet Vocab

ReformerVocab

ReformerVocab

RobertaVocab

RoBERTa Vocab

SentencePieceModel

SentencePiece Model

SentencePieceVocab

SentencePieceVocab

T5Vocab

T5 Vocab

XLMRobertaVocab

XLMRoBERTa Vocab

XLNetVocab

XLNet Vocab

Traits

Vocab

Base Vocab trait