Module rust_tokenizers::vocab[][src]

Expand description

Vocabularies

This module contains the vocabularies leveraged by the tokenizer. These contain methods for deserialization of vocabulary files and access by the tokenizers, including:

  • dictionaries (mapping from token to token ids)
  • merge files (used by Byte-Pair Encoding tokenizers)
  • sentence-piece models (trie structure and methods to find common prefix subtokens)

The following vocabularies have been implemented:

  • BERT
  • ALBERT
  • GPT2
  • GPT
  • Marian
  • RoBERTa
  • T5
  • XLMRoBERTa
  • XLNet
  • SentencePiece

All vocabularies implement the Vocab trait exposing a standard interface for integration with the tokenizers.

Structs

AlbertVocab

BaseVocab

BERT Vocab

Byte pair query

Byte pair Encoding Vocab

GPT2 Vocab

M2M100 Vocab

MBart50 Vocab

Marian Vocab

GPT Vocab

Pegasus Vocab

ProphetNet Vocab

ReformerVocab

RoBERTa Vocab

SentencePiece BPE Model

SentencePiece Model

SentencePieceVocab

T5 Vocab

XLMRoBERTa Vocab

XLNet Vocab

Traits

Base Vocab trait