rust-tokenizers
Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library It includes a broad range of tokenizers for state-of-the-art transformers architectures, including:
- Sentence Piece (unigram model)
- Sentence Piece (BPE model)
- BERT
- ALBERT
- DistilBERT
- RoBERTa
- GPT
- GPT2
- ProphetNet
- CTRL
- Pegasus
- MBart50
- M2M100
The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.
Usage example
let vocab = new;
let test_sentence = new_from_string;
let bert_tokenizer: BertTokenizer = from_existing_vocab;
println!;