Crate rust_tokenizers

source ·
Expand description

High performance tokenizers for Rust

This crate contains implementation of common tokenizers used in state-of-the-art language models. It is usd as the reference tokenization crate of rust-bert, exposing modern transformer-based models such as BERT, RoBERTa, GPT2, BART, XLNet…

The following tokenizers have been implemented and validated against a Python reference implementation:

  • Sentence Piece (unigram model)
  • BERT
  • DistilBERT
  • RoBERTa
  • FNet
  • GPT
  • GPT2
  • CTRL
  • ProphetNet
  • XLNet
  • Pegasus
  • MBart50
  • M2M100
  • NLLB
  • DeBERTa
  • DeBERTa (v2)

The library is structured into vocabularies (for the encoding and decoding of the tokens and registration of special tokens) and tokenizers (splitting the input text into tokens). Generally, a tokenizer will contain a reference vocabulary that may be used as part of the tokenization process (for example, containing a list of subwords or merges).

Usage example

use rust_tokenizers::adapters::Example;
use rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy};
use rust_tokenizers::vocab::{BertVocab, Vocab};
let vocab_path = "path/to/vocab";
let vocab = BertVocab::from_file(&vocab_path)?;
let lowercase: bool = true;
let strip_accents: bool = true;

let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, lowercase, strip_accents);

println!(
    "{:?}",
    bert_tokenizer.encode(
        &test_sentence.sentence_1,
        None,
        128,
        &TruncationStrategy::LongestFirst,
        0
    )
);

Modules

Structs

Enums

  • Type indication for tokens (e.g. special token, white space, unknown…)

Traits

Type Aliases

  • Crate-wide primitive used to store offset positions