Crate rust_tokenizers[][src]

Expand description

High performance tokenizers for Rust

This crate contains implementation of common tokenizers used in state-of-the-art language models. It is usd as the reference tokenization crate of rust-bert, exposing modern transformer-based models such as BERT, RoBERTa, GPT2, BART, XLNet…

The following tokenizers have been implemented and validated against a Python reference implementation:

  • Sentence Piece (unigram model)
  • BERT
  • DistilBERT
  • RoBERTa
  • GPT
  • GPT2
  • CTRL
  • ProphetNet
  • XLNet
  • Pegasus
  • MBart50

The library is structured into vocabularies (for the encoding and decoding of the tokens and registration of special tokens) and tokenizers (splitting the input text into tokens). Generally, a tokenizer will contain a reference vocabulary that may be used as part of the tokenization process (for example, containing a list of subwords or merges).

Usage example

use rust_tokenizers::adapters::Example;
use rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy};
use rust_tokenizers::vocab::{BertVocab, Vocab};
let vocab_path = "path/to/vocab";
let vocab = BertVocab::from_file(&vocab_path)?;

let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, true, true);

println!(
    "{:?}",
    bert_tokenizer.encode(
        &test_sentence.sentence_1,
        None,
        128,
        &TruncationStrategy::LongestFirst,
        0
    )
);

Modules

adapters

Adapter helpers to load datasets

error

Tokenizer error variants

tokenizer

Tokenizers

vocab

Vocabularies

Structs

ConsolidatedTokenIterator

ConsolidatedTokenIterator

Offset

Offset information (in unicode points) to relate a token back to its original input string

Token

Owned token that references the original text but stores its own string representation.

TokenIdsWithOffsets

Encoded sequence

TokenIdsWithSpecialTokens

Encoded input with special tokens

TokenRef

Reference token that references the original text, with a string slice representation

TokenizedInput

Tokenized Input, ready for processing in language models

TokensWithOffsets

Tokenized sequence

Enums

Mask

Type indication for tokens (e.g. special token, white space, unknown…)

Traits

ConsolidatableTokens

ConsolidatableTokens

TokenTrait

Token abstraction trait to access token fields, irrespective of their form (reference of owned)

Type Definitions

OffsetSize

Crate-wide primitive used to store offset positions