Crate rust_tokenizers

Expand description

§High performance tokenizers for Rust

This crate contains implementation of common tokenizers used in state-of-the-art language models. It is usd as the reference tokenization crate of rust-bert, exposing modern transformer-based models such as BERT, RoBERTa, GPT2, BART, XLNet…

The following tokenizers have been implemented and validated against a Python reference implementation:

Sentence Piece (unigram model)
BERT
DistilBERT
RoBERTa
FNet
GPT
GPT2
CTRL
ProphetNet
XLNet
Pegasus
MBart50
M2M100
NLLB
DeBERTa
DeBERTa (v2)

The library is structured into vocabularies (for the encoding and decoding of the tokens and registration of special tokens) and tokenizers (splitting the input text into tokens). Generally, a tokenizer will contain a reference vocabulary that may be used as part of the tokenization process (for example, containing a list of subwords or merges).

§Usage example

use rust_tokenizers::adapters::Example;
use rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy};
use rust_tokenizers::vocab::{BertVocab, Vocab};
let vocab_path = "path/to/vocab";
let vocab = BertVocab::from_file(&vocab_path)?;
let lowercase: bool = true;
let strip_accents: bool = true;

let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, lowercase, strip_accents);

println!(
    "{:?}",
    bert_tokenizer.encode(
        &test_sentence.sentence_1,
        None,
        128,
        &TruncationStrategy::LongestFirst,
        0
    )
);

Modules§

adapters: Adapter helpers to load datasets
error: Tokenizer error variants
tokenizer: Tokenizers
vocab: Vocabularies

Structs§

ConsolidatedTokenIterator: ConsolidatedTokenIterator
Offset: Offset information (in unicode points) to relate a token back to its original input string
Token: Owned token that references the original text but stores its own string representation.
TokenIdsWithOffsets: Encoded sequence
TokenIdsWithSpecialTokens: Encoded input with special tokens
TokenRef: Reference token that references the original text, with a string slice representation
TokenizedInput: Tokenized Input, ready for processing in language models
TokensWithOffsets: Tokenized sequence

Enums§

Mask: Type indication for tokens (e.g. special token, white space, unknown…)

Traits§

ConsolidatableTokens: ConsolidatableTokens
TokenTrait: Token abstraction trait to access token fields, irrespective of their form (reference of owned)

Type Aliases§

OffsetSize: Crate-wide primitive used to store offset positions

Crate rust_tokenizersCopy item path