Expand description
§High performance tokenizers for Rust
This crate contains implementation of common tokenizers used in state-of-the-art language models. It is usd as the reference tokenization crate of rust-bert, exposing modern transformer-based models such as BERT, RoBERTa, GPT2, BART, XLNet…
The following tokenizers have been implemented and validated against a Python reference implementation:
- Sentence Piece (unigram model)
- BERT
- DistilBERT
- RoBERTa
- FNet
- GPT
- GPT2
- CTRL
- ProphetNet
- XLNet
- Pegasus
- MBart50
- M2M100
- NLLB
- DeBERTa
- DeBERTa (v2)
The library is structured into vocabularies (for the encoding and decoding of the tokens and registration of special tokens) and tokenizers (splitting the input text into tokens). Generally, a tokenizer will contain a reference vocabulary that may be used as part of the tokenization process (for example, containing a list of subwords or merges).
§Usage example
use rust_tokenizers::adapters::Example;
use rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy};
use rust_tokenizers::vocab::{BertVocab, Vocab};
let vocab_path = "path/to/vocab";
let vocab = BertVocab::from_file(&vocab_path)?;
let lowercase: bool = true;
let strip_accents: bool = true;
let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, lowercase, strip_accents);
println!(
"{:?}",
bert_tokenizer.encode(
&test_sentence.sentence_1,
None,
128,
&TruncationStrategy::LongestFirst,
0
)
);
Modules§
- adapters
- Adapter helpers to load datasets
- error
- Tokenizer error variants
- tokenizer
- Tokenizers
- vocab
- Vocabularies
Structs§
- Consolidated
Token Iterator - ConsolidatedTokenIterator
- Offset
- Offset information (in unicode points) to relate a token back to its original input string
- Token
- Owned token that references the original text but stores its own string representation.
- Token
IdsWith Offsets - Encoded sequence
- Token
IdsWith Special Tokens - Encoded input with special tokens
- Token
Ref - Reference token that references the original text, with a string slice representation
- Tokenized
Input - Tokenized Input, ready for processing in language models
- Tokens
With Offsets - Tokenized sequence
Enums§
- Mask
- Type indication for tokens (e.g. special token, white space, unknown…)
Traits§
- Consolidatable
Tokens - ConsolidatableTokens
- Token
Trait - Token abstraction trait to access token fields, irrespective of their form (reference of owned)
Type Aliases§
- Offset
Size - Crate-wide primitive used to store offset positions