Crate rust_tokenizers[−][src]
Expand description
High performance tokenizers for Rust
This crate contains implementation of common tokenizers used in state-of-the-art language models. It is usd as the reference tokenization crate of rust-bert, exposing modern transformer-based models such as BERT, RoBERTa, GPT2, BART, XLNet…
The following tokenizers have been implemented and validated against a Python reference implementation:
- Sentence Piece (unigram model)
- BERT
- DistilBERT
- RoBERTa
- GPT
- GPT2
- CTRL
- ProphetNet
- XLNet
- Pegasus
- MBart50
The library is structured into vocabularies (for the encoding and decoding of the tokens and registration of special tokens) and tokenizers (splitting the input text into tokens). Generally, a tokenizer will contain a reference vocabulary that may be used as part of the tokenization process (for example, containing a list of subwords or merges).
Usage example
use rust_tokenizers::adapters::Example; use rust_tokenizers::tokenizer::{BertTokenizer, Tokenizer, TruncationStrategy}; use rust_tokenizers::vocab::{BertVocab, Vocab}; let vocab_path = "path/to/vocab"; let vocab = BertVocab::from_file(&vocab_path)?; let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized"); let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab, true, true); println!( "{:?}", bert_tokenizer.encode( &test_sentence.sentence_1, None, 128, &TruncationStrategy::LongestFirst, 0 ) );
Modules
adapters | Adapter helpers to load datasets |
error | Tokenizer error variants |
tokenizer | Tokenizers |
vocab | Vocabularies |
Structs
ConsolidatedTokenIterator | ConsolidatedTokenIterator |
Offset | Offset information (in unicode points) to relate a token back to its original input string |
Token | Owned token that references the original text but stores its own string representation. |
TokenIdsWithOffsets | Encoded sequence |
TokenIdsWithSpecialTokens | Encoded input with special tokens |
TokenRef | Reference token that references the original text, with a string slice representation |
TokenizedInput | Tokenized Input, ready for processing in language models |
TokensWithOffsets | Tokenized sequence |
Enums
Mask | Type indication for tokens (e.g. special token, white space, unknown…) |
Traits
ConsolidatableTokens | ConsolidatableTokens |
TokenTrait | Token abstraction trait to access token fields, irrespective of their form (reference of owned) |
Type Definitions
OffsetSize | Crate-wide primitive used to store offset positions |