Struct rust_tokenizers::tokenizer::MBart50Tokenizer [−][src]
pub struct MBart50Tokenizer { /* fields omitted */ }Expand description
MBart50 tokenizer
MBart50 tokenizer performing:
- Splitting on language and special tokens
- text cleaning
- NFKC decomposition
- (optional) lower casing
- SentencePiece decomposition
Implementations
Create a new instance of a MBart50Tokenizer
Expects a SentencePiece protobuf file as an input.
Parameters
- path (
&str): path to the SentencePiece model file - lower_case (
bool): flag indicating if the text should be lower-cased as part of the tokenization
Example
use rust_tokenizers::tokenizer::{MBart50Tokenizer, Tokenizer};
let lower_case = false;
let tokenizer = MBart50Tokenizer::from_file("path/to/vocab/file", lower_case).unwrap();pub fn from_existing_vocab_and_model(
vocab: MBart50Vocab,
model: SentencePieceModel,
lower_case: bool
) -> MBart50Tokenizer
pub fn from_existing_vocab_and_model(
vocab: MBart50Vocab,
model: SentencePieceModel,
lower_case: bool
) -> MBart50Tokenizer
Create a new instance of a MBart50Tokenizer from an existing vocabulary and model
Parameters
- vocab (
MBart50Vocab): vocabulary - model (
SentencePieceModel): SentencePiece model - lower_case (
bool): flag indicating if the text should be lower-cased as part of the tokenization
Example
use rust_tokenizers::tokenizer::{MBart50Tokenizer, Tokenizer};
use rust_tokenizers::vocab::{MBart50Vocab, SentencePieceModel, Vocab};
let lower_case = false;
let vocab = MBart50Vocab::from_file("path/to/vocab/file").unwrap();
let model = SentencePieceModel::from_file("path/to/model/file").unwrap();
let tokenizer = MBart50Tokenizer::from_existing_vocab_and_model(vocab, model, lower_case);Trait Implementations
Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
returns a reference to the tokenizer vocabulary
Tokenize a TokenRef, returning a sequence of tokens Read more
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example sub ##word) and generate a single output string Read more
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more
Tokenize a string, returns a vector of tokens as strings.
Use tokenize_with_offsets or tokenize_to_tokens to return offset information. Read more
Tokenize a string, returning tokens with offset information Read more
Tokenize a list of strings, returning tokens with offset information Read more
Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Convert a slice of string-like to a vector ot token indices Read more
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
Encode a string-like (tokenization followed by encoding) Read more
Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more