Struct rust_tokenizers::tokenizer::RobertaTokenizer

source ·

pub struct RobertaTokenizer { /* private fields */ }

Expand description

RoBERTa tokenizer

RoBERTa tokenizer performing:

splitting on special characters
whitespace splitting
(optional) lower casing
BPE tokenization

Implementations§

source §

impl RobertaTokenizer

source

pub fn from_file<P: AsRef<Path>, M: AsRef<Path>>( vocab_path: P, merges_path: M, lower_case: bool, add_prefix_space: bool ) -> Result<RobertaTokenizer, TokenizerError>

Create a new instance of a RobertaTokenizer Expects a vocabulary json file and a merges file as an input.

Parameters

vocab_path (&str): path to the vocabulary file
merges_path (&str): path to the merges file (use as part of the BPE encoding process)
lower_case (bool): flag indicating if the text should be lower-cased as part of the tokenization

pub fn from_file_with_special_token_mapping<V: AsRef<Path>, M: AsRef<Path>, S: AsRef<Path>>( vocab_path: V, merges_path: M, lower_case: bool, add_prefix_space: bool, special_token_mapping_path: S ) -> Result<RobertaTokenizer, TokenizerError>

Create a new instance of a RobertaTokenizer Expects a vocabulary json file and a merges file and special token mapping file as inputs.

Parameters

vocab_path (&str): path to the vocabulary file
merges_path (&str): path to the merges file (use as part of the BPE encoding process)
lower_case (bool): flag indicating if the text should be lower-cased as part of the tokenization
special_token_mapping_path (&str): path to a special token mapping file to overwrite default special tokens

pub fn from_existing_vocab_and_merges( vocab: RobertaVocab, merges: BpePairVocab, lower_case: bool, add_prefix_space: bool ) -> RobertaTokenizer

Create a new instance of a RobertaTokenizer from an existing vocabulary and merges

Parameters

vocab (RobertaVocab): GPT-like vocabulary
merges (BpePairVocab): BPE pairs vocabulary
lower_case (bool): flag indicating if the text should be lower-cased as part of the tokenization

Example

use rust_tokenizers::tokenizer::{RobertaTokenizer, Tokenizer};
use rust_tokenizers::vocab::{BpePairVocab, RobertaVocab, Vocab};
let lower_case = false;
let add_prefix_space = true;
let vocab = RobertaVocab::from_file("path/to/vocab/file").unwrap();
let merges = BpePairVocab::from_file("path/to/merges/file").unwrap();

let tokenizer = RobertaTokenizer::from_existing_vocab_and_merges(
    vocab,
    merges,
    lower_case,
    add_prefix_space,
);

Trait Implementations§

source §

impl MultiThreadedTokenizer<RobertaVocab> for RobertaTokenizer

source §

fn vocab(&self) -> &T

returns a reference to the tokenizer vocabulary

source §

fn tokenize_list_with_offsets<S>( &self, text_list: &[S] ) -> Vec<TokensWithOffsets>where S: AsRef<str> + Sync,

Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a vector of TokensWithOffsets containing the tokens and their offset information. This calls tokenize_with_offsets on the list provided. Read more

source §

fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>where S: AsRef<str> + Sync,

Multithreaded tokenization of a list of strings, returning tokens with offset information Read more

source §

fn encode_list<S>( &self, text_list: &[S], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str> + Sync,

Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast with encode optional second text, each text provided is encoded independently. Read more

source §

fn encode_pair_list<S>( &self, text_list: &[(S, S)], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str> + Sync,

Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines with encode with the list processing of encode_list. Read more

source §

fn decode_list( &self, token_ids_list: &[Vec<i64>], skip_special_tokens: bool, clean_up_tokenization_spaces: bool ) -> Vec<String>

Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. This calls decode for each provided sequence of ids Read more

source §

impl Tokenizer<RobertaVocab> for RobertaTokenizer

source §

fn vocab(&self) -> &RobertaVocab

returns a reference to the tokenizer vocabulary

source §

fn vocab_mut(&mut self) -> &mut RobertaVocab

returns a mutable reference to the tokenizer vocabulary

source §

fn tokenize_to_tokens(&self, initial_token: TokenRef<'_>) -> Vec<Token>

Tokenize a TokenRef, returning a sequence of tokens Read more

source §

fn convert_tokens_to_string(&self, tokens: Vec<String>) -> String

Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization (for example sub ##word) and generate a single output string Read more

source §

fn build_input_with_special_tokens( &self, tokens_ids_with_offsets_1: TokenIdsWithOffsets, tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets> ) -> TokenIdsWithSpecialTokens

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more

source §

fn tokenize(&self, text: &str) -> Vec<String>

Tokenize a string, returns a vector of tokens as strings. Use tokenize_with_offsets or tokenize_to_tokens to return offset information. Read more

source §

fn tokenize_with_offsets(&self, text: &str) -> TokensWithOffsets

Tokenize a string, returning tokens with offset information Read more

source §

fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>where S: AsRef<str>,

Tokenize a list of strings, returning tokens with offset information Read more

source §

fn tokenize_list_with_offsets<S>( &self, text_list: &[S] ) -> Vec<TokensWithOffsets>where S: AsRef<str>,

Tokenize a list of strings, where each corresponds to for example a sentence, returns a vector of TokensWithOffsets containing the tokens and their offset information. This calls tokenize_with_offsets on the list provided. Read more

source §

fn convert_tokens_to_ids<S>(&self, tokens: &[S]) -> Vec<i64>where S: AsRef<str>,

Convert a slice of string-like to a vector ot token indices Read more

source §

fn encode( &self, text_1: &str, text_2: Option<&str>, max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> TokenizedInput

Encode a string-like (tokenization followed by encoding) Read more

source §

fn encode_list<S>( &self, text_list: &[S], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str>,

Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast with encode optional second text, each text provided is encoded independently. Read more

source §

fn encode_pair_list<S>( &self, text_list: &[(S, S)], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str>,

Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines with encode with the list processing of encode_list. Read more

source §

fn decode_to_vec( &self, token_ids: &[i64], skip_special_tokens: bool ) -> Vec<String>

Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more

source §

fn decode( &self, token_ids: &[i64], skip_special_tokens: bool, clean_up_tokenization_spaces: bool ) -> String

Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more

source §

fn clean_up_tokenization(&self, input_string: String) -> String

Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more

source §

fn decode_list( &self, token_ids_list: &[Vec<i64>], skip_special_tokens: bool, clean_up_tokenization_spaces: bool ) -> Vec<String>

Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. This calls decode for each provided sequence of ids Read more

source §