Struct rust_tokenizers::tokenizer::RobertaTokenizer
source · pub struct RobertaTokenizer { /* private fields */ }
Expand description
RoBERTa tokenizer
RoBERTa tokenizer performing:
- splitting on special characters
- whitespace splitting
- (optional) lower casing
- BPE tokenization
Implementations§
source§impl RobertaTokenizer
impl RobertaTokenizer
sourcepub fn from_file<P: AsRef<Path>, M: AsRef<Path>>(
vocab_path: P,
merges_path: M,
lower_case: bool,
add_prefix_space: bool
) -> Result<RobertaTokenizer, TokenizerError>
pub fn from_file<P: AsRef<Path>, M: AsRef<Path>>( vocab_path: P, merges_path: M, lower_case: bool, add_prefix_space: bool ) -> Result<RobertaTokenizer, TokenizerError>
Create a new instance of a RobertaTokenizer
Expects a vocabulary json file and a merges file as an input.
Parameters
- vocab_path (
&str
): path to the vocabulary file - merges_path (
&str
): path to the merges file (use as part of the BPE encoding process) - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization
Example
use rust_tokenizers::tokenizer::{RobertaTokenizer, Tokenizer};
let lower_case = false;
let add_prefix_space = true;
let tokenizer = RobertaTokenizer::from_file(
"path/to/vocab/file",
"path/to/merges/file",
lower_case,
add_prefix_space,
)
.unwrap();
sourcepub fn from_file_with_special_token_mapping<V: AsRef<Path>, M: AsRef<Path>, S: AsRef<Path>>(
vocab_path: V,
merges_path: M,
lower_case: bool,
add_prefix_space: bool,
special_token_mapping_path: S
) -> Result<RobertaTokenizer, TokenizerError>
pub fn from_file_with_special_token_mapping<V: AsRef<Path>, M: AsRef<Path>, S: AsRef<Path>>( vocab_path: V, merges_path: M, lower_case: bool, add_prefix_space: bool, special_token_mapping_path: S ) -> Result<RobertaTokenizer, TokenizerError>
Create a new instance of a RobertaTokenizer
Expects a vocabulary json file and a merges file and special token mapping file as inputs.
Parameters
- vocab_path (
&str
): path to the vocabulary file - merges_path (
&str
): path to the merges file (use as part of the BPE encoding process) - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization - special_token_mapping_path (
&str
): path to a special token mapping file to overwrite default special tokens
Example
use rust_tokenizers::tokenizer::{RobertaTokenizer, Tokenizer};
let lower_case = false;
let add_prefix_space = true;
let tokenizer = RobertaTokenizer::from_file_with_special_token_mapping(
"path/to/vocab/file",
"path/to/merges/file",
lower_case,
add_prefix_space,
"path/to/special/token/mapping/file",
)
.unwrap();
sourcepub fn from_existing_vocab_and_merges(
vocab: RobertaVocab,
merges: BpePairVocab,
lower_case: bool,
add_prefix_space: bool
) -> RobertaTokenizer
pub fn from_existing_vocab_and_merges( vocab: RobertaVocab, merges: BpePairVocab, lower_case: bool, add_prefix_space: bool ) -> RobertaTokenizer
Create a new instance of a RobertaTokenizer
from an existing vocabulary and merges
Parameters
- vocab (
RobertaVocab
): GPT-like vocabulary - merges (
BpePairVocab
): BPE pairs vocabulary - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization
Example
use rust_tokenizers::tokenizer::{RobertaTokenizer, Tokenizer};
use rust_tokenizers::vocab::{BpePairVocab, RobertaVocab, Vocab};
let lower_case = false;
let add_prefix_space = true;
let vocab = RobertaVocab::from_file("path/to/vocab/file").unwrap();
let merges = BpePairVocab::from_file("path/to/merges/file").unwrap();
let tokenizer = RobertaTokenizer::from_existing_vocab_and_merges(
vocab,
merges,
lower_case,
add_prefix_space,
);
Trait Implementations§
source§impl MultiThreadedTokenizer<RobertaVocab> for RobertaTokenizer
impl MultiThreadedTokenizer<RobertaVocab> for RobertaTokenizer
source§fn tokenize_list_with_offsets<S>(
&self,
text_list: &[S]
) -> Vec<TokensWithOffsets>where
S: AsRef<str> + Sync,
fn tokenize_list_with_offsets<S>( &self, text_list: &[S] ) -> Vec<TokensWithOffsets>where S: AsRef<str> + Sync,
Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read moresource§fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>where
S: AsRef<str> + Sync,
fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>where S: AsRef<str> + Sync,
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
source§fn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>where
S: AsRef<str> + Sync,
fn encode_list<S>( &self, text_list: &[S], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str> + Sync,
Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with
encode
optional second text, each text provided is encoded independently. Read moresource§fn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>where
S: AsRef<str> + Sync,
fn encode_pair_list<S>( &self, text_list: &[(S, S)], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str> + Sync,
Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with
encode
with the list processing of encode_list
. Read moresource§fn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> Vec<String>
fn decode_list( &self, token_ids_list: &[Vec<i64>], skip_special_tokens: bool, clean_up_tokenization_spaces: bool ) -> Vec<String>
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls
decode
for each provided sequence of ids Read moresource§impl Tokenizer<RobertaVocab> for RobertaTokenizer
impl Tokenizer<RobertaVocab> for RobertaTokenizer
source§fn vocab(&self) -> &RobertaVocab
fn vocab(&self) -> &RobertaVocab
returns a reference to the tokenizer vocabulary
source§fn vocab_mut(&mut self) -> &mut RobertaVocab
fn vocab_mut(&mut self) -> &mut RobertaVocab
returns a mutable reference to the tokenizer vocabulary
source§fn tokenize_to_tokens(&self, initial_token: TokenRef<'_>) -> Vec<Token>
fn tokenize_to_tokens(&self, initial_token: TokenRef<'_>) -> Vec<Token>
Tokenize a TokenRef, returning a sequence of tokens Read more
source§fn convert_tokens_to_string(&self, tokens: Vec<String>) -> String
fn convert_tokens_to_string(&self, tokens: Vec<String>) -> String
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example
sub ##word
) and generate a single output string Read moresource§fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
fn build_input_with_special_tokens( &self, tokens_ids_with_offsets_1: TokenIdsWithOffsets, tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets> ) -> TokenIdsWithSpecialTokens
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. Read more
source§fn tokenize(&self, text: &str) -> Vec<String>
fn tokenize(&self, text: &str) -> Vec<String>
Tokenize a string, returns a vector of tokens as strings.
Use
tokenize_with_offsets
or tokenize_to_tokens
to return offset information. Read moresource§fn tokenize_with_offsets(&self, text: &str) -> TokensWithOffsets
fn tokenize_with_offsets(&self, text: &str) -> TokensWithOffsets
Tokenize a string, returning tokens with offset information Read more
source§fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>where
S: AsRef<str>,
fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>where S: AsRef<str>,
Tokenize a list of strings, returning tokens with offset information Read more
source§fn tokenize_list_with_offsets<S>(
&self,
text_list: &[S]
) -> Vec<TokensWithOffsets>where
S: AsRef<str>,
fn tokenize_list_with_offsets<S>( &self, text_list: &[S] ) -> Vec<TokensWithOffsets>where S: AsRef<str>,
Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read moresource§fn convert_tokens_to_ids<S>(&self, tokens: &[S]) -> Vec<i64>where
S: AsRef<str>,
fn convert_tokens_to_ids<S>(&self, tokens: &[S]) -> Vec<i64>where S: AsRef<str>,
Convert a slice of string-like to a vector ot token indices Read more
source§fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
fn encode( &self, text_1: &str, text_2: Option<&str>, max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> TokenizedInput
Encode a string-like (tokenization followed by encoding) Read more
source§fn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>where
S: AsRef<str>,
fn encode_list<S>( &self, text_list: &[S], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str>,
Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with
encode
optional second text, each text provided is encoded independently. Read moresource§fn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput>where
S: AsRef<str>,
fn encode_pair_list<S>( &self, text_list: &[(S, S)], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize ) -> Vec<TokenizedInput>where S: AsRef<str>,
Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with
encode
with the list processing of encode_list
. Read moresource§fn decode_to_vec(
&self,
token_ids: &[i64],
skip_special_tokens: bool
) -> Vec<String>
fn decode_to_vec( &self, token_ids: &[i64], skip_special_tokens: bool ) -> Vec<String>
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
source§fn decode(
&self,
token_ids: &[i64],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> String
fn decode( &self, token_ids: &[i64], skip_special_tokens: bool, clean_up_tokenization_spaces: bool ) -> String
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. Read more
source§fn clean_up_tokenization(&self, input_string: String) -> String
fn clean_up_tokenization(&self, input_string: String) -> String
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
source§fn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool
) -> Vec<String>
fn decode_list( &self, token_ids_list: &[Vec<i64>], skip_special_tokens: bool, clean_up_tokenization_spaces: bool ) -> Vec<String>
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls
decode
for each provided sequence of ids Read moresource§fn add_tokens(&mut self, tokens: &[&str])
fn add_tokens(&mut self, tokens: &[&str])
Add arbitrary tokens to the vocabulary. Read more
source§fn add_extra_ids(&mut self, num_extra_ids: i64)
fn add_extra_ids(&mut self, num_extra_ids: i64)
Add arbitrary tokens to the vocabulary. Read more
Auto Trait Implementations§
impl RefUnwindSafe for RobertaTokenizer
impl Send for RobertaTokenizer
impl Sync for RobertaTokenizer
impl Unpin for RobertaTokenizer
impl UnwindSafe for RobertaTokenizer
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more