Struct rust_tokenizers::tokenizer::ProphetNetTokenizer [−][src]
pub struct ProphetNetTokenizer { /* fields omitted */ }Expand description
ProphetNet tokenizer
ProphetNet tokenizer performing:
- BaseTokenizer tokenization (see
BaseTokenizerfor more details) - WordPiece tokenization
Implementations
pub fn from_file(
path: &str,
lower_case: bool,
strip_accents: bool
) -> Result<ProphetNetTokenizer, TokenizerError>
pub fn from_file(
path: &str,
lower_case: bool,
strip_accents: bool
) -> Result<ProphetNetTokenizer, TokenizerError>
Create a new instance of a ProphetNetTokenizer.
Expects a vocabulary flat-file as an input.
Parameters
- path (
&str): path to the vocabulary file - lower_case (
bool): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool): flag indicating if accents should be stripped from the text
Example
use rust_tokenizers::tokenizer::{ProphetNetTokenizer, Tokenizer};
let strip_accents = false;
let lower_case = false;
let tokenizer =
ProphetNetTokenizer::from_file("path/to/vocab/file", lower_case, strip_accents).unwrap();pub fn from_existing_vocab(
vocab: ProphetNetVocab,
lower_case: bool,
strip_accents: bool
) -> ProphetNetTokenizer
pub fn from_existing_vocab(
vocab: ProphetNetVocab,
lower_case: bool,
strip_accents: bool
) -> ProphetNetTokenizer
Create a new instance of a ProphetNetTokenizer from an existing vocabulary
Parameters
- vocab (
ProphetNetVocab): ProphetNet vocabulary - lower_case (
bool): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool): flag indicating if accents should be stripped from the text
Example
use rust_tokenizers::tokenizer::{ProphetNetTokenizer, Tokenizer};
use rust_tokenizers::vocab::{ProphetNetVocab, Vocab};
let strip_accents = false;
let lower_case = false;
let vocab = ProphetNetVocab::from_file("path/to/vocab/file").unwrap();
let tokenizer = ProphetNetTokenizer::from_existing_vocab(vocab, lower_case, strip_accents);Trait Implementations
Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
returns a reference to the tokenizer vocabulary
Tokenize a TokenRef, returning a sequence of tokens Read more
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example sub ##word) and generate a single output string Read more
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more
Tokenize a string, returns a vector of tokens as strings.
Use tokenize_with_offsets or tokenize_to_tokens to return offset information. Read more
Tokenize a string, returning tokens with offset information Read more
Tokenize a list of strings, returning tokens with offset information Read more
Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Convert a slice of string-like to a vector ot token indices Read more
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
Encode a string-like (tokenization followed by encoding) Read more
Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more