Struct rust_tokenizers::vocab::FNetVocab [−][src]
pub struct FNetVocab {
pub values: HashMap<String, i64>,
pub indices: HashMap<i64, String>,
pub unknown_value: &'static str,
pub special_values: HashMap<String, i64>,
pub special_indices: HashMap<i64, String>,
}Expand description
FNetVocab
Vocabulary for FNet tokenizer. Contains the following special values:
- CLS token
- SEP token
- PAD token
- MASK token
Expects a SentencePiece BPE protobuf file when created from file.
Fields
values: HashMap<String, i64>A mapping of tokens as string to indices (i.e. the encoder base)
indices: HashMap<i64, String>A mapping of token ids to strings (i.e. the decoder base)
unknown_value: &'static strThe string to use for unknown (out of vocabulary) tokens
special_values: HashMap<String, i64>A mapping of special value tokens as strings to IDs (i.e. the encoder base for special values), special values typically include things like BOS/EOS markers, class markers, mask markers and padding markers
special_indices: HashMap<i64, String>A mapping of special value tokens as IDs to strings (i.e. the decoder base for special values)
Implementations
Returns the MASK token for FNet ([MASK])
Trait Implementations
Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
Tokenize a TokenRef, returning a sequence of tokens Read more
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example sub ##word) and generate a single output string Read more
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more
Tokenize a string, returns a vector of tokens as strings.
Use tokenize_with_offsets or tokenize_to_tokens to return offset information. Read more
Tokenize a string, returning tokens with offset information Read more
Tokenize a list of strings, returning tokens with offset information Read more
Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Convert a slice of string-like to a vector ot token indices Read more
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
Encode a string-like (tokenization followed by encoding) Read more
Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
Associative function returning the unknown value for the vocabulary
Returns the unknown value on an instance
Return the map of token strings to IDs
Return the map of token IDs to strings for special values
Read a vocabulary from file Read more
Converts a token to an id. Read more
Converts an id to a token. Read more
Read a Bert-style vocab.txt file (single column, one token per line)
The from_file method should be preferred, and needs to be implemented by the specific vocabularies Read more
Converts a token to an id, provided a HashMap of values, a HashMap of special values and
the unknown value token string representation. This is not meant to be directly used, the method
token_to_id offers a more convenient interface for most vocabularies, but needs to be implemented
by the specific vocabulary. Read more
Converts an id to a token, provided a HashMap of values, a HashMap of special values and
the unknown value token string representation. This is not meant to be directly used, the method
id_to_token offers a more convenient interface for most vocabularies, but needs to be implemented
by the specific vocabulary. Read more
Register a token as a special value Read more
Auto Trait Implementations
impl RefUnwindSafe for FNetVocab
impl UnwindSafe for FNetVocab
Blanket Implementations
Mutably borrows from an owned value. Read more