Struct rust_tokenizers::vocab::MarianVocab[][src]

pub struct MarianVocab {
    pub values: HashMap<String, i64>,
    pub indices: HashMap<i64, String>,
    pub unknown_value: &'static str,
    pub special_values: HashMap<String, i64>,
    pub special_indices: HashMap<i64, String>,
}
Expand description

Marian Vocab

Vocabulary for Marian tokenizer. Contains the following special values:

  • PAD token
  • EOS token

Expects a JSON-format vocabulary when created from file.

Fields

values: HashMap<String, i64>

A mapping of tokens as string to indices (i.e. the encoder base)

indices: HashMap<i64, String>

A mapping of token ids to strings (i.e. the decoder base)

unknown_value: &'static str

The string to use for unknown (out of vocabulary) tokens

special_values: HashMap<String, i64>

A mapping of special value tokens as strings to IDs (i.e. the encoder base for special values), special values typically include things like BOS/EOS markers, class markers, mask markers and padding markers

special_indices: HashMap<i64, String>

A mapping of special value tokens as IDs to strings (i.e. the decoder base for special values)

Implementations

Returns the PAD token for Marian (<pad>)

Returns the EOS token for Marian (</s>)

Trait Implementations

Returns a copy of the value. Read more

Performs copy-assignment from source. Read more

Formats the value using the given formatter. Read more

returns a reference to the tokenizer vocabulary

Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a vector of TokensWithOffsets containing the tokens and their offset information. This calls tokenize_with_offsets on the list provided. Read more

Multithreaded tokenization of a list of strings, returning tokens with offset information Read more

Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast with encode optional second text, each text provided is encoded independently. Read more

Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines with encode with the list processing of encode_list. Read more

Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. This calls decode for each provided sequence of ids Read more

returns a reference to the tokenizer vocabulary

Tokenize a TokenRef, returning a sequence of tokens Read more

Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization (for example sub ##word) and generate a single output string Read more

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more

Tokenize a string, returns a vector of tokens as strings. Use tokenize_with_offsets or tokenize_to_tokens to return offset information. Read more

Tokenize a string, returning tokens with offset information Read more

Tokenize a list of strings, returning tokens with offset information Read more

Tokenize a list of strings, where each corresponds to for example a sentence, returns a vector of TokensWithOffsets containing the tokens and their offset information. This calls tokenize_with_offsets on the list provided. Read more

Convert a slice of string-like to a vector ot token indices Read more

Encode a string-like (tokenization followed by encoding) Read more

Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast with encode optional second text, each text provided is encoded independently. Read more

Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines with encode with the list processing of encode_list. Read more

Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more

Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more

Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more

Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. This calls decode for each provided sequence of ids Read more

Associative function returning the unknown value for the vocabulary

Returns the unknown value on an instance

Return the map of token strings to IDs

Return the map of token IDs to strings

Return the map of token strings to IDs

Return the map of token IDs to strings for special values

Read a vocabulary from file Read more

Converts a token to an id. Read more

Converts an id to a token. Read more

Read a Bert-style vocab.txt file (single column, one token per line) The from_file method should be preferred, and needs to be implemented by the specific vocabularies Read more

Converts a token to an id, provided a HashMap of values, a HashMap of special values and the unknown value token string representation. This is not meant to be directly used, the method token_to_id offers a more convenient interface for most vocabularies, but needs to be implemented by the specific vocabulary. Read more

Converts an id to a token, provided a HashMap of values, a HashMap of special values and the unknown value token string representation. This is not meant to be directly used, the method id_to_token offers a more convenient interface for most vocabularies, but needs to be implemented by the specific vocabulary. Read more

Register a token as a special value Read more

Converts a list of tokens to a list of indices. Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Performs the conversion.

Performs the conversion.

The alignment of pointer.

The type for initializers.

Initializes a with the given initializer. Read more

Dereferences the given pointer. Read more

Mutably dereferences the given pointer. Read more

Drops the object pointed to by the given pointer. Read more

The resulting type after obtaining ownership.

Creates owned data from borrowed data, usually by cloning. Read more

🔬 This is a nightly-only experimental API. (toowned_clone_into)

recently added

Uses borrowed data to replace owned data, usually by cloning. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.