Struct rust_tokenizers::vocab::SentencePieceModel [−][src]
pub struct SentencePieceModel { pub root: TrieNode, }
Expand description
SentencePiece Model
Model for SentencePiece tokenizer. Contains the following special values. This model performs
the SentencePiece unigram decomposition. As such, it contains a Trie
data structure for efficient
common prefix search.
Expects a SentencePiece protobuf file when created from file.
Fields
root: TrieNode
Trie data structure containing the vocabulary elements and their unigram log-probabilities
Implementations
Creates a SentencePiece Model from a protobuf file.
Example
use rust_tokenizers::vocab::SentencePieceModel; let path = "path/to/spiece.model"; let sentence_piece_model = SentencePieceModel::from_file(path);
Performs a common prefix search for a given query on the model Trie structure
Arguments
- text (
&str
): query to find common prefixes from
Returns
Vec<&TrieNode>
containing references to the Trie nodes with a common (character based) prefix with the query
Example
use rust_tokenizers::vocab::SentencePieceModel; let path = "path/to/spiece.model"; let sentence_piece_model = SentencePieceModel::from_file(path).unwrap(); let query = "hello"; let common_prefixes = sentence_piece_model.common_prefix_search(query);
Decodes a TokenRef
to a lattice of potential subtokens.
This step is usually followed by a backward step to find the most likely sequence.
Arguments
- token (
TokenRef<'a>
): token to decompose in sub-tokens
Returns
Vec<Option<Node<'a>>>
vector of lattice nodes. The string for the nodes references back to the original token.
Example
use rust_tokenizers::vocab::SentencePieceModel; use rust_tokenizers::TokenRef; let path = "path/to/spiece.model"; let sentence_piece_model = SentencePieceModel::from_file(path).unwrap(); let token = TokenRef::new("hello", &[0, 1, 2, 3]); let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token);
Backward pass through an array of nodes (generated as a result of the forward pass), returning the most likely sequence of nodes. These are usually converted back to tokens in a last step
Arguments
- nodes (
&'a [Option<Node<'a>>]
): possible modes generated from the forward step
Returns
Vec<&'a Node>
sequence of most likely nodes
Example
use rust_tokenizers::vocab::SentencePieceModel; use rust_tokenizers::TokenRef; let path = "path/to/spiece.model"; let sentence_piece_model = SentencePieceModel::from_file(path).unwrap(); let token = TokenRef::new("hello", &[0, 1, 2, 3]); let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token); let best_nodes_sequence = sentence_piece_model.decode_backward(&lattice_nodes);
Convert the most likely node sequences to a vector of tokens that can be further processed by the tokenizer.
Arguments
- nodes (
Vec<&Node>
): sequence of most likely nodes
Returns
Vec<Token>
sequence of most likely sub-tokens
Example
use rust_tokenizers::vocab::SentencePieceModel; use rust_tokenizers::TokenRef; let path = "path/to/spiece.model"; let sentence_piece_model = SentencePieceModel::from_file(path).unwrap(); let token = TokenRef::new("hello", &[0, 1, 2, 3]); let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token); let best_nodes_sequence = sentence_piece_model.decode_backward(&lattice_nodes); let sub_tokens = sentence_piece_model.parse_nodes_to_tokens(best_nodes_sequence);
Populates the mask
field for a sequence of sub-tokens generated by a SentencePiece model.
These masks are not generated as part of the standard unigram decomposition and must be added
afterwards. Mutates the tokens in-place.
Arguments
- tokens (
&mut [Token]
): tokens to get the masks from - whitespace_char (
char
): whitespace character to identify whether a token is a continuation token or not.
Example
use rust_tokenizers::vocab::SentencePieceModel; use rust_tokenizers::TokenRef; let path = "path/to/spiece.model"; let sentence_piece_model = SentencePieceModel::from_file(path).unwrap(); let token = TokenRef::new("hello", &[0, 1, 2, 3]); let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token); let best_nodes_sequence = sentence_piece_model.decode_backward(&lattice_nodes); let mut sub_tokens = sentence_piece_model.parse_nodes_to_tokens(best_nodes_sequence); let sub_tokens_with_masks = sentence_piece_model.populate_masks(&mut sub_tokens, ' ');
Trait Implementations
Auto Trait Implementations
impl RefUnwindSafe for SentencePieceModel
impl Send for SentencePieceModel
impl Sync for SentencePieceModel
impl Unpin for SentencePieceModel
impl UnwindSafe for SentencePieceModel
Blanket Implementations
Mutably borrows from an owned value. Read more