Struct rust_tokenizers::vocab::SentencePieceModel[][src]

pub struct SentencePieceModel {
    pub root: TrieNode,
}
Expand description

SentencePiece Model

Model for SentencePiece tokenizer. Contains the following special values. This model performs the SentencePiece unigram decomposition. As such, it contains a Trie data structure for efficient common prefix search.

Expects a SentencePiece protobuf file when created from file.

Fields

root: TrieNode

Trie data structure containing the vocabulary elements and their unigram log-probabilities

Implementations

Creates a SentencePiece Model from a protobuf file.

Example

use rust_tokenizers::vocab::SentencePieceModel;
let path = "path/to/spiece.model";

let sentence_piece_model = SentencePieceModel::from_file(path);

Performs a common prefix search for a given query on the model Trie structure

Arguments

  • text (&str): query to find common prefixes from

Returns

  • Vec<&TrieNode> containing references to the Trie nodes with a common (character based) prefix with the query

Example

use rust_tokenizers::vocab::SentencePieceModel;
let path = "path/to/spiece.model";
let sentence_piece_model = SentencePieceModel::from_file(path).unwrap();

let query = "hello";
let common_prefixes = sentence_piece_model.common_prefix_search(query);

Decodes a TokenRef to a lattice of potential subtokens. This step is usually followed by a backward step to find the most likely sequence.

Arguments

  • token (TokenRef<'a>): token to decompose in sub-tokens

Returns

  • Vec<Option<Node<'a>>> vector of lattice nodes. The string for the nodes references back to the original token.

Example

use rust_tokenizers::vocab::SentencePieceModel;
use rust_tokenizers::TokenRef;
let path = "path/to/spiece.model";
let sentence_piece_model = SentencePieceModel::from_file(path).unwrap();

let token = TokenRef::new("hello", &[0, 1, 2, 3]);
let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token);

Backward pass through an array of nodes (generated as a result of the forward pass), returning the most likely sequence of nodes. These are usually converted back to tokens in a last step

Arguments

  • nodes (&'a [Option<Node<'a>>]): possible modes generated from the forward step

Returns

  • Vec<&'a Node> sequence of most likely nodes

Example

use rust_tokenizers::vocab::SentencePieceModel;
use rust_tokenizers::TokenRef;
let path = "path/to/spiece.model";
let sentence_piece_model = SentencePieceModel::from_file(path).unwrap();

let token = TokenRef::new("hello", &[0, 1, 2, 3]);
let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token);
let best_nodes_sequence = sentence_piece_model.decode_backward(&lattice_nodes);

Convert the most likely node sequences to a vector of tokens that can be further processed by the tokenizer.

Arguments

  • nodes (Vec<&Node>): sequence of most likely nodes

Returns

  • Vec<Token> sequence of most likely sub-tokens

Example

use rust_tokenizers::vocab::SentencePieceModel;
use rust_tokenizers::TokenRef;
let path = "path/to/spiece.model";
let sentence_piece_model = SentencePieceModel::from_file(path).unwrap();

let token = TokenRef::new("hello", &[0, 1, 2, 3]);
let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token);
let best_nodes_sequence = sentence_piece_model.decode_backward(&lattice_nodes);
let sub_tokens = sentence_piece_model.parse_nodes_to_tokens(best_nodes_sequence);

Populates the mask field for a sequence of sub-tokens generated by a SentencePiece model. These masks are not generated as part of the standard unigram decomposition and must be added afterwards. Mutates the tokens in-place.

Arguments

  • tokens (&mut [Token]): tokens to get the masks from
  • whitespace_char (char): whitespace character to identify whether a token is a continuation token or not.

Example

use rust_tokenizers::vocab::SentencePieceModel;
use rust_tokenizers::TokenRef;
let path = "path/to/spiece.model";
let sentence_piece_model = SentencePieceModel::from_file(path).unwrap();

let token = TokenRef::new("hello", &[0, 1, 2, 3]);
let lattice_nodes = sentence_piece_model.decode_forward_token_ref(token);
let best_nodes_sequence = sentence_piece_model.decode_backward(&lattice_nodes);
let mut sub_tokens = sentence_piece_model.parse_nodes_to_tokens(best_nodes_sequence);
let sub_tokens_with_masks = sentence_piece_model.populate_masks(&mut sub_tokens, ' ');

Trait Implementations

Returns a copy of the value. Read more

Performs copy-assignment from source. Read more

Formats the value using the given formatter. Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Performs the conversion.

Performs the conversion.

The alignment of pointer.

The type for initializers.

Initializes a with the given initializer. Read more

Dereferences the given pointer. Read more

Mutably dereferences the given pointer. Read more

Drops the object pointed to by the given pointer. Read more

The resulting type after obtaining ownership.

Creates owned data from borrowed data, usually by cloning. Read more

🔬 This is a nightly-only experimental API. (toowned_clone_into)

recently added

Uses borrowed data to replace owned data, usually by cloning. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.