Struct rust_tokenizers::tokenizer::AlbertTokenizer [−][src]
pub struct AlbertTokenizer { /* fields omitted */ }Expand description
ALBERT tokenizer
ALBERT tokenizer performing:
- splitting on special characters
- text cleaning
- NFKC decomposition
- (optional) lower casing
- (optional) accent stripping
- SentencePiece decomposition
Implementations
pub fn from_file(
path: &str,
lower_case: bool,
strip_accents: bool
) -> Result<AlbertTokenizer, TokenizerError>
pub fn from_file(
path: &str,
lower_case: bool,
strip_accents: bool
) -> Result<AlbertTokenizer, TokenizerError>
Create a new instance of a AlbertTokenizer
Expects a SentencePiece protobuf file as an input.
Parameters
- path (
&str): path to the SentencePiece model file - lower_case (
bool): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool): flag indicating if accents should be stripped from the text
Example
use rust_tokenizers::tokenizer::{AlbertTokenizer, Tokenizer};
let strip_accents = false;
let lower_case = false;
let tokenizer =
AlbertTokenizer::from_file("path/to/vocab/file", lower_case, strip_accents).unwrap();pub fn from_existing_vocab_and_model(
vocab: AlbertVocab,
model: SentencePieceModel,
lower_case: bool,
strip_accents: bool
) -> AlbertTokenizer
pub fn from_existing_vocab_and_model(
vocab: AlbertVocab,
model: SentencePieceModel,
lower_case: bool,
strip_accents: bool
) -> AlbertTokenizer
Create a new instance of a AlbertTokenizer from an existing vocabulary and model
Parameters
- vocab (
AlbertVocab): vocabulary - model (
SentencePieceModel): SentencePiece model - lower_case (
bool): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool): flag indicating if accents should be stripped from the text
Example
use rust_tokenizers::tokenizer::{AlbertTokenizer, Tokenizer};
use rust_tokenizers::vocab::{AlbertVocab, SentencePieceModel, Vocab};
let strip_accents = false;
let lower_case = false;
let vocab = AlbertVocab::from_file("path/to/vocab/file").unwrap();
let model = SentencePieceModel::from_file("path/to/model/file").unwrap();
let tokenizer =
AlbertTokenizer::from_existing_vocab_and_model(vocab, model, lower_case, strip_accents);Trait Implementations
Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
returns a reference to the tokenizer vocabulary
Tokenize a TokenRef, returning a sequence of tokens Read more
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example sub ##word) and generate a single output string Read more
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more
Tokenize a string, returns a vector of tokens as strings.
Use tokenize_with_offsets or tokenize_to_tokens to return offset information. Read more
Tokenize a string, returning tokens with offset information Read more
Tokenize a list of strings, returning tokens with offset information Read more
Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets on the list provided. Read more
Convert a slice of string-like to a vector ot token indices Read more
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
Encode a string-like (tokenization followed by encoding) Read more
Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode optional second text, each text provided is encoded independently. Read more
Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode with the list processing of encode_list. Read more
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more