Struct rust_tokenizers::tokenizer::AlbertTokenizer [−][src]
pub struct AlbertTokenizer { /* fields omitted */ }
Expand description
ALBERT tokenizer
ALBERT tokenizer performing:
- splitting on special characters
- text cleaning
- NFKC decomposition
- (optional) lower casing
- (optional) accent stripping
- SentencePiece decomposition
Implementations
pub fn from_file(
path: &str,
lower_case: bool,
strip_accents: bool
) -> Result<AlbertTokenizer, TokenizerError>
[src]
pub fn from_file(
path: &str,
lower_case: bool,
strip_accents: bool
) -> Result<AlbertTokenizer, TokenizerError>
[src]Create a new instance of a AlbertTokenizer
Expects a SentencePiece protobuf file as an input.
Parameters
- path (
&str
): path to the SentencePiece model file - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool
): flag indicating if accents should be stripped from the text
Example
use rust_tokenizers::tokenizer::{AlbertTokenizer, Tokenizer}; let strip_accents = false; let lower_case = false; let tokenizer = AlbertTokenizer::from_file("path/to/vocab/file", lower_case, strip_accents).unwrap();
pub fn from_existing_vocab_and_model(
vocab: AlbertVocab,
model: SentencePieceModel,
lower_case: bool,
strip_accents: bool
) -> AlbertTokenizer
[src]
pub fn from_existing_vocab_and_model(
vocab: AlbertVocab,
model: SentencePieceModel,
lower_case: bool,
strip_accents: bool
) -> AlbertTokenizer
[src]Create a new instance of a AlbertTokenizer
from an existing vocabulary and model
Parameters
- vocab (
AlbertVocab
): vocabulary - model (
SentencePieceModel
): SentencePiece model - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool
): flag indicating if accents should be stripped from the text
Example
use rust_tokenizers::tokenizer::{AlbertTokenizer, Tokenizer}; use rust_tokenizers::vocab::{AlbertVocab, SentencePieceModel, Vocab}; let strip_accents = false; let lower_case = false; let vocab = AlbertVocab::from_file("path/to/vocab/file").unwrap(); let model = SentencePieceModel::from_file("path/to/model/file").unwrap(); let tokenizer = AlbertTokenizer::from_existing_vocab_and_model(vocab, model, lower_case, strip_accents);
Trait Implementations
fn tokenize_list_with_offsets<S, ST>(
&self,
text_list: S
) -> Vec<TokensWithOffsets> where
S: AsRef<[ST]>,
ST: AsRef<str> + Sync,
[src]
fn tokenize_list_with_offsets<S, ST>(
&self,
text_list: S
) -> Vec<TokensWithOffsets> where
S: AsRef<[ST]>,
ST: AsRef<str> + Sync,
[src]Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read more
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
fn encode_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[ST]>,
ST: AsRef<str> + Sync,
[src]
fn encode_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[ST]>,
ST: AsRef<str> + Sync,
[src]Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode
optional second text, each text provided is encoded independently. Read more
fn encode_pair_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[(ST, ST)]>,
ST: AsRef<str> + Sync,
[src]
fn encode_pair_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[(ST, ST)]>,
ST: AsRef<str> + Sync,
[src]Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode
with the list processing of encode_list
. Read more
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
returns a reference to the tokenizer vocabulary
Tokenize a TokenRef, returning a sequence of tokens Read more
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example sub ##word
) and generate a single output string Read more
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
[src]
fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>
) -> TokenIdsWithSpecialTokens
[src]Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. Read more
Tokenize a string, returns a vector of tokens as strings.
Use tokenize_with_offsets
or tokenize_to_tokens
to return offset information. Read more
Tokenize a string, returning tokens with offset information Read more
Tokenize a list of strings, returning tokens with offset information Read more
fn tokenize_list_with_offsets<S, ST>(
&self,
text_list: S
) -> Vec<TokensWithOffsets> where
S: AsRef<[ST]>,
ST: AsRef<str>,
[src]
fn tokenize_list_with_offsets<S, ST>(
&self,
text_list: S
) -> Vec<TokensWithOffsets> where
S: AsRef<[ST]>,
ST: AsRef<str>,
[src]Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read more
Convert a slice of string-like to a vector ot token indices Read more
fn encode<S: AsRef<str>>(
&self,
text_1: S,
text_2: Option<S>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
[src]
fn encode<S: AsRef<str>>(
&self,
text_1: S,
text_2: Option<S>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> TokenizedInput
[src]Encode a string-like (tokenization followed by encoding) Read more
fn encode_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[ST]>,
ST: AsRef<str>,
[src]
fn encode_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[ST]>,
ST: AsRef<str>,
[src]Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with encode
optional second text, each text provided is encoded independently. Read more
fn encode_pair_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[(ST, ST)]>,
ST: AsRef<str>,
[src]
fn encode_pair_list<S, ST>(
&self,
text_list: S,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize
) -> Vec<TokenizedInput> where
S: AsRef<[(ST, ST)]>,
ST: AsRef<str>,
[src]Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with encode
with the list processing of encode_list
. Read more
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Read more
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls decode
for each provided sequence of ids Read more
Auto Trait Implementations
impl RefUnwindSafe for AlbertTokenizer
impl Send for AlbertTokenizer
impl Sync for AlbertTokenizer
impl Unpin for AlbertTokenizer
impl UnwindSafe for AlbertTokenizer