pub struct BaseTokenizer<T: Vocab> { /* private fields */ }
Expand description
§Base tokenizer
Base tokenizer performing:
- whitespace tokenization
- splitting on special characters
- splitting on punctuation
- splitting on CJK characters
- (optional) lower casing
- (optional) accent stripping
This tokenizer is used as a pre-tokenizer step in the BERT and GPT tokenizers.
Implementations§
Source§impl<T: Vocab + Sync> BaseTokenizer<T>
impl<T: Vocab + Sync> BaseTokenizer<T>
Sourcepub fn from_file_with_special_token_mapping<P: AsRef<Path>, S: AsRef<Path>>(
path: P,
lower_case: bool,
strip_accents: bool,
special_token_mapping_path: S,
) -> Result<BaseTokenizer<T>, TokenizerError>
pub fn from_file_with_special_token_mapping<P: AsRef<Path>, S: AsRef<Path>>( path: P, lower_case: bool, strip_accents: bool, special_token_mapping_path: S, ) -> Result<BaseTokenizer<T>, TokenizerError>
Create a new instance of a BaseTokenizer
Expects a vocabulary flat-file and special token mapping file as inputs.
§Parameters
- path (
&str
): path to the vocabulary file (only used for special character splitting) - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool
): flag indicating if accents should be stripped from the text - special_token_mapping_path (
&str
): path to a special token mapping file to overwrite default special tokens
§Example
use rust_tokenizers::tokenizer::{BaseTokenizer, Tokenizer};
use rust_tokenizers::vocab::BaseVocab;
let strip_accents = false;
let lower_case = false;
let tokenizer: BaseTokenizer<BaseVocab> = BaseTokenizer::from_file_with_special_token_mapping(
"path/to/vocab/file",
lower_case,
strip_accents,
"path/to/special/token/mapping/file",
)
.unwrap();
Sourcepub fn from_file<P: AsRef<Path>>(
path: P,
lower_case: bool,
strip_accents: bool,
) -> Result<BaseTokenizer<T>, TokenizerError>
pub fn from_file<P: AsRef<Path>>( path: P, lower_case: bool, strip_accents: bool, ) -> Result<BaseTokenizer<T>, TokenizerError>
Create a new instance of a BaseTokenizer
Expects a vocabulary flat-file as an input.
§Parameters
- path (
&str
): path to the vocabulary file (only used for special character splitting) - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool
): flag indicating if accents should be stripped from the text
§Example
use rust_tokenizers::tokenizer::{BaseTokenizer, Tokenizer};
use rust_tokenizers::vocab::BaseVocab;
let strip_accents = false;
let lower_case = false;
let tokenizer: BaseTokenizer<BaseVocab> =
BaseTokenizer::from_file("path/to/vocab/file", lower_case, strip_accents).unwrap();
Sourcepub fn from_existing_vocab(
vocab: T,
lower_case: bool,
strip_accents: bool,
) -> BaseTokenizer<T>
pub fn from_existing_vocab( vocab: T, lower_case: bool, strip_accents: bool, ) -> BaseTokenizer<T>
Create a new instance of a BaseTokenizer
from an existing vocabulary
§Parameters
- vocab (
Vocab
): Thread-safe reference to a vocabulary - lower_case (
bool
): flag indicating if the text should be lower-cased as part of the tokenization - strip_accents (
bool
): flag indicating if accents should be stripped from the text
§Example
use rust_tokenizers::tokenizer::{BaseTokenizer, Tokenizer};
use rust_tokenizers::vocab::{BaseVocab, Vocab};
let strip_accents = false;
let lower_case = false;
let base_vocab = BaseVocab::from_file("path/to/vocab/file").unwrap();
let tokenizer = BaseTokenizer::from_existing_vocab(base_vocab, lower_case, strip_accents);
Trait Implementations§
Source§impl<T: Vocab + Sync + Send> MultiThreadedTokenizer<T> for BaseTokenizer<T>
impl<T: Vocab + Sync + Send> MultiThreadedTokenizer<T> for BaseTokenizer<T>
Source§fn tokenize_list_with_offsets<S>(
&self,
text_list: &[S],
) -> Vec<TokensWithOffsets>
fn tokenize_list_with_offsets<S>( &self, text_list: &[S], ) -> Vec<TokensWithOffsets>
Tokenize a list of strings (with multithreading), where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read moreSource§fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>
fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>
Multithreaded tokenization of a list of strings, returning tokens with offset information Read more
Source§fn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize,
) -> Vec<TokenizedInput>
fn encode_list<S>( &self, text_list: &[S], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize, ) -> Vec<TokenizedInput>
Multithreaded encoding of a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with
encode
optional second text, each text provided is encoded independently. Read moreSource§fn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize,
) -> Vec<TokenizedInput>
fn encode_pair_list<S>( &self, text_list: &[(S, S)], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize, ) -> Vec<TokenizedInput>
Multithreaded ncoding of a sequence of string-like text pairs (tokenization followed by encoding). This combines
with
encode
with the list processing of encode_list
. Read moreSource§fn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool,
) -> Vec<String>
fn decode_list( &self, token_ids_list: &[Vec<i64>], skip_special_tokens: bool, clean_up_tokenization_spaces: bool, ) -> Vec<String>
Multithreaded conversion a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls
decode
for each provided sequence of ids Read moreSource§impl<T: Vocab + Sync + Send> Tokenizer<T> for BaseTokenizer<T>
impl<T: Vocab + Sync + Send> Tokenizer<T> for BaseTokenizer<T>
Source§fn tokenize_to_tokens(&self, initial_token: TokenRef<'_>) -> Vec<Token>
fn tokenize_to_tokens(&self, initial_token: TokenRef<'_>) -> Vec<Token>
Tokenize a TokenRef, returning a sequence of tokens Read more
Source§fn tokenize(&self, text: &str) -> Vec<String>
fn tokenize(&self, text: &str) -> Vec<String>
Tokenize a string, returns a vector of tokens as strings.
Use
tokenize_with_offsets
or tokenize_to_tokens
to return offset information. Read moreSource§fn tokenize_with_offsets(&self, text: &str) -> TokensWithOffsets
fn tokenize_with_offsets(&self, text: &str) -> TokensWithOffsets
Tokenize a string, returning tokens with offset information Read more
Source§fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>
fn tokenize_list<S>(&self, text_list: &[S]) -> Vec<Vec<String>>
Tokenize a list of strings, returning tokens with offset information Read more
Source§fn tokenize_list_with_offsets<S>(
&self,
text_list: &[S],
) -> Vec<TokensWithOffsets>
fn tokenize_list_with_offsets<S>( &self, text_list: &[S], ) -> Vec<TokensWithOffsets>
Tokenize a list of strings, where each corresponds to for example a sentence, returns a
vector of TokensWithOffsets containing the tokens and their offset information. This calls
tokenize_with_offsets
on the list provided. Read moreSource§fn convert_tokens_to_ids<S>(&self, tokens: &[S]) -> Vec<i64>
fn convert_tokens_to_ids<S>(&self, tokens: &[S]) -> Vec<i64>
Convert a slice of string-like to a vector ot token indices Read more
Source§fn encode(
&self,
text_1: &str,
text_2: Option<&str>,
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize,
) -> TokenizedInput
fn encode( &self, text_1: &str, text_2: Option<&str>, max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize, ) -> TokenizedInput
Encode a string-like (tokenization followed by encoding) Read more
Source§fn encode_list<S>(
&self,
text_list: &[S],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize,
) -> Vec<TokenizedInput>
fn encode_list<S>( &self, text_list: &[S], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize, ) -> Vec<TokenizedInput>
Encode a sequence of string-like texts (tokenization followed by encoding). Not that in contrast
with
encode
optional second text, each text provided is encoded independently. Read moreSource§fn encode_pair_list<S>(
&self,
text_list: &[(S, S)],
max_len: usize,
truncation_strategy: &TruncationStrategy,
stride: usize,
) -> Vec<TokenizedInput>
fn encode_pair_list<S>( &self, text_list: &[(S, S)], max_len: usize, truncation_strategy: &TruncationStrategy, stride: usize, ) -> Vec<TokenizedInput>
Encode a sequence of string-like text pairs (tokenization followed by encoding). This combines
with
encode
with the list processing of encode_list
. Read moreSource§fn decode_to_vec(
&self,
token_ids: &[i64],
skip_special_tokens: bool,
) -> Vec<String>
fn decode_to_vec( &self, token_ids: &[i64], skip_special_tokens: bool, ) -> Vec<String>
Decode a sequence of token indices to a sequence of Strings, optionally skipping special indices Read more
Source§fn decode(
&self,
token_ids: &[i64],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool,
) -> String
fn decode( &self, token_ids: &[i64], skip_special_tokens: bool, clean_up_tokenization_spaces: bool, ) -> String
Converts a sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. Read more
Source§fn convert_tokens_to_string(&self, tokens: Vec<String>) -> String
fn convert_tokens_to_string(&self, tokens: Vec<String>) -> String
Converts a sequence of strings into a single string. This will clean-up artifacts from tokenization
(for example
sub ##word
) and generate a single output string Read moreSource§fn clean_up_tokenization(&self, input_string: String) -> String
fn clean_up_tokenization(&self, input_string: String) -> String
Cleans-up tokenization artifacts (for example whitespace before punctuation) Read more
Source§fn decode_list(
&self,
token_ids_list: &[Vec<i64>],
skip_special_tokens: bool,
clean_up_tokenization_spaces: bool,
) -> Vec<String>
fn decode_list( &self, token_ids_list: &[Vec<i64>], skip_special_tokens: bool, clean_up_tokenization_spaces: bool, ) -> Vec<String>
Converts a list of sequence of ids (integer) into a string, using the tokenizer and vocabulary
with options to remove special tokens and clean up tokenization spaces. This calls
decode
for each provided sequence of ids Read moreSource§fn build_input_with_special_tokens(
&self,
tokens_ids_with_offsets_1: TokenIdsWithOffsets,
tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>,
) -> TokenIdsWithSpecialTokens
fn build_input_with_special_tokens( &self, tokens_ids_with_offsets_1: TokenIdsWithOffsets, tokens_ids_with_offsets_2: Option<TokenIdsWithOffsets>, ) -> TokenIdsWithSpecialTokens
Build model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. Read more
Source§fn add_tokens(&mut self, tokens: &[&str])
fn add_tokens(&mut self, tokens: &[&str])
Add arbitrary tokens to the vocabulary. Read more
Source§fn add_extra_ids(&mut self, num_extra_ids: i64)
fn add_extra_ids(&mut self, num_extra_ids: i64)
Add arbitrary tokens to the vocabulary. Read more
Auto Trait Implementations§
impl<T> Freeze for BaseTokenizer<T>where
T: Freeze,
impl<T> RefUnwindSafe for BaseTokenizer<T>where
T: RefUnwindSafe,
impl<T> Send for BaseTokenizer<T>where
T: Send,
impl<T> Sync for BaseTokenizer<T>where
T: Sync,
impl<T> Unpin for BaseTokenizer<T>where
T: Unpin,
impl<T> UnwindSafe for BaseTokenizer<T>where
T: UnwindSafe,
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
Converts
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
Converts
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more