[][src]Module rust_tokenizers::preprocessing::tokenizer::tokenization_utils

Functions

bpe

Default bpe function, as called by Roberta and GPT2

clean_text

Cleans text by removing control characters and normalizing whitespace

ctrl_bpe
decompose_nfkc

NFKC decomposition

fix_mask
get_pairs
group_common_pairs
is_control

This is a custom method to check if a character is a control character. The BERT tokenizer is taking any character whose unicode category starts with C as a control character, which includes the traditional control Cc category, but also the format Cc, private use Co and surrogate Cs. The unassigned unicode category Cn has been skipped in order to avoid unnecessary checks. A faster method may be called by setting strict to false and only check against the core control characters. To match the original BERT tokenization, this should remain true.

is_punctuation
is_whitespace
lowercase

Remove diacritics

openai_gpt_bpe
replace_string

Replaces a pattern &str by a replacement &str keeping track of the offsets (all new characters in replacement have the same reference offset as the first pattern character as these may have a different size)

split_at_regex
split_on_bpe_pairs
split_on_char

Split a token on one or more characters (given a character test function)

split_on_punct

Split a token on punctuation

split_on_regex
split_on_regex_with_lookahead
split_on_special_tokens

Split a text on special tokens (like BOS/EOS/UNK markers), depending on the vocabulary

split_on_substr

Split a token on one or more substrings (given a substring test function)

strip_accents

Remove diacritics

tokenize_cjk_chars

Tokenizes CJK characters, each character will be a token

tokenize_wordpiece

Tokenize a token into word pieces according to the supplied vocabulary Continuation wordpieces will all have the suffix ##

truncate_sequences

Truncates a sequence pair in place to the maximum length.

whitespace_tokenize

Simple tokenization based on whitespace only