[−][src]Module rust_tokenizers::preprocessing::tokenizer::tokenization_utils

Functions

bpe	Default bpe function, as called by Roberta and GPT2
clean_text	Cleans text by removing control characters and normalizing whitespace
ctrl_bpe
decompose_nfkc	NFKC decomposition
fix_mask
get_pairs
group_common_pairs
is_control	This is a custom method to check if a character is a control character. The BERT tokenizer is taking any character whose unicode category starts with `C` as a control character, which includes the traditional control `Cc` category, but also the format `Cc`, private use `Co` and surrogate `Cs`. The unassigned unicode category `Cn` has been skipped in order to avoid unnecessary checks. A faster method may be called by setting strict to false and only check against the core control characters. To match the original BERT tokenization, this should remain true.
is_punctuation
is_whitespace
lowercase	Remove diacritics
openai_gpt_bpe
replace_string	Replaces a pattern &str by a replacement &str keeping track of the offsets (all new characters in replacement have the same reference offset as the first pattern character as these may have a different size)
split_at_regex
split_on_bpe_pairs
split_on_char	Split a token on one or more characters (given a character test function)
split_on_punct	Split a token on punctuation
split_on_regex
split_on_regex_with_lookahead
split_on_special_tokens	Split a text on special tokens (like BOS/EOS/UNK markers), depending on the vocabulary
split_on_substr	Split a token on one or more substrings (given a substring test function)
strip_accents	Remove diacritics
tokenize_cjk_chars	Tokenizes CJK characters, each character will be a token
tokenize_wordpiece	Tokenize a token into word pieces according to the supplied vocabulary Continuation wordpieces will all have the suffix `##`
truncate_sequences	Truncates a sequence pair in place to the maximum length.
whitespace_tokenize	Simple tokenization based on whitespace only