[−][src]Module rust_tokenizers::preprocessing::tokenizer::tokenization_utils
Functions
bpe | Default bpe function, as called by Roberta and GPT2 |
clean_text | Cleans text by removing control characters and normalizing whitespace |
ctrl_bpe | |
decompose_nfkc | NFKC decomposition |
fix_mask | |
get_pairs | |
group_common_pairs | |
is_control | This is a custom method to check if a character is a control character. The BERT tokenizer is
taking any character whose unicode category starts with |
is_punctuation | |
is_whitespace | |
lowercase | Remove diacritics |
openai_gpt_bpe | |
replace_string | Replaces a pattern &str by a replacement &str keeping track of the offsets (all new characters in replacement have the same reference offset as the first pattern character as these may have a different size) |
split_at_regex | |
split_on_bpe_pairs | |
split_on_char | Split a token on one or more characters (given a character test function) |
split_on_punct | Split a token on punctuation |
split_on_regex | |
split_on_regex_with_lookahead | |
split_on_special_tokens | Split a text on special tokens (like BOS/EOS/UNK markers), depending on the vocabulary |
split_on_substr | Split a token on one or more substrings (given a substring test function) |
strip_accents | Remove diacritics |
tokenize_cjk_chars | Tokenizes CJK characters, each character will be a token |
tokenize_wordpiece | Tokenize a token into word pieces according to the supplied vocabulary
Continuation wordpieces will all have the suffix |
truncate_sequences | Truncates a sequence pair in place to the maximum length. |
whitespace_tokenize | Simple tokenization based on whitespace only |