[−][src]Module rust_tokenizers::preprocessing::tokenizer::tokenization_utils
Functions
| bpe | Default bpe function, as called by Roberta and GPT2 |
| clean_text | Cleans text by removing control characters and normalizing whitespace |
| ctrl_bpe | |
| decompose_nfkc | NFKC decomposition |
| fix_mask | |
| get_pairs | |
| group_common_pairs | |
| is_control | This is a custom method to check if a character is a control character. The BERT tokenizer is
taking any character whose unicode category starts with |
| is_punctuation | |
| is_whitespace | |
| lowercase | Remove diacritics |
| openai_gpt_bpe | |
| replace_string | Replaces a pattern &str by a replacement &str keeping track of the offsets (all new characters in replacement have the same reference offset as the first pattern character as these may have a different size) |
| split_at_regex | |
| split_on_bpe_pairs | |
| split_on_char | Split a token on one or more characters (given a character test function) |
| split_on_punct | Split a token on punctuation |
| split_on_regex | |
| split_on_regex_with_lookahead | |
| split_on_special_tokens | Split a text on special tokens (like BOS/EOS/UNK markers), depending on the vocabulary |
| split_on_substr | Split a token on one or more substrings (given a substring test function) |
| strip_accents | Remove diacritics |
| tokenize_cjk_chars | Tokenizes CJK characters, each character will be a token |
| tokenize_wordpiece | Tokenize a token into word pieces according to the supplied vocabulary
Continuation wordpieces will all have the suffix |
| truncate_sequences | Truncates a sequence pair in place to the maximum length. |
| whitespace_tokenize | Simple tokenization based on whitespace only |