[][src]Function rust_tokenizers::preprocessing::tokenizer::tokenization_utils::is_control

pub fn is_control(character: &char, strict: bool) -> bool

This is a custom method to check if a character is a control character. The BERT tokenizer is taking any character whose unicode category starts with C as a control character, which includes the traditional control Cc category, but also the format Cc, private use Co and surrogate Cs. The unassigned unicode category Cn has been skipped in order to avoid unnecessary checks. A faster method may be called by setting strict to false and only check against the core control characters. To match the original BERT tokenization, this should remain true.