[−][src]Function rust_tokenizers::preprocessing::tokenizer::tokenization_utils::is_control
pub fn is_control(character: &char, strict: bool) -> bool
This is a custom method to check if a character is a control character. The BERT tokenizer is
taking any character whose unicode category starts with C
as a control character, which includes
the traditional control Cc
category, but also the format Cc
, private use Co
and surrogate Cs
.
The unassigned unicode category Cn
has been skipped in order to avoid unnecessary checks.
A faster method may be called by setting strict to false and only check against the core control
characters. To match the original BERT tokenization, this should remain true.