[][src]Function rust_tokenizers::preprocessing::tokenizer::tokenization_utils::split_on_substr

pub fn split_on_substr<'a, F>(
    token: TokenRef<'a>,
    test_substr: F,
    add_separators: bool
) -> Vec<TokenRef<'a>> where
    F: Fn(&'a str) -> (usize, usize, Mask)

Split a token on one or more substrings (given a substring test function)

  • token: The token to split
  • test_str: A function that contains the string buffer from the current point forward and returns a 3-tuple with the length of the match in bytes, chars and the mask to set (if the length is zero then there is no match.
  • add_separators: Add the separating characters to the tokens as well? (bool), separating tokens will be indicated in the returned mask by the value set in set_mask, which is returned by the test_substr function