[−][src]Function rust_tokenizers::preprocessing::tokenizer::tokenization_utils::split_on_substr
pub fn split_on_substr<'a, F>(
token: TokenRef<'a>,
test_substr: F,
add_separators: bool
) -> Vec<TokenRef<'a>> where
F: Fn(&'a str) -> (usize, usize, Mask),
Split a token on one or more substrings (given a substring test function)
- token: The token to split
- test_str: A function that contains the string buffer from the current point forward and returns a 3-tuple with the length of the match in bytes, chars and the mask to set (if the length is zero then there is no match.
- add_separators: Add the separating characters to the tokens as well? (bool), separating tokens
will be indicated in the returned mask by the value set in
set_mask
, which is returned by the test_substr function