pub fn tokenize(text: &str) -> Vec<String>Expand description
Tokenizes text into words by splitting on whitespace and non-alphanumeric characters, removes stop words, and applies stemming. Also splits camelCase/PascalCase identifiers and compound words.
The tokenization flow follows these steps:
- Split input text on whitespace
- For each token, further split on non-alphanumeric characters (except for leading “-”)
- For each resulting token, check if it has mixed case
- If it has mixed case, split using camel case rules
- For each part, attempt to split compound words
- Process each part: remove stop words and apply stemming
- Collect unique tokens
- Exclude terms that were negated with a “-” prefix