tokenize

Function tokenize 

Source
pub fn tokenize(text: &str) -> Vec<String>
Expand description

Tokenizes text into words by splitting on whitespace and non-alphanumeric characters, removes stop words, and applies stemming. Also splits camelCase/PascalCase identifiers and compound words.

The tokenization flow follows these steps:

  1. Split input text on whitespace
  2. For each token, further split on non-alphanumeric characters (except for leading “-”)
  3. For each resulting token, check if it has mixed case
  4. If it has mixed case, split using camel case rules
  5. For each part, attempt to split compound words
  6. Process each part: remove stop words and apply stemming
  7. Collect unique tokens
  8. Exclude terms that were negated with a “-” prefix