Function tokenize

Source

pub fn tokenize(text: &str) -> Vec<String>

Expand description

Tokenizes text into words by splitting on whitespace and non-alphanumeric characters, removes stop words, and applies stemming. Also splits camelCase/PascalCase identifiers and compound words.

The tokenization flow follows these steps:

Split input text on whitespace
For each token, further split on non-alphanumeric characters (except for leading “-”)
For each resulting token, check if it has mixed case
If it has mixed case, split using camel case rules
For each part, attempt to split compound words
Process each part: remove stop words and apply stemming
Collect unique tokens
Exclude terms that were negated with a “-” prefix

tokenize

Function tokenize Copy item path

Function tokenize