Module tokenization

Module tokenization 

Source

Functionsยง

add_special_term
Add a term to the dynamic special terms list
is_english_stop_word
Checks if a word is a common English stop word or a simple number (0-10)
is_programming_stop_word
Checks if a word is a programming language stop word
is_special_case
Checks if a word is a special case that should be treated as a single token
is_stop_word
Checks if a word is either an English or programming stop word
load_vocabulary
Loads a vocabulary for compound word splitting This is a simplified version that could be expanded with a real dictionary
split_camel_case
Splits a string on camel case boundaries This function handles:
split_compound_word
Attempts to split a compound word into its constituent parts using a vocabulary Returns the original word if it cannot be split
tokenize
Tokenizes text into words by splitting on whitespace and non-alphanumeric characters, removes stop words, and applies stemming. Also splits camelCase/PascalCase identifiers and compound words.
tokenize_and_stem
Tokenize and stem a keyword, handling camel case and compound word splitting This function is used by the elastic query parser to process terms in the AST