Module tokenization

Functions§

add_special_term: Add a term to the dynamic special terms list
is_english_stop_word: Checks if a word is a common English stop word or a simple number (0-10)
is_programming_stop_word: Checks if a word is a programming language stop word
is_special_case: Checks if a word is a special case that should be treated as a single token
is_stop_word: Checks if a word is either an English or programming stop word
load_vocabulary: Loads a vocabulary for compound word splitting This is a simplified version that could be expanded with a real dictionary
split_camel_case: Splits a string on camel case boundaries This function handles:
split_compound_word: Attempts to split a compound word into its constituent parts using a vocabulary Returns the original word if it cannot be split
tokenize: Tokenizes text into words by splitting on whitespace and non-alphanumeric characters, removes stop words, and applies stemming. Also splits camelCase/PascalCase identifiers and compound words.
tokenize_and_stem: Tokenize and stem a keyword, handling camel case and compound word splitting This function is used by the elastic query parser to process terms in the AST