Preprocesses text with filename for search by tokenizing and removing duplicates
This is used for filename matching - it adds the filename and its directory structure to the tokens
Tokenizes text into lowercase words by splitting on whitespace and non-alphanumeric characters,
removes stop words, and applies stemming. Also splits camelCase/PascalCase identifiers.