Expand description
ASCII tokenizer for FTS — splits on [^A-Za-z0-9]+ and lowercases.
Resolves Phase 8 plan Q3 (ASCII MVP). Unicode-aware tokenization is
deferred to Phase 8.1 behind a unicode cargo feature; the limitation
here is intentional. Non-ASCII bytes are treated as separators, which
means accented Latin (café), CJK, and other non-ASCII scripts won’t
be searchable until that follow-up lands.
No stemming and no stop-word removal (Q4 + Q5). BM25’s IDF naturally downweights common terms, and modern RAG pipelines rely on exact lexical matches for technical retrieval.
Functions§
- tokenize
- Split
texton runs of non-ASCII-alphanumeric bytes and lowercase each resulting term. Empty input or input made entirely of separators returns an emptyVec.