sqlrite::sql::fts

Module tokenizer

Expand description

ASCII tokenizer for FTS — splits on [^A-Za-z0-9]+ and lowercases.

Resolves Phase 8 plan Q3 (ASCII MVP). Unicode-aware tokenization is deferred to Phase 8.1 behind a unicode cargo feature; the limitation here is intentional. Non-ASCII bytes are treated as separators, which means accented Latin (café), CJK, and other non-ASCII scripts won’t be searchable until that follow-up lands.

No stemming and no stop-word removal (Q4 + Q5). BM25’s IDF naturally downweights common terms, and modern RAG pipelines rely on exact lexical matches for technical retrieval.

Functions§

tokenize: Split text on runs of non-ASCII-alphanumeric bytes and lowercase each resulting term. Empty input or input made entirely of separators returns an empty Vec.