Expand description
A small, dependency-free tokenizer for the BM25 / full-text path (ADR-0046).
Turns text into term ids so a tokenized string is a SparseVector whose
values are term frequencies, reusing the sparse machinery of ADR-0043/0045. The
pipeline is deterministic — Unicode-aware splitting on non-alphanumeric
boundaries, lowercasing, a small English stop-word filter, and Snowball
(Porter2) stemming — so the same text always produces the same terms, and
ingest and query tokenize identically.
The stemmer is the Snowball English (Porter2) algorithm via rust-stemmers
(ADR-0048), which conflates morphological variants — connection /
connected / connecting → connect, ponies → poni — so a query term
matches inflected document terms. Because ingest and query share the same
stemmer the conflation is consistent on both sides. (It superseded the original
dependency-free plural-only S-stemmer of ADR-0046, behind the same tokens
seam.)
One remaining, documented ceiling: term ids are a 32-bit FNV-1a hash of the token, so distinct tokens can in principle collide. For realistic vocabularies this is negligible (and learned-sparse vocabularies already collide by construction).
Constants§
- TEXT_
KEY - The reserved payload key carrying a point’s full-text field (ADR-0046). When a
point has no explicit
__quiver_sparse__vector but carries a string under this key, the engine tokenizes it into a term-frequency sparse vector at ingest, so the point is searchable by BM25 over text alone.
Functions§
- query_
term_ ids - Tokenize
textinto the de-duplicated query term ids BM25 scores against (a repeated query term counts once). The query side of the BM25 path (ADR-0046). - term_id
- The stable 32-bit dimension id for a token (FNV-1a). Deterministic across runs and platforms.
- text_
to_ sparse - Tokenize
textinto a term-frequencySparseVector: dimension ids are token ids (term_id) and values are within-text term counts. The ingest side of the BM25 path (ADR-0046). - tokens
- Tokenize
textinto normalized terms: lowercased, split on non-alphanumeric boundaries, stop-words removed, and plural-stemmed. The order is preserved and duplicates are kept (so callers can compute term frequencies).