Skip to main content

Module tokenize

Module tokenize 

Source
Expand description

A small, dependency-free tokenizer for the BM25 / full-text path (ADR-0046).

Turns text into term ids so a tokenized string is a SparseVector whose values are term frequencies, reusing the sparse machinery of ADR-0043/0045. The pipeline is deterministic — Unicode-aware splitting on non-alphanumeric boundaries, lowercasing, a small English stop-word filter, and Snowball (Porter2) stemming — so the same text always produces the same terms, and ingest and query tokenize identically.

The stemmer is the Snowball English (Porter2) algorithm via rust-stemmers (ADR-0048), which conflates morphological variants — connection / connected / connectingconnect, poniesponi — so a query term matches inflected document terms. Because ingest and query share the same stemmer the conflation is consistent on both sides. (It superseded the original dependency-free plural-only S-stemmer of ADR-0046, behind the same tokens seam.)

One remaining, documented ceiling: term ids are a 32-bit FNV-1a hash of the token, so distinct tokens can in principle collide. For realistic vocabularies this is negligible (and learned-sparse vocabularies already collide by construction).

Constants§

TEXT_KEY
The reserved payload key carrying a point’s full-text field (ADR-0046). When a point has no explicit __quiver_sparse__ vector but carries a string under this key, the engine tokenizes it into a term-frequency sparse vector at ingest, so the point is searchable by BM25 over text alone.

Functions§

query_term_ids
Tokenize text into the de-duplicated query term ids BM25 scores against (a repeated query term counts once). The query side of the BM25 path (ADR-0046).
term_id
The stable 32-bit dimension id for a token (FNV-1a). Deterministic across runs and platforms.
text_to_sparse
Tokenize text into a term-frequency SparseVector: dimension ids are token ids (term_id) and values are within-text term counts. The ingest side of the BM25 path (ADR-0046).
tokens
Tokenize text into normalized terms: lowercased, split on non-alphanumeric boundaries, stop-words removed, and plural-stemmed. The order is preserved and duplicates are kept (so callers can compute term frequencies).