Skip to main content

Module text

Module text 

Source
Expand description

Tiny text utilities shared by the embedder, ranker, and skill parser. Deterministic by design — embeddings persisted in the index must reproduce byte-for-byte across runs and builds, so we use a fixed FNV hash, not the std hasher (whose seed/impl is not a stability guarantee).

Functions§

content_tokens
tokenize, minus stopwords — the discriminative tokens of a phrase. Used both to gate a candidate phrase by length and to match it against a prompt.
fnv1a_32
FNV-1a 32-bit — stable token→bucket hash for the bag-of-words embedder.
fnv1a_64
FNV-1a 64-bit — content hash for index cache invalidation (not security).
match_tokens
content_tokens, each normalized through norm_token — the form the surface-matching channels (phrase, BM25) compare prompt and skill text in.
norm_token
Light, deterministic singular form of a (lowercase) token, so the surface-form channels — keyword, phrase, BM25 — match across trivial inflection (“spreadsheets” ↔ “spreadsheet”, “dependencies” ↔ “dependency”). Not a real stemmer: it only needs to be consistent, because both the prompt side and the skill side are normalized through it at match time. Applied at match time only — never inside the embedders — so persisted index vectors are untouched.
tokenize
Lowercase, split on non-alphanumerics, drop tokens shorter than 2 chars.