Module tokenizer

Expand description

HELIX-IDEA-005 Phase 4 — pluggable tokenizer trait.

Contract: contracts/apr-hybrid-retrieval-v1.yaml (FALSIFY-HYBRID-003).

Pre-Phase-4, BM25Index had a hardcoded internal tokenize() method that split on non-alphanumeric boundaries with built-in lowercasing + stopword filtering. The §2.5 sketch called this out as drift risk: a future apr serve inference path that uses a different tokenizer (BPE / SentencePiece) would silently disagree with BM25 on what counts as a “term”, with no compile-time anchor pinning the two together.

Phase 4 extracts the contract: a public Tokenizer trait that BM25Index can OPTIONALLY accept via [BM25Index::with_tokenizer]. The default behaviour is unchanged (the existing internal tokenization stays the fallback when no override is set), so existing callers don’t break. Future callers — including a shared inference tokenizer — implement Tokenizer and plug in.

Note on scope: this PR ships the trait surface plus the BM25 plumbing to use it. Wiring the actual inference-path tokenizer (Qwen / Llama BPE) into the same trait is a separate effort; the gate in §2.5 only asserts the trait is pluggable today, not that any specific inference tokenizer is wired.

Structs§

WhitespaceTokenizer: Whitespace-and-non-alphanumeric tokenizer with optional lowercasing, stopword filtering, and minimum-length filtering. Matches the rule that lived inside BM25Index pre-Phase-4 and is the default when no other tokenizer is supplied.

Traits§

Tokenizer: Decompose a string into BM25 search terms.