Expand description
HELIX-IDEA-005 Phase 4 — pluggable tokenizer trait.
Contract: contracts/apr-hybrid-retrieval-v1.yaml (FALSIFY-HYBRID-003).
Pre-Phase-4, BM25Index had a hardcoded internal tokenize()
method that split on non-alphanumeric boundaries with built-in
lowercasing + stopword filtering. The §2.5 sketch called this
out as drift risk: a future apr serve inference path that uses
a different tokenizer (BPE / SentencePiece) would silently
disagree with BM25 on what counts as a “term”, with no
compile-time anchor pinning the two together.
Phase 4 extracts the contract: a public Tokenizer trait that
BM25Index can OPTIONALLY accept via [BM25Index::with_tokenizer].
The default behaviour is unchanged (the existing internal
tokenization stays the fallback when no override is set), so
existing callers don’t break. Future callers — including a
shared inference tokenizer — implement Tokenizer and plug
in.
Note on scope: this PR ships the trait surface plus the BM25 plumbing to use it. Wiring the actual inference-path tokenizer (Qwen / Llama BPE) into the same trait is a separate effort; the gate in §2.5 only asserts the trait is pluggable today, not that any specific inference tokenizer is wired.
Structs§
- Whitespace
Tokenizer - Whitespace-and-non-alphanumeric tokenizer with optional
lowercasing, stopword filtering, and minimum-length filtering.
Matches the rule that lived inside
BM25Indexpre-Phase-4 and is the default when no other tokenizer is supplied.
Traits§
- Tokenizer
- Decompose a string into BM25 search terms.