Expand description
Sound trigram filter for approximate-regex matching.
Given a regex AST and a global edit budget k, produce a
ApproxFilter that lets the index narrow the candidate
set before invoking the (expensive) TRE matcher on each
survivor. The filter is sound: every doc that
approximately matches the pattern within k edits passes
the filter, so applying it never produces a false negative.
§Construction
The filter is parameterised by the pattern’s literal runs
– maximal contiguous sequences of literal bytes from the
AST (see crate::prefix_extract::extract_literal_runs).
For each run of length L_i, the run contributes
T_i = max(0, L_i - 2) candidate trigrams; the total
across runs is T = sum T_i.
With k edits permitted, each error can destroy at most
three pattern trigrams (the up-to-three windows that
contain the edited byte). So at least
max(0, T - 3k) of the pattern’s trigrams must appear in
any matching doc, giving the soundness invariant
#(pattern_trigrams in doc) >= T - 3kThe filter therefore has two fields:
trigrams– the pattern trigrams (deduplicated).min_required–max(0, T - 3k)clamped totrigrams.len().
When min_required == 0 the filter degenerates and the
caller has to fall back to a full scan.
§Application
ApproxFilter::candidates is the entry point used by
crate::index::TextIndex::search_regex_approx. It first
computes the postings UNION of all pattern trigrams (an
upper bound on the candidate set: any doc that contains
min_required >= 1 of them must be in the union of all of
them), then for each survivor counts how many pattern
trigrams the per-doc bloom filter accepts and rejects docs
whose count falls below min_required.
§Why not Navarro tiling?
Navarro’s (k+1)-tiling argument partitions the pattern
into k+1 disjoint contiguous tiles, each of which must
match exactly somewhere because at least one of them is
error-free by pigeonhole. For sufficiently long patterns
this gives a tighter selectivity than the per-trigram
bound. In practice the approximate-regex queries dyntext
sees today have short literal runs (typically 3-5 bytes),
and the (k+1)-tiling either degenerates or matches the
per-trigram bound. We use the simpler bound here and leave
the tiling refinement for a follow-up.
Structs§
- Approx
Filter - Sound trigram filter for an approximate-regex query.