Skip to main content

Module tiling

Module tiling 

Source
Expand description

Sound trigram filter for approximate-regex matching.

Given a regex AST and a global edit budget k, produce a ApproxFilter that lets the index narrow the candidate set before invoking the (expensive) TRE matcher on each survivor. The filter is sound: every doc that approximately matches the pattern within k edits passes the filter, so applying it never produces a false negative.

§Construction

The filter is parameterised by the pattern’s literal runs – maximal contiguous sequences of literal bytes from the AST (see crate::prefix_extract::extract_literal_runs). For each run of length L_i, the run contributes T_i = max(0, L_i - 2) candidate trigrams; the total across runs is T = sum T_i.

With k edits permitted, each error can destroy at most three pattern trigrams (the up-to-three windows that contain the edited byte). So at least max(0, T - 3k) of the pattern’s trigrams must appear in any matching doc, giving the soundness invariant

    #(pattern_trigrams in doc) >= T - 3k

The filter therefore has two fields:

  • trigrams – the pattern trigrams (deduplicated).
  • min_requiredmax(0, T - 3k) clamped to trigrams.len().

When min_required == 0 the filter degenerates and the caller has to fall back to a full scan.

§Application

ApproxFilter::candidates is the entry point used by crate::index::TextIndex::search_regex_approx. It first computes the postings UNION of all pattern trigrams (an upper bound on the candidate set: any doc that contains min_required >= 1 of them must be in the union of all of them), then for each survivor counts how many pattern trigrams the per-doc bloom filter accepts and rejects docs whose count falls below min_required.

§Why not Navarro tiling?

Navarro’s (k+1)-tiling argument partitions the pattern into k+1 disjoint contiguous tiles, each of which must match exactly somewhere because at least one of them is error-free by pigeonhole. For sufficiently long patterns this gives a tighter selectivity than the per-trigram bound. In practice the approximate-regex queries dyntext sees today have short literal runs (typically 3-5 bytes), and the (k+1)-tiling either degenerates or matches the per-trigram bound. We use the simpler bound here and leave the tiling refinement for a follow-up.

Structs§

ApproxFilter
Sound trigram filter for an approximate-regex query.