Skip to main content

Module grep_executor

Module grep_executor 

Source
Expand description

Grep Lane Executor (Task 5)

Exact regex / substring search as a first-class retrieval lane, built on the trigram candidate index. The pipeline is:

regex ──► required-literal extraction ──► trigram conjunction
      ──► trigram posting intersection (candidate DocIds)
      ──► ∩ AllowedSet            (filter pushdown BEFORE verification)
      ──► regex verification      (linear-time, finite-automaton engine)
      ──► ranked hits  OR  candidate gate

§Correctness over speed

Trigram pre-filtering is only ever used when the executor can prove the extracted literals are mandatory (present in every possible match). For any pattern it cannot prove this for — alternation, groups, character classes, or no literal run of length ≥ 3 — it falls back to an explicit, bounded full scan rather than risk a false negative. The full-scan path is capped by max_scan; exceeding the cap is reported as GrepError::DegeneratePattern instead of silently returning partial results.

§Verification engine

Verification uses the regex crate, a finite-automaton engine with guaranteed linear-time matching, so adversarial patterns cannot turn the verify stage into a catastrophic-backtracking DoS.

§Two fusion modes

Grep produces a set, but RRF consumes ranked lists. Both shapes are supported:

  • GrepMode::Rank scores each hit by specificity-weighted, TF-saturated, length-pivoted relevance (BM25-flavored over the pattern’s literal terms) so it can plug into RRF as a third ranked lane without the short-document / common-term bias of raw match density.
  • GrepMode::Gate returns the matching documents as an AllowedSet to intersect into the other lanes (the “find the function that contains X” cascade), via GrepResults::into_allowed_set.

Structs§

GrepExecutor
The grep executor: plans and runs regex search over a TrigramIndex.
GrepHit
A single grep match.
GrepResults
The outcome of a grep search.

Enums§

GrepError
Errors the grep lane can return.
GrepMode
How grep results should be consumed by the fusion layer.

Constants§

DEFAULT_MAX_SCAN
Default cap on documents verified by a degenerate (no-trigram) full scan.

Functions§

required_trigrams
Extract the mandatory trigram conjunction for pattern, or None if the pattern is too complex to prove a mandatory literal (caller must full-scan).