Struct TextIndex

Source

pub struct TextIndex { /* private fields */ }

Expand description

Trigram-based text index supporting incremental insert, remove, and exact-substring search.

Implementations§

Source §

impl TextIndex

Source

pub fn new() -> Self

Construct an empty index.

Source

pub fn doc_count(&self) -> usize

Number of documents currently in the index.

Source

pub fn postings(&self) -> &Postings

Borrow the inverted postings index.

Source

pub fn docs(&self) -> &BTreeMap<u32, IndexedDoc>

Borrow the document store.

Source

pub fn insert(&mut self, text: Vec<u8>) -> u32

Insert text and return the assigned doc id.

The doc id is assigned monotonically; doc ids are not recycled, so a removed id is not handed out again.

Source

pub fn remove(&mut self, doc_id: u32) -> Option<Vec<u8>>

Remove the document at doc_id and return its raw bytes, if any.

All trigram entries for the doc are pulled from the postings index. A trigram whose postings list becomes empty after removal is garbage collected.

Source

pub fn search_substring(&self, query: &[u8]) -> Vec<u32>

Search for documents whose text contains query as a contiguous byte substring.

Results are returned in insertion order. A document that has been removed is not returned even if its postings entries were missed by a buggy remove (we always re-verify against the doc store).

Queries shorter than MIN_TRIGRAM_QUERY_LEN cannot be resolved through the trigram index and fall back to a full scan.

Source

pub fn search_regex(&self, pattern: &str) -> Result<Vec<u32>, RegexError>

Search for documents whose text matches pattern as a regular expression.

The query path is the same four-tier filter funnel as Self::search_substring, plus a Phase-2 prefix extraction step:

Parse pattern into the internal AST and extract the trigrams that any matching string MUST contain (see crate::prefix_extract).
Intersect those trigrams’ postings lists into a candidate doc-id set. If the AST cannot be lowered (named capture group, etc.) or yields no required trigrams, fall back to scanning every doc.
Per-doc bloom filter recheck (skipped on full scan).
Compile the pattern with regex::bytes::Regex and re-run it against each candidate’s stored bytes.

Results are returned in insertion order.

§Errors

Returns RegexError::Parse if the pattern is syntactically invalid or uses a regex feature that the underlying regex crate does not support (lookarounds, backreferences, …). A pattern that parses cleanly but trips the prefix extractor’s unsupported-feature path (named capture groups) does NOT surface as an error: the search still runs, just via the slower full-scan + recheck path.

Source

pub fn search_regex_approx( &self, pattern: &str, max_errors: u16, ) -> Result<Vec<u32>, TreError>

Search for documents that match pattern as an approximate POSIX extended regular expression with up to max_errors edit operations.

This is the Phase 3 entry point for the TRE-backed recheck. The current implementation does a full scan over the document store: every doc is fed to a single compiled TreCompiledPattern. Phase 2 will add a regex prefix extractor that lets us restrict the scan to a trigram-postings-derived candidate set; the signature here is forward-compatible with that change.

Results are returned in ascending document-id order, which equals insertion order because doc ids are monotonic.