pub struct TextIndex { /* private fields */ }Expand description
Trigram-based text index supporting incremental insert, remove, and exact-substring search.
Implementations§
Source§impl TextIndex
impl TextIndex
Sourcepub fn docs(&self) -> &BTreeMap<u32, IndexedDoc>
pub fn docs(&self) -> &BTreeMap<u32, IndexedDoc>
Borrow the document store.
Sourcepub fn insert(&mut self, text: Vec<u8>) -> u32
pub fn insert(&mut self, text: Vec<u8>) -> u32
Insert text and return the assigned doc id.
The doc id is assigned monotonically; doc ids are not recycled, so a removed id is not handed out again.
Sourcepub fn remove(&mut self, doc_id: u32) -> Option<Vec<u8>>
pub fn remove(&mut self, doc_id: u32) -> Option<Vec<u8>>
Remove the document at doc_id and return its raw
bytes, if any.
All trigram entries for the doc are pulled from the postings index. A trigram whose postings list becomes empty after removal is garbage collected.
Sourcepub fn search_substring(&self, query: &[u8]) -> Vec<u32>
pub fn search_substring(&self, query: &[u8]) -> Vec<u32>
Search for documents whose text contains query as a
contiguous byte substring.
Results are returned in insertion order. A document that has been removed is not returned even if its postings entries were missed by a buggy remove (we always re-verify against the doc store).
Queries shorter than MIN_TRIGRAM_QUERY_LEN cannot be
resolved through the trigram index and fall back to a
full scan.
Sourcepub fn search_regex(&self, pattern: &str) -> Result<Vec<u32>, RegexError>
pub fn search_regex(&self, pattern: &str) -> Result<Vec<u32>, RegexError>
Search for documents whose text matches pattern as a
regular expression.
The query path is the same four-tier filter funnel as
Self::search_substring, plus a Phase-2 prefix
extraction step:
- Parse
patterninto the internal AST and extract the trigrams that any matching string MUST contain (seecrate::prefix_extract). - Intersect those trigrams’ postings lists into a candidate doc-id set. If the AST cannot be lowered (named capture group, etc.) or yields no required trigrams, fall back to scanning every doc.
- Per-doc bloom filter recheck (skipped on full scan).
- Compile the pattern with
regex::bytes::Regexand re-run it against each candidate’s stored bytes.
Results are returned in insertion order.
§Errors
Returns RegexError::Parse if the pattern is
syntactically invalid or uses a regex feature that the
underlying regex crate does not support (lookarounds,
backreferences, …). A pattern that parses cleanly but
trips the prefix extractor’s unsupported-feature path
(named capture groups) does NOT surface as an error:
the search still runs, just via the slower full-scan +
recheck path.
Sourcepub fn search_regex_approx(
&self,
pattern: &str,
max_errors: u16,
) -> Result<Vec<u32>, TreError>
pub fn search_regex_approx( &self, pattern: &str, max_errors: u16, ) -> Result<Vec<u32>, TreError>
Search for documents that match pattern as an
approximate POSIX extended regular expression with up
to max_errors edit operations.
This is the Phase 3 entry point for the TRE-backed
recheck. The current implementation does a full scan
over the document store: every doc is fed to a single
compiled TreCompiledPattern. Phase 2 will add a
regex prefix extractor that lets us restrict the scan
to a trigram-postings-derived candidate set; the
signature here is forward-compatible with that change.
Results are returned in ascending document-id order, which equals insertion order because doc ids are monotonic.