pub struct TextIndex { /* private fields */ }Expand description
Trigram-based text index supporting incremental insert, remove, and exact-substring search.
Implementations§
Source§impl TextIndex
impl TextIndex
Sourcepub fn docs(&self) -> &BTreeMap<u32, IndexedDoc>
pub fn docs(&self) -> &BTreeMap<u32, IndexedDoc>
Borrow the document store.
Sourcepub fn insert(&mut self, text: Vec<u8>) -> u32
pub fn insert(&mut self, text: Vec<u8>) -> u32
Insert text and return the assigned doc id.
The doc id is assigned monotonically; doc ids are not recycled, so a removed id is not handed out again.
Sourcepub fn remove(&mut self, doc_id: u32) -> Option<Vec<u8>>
pub fn remove(&mut self, doc_id: u32) -> Option<Vec<u8>>
Remove the document at doc_id and return its raw
bytes, if any.
All trigram entries for the doc are pulled from the postings index. A trigram whose postings list becomes empty after removal is garbage collected.
Sourcepub fn search_substring(&self, query: &[u8]) -> Vec<u32>
pub fn search_substring(&self, query: &[u8]) -> Vec<u32>
Search for documents whose text contains query as a
contiguous byte substring.
Results are returned in insertion order. A document that has been removed is not returned even if its postings entries were missed by a buggy remove (we always re-verify against the doc store).
Queries shorter than MIN_TRIGRAM_QUERY_LEN cannot be
resolved through the trigram index and fall back to a
full scan.
Sourcepub fn search_regex(&self, pattern: &str) -> Result<Vec<u32>, RegexError>
pub fn search_regex(&self, pattern: &str) -> Result<Vec<u32>, RegexError>
Search for documents whose text matches pattern as a
regular expression.
The query path is the same four-tier filter funnel as
Self::search_substring, plus a Phase-2 prefix
extraction step:
- Parse
patterninto the internal AST and extract the trigrams that any matching string MUST contain (seecrate::prefix_extract). - Intersect those trigrams’ postings lists into a candidate doc-id set. If the AST cannot be lowered (named capture group, etc.) or yields no required trigrams, fall back to scanning every doc.
- Per-doc bloom filter recheck (skipped on full scan).
- If the AST starts with
^literal, prune candidates whose first bytes do not equal the literal prefix. - Compile the pattern with
regex::bytes::Regexand re-run it against each candidate’s stored bytes.
Results are returned in insertion order.
§Errors
Returns RegexError::Parse if the pattern is
syntactically invalid or uses a regex feature that the
underlying regex crate does not support (lookarounds,
backreferences, …). A pattern that parses cleanly but
trips the prefix extractor’s unsupported-feature path
(named capture groups) does NOT surface as an error:
the search still runs, just via the slower full-scan +
recheck path.
Sourcepub fn search_regex_approx(
&self,
pattern: &str,
max_errors: u16,
) -> Result<Vec<u32>, TreError>
pub fn search_regex_approx( &self, pattern: &str, max_errors: u16, ) -> Result<Vec<u32>, TreError>
Search for documents that match pattern as an
approximate POSIX extended regular expression with up
to max_errors edit operations.
The path mirrors Self::search_regex but the tier-4
recheck delegates to TreCompiledPattern instead of
the std regex matcher because TRE is the only engine
in the workspace that implements approximate match
semantics. The pre-recheck filter funnel uses the
pigeonhole bound surviving_trigrams >= T - 3k
(see ApproxFilter) to stay sound under the edit
budget.
For a ^literal... pattern the anchor fast-path
rejects every candidate whose first bytes are too far
(in Hamming distance) from the literal prefix. For
max_errors == 0 this is a byte-equality check; for
max_errors >= 1 it is a Hamming-distance check
against the prefix length, which is sound for the
substitution-only case TRE optimises and conservative
(always returns true for prefixes whose Hamming
distance is within max_errors) for the
insert/delete cases.
When the surviving candidate set after filtering is
large (>= PARALLEL_RECHECK_THRESHOLD), the per-doc
TRE recheck is dispatched to a Rayon parallel iterator
to fan the cost across CPU cores. TRE’s compiled
regex_t is !Send, so each parallel worker compiles
its own copy from the original pattern bytes.
Results are returned in ascending document-id order.
§Errors
Returns TreError::Compile if the pattern fails to
compile under the given options.
Trait Implementations§
Source§impl<'de> Deserialize<'de> for TextIndex
impl<'de> Deserialize<'de> for TextIndex
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Auto Trait Implementations§
impl Freeze for TextIndex
impl RefUnwindSafe for TextIndex
impl Send for TextIndex
impl Sync for TextIndex
impl Unpin for TextIndex
impl UnsafeUnpin for TextIndex
impl UnwindSafe for TextIndex
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more