dyntext/lib.rs
1//! Trigram + bloom-filter text index.
2//!
3//! `dyntext` is the algorithmic core of the dynomite text-search
4//! surface. It ports the inverted-index pipeline from
5//! [pg_tre](https://codeberg.org/gregburd/pg_tre) -- a
6//! PostgreSQL access method for approximate-regex matching --
7//! into pure Rust, so the same trigram + bloom funnel can sit
8//! behind dynomite's Redis FT.* command surface.
9//!
10//! # Phase 1 scope
11//!
12//! This crate currently implements:
13//!
14//! * Three-byte n-gram extraction with padding so the input's
15//! boundary bytes get full coverage (see [`trigram`]).
16//! * A roaring-bitmap-backed inverted index keyed by trigram
17//! hash (see [`postings::Postings`]).
18//! * A standard bloom filter with configurable bit count and
19//! hash count (see [`bloom::BloomFilter`]).
20//! * A combined [`index::TextIndex`] that ties the three
21//! together and serves exact-substring queries through the
22//! four-tier filter funnel from the design doc.
23//!
24//! # Phase 2 + 3 scope
25//!
26//! Phase 2 adds regex-driven search on top of the existing
27//! exact-substring path:
28//!
29//! * A small internal regex AST built from
30//! [`regex_syntax`]'s HIR (see [`regex_ast`]).
31//! * A required-trigram extractor that walks the AST and
32//! computes the trigrams every matching string must contain
33//! (see [`prefix_extract`]).
34//! * [`index::TextIndex::search_regex`], which uses the
35//! extractor to prune the postings lists before running the
36//! actual matcher (currently [`regex::bytes::Regex`]).
37//!
38//! Phase 3 adds the approximate-regex recheck:
39//!
40//! * Safe FFI wrapper around the TRE C library for
41//! approximate-regex matching with up to k typos
42//! (see [`tre`]). The wrapper is the optional recheck step;
43//! the trigram + bloom funnel is reused unchanged.
44//! * Phase 4: Redis FT.SEARCH / FT.REGEX command parser
45//! integration on top of the dynvec fold.
46//!
47//! # Optional features
48//!
49//! * `noxu` -- enables the [`persist`] module that serialises
50//! a [`TextIndex`] to an embedded Noxu DB environment so
51//! the trigram postings, per-doc bloom filters, and raw
52//! text survive a process restart. The feature pulls in
53//! `noxu-db` and `bincode` as workspace path dependencies.
54//!
55//! # Quick start
56//!
57//! ```
58//! use dyntext::index::TextIndex;
59//!
60//! let mut idx = TextIndex::new();
61//! let id_a = idx.insert(b"the quick brown fox".to_vec());
62//! let id_b = idx.insert(b"jumped over a lazy dog".to_vec());
63//! let id_c = idx.insert(b"another brown fox here".to_vec());
64//!
65//! let hits = idx.search_substring(b"brown fox");
66//! assert!(hits.contains(&id_a));
67//! assert!(hits.contains(&id_c));
68//! assert!(!hits.contains(&id_b));
69//! ```
70
71pub mod bloom;
72pub mod index;
73#[cfg(feature = "noxu")]
74pub mod persist;
75pub mod postings;
76pub mod prefix_extract;
77pub mod regex_ast;
78pub mod tiling;
79pub mod tre;
80pub mod trigram;
81
82pub use bloom::BloomFilter;
83pub use index::{IndexedDoc, TextIndex, MIN_TRIGRAM_QUERY_LEN};
84pub use postings::Postings;
85pub use prefix_extract::{
86 anchored_prefix, extract_literal_runs, has_top_level_start_anchor, required_trigram_hashes,
87 required_trigrams,
88};
89pub use regex_ast::{parse as parse_regex, Ast as RegexAst, RegexError};
90pub use tiling::ApproxFilter;
91pub use tre::{TreCompiledPattern, TreError, TreMatch, TreMatchOpts};
92pub use trigram::{
93 extract_query_trigram_set, extract_query_trigrams, extract_trigram_set, extract_trigrams,
94 hash_trigram,
95};