Skip to main content

dyntext/
lib.rs

1//! Trigram + bloom-filter text index.
2//!
3//! `dyntext` is the algorithmic core of the dynomite text-search
4//! surface. It ports the inverted-index pipeline from
5//! [pg_tre](https://codeberg.org/gregburd/pg_tre) -- a
6//! PostgreSQL access method for approximate-regex matching --
7//! into pure Rust, so the same trigram + bloom funnel can sit
8//! behind dynomite's Redis FT.* command surface.
9//!
10//! # Phase 1 scope
11//!
12//! This crate currently implements:
13//!
14//! * Three-byte n-gram extraction with padding so the input's
15//!   boundary bytes get full coverage (see [`trigram`]).
16//! * A roaring-bitmap-backed inverted index keyed by trigram
17//!   hash (see [`postings::Postings`]).
18//! * A standard bloom filter with configurable bit count and
19//!   hash count (see [`bloom::BloomFilter`]).
20//! * A combined [`index::TextIndex`] that ties the three
21//!   together and serves exact-substring queries through the
22//!   four-tier filter funnel from the design doc.
23//!
24//! # Phase 2 + 3 scope
25//!
26//! Phase 2 adds regex-driven search on top of the existing
27//! exact-substring path:
28//!
29//! * A small internal regex AST built from
30//!   [`regex_syntax`]'s HIR (see [`regex_ast`]).
31//! * A required-trigram extractor that walks the AST and
32//!   computes the trigrams every matching string must contain
33//!   (see [`prefix_extract`]).
34//! * [`index::TextIndex::search_regex`], which uses the
35//!   extractor to prune the postings lists before running the
36//!   actual matcher (currently [`regex::bytes::Regex`]).
37//!
38//! Phase 3 adds the approximate-regex recheck:
39//!
40//! * Safe FFI wrapper around the TRE C library for
41//!   approximate-regex matching with up to k typos
42//!   (see [`tre`]). The wrapper is the optional recheck step;
43//!   the trigram + bloom funnel is reused unchanged.
44//! * Phase 4: Redis FT.SEARCH / FT.REGEX command parser
45//!   integration on top of the dynvec fold.
46//!
47//! # Optional features
48//!
49//! * `noxu` -- enables the [`persist`] module that serialises
50//!   a [`TextIndex`] to an embedded Noxu DB environment so
51//!   the trigram postings, per-doc bloom filters, and raw
52//!   text survive a process restart. The feature pulls in
53//!   `noxu-db` and `bincode` as workspace path dependencies.
54//!
55//! # Quick start
56//!
57//! ```
58//! use dyntext::index::TextIndex;
59//!
60//! let mut idx = TextIndex::new();
61//! let id_a = idx.insert(b"the quick brown fox".to_vec());
62//! let id_b = idx.insert(b"jumped over a lazy dog".to_vec());
63//! let id_c = idx.insert(b"another brown fox here".to_vec());
64//!
65//! let hits = idx.search_substring(b"brown fox");
66//! assert!(hits.contains(&id_a));
67//! assert!(hits.contains(&id_c));
68//! assert!(!hits.contains(&id_b));
69//! ```
70
71pub mod bloom;
72pub mod index;
73#[cfg(feature = "noxu")]
74pub mod persist;
75pub mod postings;
76pub mod prefix_extract;
77pub mod regex_ast;
78pub mod tre;
79pub mod trigram;
80
81pub use bloom::BloomFilter;
82pub use index::{IndexedDoc, TextIndex, MIN_TRIGRAM_QUERY_LEN};
83pub use postings::Postings;
84pub use prefix_extract::{required_trigram_hashes, required_trigrams};
85pub use regex_ast::{parse as parse_regex, Ast as RegexAst, RegexError};
86pub use tre::{TreCompiledPattern, TreError, TreMatch, TreMatchOpts};
87pub use trigram::{
88    extract_query_trigram_set, extract_query_trigrams, extract_trigram_set, extract_trigrams,
89    hash_trigram,
90};