Skip to main content

Crate dyntext

Crate dyntext 

Source
Expand description

Trigram + bloom-filter text index.

dyntext is the algorithmic core of the dynomite text-search surface. It ports the inverted-index pipeline from pg_tre – a PostgreSQL access method for approximate-regex matching – into pure Rust, so the same trigram + bloom funnel can sit behind dynomite’s Redis FT.* command surface.

§Phase 1 scope

This crate currently implements:

  • Three-byte n-gram extraction with padding so the input’s boundary bytes get full coverage (see trigram).
  • A roaring-bitmap-backed inverted index keyed by trigram hash (see postings::Postings).
  • A standard bloom filter with configurable bit count and hash count (see bloom::BloomFilter).
  • A combined index::TextIndex that ties the three together and serves exact-substring queries through the four-tier filter funnel from the design doc.

§Phase 2 + 3 scope

Phase 2 adds regex-driven search on top of the existing exact-substring path:

Phase 3 adds the approximate-regex recheck:

  • Safe FFI wrapper around the TRE C library for approximate-regex matching with up to k typos (see tre). The wrapper is the optional recheck step; the trigram + bloom funnel is reused unchanged.
  • Phase 4: Redis FT.SEARCH / FT.REGEX command parser integration on top of the dynvec fold.

§Optional features

  • noxu – enables the [persist] module that serialises a TextIndex to an embedded Noxu DB environment so the trigram postings, per-doc bloom filters, and raw text survive a process restart. The feature pulls in noxu-db and bincode as workspace path dependencies.

§Quick start

use dyntext::index::TextIndex;

let mut idx = TextIndex::new();
let id_a = idx.insert(b"the quick brown fox".to_vec());
let id_b = idx.insert(b"jumped over a lazy dog".to_vec());
let id_c = idx.insert(b"another brown fox here".to_vec());

let hits = idx.search_substring(b"brown fox");
assert!(hits.contains(&id_a));
assert!(hits.contains(&id_c));
assert!(!hits.contains(&id_b));

Re-exports§

pub use bloom::BloomFilter;
pub use index::IndexedDoc;
pub use index::TextIndex;
pub use index::MIN_TRIGRAM_QUERY_LEN;
pub use postings::Postings;
pub use prefix_extract::required_trigram_hashes;
pub use prefix_extract::required_trigrams;
pub use regex_ast::parse as parse_regex;
pub use regex_ast::Ast as RegexAst;
pub use regex_ast::RegexError;
pub use tre::TreCompiledPattern;
pub use tre::TreError;
pub use tre::TreMatch;
pub use tre::TreMatchOpts;
pub use trigram::extract_query_trigram_set;
pub use trigram::extract_query_trigrams;
pub use trigram::extract_trigram_set;
pub use trigram::extract_trigrams;
pub use trigram::hash_trigram;

Modules§

bloom
Standard bloom filter with configurable bit count and hash count.
index
Top-level TextIndex type combining trigram extraction, the inverted postings index, and the per-document bloom filter into a single insert-and-search facility.
postings
Inverted-index data structure mapping trigram hash to the set of document ids whose text contains the trigram.
prefix_extract
Required-trigram extraction from a regex AST.
regex_ast
Internal regex AST used by the Phase 2 prefix extractor.
tre
Safe Rust wrapper around tre-sys.
trigram
Three-byte n-gram extraction.