Expand description
Storage-level pre-filtering for pattern matching.
flashsieve builds per-block byte histograms and 2-byte n-gram bloom
filters, then uses them to answer which blocks might contain matches.
§Overview
This crate provides:
BlockIndexBuilder— construct indexesBlockIndex— query and serialize indexesFileBloomIndex— file-level bloom union for fast n-gram rejection before per-block scansIncrementalBuilder— append new blocks to serialized indexes without re-indexing prior dataMmapBlockIndex— query serialized indexes in placeByteFilter— byte-level pre-filteringNgramFilter— n-gram bloom pre-filteringNgramBloomandBlockedNgramBloom— standalone bloom filters
§Quick Start
use flashsieve::{BlockIndexBuilder, ByteFilter, NgramFilter};
let mut block_a = vec![b'x'; 256];
let mut block_b = vec![b'y'; 256];
block_a[..6].copy_from_slice(b"secret");
block_b[..5].copy_from_slice(b"token");
let patterns: [&[u8]; 2] = [b"secret", b"token"];
let index = BlockIndexBuilder::new()
.block_size(256)
.bloom_bits(512)
.build_streaming([block_a, block_b].into_iter())?;
let byte_filter = ByteFilter::from_patterns(&patterns);
let ngram_filter = NgramFilter::from_patterns(&patterns);
let candidates = index.candidate_blocks(&byte_filter, &ngram_filter);
assert_eq!(candidates.len(), 1);
assert_eq!(candidates[0].length, 512);§False Positive Rate
The bloom filter’s false positive rate depends on:
m= number of bitsn= number of distinct n-grams insertedk= number of hash functions (k=3in this crate)
The theoretical FPR is:
FPR ≈ (1 - e^(-kn/m))^kFor optimal k = (m/n) × ln(2):
FPR ≈ 0.6185^(m/n)§Serialization Format
Indexes can be serialized to a portable binary format (version 2):
[magic: 4 bytes = "FSBX"]
[version: u32 LE = 2]
[block_size: u64 LE]
[total_len: u64 LE]
[block_count: u64 LE]
for each block:
[histogram: 256 × u32 LE = 1024 bytes]
[bloom_num_bits: u64 LE]
[bloom_words: u64 LE]
[bloom_data: bloom_words × u64 LE]
[crc32: u32 LE]See the index module for detailed specification.
§Hash Functions
The bloom filter uses a primary 64-bit mix for each 2-byte n-gram and a
derived second hash for double hashing (k = 3 probes):
-
wyhash-style mix:
x = (u64(a) << 8) | u64(b), thenx = x.wrapping_mul(0x9E37_79B9_7F4A_7C15), thenx ^ (x >> 32). -
Second hash:
h2 = (h1 ^ (h1 >> 32)).max(1)so double hashing never uses a zero increment.
§Safety
This crate keeps unsafe usage narrowly scoped to hot-path bloom-filter
word loads where index validity is proven by construction.
Re-exports§
pub use bloom::BlockedNgramBloom;pub use bloom::NgramBloom;pub use builder::BlockIndexBuilder;pub use error::Error;pub use error::Result;pub use filter::ByteFilter;pub use filter::NgramFilter;pub use filter::CompositeFilter;pub use filter::FilterOp;pub use histogram::ByteHistogram;pub use incremental::IncrementalBuilder;pub use index::BlockIndex;pub use index::CandidateRange;pub use mmap_index::ByteHistogramRef;pub use mmap_index::MmapBlockIndex;pub use mmap_index::NgramBloomRef;
Modules§
- bloom
- Bloom filter primitives for block-level 2-byte n-gram summaries.
- builder
- Builders for constructing
BlockIndexvalues from raw block data. Construction of block indexes from raw data. - error
- Error types returned by crate operations.
Error types for
flashsieve. - filter
- Query filters used to pre-filter candidate blocks. Query filter types for block pre-filtering.
- histogram
- Byte histogram summaries stored for each indexed block. Per-block byte frequency histograms.
- incremental
- Incremental update operations for existing block indexes.
Incremental update operations for
BlockIndex. - incremental_
watch - Incremental index updates via filesystem notifications. Incremental index updates via filesystem notifications.
- index
- Indexed block metadata and candidate range queries. Block indexes and candidate range queries.
- mmap_
index - Zero-parse views over serialized block-index data. Zero-parse block index access over serialized bytes.
- transport
- Compressed transport format for peer-to-peer bloom index sharing. Compressed bloom index transport for peer-to-peer index sharing.
Structs§
- File
Bloom Index - Hierarchical bloom wrapper: file-level union bloom for fast n-gram rejection.
Wraps a
BlockIndexwith a file-level n-gram bloom filter.