flashsieve 0.1.1

Storage-level pre-filtering for pattern matching — skip blocks that can't contain matches
Documentation

flashsieve

crates.io docs.rs license

Storage-level pre-filtering for pattern matching.

flashsieve builds per-block byte histograms and 2-byte n-gram bloom filters, then uses them to answer which blocks might contain matches. If a block cannot contain a pattern, it is skipped entirely—saving CPU, I/O, and memory at scale.


Installation

[dependencies]
flashsieve = "0.1"

Minimum supported Rust version: 1.85.


Quick Start

use flashsieve::{BlockIndexBuilder, ByteFilter, NgramFilter};

let mut block_a = vec![b'x'; 256];
let mut block_b = vec![b'y'; 256];
block_a[..6].copy_from_slice(b"secret");
block_b[..5].copy_from_slice(b"token");
let patterns: [&[u8]; 2] = [b"secret", b"token"];

let index = BlockIndexBuilder::new()
    .block_size(256)
    .bloom_bits(512)
    .build_streaming([block_a, block_b].into_iter())?;

let byte_filter = ByteFilter::from_patterns(&patterns);
let ngram_filter = NgramFilter::from_patterns(&patterns);

let candidates = index.candidate_blocks(&byte_filter, &ngram_filter);
assert_eq!(candidates.len(), 1);
assert_eq!(candidates[0].length, 512);
# Ok::<(), flashsieve::Error>(())

What It Does

Component Purpose
BlockIndexBuilder Constructs indexes from raw bytes or streaming blocks
BlockIndex In-memory index with per-block histograms & bloom filters
FileBloomIndex File-level bloom union for fast n-gram rejection before per-block scans
MmapBlockIndex Zero-parse, zero-copy queries over serialized indexes
IncrementalBuilder Append new blocks to existing indexes without rebuilding
ByteFilter Reject blocks that don't contain all required bytes
NgramFilter Reject blocks that don't contain all required 2-byte n-grams

Bloom Filter Math

The theoretical false-positive rate for a Bloom filter with:

  • m = number of bits
  • n = number of distinct n-grams inserted
  • k = number of hash functions (k = 3 in this crate)

is approximately:

FPR ≈ (1 - e^(-kn/m))^k

For the optimal number of hash functions k = (m/n) × ln(2):

FPR ≈ 0.6185^(m/n)

For filters with ≥4096 bits, flashsieve allocates an exact 65,536-bit pair table, giving zero false positives for all possible 2-byte n-gram queries.


Safety

This crate keeps unsafe usage narrowly scoped to hot-path bloom-filter word loads where index validity is proven by construction. All serialized input is validated (magic, version, checksum, bounds) before use—corrupt files return typed errors rather than panicking.


License

Licensed under the MIT license (LICENSE or https://opensource.org/licenses/MIT).