Skip to main content

Crate flashsieve

Crate flashsieve 

Source
Expand description

Storage-level pre-filtering for pattern matching.

flashsieve builds per-block byte histograms and 2-byte n-gram bloom filters, then uses them to answer which blocks might contain matches.

§Overview

This crate provides:

§Quick Start

use flashsieve::{BlockIndexBuilder, ByteFilter, NgramFilter};

let mut block_a = vec![b'x'; 256];
let mut block_b = vec![b'y'; 256];
block_a[..6].copy_from_slice(b"secret");
block_b[..5].copy_from_slice(b"token");
let patterns: [&[u8]; 2] = [b"secret", b"token"];

let index = BlockIndexBuilder::new()
    .block_size(256)
    .bloom_bits(512)
    .build_streaming([block_a, block_b].into_iter())?;

let byte_filter = ByteFilter::from_patterns(&patterns);
let ngram_filter = NgramFilter::from_patterns(&patterns);

let candidates = index.candidate_blocks(&byte_filter, &ngram_filter);
assert_eq!(candidates.len(), 1);
assert_eq!(candidates[0].length, 512);

§False Positive Rate

The bloom filter’s false positive rate depends on:

  • m = number of bits
  • n = number of distinct n-grams inserted
  • k = number of hash functions (k=3 in this crate)

The theoretical FPR is:

FPR ≈ (1 - e^(-kn/m))^k

For optimal k = (m/n) × ln(2):

FPR ≈ 0.6185^(m/n)

§Serialization Format

Indexes can be serialized to a portable binary format (version 2):

[magic: 4 bytes = "FSBX"]
[version: u32 LE = 2]
[block_size: u64 LE]
[total_len: u64 LE]
[block_count: u64 LE]
for each block:
  [histogram: 256 × u32 LE = 1024 bytes]
  [bloom_num_bits: u64 LE]
  [bloom_words: u64 LE]
  [bloom_data: bloom_words × u64 LE]
[crc32: u32 LE]

See the index module for detailed specification.

§Hash Functions

The bloom filter uses a primary 64-bit mix for each 2-byte n-gram and a derived second hash for double hashing (k = 3 probes):

  1. wyhash-style mix: x = (u64(a) << 8) | u64(b), then x = x.wrapping_mul(0x9E37_79B9_7F4A_7C15), then x ^ (x >> 32).

  2. Second hash: h2 = (h1 ^ (h1 >> 32)).max(1) so double hashing never uses a zero increment.

§Safety

This crate keeps unsafe usage narrowly scoped to hot-path bloom-filter word loads where index validity is proven by construction.

Re-exports§

pub use bloom::BlockedNgramBloom;
pub use bloom::NgramBloom;
pub use builder::BlockIndexBuilder;
pub use error::Error;
pub use error::Result;
pub use filter::ByteFilter;
pub use filter::NgramFilter;
pub use filter::CompositeFilter;
pub use filter::FilterOp;
pub use histogram::ByteHistogram;
pub use incremental::IncrementalBuilder;
pub use index::BlockIndex;
pub use index::CandidateRange;
pub use mmap_index::ByteHistogramRef;
pub use mmap_index::MmapBlockIndex;
pub use mmap_index::NgramBloomRef;

Modules§

bloom
Bloom filter primitives for block-level 2-byte n-gram summaries.
builder
Builders for constructing BlockIndex values from raw block data. Construction of block indexes from raw data.
error
Error types returned by crate operations. Error types for flashsieve.
filter
Query filters used to pre-filter candidate blocks. Query filter types for block pre-filtering.
histogram
Byte histogram summaries stored for each indexed block. Per-block byte frequency histograms.
incremental
Incremental update operations for existing block indexes. Incremental update operations for BlockIndex.
incremental_watch
Incremental index updates via filesystem notifications. Incremental index updates via filesystem notifications.
index
Indexed block metadata and candidate range queries. Block indexes and candidate range queries.
mmap_index
Zero-parse views over serialized block-index data. Zero-parse block index access over serialized bytes.
transport
Compressed transport format for peer-to-peer bloom index sharing. Compressed bloom index transport for peer-to-peer index sharing.

Structs§

FileBloomIndex
Hierarchical bloom wrapper: file-level union bloom for fast n-gram rejection. Wraps a BlockIndex with a file-level n-gram bloom filter.