pub trait PrefixExtractor:
Send
+ Sync
+ UnwindSafe
+ RefUnwindSafe {
// Required method
fn prefixes<'a>(
&self,
key: &'a [u8],
) -> Box<dyn Iterator<Item = &'a [u8]> + 'a>;
// Provided method
fn is_valid_scan_boundary(&self, prefix: &[u8]) -> bool { ... }
}Expand description
Extracts prefixes from keys for prefix bloom filter indexing.
When a PrefixExtractor is configured on a tree, the bloom filter indexes
not only full keys but also the prefixes returned by PrefixExtractor::prefixes.
This allows prefix scans to skip entire segments that contain no keys with a
matching prefix, dramatically reducing I/O for prefix-heavy workloads (e.g.,
graph adjacency lists, time-series buckets).
§Thread Safety
Implementations must be Send + Sync + UnwindSafe + RefUnwindSafe.
The extractor is shared across flush, compaction, and read threads via Arc,
and may be accessed across panic boundaries (e.g., catch_unwind in tests).
§Example
use lsm_tree::PrefixExtractor;
/// Extracts prefixes at each ':' separator boundary.
///
/// For key `adj:out:42:KNOWS`, yields:
/// `adj:`, `adj:out:`, `adj:out:42:`
struct ColonSeparatedPrefix;
impl PrefixExtractor for ColonSeparatedPrefix {
fn prefixes<'a>(&self, key: &'a [u8]) -> Box<dyn Iterator<Item = &'a [u8]> + 'a> {
Box::new(
key.iter()
.enumerate()
.filter(|(_, b)| **b == b':')
.map(move |(i, _)| &key[..=i]),
)
}
}Required Methods§
Sourcefn prefixes<'a>(&self, key: &'a [u8]) -> Box<dyn Iterator<Item = &'a [u8]> + 'a>
fn prefixes<'a>(&self, key: &'a [u8]) -> Box<dyn Iterator<Item = &'a [u8]> + 'a>
Returns an iterator of prefixes to index for the given key.
Each yielded prefix will be hashed and inserted into the segment’s bloom filter. During a prefix scan, the scan prefix is hashed and checked against the bloom — segments without a match are skipped.
Implementations should return prefixes from shortest to longest. The full key itself is always indexed separately by the standard bloom path; including it in the returned prefixes is allowed but redundant and generally unnecessary.
The returned iterator must be finite and yield a small number of prefixes per key (typically 1–5). It is called on the write path for every key during flush and compaction.
§Performance note
Returns Box<dyn Iterator> for object safety (Arc<dyn PrefixExtractor>).
Most extractors yield 1–5 prefixes per key, so the allocation is negligible
compared to the bloom hash + I/O cost. A callback-based for_each_prefix
alternative could avoid this allocation but would expand the trait API
surface; consider adding it if profiling shows measurable overhead.
Provided Methods§
Sourcefn is_valid_scan_boundary(&self, prefix: &[u8]) -> bool
fn is_valid_scan_boundary(&self, prefix: &[u8]) -> bool
Returns true if prefix is a valid scan boundary for this extractor.
A scan boundary is valid when every key that the tree would consider
a match for this prefix in a prefix scan had prefix indexed via
prefixes at write time. This is the contract that
makes bloom-based table skipping safe: if the bloom filter says “no
match”, we can skip the table because every matching key would have
produced the prefix hash during flush/compaction.
§Default implementation
Checks whether prefixes(prefix) emits prefix itself — i.e.,
whether the extractor considers this byte sequence a boundary.
This is correct for well-behaved extractors whose prefixes() returns
sub-slices of the input key.
§When to override
Override this method when the default self-referential check is either:
- Too expensive — e.g., the extractor can check a sentinel byte in O(1) instead of iterating all prefixes.
- Incorrect — e.g., the extractor produces prefixes that are not
sub-slices of the input, so the default
any(|p| p == prefix)check would never match even for valid boundaries.