pub struct KeyIndexer { /* private fields */ }Expand description
KeyIndexer maps 64-bit key hashes to 48-bit file offsets, augmented with
a 16-bit tag for lightweight collision detection.
§Purpose
While XXH3 is a high-quality 64-bit hash, collisions are still possible in large-scale stores. This index:
- Detects such collisions via a 16-bit fingerprint (
tag) - Avoids storing full keys in memory
- Maintains constant memory footprint (
u64per key)
§Packed Value Format
Stored in a single u64:
[63 48][47 0]
[ tag (16) ][ file offset (48) ]§Collision Handling
During lookups, the stored tag is compared to a rederived one from the key or key hash. If the tags do not match, the entry is rejected as a potential collision.
§Limitations
- Max file size: 2^48 bytes = 256 TiB
- Any file larger than this will overflow the offset field and corrupt the tag.
- Tag uniqueness: 2^16 = 65,536 distinct tags
- Probability of tag collision is 1 in 65,536 per conflicting key hash
- This is sufficient to distinguish over 4 billion keys with ~50% collision probability (birthday bound)
§Tradeoffs
This scheme offers a middle ground:
- Significantly improves safety over pure hash indexing
- No dynamic memory cost
- Small and predictable performance cost (~2 bit ops + 1 compare)
Implementations§
Source§impl KeyIndexer
impl KeyIndexer
Sourcepub fn tag_from_hash(key_hash: u64) -> u16
pub fn tag_from_hash(key_hash: u64) -> u16
Returns a 16-bit tag from the upper bits of a key hash.
Sourcepub fn tag_from_key(key: &[u8]) -> u16
pub fn tag_from_key(key: &[u8]) -> u16
Computes a tag directly from the raw key.
Sourcepub fn pack(tag: u16, offset: u64) -> u64
pub fn pack(tag: u16, offset: u64) -> u64
Combines a tag and offset into a packed 64-bit value.
§Panics
If offset exceeds 48 bits (i.e. > 256 TiB), this will panic.
Sourcepub fn build(mmap: &Mmap, tail_offset: u64) -> Self
pub fn build(mmap: &Mmap, tail_offset: u64) -> Self
Scans the file backwards and builds a tag-aware hash index.
The most recent version of each key hash is kept.
Sourcepub fn insert(
&mut self,
key_hash: u64,
new_offset: u64,
) -> Result<Option<u64>, &'static str>
pub fn insert( &mut self, key_hash: u64, new_offset: u64, ) -> Result<Option<u64>, &'static str>
Inserts a new key hash and offset into the index, with collision detection.
This method checks the tag of an existing entry before overwriting. If the
tags do not match, it returns an Err, indicating a hash collision.
§Returns
Ok(Some(u64)): On a successful update, returning the previous offset.Ok(None): On a successful insert of a new key.Err(&'static str): On a tag mismatch, indicating a hash collision.
Sourcepub fn get_packed(&self, key_hash: &u64) -> Option<&u64>
pub fn get_packed(&self, key_hash: &u64) -> Option<&u64>
Gets the raw packed value.
Sourcepub fn get_offset(&self, key_hash: &u64) -> Option<u64>
pub fn get_offset(&self, key_hash: &u64) -> Option<u64>
Returns only the unpacked offset (ignores tag).
pub fn len(&self) -> usize
pub fn is_empty(&self) -> bool
Sourcepub fn values(&self) -> Values<'_, u64, u64>
pub fn values(&self) -> Values<'_, u64, u64>
Returns a memory-efficient iterator over the packed (tag|offset) values.
This method is preferable to collecting all values into a Vec when the
index is large, as it avoids a large upfront memory allocation. The iterator
borrows the underlying index, so it must be used within the lifetime of the
KeyIndexer’s read lock.