Expand description
§PtrHash: Minimal Perfect Hashing at RAM Throughput
PtrHash builds a minimal perfect hash function, that is,
a hash function that maps a fixed set of keys to {0, ..., n-1}
.
PtrHash was developed for large key sets of at least 1 million keys, and has been tested up to 10^11 keys. It only uses 2.4 bits per key.
It can also be used for arbitrary small sets. In this case, the space efficiency will be less due to a relatively large constant overhead.
See the GitHub readme or paper (arXiv, blog version) for details on the algorithm and performance.
Usage example:
use ptr_hash::{PtrHash, PtrHashParams};
// Generate some random keys.
let n = 1_000_000;
let keys = ptr_hash::util::generate_keys(n);
// Build the datastructure.
// NOTE: For small sets, say <1M keys, use `PtrHashParams::default_fast()` instead.
// The default parameters are optimized for large sets, and need large (>100k or so) inputs
// to ensure internal bucket sizes don't deviate too much from their expectation.
let mphf = <PtrHash>::new(&keys, PtrHashParams::default());
// Get the index of a key.
let key = 0;
let idx = mphf.index(&key);
assert!(idx < n);
// Get the non-minimal index of a key.
// Can be slightly faster returns keys up to `n/alpha ~ 1.01*n`.
let _idx = mphf.index_no_remap(&key);
// An iterator over the indices of the keys.
// 32: number of iterations ahead to prefetch.
// true: remap to a minimal key in [0, n).
// _: placeholder to infer the type of keys being iterated.
let indices = mphf.index_stream::<32, true, _>(&keys);
assert_eq!(indices.sum::<usize>(), (n * (n - 1)) / 2);
// Query a batch of keys.
let keys = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15];
let mut indices = mphf.index_batch::<16, true, _>(keys);
indices.sort();
for i in 0..indices.len()-1 {
assert!(indices[i] != indices[i+1]);
}
// Test that all items map to different indices
let mut taken = vec![false; n];
for key in keys {
let idx = mphf.index(&key);
assert!(!taken[idx]);
taken[idx] = true;
}
§Partitioning
By default, PtrHash partitions the keys into multiple parts. This speeds up construction in two ways:
- smaller parts have better cache locality, and
- parts can be constructed in parallel.
However, at query time there is a small overhead to compute the part of each key.
To achieve slightly faster queries, set PtrHashParams::single_part
to true
,
and then use PtrHash::index_single_part()
instead of PtrHash::index()
.
Modules§
- bucket_
fn - Various “bucket functions” are implemented here.
- hash
- Customizable Hasher trait.
Implementations of the
Hash
trait that abstracts over 64 and 128-bit hashes. - pack
- Extendable backing storage trait and types.
The
Packed
andMutPacked
traits are used for the underlying storage of the remap vector. - util
- Some internal logging and testing utilities. Internal utilities that are only exposed for testing/benchmarking purposes. Do not use externally.
Structs§
- PtrHash
- PtrHash datastructure. It is recommended to use PtrHash with default types.
- PtrHash
Params - Parameters for PtrHash construction.
Enums§
- Sharding
- The sharding method to use.
Traits§
- KeyT
- Trait that keys must satisfy.
Type Aliases§
- Default
PtrHash - Type alias to simplify construction.