pub struct SetU64(_);
Expand description
A set of u64
Implementation
The implementation and size of a SetU64
is an internal detail that
is not stable, but may guide your use. It is optimized for size,
while maintaining the scaling of a hashmap.
The implementation
is designed for the use case of storing indexes into a Vec
. This use
case tends to involve small integers (they must be less than its len
),
which could include any number of such integers. They also have a greater
than average likelihood of including sequential or close integers, particularly if
values are pushed to the Vec
while their indexes are added to sets.
Small sets
Very sets of up to seven small numbers are stored on the stack in a single tagged pointer. This is 8 bytes on a 64-bit system, and 4 bytes on a 32-bit system. On a 32-bit system (which I won’t discuss further here, look in the code!) the elemets must be smaller in order to be stored without allocation.
Sets stored in a single word have 3 bits dedicated to the number of elements in the set, with the remaining bits used to represent the value of the smallest element in the set, followed by the differences between subsequent elements in the ordered set. Twice as many bits (rounded up) are dedicated to the fist element as to the differences.
On a 64-bit system, this works out to…
- We can store a set with a single integer less than about 1018. Thus we should be able to store just about any zero-element or one-element values on the stack.
- We can store two values on the stack, provided the lesser is less than 1012, and the difference between them is less than a million (106). This starkly highlights the optimization for closely spaced numbers. It is also tweakable in the code, and could be changed.
- We can store three values on the stack, if the first is less than about 3×107 (30 million), and the following two values have gaps of less than 4096.
- …
- To hold 7 values in 64 bits, we limit the first value to about 500 thousand, and the following differences must be less than 128. So your seven numbers will seriously need to be either all quite small or quite closely packed.
Larger sets
Larger sets are stored on the heap. We currently have three different heap formats, which will be chosen based on the distribution of your values. The format is only changed when reallocation is required, so it may be challenging to predict the format of a given set, particularly as the reallocation size is randomized in order to mitigate the risk of hash collision attacks. The three formats are:
Internal::Dense
The set is stored as a bitmap up to a maximum value. This format is chosen when the number of elements exceeds 1/127 of the maximum value (see the implementation ofSetU64::with_capacity_and_max
), which means that less than a byte will be used per element.Internal::Heap
The set is stored as a kind of Robin Hood hash map (without hashing!) from most significant bits to bitmaps holding sets of the stored least significant bits.Internal::Big
An ordinary Robin Hood hash set (without hashing!), with a dynamic sentinal value indicating that a bucket is empty. This is used only when the maximum value in the set is very large.
Internal::Heap
format
I will describe here some details of the Internal::Heap
format, which is likely the
most common one, and is definitely the most complicated. The data is stored in an array
of u64
buckets. Based on the largest element
present, we divide the bits of each of those elements into a key and a bitmap. The array
is a Robin Hood hashmap between keys and bitmaps. The key
represents the most significant value of the elements stored, and the bitmap stores the
set of values which have that same most significant value.
As an example, consider a set with a maximum of 5000. This maximum requires 13
bits to represent, but we don’t need 13 bits to store the key, because that would leave 51 bits for
the bitmap, so each bucket would hold 51 possible values, enabling us to store values of
up to 51*(1 << 13)
. So instead we could use just 7 bits to hold the keys, leaving 57
bits per bucket, which would thus enable us to store values up to about 54 thousand. Which
bucket size we use will depend on the order in which elements were added, since we only
reallocate when needed either because we have an element that we cannot fit, or because our
hashmap is too full. When allocating, we tend to leave room for the maximum to increase.
Assuming we use 13 bits per bucket, then let’s talk through the process of inserting the
value 137. Since we have 51 elements per bucket, the key is found by dividing by 51,
which gives us a key of 137/51 = 2. We then look up the bucket with key 2 using a pretty
standard Robin Hood algorithm (except that the keys are a portion of a word). Once we have
that bucket, we will identify that the bit corresponding to our value is bit number 137 % 51
,
which we will set (and also check the value of, to track the number of elements in the set)
and determine the return value.
This format allows us to efficiently store sets in which there are contiguous chunks of
elements, and when the elements are widely spaced it at least takes no more than 64 bits
per 64-bit value, plus hash-set overhead. Its complexity also makes it significantly
slower for insertion (or collect()
) than a standard HashSet
, but it also can take
considerably less space.
Implementations
sourceimpl SetU64
impl SetU64
sourcepub fn with_capacity_of(other: &Self) -> Self
pub fn with_capacity_of(other: &Self) -> Self
Create an empty set with capacity to hold the provided set.
let a: tinyset::SetU64 = (1..300).collect();
let mut b = tinyset::SetU64::with_capacity_of(&a);
assert_eq!(a.capacity(), b.capacity());
assert_eq!(b.len(), 0);
for i in a.iter() {
b.insert(i);
}
assert_eq!(a.capacity(), b.capacity());
assert_eq!(b.len(), a.len());
Trait Implementations
sourceimpl<'b> BitOr<&'b SetU64> for SetU64
impl<'b> BitOr<&'b SetU64> for SetU64
sourceimpl Extend<u64> for SetU64
impl Extend<u64> for SetU64
sourcefn extend<T: IntoIterator<Item = u64>>(&mut self, iter: T)
fn extend<T: IntoIterator<Item = u64>>(&mut self, iter: T)
sourcefn extend_one(&mut self, item: A)
fn extend_one(&mut self, item: A)
extend_one
)sourcefn extend_reserve(&mut self, additional: usize)
fn extend_reserve(&mut self, additional: usize)
extend_one
)