numerical-multiset: An ordered multiset of machine numbers
What is this?
Let's break down the title of this README into more digestible chunks:
-
Multiset: The
NumericalMultisetcontainer provided by this crate implements a generalization of the mathematical notion of set, called a multiset. Unlike a set, a multiset can simultaneously hold multiple elements that compare equal to each other. -
Ordered: Most multiset implementations are based on associative containers. In a crates.io survey performed at the time of writing, the most popular choice was hash maps, which do not expose any meaningful element ordering:
- Any hash map insertion can change the order of elements that is exposed by iterators.
- There is no efficient way to find what is e.g. the smallest element without iterating over all elements.
In contrast, this multiset implementation is based on an ordered associative container (a
BTreeMapat the time of writing). This allows it to efficiently answer order-related queries, like in-order iteration over elements or extraction of the minimum/maximum element. The price to pay for these fast ordering queries is that the algorithmic complexity of order-unrelated multiset operations like insertions are deletions will scale less well as the number of elements grows. -
Numbers: In Rust terms, a general-purpose multiset is typically based on a data structure of the form
Map<Element, Collection<Element>>whereMapis an associative container andCollectiongroups together multiset elements that share a common comparison key. However, the multiset provided by this crate is not general-purpose, but specialized for machine number types (u32,f32...) and newtypes thereof. These number types share a few properties which this crate leverages to improve ergonomics, memory usage and execution performance:- The equality operator of machine numbers is defined such that if two
numbers compare equal, they are the same number, i.e. there is no hidden
internal information that their
PartialEqimplementation does not look at. Storing a collection of identical values like a general-purpose multiset does is thus pointless in our case, instead we can simply count the number of occurences of each value, leading to a more efficient sparse histogram data structure of the formMap<Element, NonZeroUsize>. - Machine numbers all implement the
Copytrait, which means that we do not need the complexity of the standard library's associative container APIs (with reference accessors and nontrivial use of theBorrowtrait) and can instead provide simpler and slightly more efficient value-based APIs.
- The equality operator of machine numbers is defined such that if two
numbers compare equal, they are the same number, i.e. there is no hidden
internal information that their
How is this useful?
This crate was initially written while implementing windowed signal processing algorithms where one is receiving an stream of numbers, and for each new input beyond the first few ones, one needs to answer a question about the last N numbers that were received, such that the naive algorithm to answer the question involves maintaining a sorted list of previous data points.
For example, a median filter can be efficiently implemented by maintaining two
NumericalMultisets, one representing numbers below the current median and one
representing numbers above the current median. This is effectively just a sparse
variation of the classic dense histogramming algorithm that is commonly used
when processing 8-bit grayscale images, but sadly does not scale to
higher-resolution input data (>= 32-bit) due to excessive memory usage and poor
cache locality.
License
This crate is distributed under the terms of the MPLv2 license. See the LICENSE file for details.
More relaxed licensing (Apache, MIT, BSD...) may also be negociated, in exchange of a financial contribution. Contact me for details at knights_of_ni AT gmx DOTCOM.