Crate csf

source ·
Expand description

csf is the Rust library (by Piotr Beling) of (compressed) static functions that use perfect hashing (and value compression).

csf contains the following types that implement static functions: ls::Map, fp::Map. They can represent functions (immutable maps) from a set of (hashable) keys to unsigned integers of given bit-size. They take somewhat more than nb bits to represent a function from an n-element set into a set of b-bit values. The expected time complexities of their construction and evaluation are O(n) and O(1), respectively.

csf contains the following types that implement compressed static functions: ls::CMap, fp::CMap, fp::GOCMap. They can represent functions (immutable maps) from a set of (hashable) keys to a set of values of any type. To represent a function f:X→Y, they use the space slightly larger than |X|H, where H is the entropy of the distribution of the f values over X. Their expected time complexity is O(c) for evaluation and O(|X|c) for construction (not counting building the encoding dictionary), where c is the average codeword length (given in code fragments) of the values.

None of the static functions (including the compressed ones) included in csf explicitly store keys. Therefore, these functions are usually unable to detect whether an item belongs to the set of keys for which they were constructed. Thus, queried for a value assigned to an item outside that set, they can return any value.

§Example

use csf::ls;

let subset: ls::Map = [("alpha", 1u8), ("beta", 0), ("gamma", 1)].as_ref().into();
assert_eq!(subset.get(&"alpha"), 1);
assert_eq!(subset.get(&"beta"), 0);
assert_eq!(subset.get(&"gamma"), 1);
// Any 1-bit value (either 0 or 1) can be returned for other arguments:
assert!(subset.get(&"other") <= 1);

§Space overhead (benchmarks)

§Introduction and description of the distributions used

The plots show the space overhead (over entropy of the distribution of function values) of the static functions included in the csf, for two families of functions that differ in the distribution of values:

  • Functions with equal distribution map 220 = ed+i keys to d+1 different values, where d=1,…,255 and i=1,…,e. One value is assigned to i keys, and each of the other d values is assigned to e keys. The entropy of the distribution of function values is in the range (log2(d), log2(d+1)] and increases with both d and i.
  • Functions with dominated (by a single value) distribution map 220 keys to 256 different values. One value is assigned to 220-255k keys and each of the remaining 255 values is assigned to k keys, where k=1,…,220/256. The entropy of the distribution of function values increases with k.

The dashed black line shown in each plot is the size of the uncompressed vector of values, assuming that each value is stored using the smallest possible (for the number of different values), constant whole number of bits.

The data for the plots were generated using the csf_benchmark program.

§(Uncompressed) static functions

2024-02-24T19:57:41.836909 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/ 2024-02-24T19:57:43.961693 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/ 2024-02-24T19:57:44.642221 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/ 2024-02-24T19:57:45.251792 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/

§Compressed static functions

Notes: The functions use Huffman coding to compress the values. The codewords consist of fragments of a length that minimizes the size of the function (which is always 1 bit in the case of ls::CMap). Level sizes of fp::CMap and fp::GOCMap are determined by fp::OptimalLevelSize.

2024-02-24T19:57:45.989679 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/ 2024-02-24T19:57:46.744639 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/ 2024-02-24T19:57:47.482279 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/ 2024-02-24T19:57:48.328443 image/svg+xml Matplotlib v3.6.3, https://matplotlib.org/

Modules§

  • Encoding of values of any type to/from a sequence of code words of fixed bit length.
  • Compressed static maps based on fingerprinting.
  • Compressed static maps based on solving linear systems.

Traits§

  • Provides methods to get dynamic and total size of the variable.

Functions§

  • Calculates the minimal number of bits needed to store values from 0 to given max_value.
  • Calculates the minimal number of bits needed to store any of the given values.
  • Calculates the minimal number of bits needed to store any of the given values.