Compressed Integer Vector Library
A Rust library that provides a compressed representation for vectors of unsigned 64-bit integers by utilizing several variable-length coding methods provided by the dsi-bitstream library. It is engineered to offer both efficient compression and fast random access to individual elements. To support fast random access, the library uses a sampling technique to balance decoding speed and memory footprint.
The compressed-intvec behaves like a standard Rust vector, allowing for efficient storage and retrieval of integers while minimizing memory usage through compression techniques. The library supports both big-endian (BEIntVec
) and little-endian (LEIntVec
) representations and offers a variety of codecs to choose from.
A Quick Example
Let's consider the universal code Gamma introduced by Elias in the 1960s. This code represents an integer as a unary prefix followed by the binary representation of the integer (thus the name universal, as for every integer x
the length of the code is always $O(\log x)$, so just a constant factor longer than its binary form). So for example 9
will be encoded as 0001001
.
use BEIntVec;
use GammaCodec;
let vec = vec!;
// The compressed-intvec needs a sampling parameter
let sampling_param = 2; // small since the vector is small
let compressed_be = from;
assert_eq!;
for in compressed.iter.enumerate
Or alternatively, you can use the LEIntVec
for little-endian representation:
use LEIntVec;
use GammaCodec;
let vec = vec!;
let compressed_le = from;
for in compressed.iter.enumerate
Available Codecs
All the codecs are provided by the dsi-bitstream library. The following are implemented in the compressed-intvec
library:
Codec Name | Description |
---|---|
GammaCodec |
Uses gamma coding without requiring extra runtime parameters. |
DeltaCodec |
Uses delta coding, likewise without extra parameters. |
ExpGolombCodec |
Requires an extra parameter (e.g., a parameter k ) for encoding/decoding. |
ZetaCodec |
Uses additional runtime parameters ζ. |
RiceCodec |
Uses a Rice parameter for encoding/decoding. Ideal for skewed distributions. In this case you may want to set this paramater as the floor of the log_2 of the mean of your values |
MinimalBinaryCodec |
A minimal binary code with upper bound u > 0 (truncated binary encoding). This is optimal for uniformly distributed data in the range [0, u). |
ParamZetaCodec |
Parameterized variant of Zeta codec using compile-time flags. |
ParamDeltaCodec |
Parameterized variant of Delta codec using compile-time flags. |
ParamGammaCodec |
Parameterized variant of Gamma codec using compile-time flags. |
For codecs that require extra parameters, we can create a compressed int-vec with the method from_with_params
:
use BEIntVec;
use RiceCodec;
let vec = vec!;
let rice_param = 3; // for example
let sampling_param = 2;
let compressed = from_with_params;
Choosing the right codec is crucial for achieving optimal compression. The efficiency of a codec is highly dependent on the underlying data distribution. For example, Rice coding is usually effective for skewed distributions, while Minimal Binary coding is optimal for uniformly distributed data.
Memory Analysis (and why choosing the right codec is important)
Both the little-endian and big-endian version of the compressed int-vec include support for the [MemDbg/MemSize][https://docs.rs/mem-dbg/] traits from the mem_dbg crate. For example, this is the output of mem_dbg(DbgFlags::empty()
for a very large BeIntVec instance:
Consider a vector of u64
values uniformly distributed in the range ([0, u64::MAX)). The following function generates this vector:
The size of the vector before compression is measured as follows:
let input_vec = generate_uniform_vec;
println!;
let _ = input_vec.mem_dbg;
This outputs:
Size of the standard Vec<u64>
8024 B ⏺
Next, we compress the vector using both MinimalBinaryCodec
and DeltaCodec
:
let minimal_intvec = from_with_param;
let delta_intvec = from;
println!;
minimal_intvec.mem_dbg;
println!;
delta_intvec.mem_dbg;
The compression results are:
Size of the compressed IntVec with MinimalBinaryCodec
832 B ⏺
528 B ├╴data
280 B ├╴samples
0 B ├╴codec
8 B ├╴k
8 B ├╴len
8 B ├╴codec_param
0 B ╰╴endian
Size of the compressed IntVec with DeltaCodec
9584 B ⏺
9288 B ├╴data
280 B ├╴samples
0 B ├╴codec
8 B ├╴k
8 B ├╴len
0 B ├╴codec_param
0 B ╰╴endian
As shown, MinimalBinaryCodec
is the most efficient for uniformly distributed data, reducing the size to 832 bytes—an order of magnitude smaller than the original vector. In contrast, DeltaCodec
actually increases the size to 9584 bytes, as it is poorly suited for uniform distributions.
Benchmarks
Benchmarks are provided to evaluate both the random access speed and space occupancy of compressed vectors.
- Random Access Benchmarks: Located in random_access.rs, these benchmarks measure the time required to access individual elements.
- Space Occupancy Benchmarks: Located in bench_size.rs, these benchmarks report the memory footprint of various codec configurations and compressed representations.
To run the benchmarks, execute:
The results are output to the terminal and also written to CSV files (e.g. benchmark_space.csv
).
Space Occupancy
In the first graph, the input is a vector of 10k random elements uniformly distributed in the range [0, 10k)
. Here, all codecs outperform the standard vector in terms of space occupancy, but the MinimalBinaryCodec clearly wins as it is specifically designed for this type of distribution. However, the other codecs also perform well because the range is small.
In the second graph, the input is the same vector, but the range is [0, u32::MAX)
(viewed as u64). Here, we see that all codecs start to perform poorly, except for MinimalBinaryCodec, which continues to be the best. In particular, codecs like Gamma perform worse than the standard vector.
If we were to increase the range even further, all codecs except MinimalBinaryCodec would perform worse than the standard vector.
Random Access
Even though in theory the access of this compressed integer vector is $O(1)$, we can't expect it to be as fast as a standard vector. The performance will be affected by the codec used, the distribution of the data and the sampling parameter. However, the benchmarks show that the performance is still quite good, even for large vectors. Choosing as sample rate a value like k = 32
seems to be a good trade-off between memory and speed.
License
This library is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.