Compressed Integer Vector Library
A Rust library for compressing vectors of u64
integers using variable-length coding from dsi-bitstream. Offers fast random access via sampling to balance speed and memory. Choose between big-endian (BEIntVec
) or little-endian (LEIntVec
) encoding.
Features
- Efficient Compression: Leverage various codecs like Gamma, Delta, and Rice coding.
- Fast Random Access: Achieve O(1) access with configurable sampling.
- Memory Analysis: Integrate with
mem-dbg
for memory profiling. - Flexible Codecs: Select codecs based on data distribution for optimal compression.
The sampling parameter determines how often full positions are stored to speed up access. Higher values reduce memory overhead but may increase access time. For example, sampling_param = 32
is usually a good trade-off for large datasets.
A Quick Example
Let's consider the universal code Gamma introduced by Elias in the 1960s. This code represents an integer as a unary prefix followed by the binary representation of the integer (thus the name universal, as for every integer x
the length of the code is always $O(\log x)$ long, so just a constant factor longer than its binary form). So for example 9
will be encoded as 0001001
.
use BEIntVec;
use GammaCodec;
let vec = vec!;
// The compressed-intvec needs a sampling parameter
let sampling_param = 2; // small since the vector is small
let compressed_be = from;
assert_eq!;
for in compressed_be.iter.enumerate
Or alternatively, you can use the LEIntVec
for little-endian representation:
use LEIntVec;
use GammaCodec;
let vec = vec!;
let compressed_le = from;
for in compressed.iter.enumerate
Available Codecs
Codec Name | Description |
---|---|
GammaCodec |
Uses gamma coding without requiring extra runtime parameters. |
DeltaCodec |
Uses delta coding, likewise without extra parameters. |
ExpGolombCodec |
Requires an extra parameter (e.g., a parameter k ) for encoding/decoding. |
ZetaCodec |
Uses additional runtime parameters ζ. |
RiceCodec |
Uses a Rice parameter for encoding/decoding. Ideal for skewed distributions. In this case you may want to set this paramater as the floor of the log_2 of the mean of your values |
MinimalBinaryCodec |
A minimal binary code with upper bound u > 0 (truncated binary encoding). This is optimal for uniformly distributed data in the range [0, u). |
ParamZetaCodec |
Parameterized variant of Zeta codec using compile-time flags. |
ParamDeltaCodec |
Parameterized variant of Delta codec using compile-time flags. |
ParamGammaCodec |
Parameterized variant of Gamma codec using compile-time flags. |
For codecs that require extra parameters, we can create a compressed int-vec with the method from_with_param
:
use BEIntVec;
use RiceCodec;
let vec = vec!;
let rice_param = 3; // for example
let sampling_param = 2;
let compressed = from_with_param.unwrap;
Choosing the right codec is crucial for achieving optimal compression. The efficiency of a codec is highly dependent on the underlying data distribution. For example, Rice coding is usually effective for skewed distributions, while Minimal Binary coding is optimal for uniformly distributed data.
Endianness
Choose BEIntVec
or LEIntVec
based on data interoperability needs. Performance is equivalent; endianness affects byte order in compressed storage.
Memory Analysis (and why choosing the right codec is important)
Both the little-endian and big-endian version of the compressed int-vec include support for the MemDbg/MemSize traits from the mem_dbg crate. For example, this is the output of mem_dbg(DbgFlags::empty()
for a very large BEIntVec
instance:
Consider now a vector of u64
values uniformly distributed in the range [0, u64::MAX)
The size of the vector before compression is measured as follows:
let input_vec = generate_uniform_vec;
println!;
input_vec.mem_dbg;
This outputs:
Size of the standard Vec<u64>
8024 B ⏺
Next, we compress the vector using both MinimalBinaryCodec
and DeltaCodec
using for both a sampling parameter of 32
:
let minimal_intvec = from_with_param;
let delta_intvec = from;
println!;
minimal_intvec.mem_dbg;
println!;
delta_intvec.mem_dbg;
The compression results are:
Size of the compressed IntVec with MinimalBinaryCodec
832 B ⏺
528 B ├╴data
280 B ├╴samples
0 B ├╴codec
8 B ├╴k
8 B ├╴len
8 B ├╴codec_param
0 B ╰╴endian
Size of the compressed IntVec with DeltaCodec
9584 B ⏺
9288 B ├╴data
280 B ├╴samples
0 B ├╴codec
8 B ├╴k
8 B ├╴len
0 B ├╴codec_param
0 B ╰╴endian
As shown, MinimalBinaryCodec
is the most efficient for uniformly distributed data, reducing the size to 832 bytes, an order of magnitude smaller than the original vector. In contrast, DeltaCodec
actually increases the size to 9584 bytes, as it is poorly suited for uniform distributions.
Benchmarks
Benchmarks are provided to evaluate both the random access speed and space occupancy of compressed vectors.
- Random Access Benchmarks: These benchmarks measure the time required to access individual elements.
- Space Occupancy Benchmarks: These benchmarks report the memory footprint of various codec configurations and compressed representations.
To run the benchmarks, execute:
The results are output to the terminal and also written to CSV files (e.g. benchmark_space.csv
).
Space Occupancy
In the first graph, the input is a vector of 10k
random elements uniformly distributed in the range [0, 10k)
. Here, all codecs outperform the standard vector in terms of space occupancy, but the MinimalBinaryCodec
clearly wins as it is specifically designed for this type of distribution. However, the other codecs also perform well because the range is small.
In the second graph, the input is the same vector, but the range is [0, u32::MAX)
(viewed as u64
). Here, we see that all codecs start to perform poorly, except for MinimalBinaryCodec
, which continues to be the best. In particular, codecs like Gamma perform worse than the standard vector.
If we were to increase the range even further, all codecs except MinimalBinaryCodec
would perform worse than the standard vector.
Random Access
Even though in theory the access of this compressed integer vector is $O(1)$, we can't expect it to be as fast as a standard vector. The performance will be affected by the codec used, the distribution of the data and the sampling parameter. However, the benchmarks show that the performance is still quite good, even for large vectors. Choosing as sample rate a value like k = 32
seems to be a good trade-off between memory and speed.
License
This library is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.