Expand description
§Compressed IntVec Library
A Rust implementation of compressed vectors of unsigned 64-bit
integers using instantaneous codes. The key structure, IntVec
,
compresses the data into a bitstream while storing sample offsets for fast random access. The bitstream
and the codes are provided by the library dsi-bitstream.
§Features
- Multiple Codecs: Use various encoding schemes such as Gamma, Delta, Exp‑Golomb,
Zeta, Rice, and their parameterized variants. All codecs implement the
Codec
trait. - Endianness Support: Dedicated types are available for both big‑endian and little‑endian
representations. See
BEIntVec
andLEIntVec
. - Memory Debugging: Integration with the
mem_dbg
crate helps in estimating the memory footprint of compressed data structures.
§Modules
codecs
(src/codecs.rs): Contains the genericCodec
trait and its implementations for various coding schemes. Each codec defines both encoding and decoding functionalities.intvec
(src/intvec.rs): Implements the mainIntVec
structure, along with big‑endian and little‑endian variants, constructors, element retrieval, full decoding, and iterator support.
§Usage Example
To compress a vector of integers using gamma coding in big‑endian format:
use compressed_intvec::intvec::BEIntVec;
use compressed_intvec::codecs::GammaCodec;
let input = vec![1, 2, 3, 4, 5];
// Create a compressed vector with a sampling period of 2 (every 2nd value is stored as a sample)
let intvec = BEIntVec::<GammaCodec>::from(&input, 2);
// Random access: decode the 4th element
assert_eq!(intvec.get(3), Some(4));
// Decode the full original vector (note: this may be expensive)
assert_eq!(intvec.into_vec(), input);
For little‑endian encoding, use LEIntVec
similarly:
use compressed_intvec::intvec::LEIntVec;
use compressed_intvec::codecs::GammaCodec;
let input = vec![10, 20, 30, 40];
let intvec = LEIntVec::<GammaCodec>::from(&input, 2);
assert_eq!(intvec.get(2), Some(30));
§Codecs Overview
The codecs
module provides implementations of the Codec
trait using different encoding algorithms:
- GammaCodec: Uses gamma coding without requiring extra runtime parameters.
- DeltaCodec: Uses delta coding, likewise without extra parameters.
- ExpGolombCodec: Requires an extra parameter (e.g. a parameter
k
) for encoding/decoding. - ZetaCodec: Uses additional runtime parameters ζ.
- RiceCodec: Uses a Rice parameter for encoding/decoding. Ideal for skewed distributions, you likely want to the set the rice prameter as the floor of the base‑2 logarithm of the mean value of the data.
- MinimalBinaryCodec: A minimal binary code with upper bound
u > 0
(truncated binary encoding). This is optimal for uniformly distributed data in the range [0, u). - Parameterized Codecs: Variants like
ParamZetaCodec
,ParamDeltaCodec
, andParamGammaCodec
use compile‑time flags (e.g. lookup table usage) to control internal behavior. If you choose to use this, consider that the tables are hardcoded in the library dsi-dsi-bitstream (in the file_tables.rs
). If you want to change the tables, you need to clone the repository and change the tables with the python scriptgen_code_tables.py
and then compile the library. See
Theoretical Consideration: The efficiency of each codec is closely tied to the data distribution.
- Skewed Distributions: When the data is skewed, Rice coding generally outperforms Gamma coding. In these cases, choose the Rice parameter as the floor of the base‑2 logarithm of the mean value.
- Power Law Distributions: If the data roughly follows a power law (i.e. P(x) ∝ x⁻²), Gamma coding is often the optimal choice.
- Uniform Distributions: For data uniformly distributed across the range [0, u64::MAX), minimal binary coding is the most efficient.
For more in‐depth information, please consult the literature on entropy coding.
§The IntVec
Structure
The IntVec
type encapsulates:
- Compressed Data: A
Vec<u64>
containing the compressed bitstream. - Sampling Offsets: A
Vec<usize>
tracking bit offsets for every k‑th value to allow fast random access. - Codec Parameters: Any additional parameters required by the selected codec.
The main functionalities include:
- Construction: Use
from_with_param
(or the convenience methodfrom
for codecs without runtime parameters) to build a compressed vector. - Random Access: Retrieve a value using
get(index)
by decoding only the required part of the stream. - Full Decoding: Convert back to a standard vector with
into_vec()
. This operation decodes every value. - Iteration: The
iter()
method returns an iterator that decodes values on the fly.