Expand description
Codec selection for variable-length integer compression.
This module defines Codec, an enum for controlling the
compression strategy of an VarVec. The choice of codec is a critical
performance parameter, as its effectiveness depends on the statistical
properties of the data being compressed.
§Codec Selection Strategy
Codec selection is performed by a statistical analysis of the entire input dataset at construction time.
The Codec enum provides several ways to specify the compression method:
- Explicit Specification: A specific codec and all its parameters are provided. This is suitable when the data characteristics are known in advance.
use compressed_intvec::prelude::*;
let data: &[u32] = &(0..1000).collect::<Vec<_>>();
// Explicitly specify a non-parametric codec
let delta_vec: UVarVec<u32> = VarVec::builder()
.codec(Codec::Delta)
.k(16)
.build(&data)?;
// Explicitly specify a parametric codec with a fixed parameter
let zeta_vec: UVarVec<u32> = VarVec::builder()
.codec(Codec::Zeta { k: Some(3) })
.build(&data)?;- Automatic Parameter Estimation: A specific codec family is chosen, but
the optimal parameter is determined by the builder based on a full data
analysis. This is achieved by providing
Noneas the parameter value.- Example:
Rice { log2_b: None }will find the bestlog2_bfor the given data.
- Example:
use compressed_intvec::prelude::*;
let data: &[u32] = &(0..1000).collect::<Vec<_>>();
// Automatically select the best Rice parameter
let rice_vec: UVarVec<u32> = VarVec::builder()
.codec(Codec::Rice { log2_b: None })
.build(&data)?;- Fully Automatic Selection: The builder analyzes the data against all
available codecs and their standard parameter ranges to find the single
best configuration. This is activated by using
Codec::Auto.
use compressed_intvec::prelude::*;
let data: &[u32] = &(0..1000).collect::<Vec<_>>();
// Automatically select the best codec and parameters for the data
let auto_vec: UVarVec<u32> = VarVec::builder()
.codec(Codec::Auto)
.build(&data)?;§Analysis Mechanism
The selection logic uses the CodesStats utility from the dsi-bitstream
crate. For a given sequence of integers, CodesStats calculates the exact
total bit cost for encoding the sequence with a wide range of instantaneous
codes and their common parameterizations.
§Construction Overhead
The full-dataset analysis has a one-time computational cost at construction.
The complexity is O(N * C), where N is the number of elements in the
input and C is the number of codec configurations tested by CodesStats
(approximately 70).
This trade-off is suitable for read-heavy workloads where a higher initial cost is acceptable for better compression and subsequent read performance.
§Implementation Notes
- The parameter ranges for codecs like Zeta and Rice are defined by the
const genericsof theCodesStatsstruct indsi-bitstream. The default values cover common and effective parameter ranges. - If a data distribution benefits from a parameter outside of the tested
range (e.g., Zeta with
k=20), it must be specified explicitly in the builder via.codec(Codec::Zeta { k: Some(20) }).
Enums§
Type Aliases§
- Variable
Codec Spec Deprecated - Deprecated alias for
Codec. UseCodecinstead.