Module codec

Source
Expand description

Codec selection for variable-length integer compression.

This module defines VariableCodecSpec, an enum for controlling the compression strategy of an IntVec. The choice of codec is a critical performance parameter, as its effectiveness depends on the statistical properties of the data being compressed.

§Codec Selection Strategy

Codec selection is performed by a statistical analysis of the entire input dataset at construction time.

The VariableCodecSpec enum provides several ways to specify the compression method:

  1. Explicit Specification: A specific codec and all its parameters are provided. This is suitable when the data characteristics are known in advance.
    • Non-parametric examples: Gamma, Delta.
    • Parametric example: Zeta { k: Some(3) }.
use compressed_intvec::prelude::*;
 
let data: &[u32] = &(0..1000).collect::<Vec<_>>(); 
 
// Explicitly specify a non-parametric codec
let delta_vec: UIntVec<u32> = IntVec::builder()
    .codec(VariableCodecSpec::Delta)
    .k(16)
    .build(&data)
    .unwrap();
  
// Explicitly specify a parametric codec with a fixed parameter
let zeta_vec: UIntVec<u32> = IntVec::builder()
    .codec(VariableCodecSpec::Zeta { k: Some(3) })
    .build(&data)
    .unwrap();
  1. Automatic Parameter Estimation: A specific codec family is chosen, but the optimal parameter is determined by the builder based on a full data analysis. This is achieved by providing None as the parameter value.
    • Example: Rice { log2_b: None } will find the best log2_b for the given data.
use compressed_intvec::prelude::*;

let data: &[u32] = &(0..1000).collect::<Vec<_>>();

// Automatically select the best Rice parameter
let rice_vec: UIntVec<u32> = IntVec::builder()
    .codec(VariableCodecSpec::Rice { log2_b: None })
    .build(&data)
    .unwrap();
  1. Fully Automatic Selection: The builder analyzes the data against all available codecs and their standard parameter ranges to find the single best configuration. This is activated by using VariableCodecSpec::Auto.
use compressed_intvec::prelude::*;
 
let data: &[u32] = &(0..1000).collect::<Vec<_>>();
// Automatically select the best codec and parameters for the data
let auto_vec: UIntVec<u32> = IntVec::builder()
   .codec(VariableCodecSpec::Auto)
   .build(&data)
   .unwrap();

§Analysis Mechanism

The selection logic uses the CodesStats utility from the dsi-bitstream crate. For a given sequence of integers, CodesStats calculates the exact total bit cost for encoding the sequence with a wide range of instantaneous codes and their common parameterizations.

§Construction Overhead

The full-dataset analysis has a one-time computational cost at construction. The complexity is O(N * C), where N is the number of elements in the input and C is the number of codec configurations tested by CodesStats (approximately 70).

This trade-off is suitable for read-heavy workloads where a higher initial cost is acceptable for better compression and subsequent read performance.

§Implementation Notes

  • The parameter ranges for codecs like Zeta and Rice are defined by the const generics of the CodesStats struct in dsi-bitstream. The default values cover common and effective parameter ranges.
  • If a data distribution benefits from a parameter outside of the tested range (e.g., Zeta with k=20), it must be specified explicitly in the builder via .codec(VariableCodecSpec::Zeta { k: Some(20) }).

Enums§

VariableCodecSpec
Specifies the compression codec and its parameters for an IntVec.