Module variable

Source
Expand description

A compressed integer vector using variable-length encoding with fast random access.

This module provides IntVec, a data structure for storing sequences of integers in a compressed format while retaining efficient random access. It is well-suited for datasets where integer values are non-uniformly distributed, as it uses instantaneous variable-length codes to represent smaller numbers with fewer bits.

§Core Concepts

§Variable-Length Encoding

Unlike FixedVec, which uses a fixed number of bits for every integer, IntVec employs instantaneous codes (such as Gamma, Delta, or Rice codes) provided by the dsi-bitstream crate. This approach allows each integer to be encoded with a variable number of bits, typically using shorter codes for smaller or more frequent values. This can lead to significant space savings, especially for data with many small numbers.

Signed integers (e.g., i8, i32) are supported transparently through zig-zag encoding, which maps small negative and positive integers to small unsigned integers.

§Random Access and Sampling

A key challenge with variable-length codes is that the location of the i-th element cannot be calculated directly. To solve this, IntVec implements a sampling mechanism. It stores the bit position of every k-th element in a separate, auxiliary FixedVec. This parameter, k, is the sampling rate.

To access an element at index, IntVec:

  1. Finds the nearest sample by calculating index / k.
  2. Retrieves the bit offset of the start of that sampled block.
  3. Jumps to that offset in the compressed data stream.
  4. Sequentially decodes the remaining index % k elements to reach the target.

This strategy provides amortized O(1) random access, as the number of sequential decoding steps is bounded by k.

§The k Trade-off

The choice of the sampling rate k involves a trade-off:

  • Smaller k: Faster random access (fewer elements to decode per access) but higher memory usage due to a larger samples table.
  • Larger k: Slower random access but better compression, as the samples table is smaller.

The optimal k depends on the specific access patterns of an application.

§Design and Immutability

IntVec is immutable after creation. Unlike FixedVec, it does not provide methods for in-place modification (e.g., set, push).

If a value in the middle of the compressed bitstream were changed, its new encoded length might be different. For example, changing 5 (which might be 4 bits) to 5000 (which might be 16 bits) would require shifting all subsequent data, invalidating every sample point that follows. The cost of such an operation would be prohibitive, scaling with the length of the vector.

For this reason, IntVec is designed as a write-once, read-many data structure.

§Access Strategies and Readers

IntVec provides multiple interfaces for accessing data, each optimized for a different pattern of use.

  • get(): For single, infrequent lookups. Each call creates and discards an internal reader, which incurs overhead if used in a loop.

  • get_many(): The most efficient method for retrieving a batch of elements when all indices are known beforehand. It sorts the indices to perform a single, monotonic scan over the data, minimizing redundant decoding and seek operations.

  • IntVecReader (Reusable Stateless Reader): This reader is created via IntVec::reader(). It maintains an internal, reusable bitstream reader, amortizing its setup cost over multiple calls. Each get() call is stateless with respect to position: it performs a full seek to the nearest sample and decodes forward from there, independently of previous calls. It is best suited for true random access patterns where indices are sparse and unpredictable.

  • IntVecSeqReader (Stateful Sequential Reader): This reader, created via IntVec::seq_reader(), is a stateful object optimized for access patterns with high locality. It maintains an internal cursor of the current decoding position.

    • Fast Path: If a requested index is at or after the cursor’s current position and within the same sample block, the reader simply decodes forward from its last position, avoiding a costly seek operation.
    • Fallback Path: If the requested index is before the cursor or in a different sample block, the reader falls back to the standard behavior of seeking to the nearest sample and decoding from there. This makes it very efficient for iterating through sorted or clustered indices.

§Main Components

  • IntVec: The core compressed vector.
  • IntVecBuilder: The primary tool for constructing an IntVec with custom compression codecs and sampling rates.
  • VariableCodecSpec: An enum to specify the compression codec.
  • IntVecReader: A reusable, stateless reader for efficient random access.
  • IntVecSeqReader: A stateful reader optimized for sequential or localized access patterns.
  • IntVecSlice: An immutable, zero-copy view over a portion of the vector.

§Examples

§Basic Usage with Unsigned Integers

Create a UIntVec (an alias for IntVec<u32, LE>) from a slice of u32. The builder will automatically select a suitable codec and use a default sampling rate.

use compressed_intvec::variable::{IntVec, UIntVec};

let data: Vec<u32> = vec![100, 200, 300, 1024];
let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();

assert_eq!(vec.len(), 4);
// Accessing an element
assert_eq!(vec.get(1), Some(200));

§Storing Signed Integers

IntVec handles signed integers, such as i16, by mapping them to unsigned values using zig-zag encoding.

use compressed_intvec::variable::{IntVec, SIntVec};

let data: &[i16] = &[-5, 20, -100, 0, 8];
let vec: SIntVec<i16> = IntVec::from_slice(data).unwrap();

assert_eq!(vec.len(), 5);
assert_eq!(vec.get(0), Some(-5));
assert_eq!(vec.get(2), Some(-100));

§Manual Codec and Sampling Rate

For fine-grained control, use the IntVecBuilder. Here, we specify a sampling rate of k=8 and use the Zeta code with k=3.

use compressed_intvec::variable::{IntVec, UIntVec, VariableCodecSpec};

let data: Vec<u64> = (0..100).map(|i| i * i).collect();

let vec: UIntVec<u64> = IntVec::builder()
    .k(8) // Set sampling rate
    .codec(VariableCodecSpec::Zeta { k: Some(3) }) // Set compression codec
    .build(&data)
    .unwrap();

assert_eq!(vec.get_sampling_rate(), 8);
assert_eq!(vec.get(10), Some(100));

Best performance is achieved when the sampling rate k is a power of two. Usually a value of 32 or 16 is a good trade-off between speed and compression ratio.

§Codec Selection and Performance

The choice of compression codec is critical for performance and space efficiency. IntVecBuilder offers automatic codec selection via VariableCodecSpec::Auto. When enabled, the builder analyzes the entire input dataset to find the codec that offers the best compression ratio.

This analysis involves calculating the compressed size for the data with approximately 70 different codec configurations. This process introduces a significant, one-time construction overhead.

Use Auto for read-heavy workloads where the IntVec is built once and then accessed many times. The initial cost is easily amortized by the long-term space savings.

If your application creates many small IntVecs or accesses them frequently, the repeated cost of analysis can become a performance bottleneck. In such scenarios, it is better to explicitly specify a codec (e.g., VariableCodecSpec::Gamma or VariableCodecSpec::Delta) that is known to be a good general-purpose choice for your data.

use compressed_intvec::prelude::*;

let data: Vec<u32> = (0..100).collect();

// Create an IntVec with automatic codec selection
let vec: UIntVec<u32> = IntVec::builder()
    .build(&data)
    .unwrap();

Re-exports§

pub use self::codec::VariableCodecSpec;
pub use self::traits::Storable;
pub use builder::IntVecBuilder;
pub use builder::IntVecFromIterBuilder;
pub use reader::IntVecReader;
pub use seq_reader::IntVecSeqReader;
pub use slice::IntVecSlice;

Modules§

builder
Builders for constructing an IntVec.
codec
Codec selection for variable-length integer compression.
iter
Iterators for IntVec.
reader
A reader for efficient, repeated random access into an IntVec.
seq_reader
A stateful reader for efficient, sequential access into an IntVec.
slice
An immutable, zero-copy slice of an IntVec.
traits
Core traits for the variable module.

Structs§

IntVec
A compressed, randomly accessible vector of integers using variable-length encoding.

Enums§

IntVecError
Defines the set of errors that can occur in IntVec operations.

Type Aliases§

BEIntVec
An IntVec for u64 elements with Big-Endian bit layout.
BESIntVec
An IntVec for i64 elements with Big-Endian bit layout.
LEIntVec
An IntVec for u64 elements with Little-Endian bit layout.
LESIntVec
An IntVec for i64 elements with Little-Endian bit layout.
SIntVec
An IntVec for signed integers with Little-Endian bit layout.
UIntVec
An IntVec for unsigned integers with Little-Endian bit layout.