Skip to main content

Module variable

Module variable 

Source
Expand description

A compressed integer vector using variable-length encoding with fast random access.

This module provides VarVec, a data structure for storing sequences of integers in a compressed format while retaining efficient random access. It is well-suited for datasets where integer values are non-uniformly distributed, as it uses instantaneous variable-length codes to represent smaller numbers with fewer bits.

§Core Concepts

§Variable-Length Encoding

Unlike FixedVec, which uses a fixed number of bits for every integer, VarVec employs instantaneous codes (such as Gamma, Delta, or Rice codes) provided by the dsi-bitstream crate. This approach allows each integer to be encoded with a variable number of bits, typically using shorter codes for smaller or more frequent values. This can lead to significant space savings, especially for data with many small numbers.

Signed integers (e.g., i8, i32) are supported transparently through zig-zag encoding, which maps small negative and positive integers to small unsigned integers.

§Random Access and Sampling

A key challenge with variable-length codes is that the location of the i-th element cannot be calculated directly. To solve this, VarVec implements a sampling mechanism. It stores the bit position of every k-th element in a separate, auxiliary FixedVec. This parameter, k, is the sampling rate.

To access an element at index, VarVec:

  1. Finds the nearest sample by calculating index / k.
  2. Retrieves the bit offset of the start of that sampled block.
  3. Jumps to that offset in the compressed data stream.
  4. Sequentially decodes the remaining index % k elements to reach the target.

This strategy provides amortized O(1) random access, as the number of sequential decoding steps is bounded by k.

§The k Trade-off

The choice of the sampling rate k involves a trade-off:

  • Smaller k: Faster random access (fewer elements to decode per access) but higher memory usage due to a larger samples table.
  • Larger k: Slower random access but better compression, as the samples table is smaller.

The optimal k depends on the specific access patterns of an application.

§Design and Immutability

VarVec is immutable after creation. Unlike FixedVec, it does not provide methods for in-place modification (e.g., set, push).

If a value in the middle of the compressed bitstream were changed, its new encoded length might be different. For example, changing 5 (which might be 4 bits) to 5000 (which might be 16 bits) would require shifting all subsequent data, invalidating every sample point that follows. The cost of such an operation would be prohibitive, scaling with the length of the vector.

For this reason, VarVec is designed as a write-once, read-many data structure.

§Access Strategies and Readers

VarVec provides multiple interfaces for accessing data, each optimized for a different pattern of use.

  • get(): For single, infrequent lookups. Each call creates and discards an internal reader, which incurs overhead if used in a loop.

  • get_many(): The most efficient method for retrieving a batch of elements when all indices are known beforehand. It sorts the indices to perform a single, monotonic scan over the data, minimizing redundant decoding and seek operations.

  • VarVecReader (Reusable Stateless Reader): This reader is created via VarVec::reader(). It maintains an internal, reusable bitstream reader, amortizing its setup cost over multiple calls. Each get() call is stateless with respect to position: it performs a full seek to the nearest sample and decodes forward from there, independently of previous calls. It is best suited for true random access patterns where indices are sparse and unpredictable.

  • VarVecSeqReader (Stateful Sequential Reader): This reader, created via VarVec::seq_reader(), is a stateful object optimized for access patterns with high locality. It maintains an internal cursor of the current decoding position.

    • Fast Path: If a requested index is at or after the cursor’s current position and within the same sample block, the reader simply decodes forward from its last position, avoiding a costly seek operation.
    • Fallback Path: If the requested index is before the cursor or in a different sample block, the reader falls back to the standard behavior of seeking to the nearest sample and decoding from there. This makes it very efficient for iterating through sorted or clustered indices.

§Migration Notice

As of version 0.6.0, all IntVec* types have been renamed to VarVec* for consistency with the module naming convention (fixedFixedVec, seqSeqVec, variableVarVec). The old names remain available as deprecated type aliases and will continue to work, but using the new names is recommended.

§Main Components

  • VarVec: The core compressed vector.
  • VarVecBuilder: The primary tool for constructing an VarVec with custom compression codecs and sampling rates.
  • Codec: An enum to specify the compression codec.
  • VarVecReader: A reusable, stateless reader for efficient random access.
  • VarVecSeqReader: A stateful reader optimized for sequential or localized access patterns.
  • VarVecSlice: An immutable, zero-copy view over a portion of the vector.

§Examples

§Basic Usage with Unsigned Integers

Create a UVarVec (an alias for VarVec<u32, LE>) from a slice of u32. The builder will automatically select a suitable codec and use a default sampling rate.

use compressed_intvec::variable::{VarVec, UVarVec};

let data: Vec<u32> = vec![100, 200, 300, 1024];
let vec: UVarVec<u32> = VarVec::from_slice(&data)?;

assert_eq!(vec.len(), 4);
// Accessing an element
assert_eq!(vec.get(1), Some(200));

§Storing Signed Integers

VarVec handles signed integers, such as i16, by mapping them to unsigned values using zig-zag encoding.

use compressed_intvec::variable::{VarVec, SVarVec};

let data: &[i16] = &[-5, 20, -100, 0, 8];
let vec: SVarVec<i16> = VarVec::from_slice(data)?;

assert_eq!(vec.len(), 5);
assert_eq!(vec.get(0), Some(-5));
assert_eq!(vec.get(2), Some(-100));

§Manual Codec and Sampling Rate

For fine-grained control, use the VarVecBuilder. Here, we specify a sampling rate of k=8 and use the Zeta code with k=3.

use compressed_intvec::variable::{VarVec, UVarVec, Codec};

let data: Vec<u64> = (0..100).map(|i| i * i).collect();

let vec: UVarVec<u64> = VarVec::builder()
    .k(8) // Set sampling rate
    .codec(Codec::Zeta { k: Some(3) }) // Set compression codec
    .build(&data)?;

assert_eq!(vec.sampling_rate(), 8);
assert_eq!(vec.get(10), Some(100));

Best performance is achieved when the sampling rate k is a power of two. Usually a value of 32 or 16 is a good trade-off between speed and compression ratio.

§Codec Selection and Performance

The choice of compression codec is critical for performance and space efficiency. VarVecBuilder offers automatic codec selection via Codec::Auto. When enabled, the builder analyzes the entire input dataset to find the codec that offers the best compression ratio.

This analysis involves calculating the compressed size for the data with approximately 70 different codec configurations. This process introduces a significant, one-time construction overhead.

Use Auto for read-heavy workloads where the VarVec is built once and then accessed many times. The initial cost is easily amortized by the long-term space savings.

If your application creates many small VarVecs or accesses them frequently, the repeated cost of analysis can become a performance bottleneck. In such scenarios, it is better to explicitly specify a codec (e.g., Gamma or Delta) that is known to be a good general-purpose choice for your data.

use compressed_intvec::prelude::*;

let data: Vec<u32> = (0..100).collect();

// Create an VarVec with automatic codec selection
let vec: UVarVec<u32> = VarVec::builder()
    .build(&data)?;

Re-exports§

pub use self::codec::Codec;
pub use self::traits::Storable;
pub use builder::VarVecBuilder;
pub use builder::VarVecFromIterBuilder;
pub use reader::VarVecReader;
pub use seq_reader::VarVecSeqReader;
pub use slice::VarVecSlice;
pub use slice::VarVecSliceIter;
pub use self::codec::VariableCodecSpec;Deprecated

Modules§

builder
Builders for constructing an VarVec.
codec
Codec selection for variable-length integer compression.
iter
Iterators for VarVec.
reader
A reader for efficient, repeated random access into an VarVec.
seq_reader
A stateful reader for efficient, sequential access into an VarVec.
slice
An immutable, zero-copy slice of an VarVec.
traits
Core traits for the variable module.

Structs§

VarVec
A compressed, randomly accessible vector of integers using variable-length encoding.

Enums§

VarVecError
Defines the set of errors that can occur in VarVec operations.

Type Aliases§

BEIntVecDeprecated
Deprecated alias for BEVarVec. Use BEVarVec instead.
BESIntVecDeprecated
Deprecated alias for BESVarVec. Use BESVarVec instead.
BESVarVec
An VarVec for i64 elements with Big-Endian bit layout.
BEVarVec
An VarVec for u64 elements with Big-Endian bit layout.
IntVecDeprecated
Deprecated alias for VarVec. Use VarVec instead.
IntVecBuilderDeprecated
Deprecated alias for VarVecBuilder. Use VarVecBuilder instead.
IntVecErrorDeprecated
Deprecated alias for VarVecError. Use VarVecError instead.
IntVecFromIterBuilderDeprecated
Deprecated alias for VarVecFromIterBuilder. Use VarVecFromIterBuilder instead.
IntVecIntoIterDeprecated
Deprecated alias for VarVecIntoIter. Use VarVecIntoIter instead.
IntVecIterDeprecated
Deprecated alias for VarVecIter. Use VarVecIter instead.
IntVecReaderDeprecated
Deprecated alias for VarVecReader. Use VarVecReader instead.
IntVecSeqReaderDeprecated
Deprecated alias for VarVecSeqReader. Use VarVecSeqReader instead.
IntVecSliceDeprecated
Deprecated alias for VarVecSlice. Use VarVecSlice instead.
IntVecSliceIterDeprecated
Deprecated alias for VarVecSliceIter. Use VarVecSliceIter instead.
LEIntVecDeprecated
Deprecated alias for LEVarVec. Use LEVarVec instead.
LESIntVecDeprecated
Deprecated alias for LESVarVec. Use LESVarVec instead.
LESVarVec
An VarVec for i64 elements with Little-Endian bit layout.
LEVarVec
An VarVec for u64 elements with Little-Endian bit layout.
SIntVecDeprecated
Deprecated alias for SVarVec. Use SVarVec instead.
SVarVec
An VarVec for signed integers with Little-Endian bit layout.
UIntVecDeprecated
Deprecated alias for UVarVec. Use UVarVec instead.
UVarVec
An VarVec for unsigned integers with Little-Endian bit layout.