Expand description
A compressed integer vector using variable-length encoding with fast random access.
This module provides IntVec
, a data structure for storing sequences of
integers in a compressed format while retaining efficient random access. It is
well-suited for datasets where integer values are non-uniformly distributed,
as it uses instantaneous variable-length codes to represent smaller numbers with fewer bits.
§Core Concepts
§Variable-Length Encoding
Unlike FixedVec
, which uses a fixed number of
bits for every integer, IntVec
employs instantaneous codes (such as
Gamma, Delta, or Rice codes) provided by the dsi-bitstream
crate. This
approach allows each integer to be encoded with a variable number of bits,
typically using shorter codes for smaller or more frequent values. This can
lead to significant space savings, especially for data with many small numbers.
Signed integers (e.g., i8
, i32
) are supported transparently through
zig-zag encoding, which maps small negative and positive integers to small
unsigned integers.
§Random Access and Sampling
A key challenge with variable-length codes is that the location of the i-th
element cannot be calculated directly. To solve this, IntVec
implements a
sampling mechanism. It stores the bit position of every k-th element in
a separate, auxiliary FixedVec
. This parameter, k
, is the sampling rate.
To access an element at index
, IntVec
:
- Finds the nearest sample by calculating
index / k
. - Retrieves the bit offset of the start of that sampled block.
- Jumps to that offset in the compressed data stream.
- Sequentially decodes the remaining
index % k
elements to reach the target.
This strategy provides amortized O(1) random access, as the number of
sequential decoding steps is bounded by k
.
§The k
Trade-off
The choice of the sampling rate k
involves a trade-off:
- Smaller
k
: Faster random access (fewer elements to decode per access) but higher memory usage due to a larger samples table. - Larger
k
: Slower random access but better compression, as the samples table is smaller.
The optimal k
depends on the specific access patterns of an application.
§Design and Immutability
IntVec
is immutable after creation.
Unlike FixedVec
, it does not provide methods for
in-place modification (e.g., set
, push
).
If a value in the middle of the compressed bitstream were changed, its new
encoded length might be different. For example, changing 5
(which might be
4 bits) to 5000
(which might be 16 bits) would require shifting all
subsequent data, invalidating every sample point that follows. The cost of
such an operation would be prohibitive, scaling with the length of the vector.
For this reason, IntVec
is designed as a write-once, read-many data structure.
§Access Strategies and Readers
IntVec
provides multiple interfaces for accessing data, each optimized for a
different pattern of use.
-
get()
: For single, infrequent lookups. Each call creates and discards an internal reader, which incurs overhead if used in a loop. -
get_many()
: The most efficient method for retrieving a batch of elements when all indices are known beforehand. It sorts the indices to perform a single, monotonic scan over the data, minimizing redundant decoding and seek operations. -
IntVecReader
(Reusable Stateless Reader): This reader is created viaIntVec::reader()
. It maintains an internal, reusable bitstream reader, amortizing its setup cost over multiple calls. Eachget()
call is stateless with respect to position: it performs a full seek to the nearest sample and decodes forward from there, independently of previous calls. It is best suited for true random access patterns where indices are sparse and unpredictable. -
IntVecSeqReader
(Stateful Sequential Reader): This reader, created viaIntVec::seq_reader()
, is a stateful object optimized for access patterns with high locality. It maintains an internal cursor of the current decoding position.- Fast Path: If a requested
index
is at or after the cursor’s current position and within the same sample block, the reader simply decodes forward from its last position, avoiding a costly seek operation. - Fallback Path: If the requested
index
is before the cursor or in a different sample block, the reader falls back to the standard behavior of seeking to the nearest sample and decoding from there. This makes it very efficient for iterating through sorted or clustered indices.
- Fast Path: If a requested
§Main Components
IntVec
: The core compressed vector.IntVecBuilder
: The primary tool for constructing anIntVec
with custom compression codecs and sampling rates.VariableCodecSpec
: An enum to specify the compression codec.IntVecReader
: A reusable, stateless reader for efficient random access.IntVecSeqReader
: A stateful reader optimized for sequential or localized access patterns.IntVecSlice
: An immutable, zero-copy view over a portion of the vector.
§Examples
§Basic Usage with Unsigned Integers
Create a UIntVec
(an alias for IntVec<u32, LE>
) from a slice of u32
. The builder will automatically
select a suitable codec and use a default sampling rate.
use compressed_intvec::variable::{IntVec, UIntVec};
let data: Vec<u32> = vec![100, 200, 300, 1024];
let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
assert_eq!(vec.len(), 4);
// Accessing an element
assert_eq!(vec.get(1), Some(200));
§Storing Signed Integers
IntVec
handles signed integers, such as i16
, by mapping them to unsigned
values using zig-zag encoding.
use compressed_intvec::variable::{IntVec, SIntVec};
let data: &[i16] = &[-5, 20, -100, 0, 8];
let vec: SIntVec<i16> = IntVec::from_slice(data).unwrap();
assert_eq!(vec.len(), 5);
assert_eq!(vec.get(0), Some(-5));
assert_eq!(vec.get(2), Some(-100));
§Manual Codec and Sampling Rate
For fine-grained control, use the IntVecBuilder
. Here, we specify a
sampling rate of k=8
and use the Zeta
code with k=3
.
use compressed_intvec::variable::{IntVec, UIntVec, VariableCodecSpec};
let data: Vec<u64> = (0..100).map(|i| i * i).collect();
let vec: UIntVec<u64> = IntVec::builder()
.k(8) // Set sampling rate
.codec(VariableCodecSpec::Zeta { k: Some(3) }) // Set compression codec
.build(&data)
.unwrap();
assert_eq!(vec.get_sampling_rate(), 8);
assert_eq!(vec.get(10), Some(100));
Best performance is achieved when the sampling rate k
is a power of two. Usually a value of 32
or 16
is a good trade-off between speed and compression ratio.
§Codec Selection and Performance
The choice of compression codec is critical for performance and space efficiency.
IntVecBuilder
offers automatic codec selection via
VariableCodecSpec::Auto
. When enabled, the builder analyzes the entire input
dataset to find the codec that offers the best compression ratio.
This analysis involves calculating the compressed size for the data with approximately 70 different codec configurations. This process introduces a significant, one-time construction overhead.
Use Auto
for read-heavy workloads where the IntVec
is built once and then accessed many times. The initial cost is easily amortized by
the long-term space savings.
If your application creates many small IntVec
s or accesses them frequently,
the repeated cost of analysis can become a performance
bottleneck. In such scenarios, it is better to explicitly specify a codec
(e.g., VariableCodecSpec::Gamma
or VariableCodecSpec::Delta
) that is known
to be a good general-purpose choice for your data.
use compressed_intvec::prelude::*;
let data: Vec<u32> = (0..100).collect();
// Create an IntVec with automatic codec selection
let vec: UIntVec<u32> = IntVec::builder()
.build(&data)
.unwrap();
Re-exports§
pub use self::codec::VariableCodecSpec;
pub use self::traits::Storable;
pub use builder::IntVecBuilder;
pub use builder::IntVecFromIterBuilder;
pub use reader::IntVecReader;
pub use seq_reader::IntVecSeqReader;
pub use slice::IntVecSlice;
Modules§
- builder
- Builders for constructing an
IntVec
. - codec
- Codec selection for variable-length integer compression.
- iter
- Iterators for
IntVec
. - reader
- A reader for efficient, repeated random access into an
IntVec
. - seq_
reader - A stateful reader for efficient, sequential access into an
IntVec
. - slice
- An immutable, zero-copy slice of an
IntVec
. - traits
- Core traits for the
variable
module.
Structs§
- IntVec
- A compressed, randomly accessible vector of integers using variable-length encoding.
Enums§
- IntVec
Error - Defines the set of errors that can occur in
IntVec
operations.
Type Aliases§
- BEInt
Vec - An
IntVec
foru64
elements with Big-Endian bit layout. - BESInt
Vec - An
IntVec
fori64
elements with Big-Endian bit layout. - LEInt
Vec - An
IntVec
foru64
elements with Little-Endian bit layout. - LESInt
Vec - An
IntVec
fori64
elements with Little-Endian bit layout. - SIntVec
- An
IntVec
for signed integers with Little-Endian bit layout. - UIntVec
- An
IntVec
for unsigned integers with Little-Endian bit layout.