Crate compressed_intvec

Expand description

§Compressed Integer Vectors

license

A Rust library that provides space-efficient, in-memory representations for integer vectors. It offers three complementary data structures:

FixedVec for fixed-width encoding with blazing fast mutable and atomic access.
VarVec for variable-length encoding with high compression and amortized O(1) random access.
SeqVec for storing sequences of integers with indexed access.

The library is designed to reduce the memory footprint of standard std::vec::Vec collections of integers while retaining performant access patterns.

§Core Structures: `FixedVec`, `VarVec`, and `SeqVec`

The library provides three distinct vector types, each based on a different encoding principle and access pattern. Choosing the right one depends on the specific use case, performance requirements, and data characteristics.

§`FixedVec`: Fixed-Width Encoding

Implements a vector where every integer occupies the same, predetermined number of bits.

O(1) Random Access: The memory location of any element is determined by a direct bit-offset calculation, resulting in minimal-overhead access. With low bit widths (e.g., 8, 16, 32), FixedVec can be faster than std::vec::Vec for random access due to better cache utilization.
Mutability: Supports in-place modifications after creation through an API similar to std::vec::Vec (e.g., push, set, pop).
Atomic Operations: Provides AtomicFixedVec, a thread-safe variant that supports atomic read-modify-write operations for concurrent environments.

§`VarVec`: Variable-Width Element Encoding

VarVec uses variable-length instantaneous codes (e.g., Gamma, Delta, Rice, Zeta) to represent each integer. Smaller or more frequent values get shorter codes, which can dramatically reduce memory usage for non-uniform distributions.

The fundamental trade-off of variable-length encoding is that elements no longer have predictable positions in the bitstream: finding element i requires decoding every element before it. A naïve scan from the start would make random access O(n), which is impractical.

VarVec solves this with periodic sampling: it stores the bit offset of every k-th element in an auxiliary FixedVec. To access element i, the reader jumps to the nearest sample at position ⌊i/k⌋ and decodes at most k − 1 elements forward. This bounds the cost per access to O(k), yielding amortized O(1) random access.

The sampling rate k controls a space–speed trade-off:

Smaller k: faster access (fewer elements to decode per lookup), but more memory for the samples table.
Larger k: better compression (smaller samples table), but slower access.

Additional features:

Automatic Codec Selection: Codec::Auto analyzes the data to select the most space-efficient codec.
Immutability: VarVec is immutable after construction — changing one element could alter its encoded length, shifting all subsequent data and invalidating every sample point.

§`SeqVec`: Variable-Length Sequence Encoding

SeqVec applies the same variable-length encoding as VarVec, but at the level of sequences rather than individual elements. All sequences are concatenated into a single compressed bitstream, and an index stores the bit offset of each sequence boundary — essentially non-uniform sampling at the start of every sequence, rather than at periodic intervals.

This design is natural when the data is organized as variable-length collections (e.g., adjacency lists in a graph, document-term associations): you retrieve an entire sequence at once, iterating over its elements via a zero-allocation SeqIter.

Indexed Access: Jump to any sequence in O(1) via the boundary index, then decode its elements sequentially.
Optional Length Storage: By default, sequence lengths are computed on-the-fly (the iterator stops when it reaches the next boundary). Enable store_lengths for O(1) length queries at a small additional memory cost.
Same Codec Flexibility: Supports all the compression codecs available in VarVec.

§Quick Start

The following examples show some use cases for FixedVec, VarVec, and SeqVec. All common types and traits are available through the prelude.

§Example: `FixedVec` for Uniform Data and Mutable Access

use compressed_intvec::prelude::*;

// Data where values are within a relatively small, uniform range.
let data: &[u32] = &[10, 20, 30, 40, 50];

// The builder automatically infers the minimal bit width required.
let mut vec: UFixedVec<u32> = FixedVec::builder()
    .build(data)
    .unwrap();

assert_eq!(vec.get(2), Some(30));

// `FixedVec` supports in-place mutation.
vec.set(2, 35);
assert_eq!(vec.get(2), Some(35));

§Example: `VarVec` for High-Ratio Compression

use compressed_intvec::prelude::*;

// Skewed data with a mix of small and large values.
let skewed_data: &[u64] = &[5, 8, 13, 1000, 7, 6, 10_000, 10, 2, 3];

// The builder can automatically select the best compression codec.
let varvec: LEVarVec = VarVec::builder()
    .codec(Codec::Auto)
    .build(skewed_data)
    .unwrap();

assert_eq!(varvec.len(), skewed_data.len());
assert_eq!(varvec.get(3), Some(1000));

§Example: `SeqVec` for Sequence Collections

use compressed_intvec::prelude::*;

let sequences: &[&[u32]] = &[
    &[1, 2, 3],
    &[10, 20],
    &[100, 200, 300, 400],
    &[], // Empty sequences are supported
];

// Use the builder to choose a codec. Here we explicitly set
// `store_lengths(false)` to show the default behaviour where lengths are
// not stored, so length queries require decoding.
// Use `store_lengths(true)` when you frequently need O(1) length queries; it stores per-sequence lengths at a small additional memory cost.
let vec: LESeqVec<u32> = SeqVec::builder()
    .codec(Codec::Auto)
    .store_lengths(false)
    .build(sequences)
    .unwrap();

assert_eq!(vec.num_sequences(), 4);
assert!(!vec.has_stored_lengths());
assert_eq!(vec.sequence_len(0), None);

// Access a sequence by index
let seq1: Vec<u32> = vec.get(1).unwrap().collect();
assert_eq!(seq1, vec![10, 20]);

// Iterate over all sequences
for (idx, seq_iter) in vec.iter().enumerate() {
    let seq: Vec<u32> = seq_iter.collect();
    println!("Sequence {}: {:?}", idx, seq);
}

§Atomic Operations with `AtomicFixedVec`

For concurrent applications, the library provides AtomicFixedVec, a thread-safe variant of FixedVec. It supports atomic read-modify-write (RMW) operations, enabling safe and efficient manipulation of shared integer data across multiple threads. The atomicity guarantees depend on the element’s bit width and its alignment within the underlying u64 storage words.

Lock-Free Path: When an element is fully contained within a single u64 word (guaranteed for power-of-two bit widths), operations are performed using lock-free atomic instructions. Performance here is optimal as no locks are involved.
Locked Path: When an element’s bits span across the boundary of two u64 words (common for non-power-of-two bit widths), operations are protected by a fine-grained mutex from a striped lock pool. This ensures atomicity for the two-word update without resorting to a global lock.

Ideal for shared counters, parallel data processing, and any scenario requiring multiple threads to read from and write to the same integer vector concurrently. For write-heavy workloads, configuring the bit width to a power of two (e.g., 8, 16, 32) is recommended to ensure all operations remain on the lock-free path.

§Example: Concurrent Atomic Operations

use compressed_intvec::prelude::*;
use std::sync::Arc;
use std::thread;
use std::sync::atomic::Ordering;

// A vector with a single counter, initialized to 0.
// A bit width of 17 is sufficient to hold the final count (80000).
let vec = Arc::new(
    UAtomicFixedVec::<u32>::builder()
        .bit_width(BitWidth::Explicit(17))
        .build(&[0])
        .unwrap(),
);

const NUM_THREADS: u32 = 8;
const INCREMENTS_PER_THREAD: u32 = 10_000;

let mut handles = vec![];
for _ in 0..NUM_THREADS {
    let vec_clone = Arc::clone(&vec);
    handles.push(thread::spawn(move || {
        for _ in 0..INCREMENTS_PER_THREAD {
            // Atomically increment the counter.
            vec_clone.fetch_add(0, 1, Ordering::SeqCst);
        }
    }));
}

for handle in handles {
    handle.join().unwrap();
}

let final_value = vec.get(0).unwrap();
assert_eq!(final_value, NUM_THREADS * INCREMENTS_PER_THREAD);

§Compression with `VarVec`

VarVec is the optimal choice when data is not uniformly distributed and minimizing memory usage is a priority. It uses variable-length codes to represent integers.

The compression strategy is controlled by the Codec enum, passed to the builder.

§Choosing the Right Codec

For not too large use cases, the recommended strategy is Codec::Auto, which analyzes the data to select the most space-efficient codec. Note that this has a one-time cost during construction that is not negligible for very large datasets or frequent builds. You can also specify a codec explicitly based on your data characteristics.

`Codec` Variant	Description & Encoding Strategy	Optimal Data Distribution
`Auto`	Analyzes the data to choose the best variable-length code, balancing build time and compression ratio.	Agnostic; adapts to the input data.
`Gamma` (γ)	A universal, parameter-free code. Encodes `n` using the unary code of log₂(n+1), followed by the remaining bits of `n`+1.	Implied distribution is ≈ 1/(2x²). Optimal for data skewed towards small non-negative integers.
`Delta` (δ)	A universal, parameter-free code. Encodes `n` using the γ code of log₂(n+1) , making it more efficient than γ for larger values.	Implied distribution is ≈ 1/(2x(log x)²).
`Rice`	A fast, tunable version of Golomb codes where the parameter b must be a power of two. Encodes `n` by splitting it into a quotient (stored in unary) and a remainder (stored in binary).	Geometric distributions.
`Golomb`	A tunable code, more general than Rice. Encodes `n` by splitting it into a quotient (stored in unary) and a remainder (stored using a minimal binary code).	Geometric distributions. Implied distribution is ≈ 1/rˣ.
`Zeta` (ζ)	A tunable code for power-law data. Encodes `n` based on log₂(n+1)/k in unary, followed by a minimal binary code for the remainder.	Power-law distributions (e.g., word frequencies, node degrees). Implied distribution is ≈ 1/x^1+1/k.
`VByteLe`/`Be`	A byte-aligned code that uses a continuation bit to store integers in a variable number of bytes. Fast to decode. The big-endian variant is lexicographical.	Implied distribution is ≈ 1/x^8/7. Good for general-purpose integer data.
`Omega` (ω)	A universal, parameter-free code that recursively encodes the length of the binary representation of `n`.	Implied distribution is approximately 1/x. Compact for very large numbers.
`Unary`	The simplest code. Encodes `n` as `n` zero-bits followed by a one-bit.	Geometric distributions with a very high probability of small values (e.g., boolean flags).
`Explicit`	An escape hatch to use any code from the `dsi-bitstream::codes::Codes` enum.	Advanced use cases requiring specific, unlisted codes.

§Automatic Selection with `Codec::Auto`

The Auto strategy removes the guesswork from codec selection. During the build phase, it analyzes the input data and selects the codec that offers the best compression ratio. This introduces a one-time cost for the analysis at construction time. Use the Auto codec when you want to create a VarVec once and read it many times, as the amortized cost of the analysis is negligible compared to the space savings and performance of subsequent reads.

If you need to create multiple VarVec instances at run-time, consider using a specific codec that matches your data distribution to avoid the overhead of analysis.

§`VarVec` Access Patterns

Because variable-length elements can only be decoded sequentially from a known bit offset, the way you access a VarVec matters. The library provides several methods, each optimized for a specific access pattern.

§`get_many`: Batch Access from a Slice

For retrieving a batch of elements from a slice of indices.

Mechanism: This method sorts the provided indices to transform a random access pattern into a single, monotonic scan over the compressed data. This approach minimizes expensive bitstream seek operations and leverages data locality.

This should be your preferred method for any batch lookup when all indices are known and can be stored in a slice.

use compressed_intvec::prelude::*;
use rand::RngExt;

let data: Vec<u64> = (0..10_000).collect();
let varvec: LEVarVec = VarVec::builder()
    .codec(Codec::Delta)
    .k(32)
    .build(&data)
    .unwrap();

// Indices can be in any order.
let indices_to_get: Vec<usize> = (0..100).map(|_| rand::rng().random_range(0..10_000)).collect();

// `get_many` retrieves all values in one optimized pass.
let values = varvec.get_many(&indices_to_get).unwrap();

assert_eq!(values, indices_to_get.iter().map(|&i| data[i]).collect::<Vec<_>>());

§`get_many_from_iter`: Access from an Iterator

For retrieving elements from a streaming iterator of indices.

Mechanism: Processes indices on-the-fly using a stateful variable::IntVecSeqReader internally, which is optimized for streams with sequential locality.

Use when indices cannot be collected into a slice, for example due to memory constraints.

use compressed_intvec::prelude::*;

let data: Vec<u64> = (0..10_000).collect();
let varvec: LEVarVec = VarVec::from_slice(&data).unwrap();

// Process indices from a streaming source, such as a range iterator.
let values: Vec<u64> = varvec.get_many_from_iter(500..505).unwrap();

assert_eq!(values, vec![500, 501, 502, 503, 504]);

§Dynamic Lookups: `VarVecReader` and `VarVecSeqReader`

There are interactive scenarios where lookup indices are not known in advance. The library provides two reader types to handle such cases:

§`VarVecReader`: Stateless Random Access

A stateless reader for efficient, repeated random lookups.

Mechanism: Amortizes the setup cost of the bitstream reader across multiple calls. Each get operation performs an independent seek from the nearest sample point.

Optimal for sparse, unpredictable access patterns where there is no locality between consecutive lookups.

use compressed_intvec::prelude::*;

let data: Vec<u64> = (0..10_000).collect();
let varvec: LEVarVec = VarVec::from_slice(&data).unwrap();

// Create a stateless reader for random access.
let mut reader = varvec.reader();
assert_eq!(reader.get(500).unwrap(), Some(500));
assert_eq!(reader.get(10).unwrap(), Some(10));

§`VarVecSeqReader`: Stateful Sequential Access

A stateful reader optimized for access patterns with sequential locality.

Mechanism: Maintains an internal cursor. If a requested index is forward and within the same sample block, it decodes from the last position, avoiding a full seek.

Optimal for iterating through sorted or clustered indices where consecutive lookups are near each other.

use compressed_intvec::prelude::*;

let data: Vec<u64> = (0..10_000).collect();
let varvec: LEVarVec = VarVec::from_slice(&data).unwrap();

// Create a stateful reader for sequential access.
let mut seq_reader = varvec.seq_reader();
assert_eq!(seq_reader.get(500).unwrap(), Some(500));
// This second call is faster as it decodes forward from index 500.
assert_eq!(seq_reader.get(505).unwrap(), Some(505));

§Storing Sequences with `SeqVec`

As described above, SeqVec concatenates all sequences into one compressed bitstream and stores a boundary index for O(1) access to any sequence. Common applications include compressed adjacency lists, document-term associations, and any data organized as variable-length collections.

§Basic Usage

use compressed_intvec::seq::{SeqVec, LESeqVec};

let sequences: &[&[u32]] = &[
    &[1, 2, 3],
    &[10, 20],
    &[100, 200, 300, 400],
];

let vec: LESeqVec<u32> = SeqVec::from_slices(sequences).unwrap();

assert_eq!(vec.num_sequences(), 3);

// Retrieve an entire sequence by index.
let second_seq: Vec<u32> = vec.get(1).unwrap().collect();
assert_eq!(second_seq, vec![10, 20]);

// Iterate over all sequences.
for (idx, seq_iter) in vec.iter().enumerate() {
    let seq: Vec<u32> = seq_iter.collect();
    println!("Sequence {}: {:?}", idx, seq);
}

§Codec Customization

Like VarVec, SeqVec supports the same compression codecs and can automatically select the best one:

use compressed_intvec::seq::{SeqVec, LESeqVec, Codec};

let sequences: Vec<Vec<u64>> = vec![
    vec![1, 1, 1, 2, 3],
    vec![100, 200, 300],
];

let vec: LESeqVec<u64> = SeqVec::builder()
    .codec(Codec::Zeta { k: Some(3) })
    .build(&sequences)
    .unwrap();

assert_eq!(vec.num_sequences(), 2);

§Storing Sequence Lengths

By default, sequence lengths are computed on-the-fly by decoding elements until the next sequence boundary is reached. For scenarios where O(1) length queries are beneficial, lengths can be stored explicitly:

use compressed_intvec::seq::{SeqVec, LESeqVec};

let sequences: &[&[u32]] = &[&[1, 2, 3], &[10, 20], &[]];

let vec: LESeqVec<u32> = SeqVec::builder()
    .store_lengths(true)
    .build(sequences)
    .unwrap();

// O(1) length query instead of O(length).
let len = vec.sequence_len(0).unwrap();
assert_eq!(len, 3);

§Memory Analysis

The library integrates mem-dbg to provide memory usage statistics, allowing you to compare the size of different encoding strategies easily. This is particularly useful for understanding the trade-offs between FixedVec and IntVec in terms of memory efficiency.

use compressed_intvec::prelude::*;
use mem_dbg::{DbgFlags, MemDbg};
use rand::{rngs::SmallRng, RngExt, SeedableRng};

// Generates a vector with uniformly random values.
fn generate_random_vec(size: usize, max: u64) -> Vec<u64> {
    let mut rng = SmallRng::seed_from_u64(42);
    (0..size).map(|_| rng.random_range(0..max)).collect()
}

fn main() {
    let data = generate_random_vec(1_000_000, 1 << 20);

    println!("Size of the uncompressed Vec<u64>:");
    data.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);

    // Create a VarVec with Gamma encoding.
    let gamma_varvec = LEVarVec::builder()
        .codec(Codec::Gamma)
        .build(&data)
        .unwrap();

    println!("\nSize of the VarVec with gamma encoding:");
    gamma_varvec.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);

    // Let the library analyze the data and choose the best codec.
    let auto_varvec = LEVarVec::builder()
        .codec(Codec::Auto)
        .build(&data)
        .unwrap();

    println!("\nSize of the VarVec with Auto encoding:");
    auto_varvec.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);
    println!("\nCodec selected by Auto: {:?}", auto_varvec.encoding());

    // Create a FixedVec with minimal bit width (20 bits)
    let fixed_vec = LEFixedVec::builder()
        .bit_width(BitWidth::Minimal)
        .build(&data)
        .unwrap();

    println!("\nSize of the FixedVec with minimal bit width:");
    fixed_vec.mem_dbg(DbgFlags::HUMANIZE | DbgFlags::PERCENTAGE | DbgFlags::RUST_LAYOUT);
}

The output displays memory breakdown for a 1,000,000-element vector of u64 integers, uniformly distributed between 0 and 2²⁰

Vec<u64>: 8.00 MB (8 bytes per element)
VarVec with Gamma encoding: 4.73 MB (41% reduction)
VarVec with Auto codec selection: 2.85 MB (64% reduction, selected Zeta k=10)
FixedVec with minimal bit width: 2.50 MB (69% reduction, 20 bits per element)

The memory analysis also shows the internal structure of each data type, including storage overhead for metadata, sampling structures, and encoding parameters.

Size of the uncompressed Vec<u64>:
8.000 MB 100.00% ⏺

Size of the VarVec with gamma encoding:
4.727 MB 100.00% ⏺
4.625 MB  97.85% ├╴data
101.6 kB   2.15% ├╴samples
101.6 kB   2.15% │ ├╴bits
    8  B   0.00% │ ├╴bit_width
    8  B   0.00% │ ├╴mask
    8  B   0.00% │ ├╴len
    0  B   0.00% │ ╰╴_phantom
    8  B   0.00% ├╴k
    8  B   0.00% ├╴len
   16  B   0.00% ├╴encoding
    0  B   0.00% ╰╴_markers

Size of the VarVec with Auto encoding:
2.846 MB 100.00% ⏺
2.749 MB  96.57% ├╴data
97.72 kB   3.43% ├╴samples
97.70 kB   3.43% │ ├╴bits
    8  B   0.00% │ ├╴bit_width
    8  B   0.00% │ ├╴mask
    8  B   0.00% │ ├╴len
    0  B   0.00% │ ╰╴_phantom
    8  B   0.00% ├╴k
    8  B   0.00% ├╴len
   16  B   0.00% ├╴encoding
    0  B   0.00% ╰╴_markers

Codec selected by Auto: Zeta { k: 10 }

Size of the FixedVec with minimal bit width:
2.500 MB 100.00% ⏺
2.500 MB 100.00% ├╴bits
    8  B   0.00% ├╴bit_width
    8  B   0.00% ├╴mask
    8  B   0.00% ├╴len
    0  B   0.00% ╰╴_phantom

§Benchmarks

The library includes benchmarks for FixedVec, VarVec, and SeqVec. It also tests performance against other library implementations of compressed integer storage: sux::BitFieldVec and succinct::IntVector. These benchmarks are in the benches directory and can be run via

cargo bench

The benchmarks measure the performance of random access, batch access, sequential access, and memory usage for various data distributions and vector sizes.

§Optional Features: Storing `usize` and `isize`

By default, variable::VarVec only supports integer types with a fixed size (e.g., u32, i64). This guarantees that compressed data is portable across different machine architectures (e.g., from a 64-bit server to a 32-bit embedded device).

The arch-dependent-storable feature flag enables Storable implementations for usize and isize. When activated, you can create a VarVec<usize> directly.

Warning: This feature breaks data portability. A VarVec<usize> created on a 64-bit system containing values larger than u32::MAX will cause a panic if deserialized or read on a 32-bit system. Only enable this feature if you can guarantee that your application and its data will only ever run on a single target architecture (e.g., x86_64).

Enable it in your Cargo.toml:

compressed-intvec = { version = "0.6.0", features = ["arch-dependent-storable"] }

§TODO

Add support for epsilon-serde

Modules§

fixed: A generic, compressed, and randomly accessible vector with fixed-width encoding.
prelude: A prelude for compressed-intvec.
seq: A compressed vector of variable-length sequences with indexed access.
variable: A compressed integer vector using variable-length encoding with fast random access.

Macros§

atomic_fixed_vec: Creates an AtomicFixedVec with default parameters.
fixed_vec: Creates a FixedVec with default parameters.
int_vec: Creates a LEVarVec (a VarVec of u64s) containing the given elements.
seq_vec: Creates a LESeqVec containing the given sequences.
sfixed_vec: Creates a FixedVec of signed integers (forces i64).
sint_vec: Creates a LESVarVec (a VarVec of i64s) containing the given elements.
sseq_vec: Creates a LESeqVec of signed integers (forces i64).