compressed_intvec/
lib.rs

1//!  # Compressed Integer Vector Library
2//!
3//! [![Crates.io](https://img.shields.io/crates/v/compressed-intvec.svg)](https://crates.io/crates/compressed-intvec)
4//! [![CI](https://github.com/lukefleed/compressed-intvec/actions/workflows/rust.yml/badge.svg)](https://github.com/lukefleed/compressed-intvec/actions/workflows/rust.yml)
5//! [![Docs](https://docs.rs/compressed-intvec/badge.svg)](https://docs.rs/compressed-intvec)
6//!
7//! This Rust library provides efficient compression for vectors of `u64` integers by leveraging
8//! variable-length coding from the [dsi-bitstream](https://docs.rs/dsi-bitstream) crate. It offers
9//! rapid random access via configurable sampling, thus striking a balance between speed and memory
10//! usage. Users can select between big-endian (`BEIntVec`) and little-endian (`LEIntVec`) encoding
11//! based on their interoperability requirements.
12//!
13//! ## Features
14//!
15//! - **Efficient Compression:** Utilizes codecs such as Gamma, Delta, and Rice to compress data effectively.
16//! - **Fast Random Access:** Provides O(1) access by storing periodic full positions (sampling).
17//! - **Memory Profiling:** Integrates with [`mem-dbg`](https://crates.io/crates/mem-dbg) for memory analysis.
18//! - **Flexible Codecs:** Offers a variety of codecs to best suit the distribution of your data.
19//!
20//! The sampling parameter dictates how often a full positional index is stored to accelerate random access.
21//! A larger sampling parameter reduces memory overhead at the cost of slightly increased access time.
22//! For many datasets, a value such as `32` serves as an effective trade-off.
23//!
24//! ### Example: Gamma Coding
25//!
26//! Gamma coding, introduced by Elias in the 1960s, is a universal code that represents an integer
27//! using a unary prefix followed by its binary representation. For any integer `x`, the code length
28//! is $O(\log x)$, just marginally longer than its binary form. For instance, the integer `9` is encoded as `0001001`.
29//!
30//! ```rust
31//! use compressed_intvec::intvec::BEIntVec;
32//! use compressed_intvec::codecs::GammaCodec;
33//!
34//! let vec = vec![1, 3, 6, 8, 13, 3];
35//! let sampling_param = 2; // A modest sampling parameter for a small vector
36//! let compressed_be = BEIntVec::<GammaCodec>::from(&vec, sampling_param).unwrap();
37//!
38//! assert_eq!(compressed_be.get(3), 8);
39//!
40//! for (i, val) in compressed_be.iter().enumerate() {
41//!     assert_eq!(val, vec[i]);
42//! }
43//! ```
44//!
45//! Alternatively, for little-endian encoding:
46//!
47//! ```rust
48//! use compressed_intvec::intvec::LEIntVec;
49//! use compressed_intvec::codecs::GammaCodec;
50//!
51//! let vec = vec![1, 3, 6, 8, 13, 3];
52//! let compressed_le = LEIntVec::<GammaCodec>::from(&vec, 2).unwrap();
53//!
54//! for (i, val) in compressed_le.iter().enumerate() {
55//!     assert_eq!(val, vec[i]);
56//! }
57//! ```
58//!
59//! ## Supported Codecs
60//!
61//! The library offers several codecs for compressing integer vectors:
62//!
63//! | Codec Name           | Description                                                                                                                                                                  |
64//! | -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
65//! | `GammaCodec`         | Gamma coding without requiring extra runtime parameters.                                                                                                                   |
66//! | `DeltaCodec`         | Delta coding without additional parameters.                                                                                                                                |
67//! | `ExpGolombCodec`     | Exp-Golomb coding, which requires an extra parameter (e.g., a parameter `k`).                                                                                               |
68//! | `ZetaCodec`          | Zeta coding with additional runtime parameters ζ.                                                                                                                          |
69//! | `RiceCodec`          | Rice coding, ideal for skewed distributions. The Rice parameter is often chosen as the floor of $\log_2$ of the mean of the values.                                         |
70//! | `MinimalBinaryCodec` | Minimal binary (truncated binary) coding for values in `[0, u)`. Optimal for uniformly distributed data. See [truncated binary encoding](https://en.wikipedia.org/wiki/Truncated_binary_encoding). |
71//! | `ParamZetaCodec`     | A parameterized variant of Zeta coding using compile-time flags.                                                                                                             |
72//! | `ParamDeltaCodec`    | A parameterized variant of Delta coding using compile-time flags.                                                                                                            |
73//! | `ParamGammaCodec`    | A parameterized variant of Gamma coding using compile-time flags.                                                                                                            |
74//!
75//! For codecs requiring extra parameters, use the `from_with_params` method. For example, with Rice coding:
76//!
77//! ```rust
78//! use compressed_intvec::intvec::BEIntVec;
79//! use compressed_intvec::codecs::RiceCodec;
80//!
81//! let vec = vec![1, 3, 6, 8, 13, 3];
82//! let rice_param = 3; // Example Rice parameter
83//! let sampling_param = 2;
84//! let compressed = BEIntVec::<RiceCodec>::from_with_param(&vec, sampling_param, rice_param).unwrap();
85//! ```
86//!
87//! Choosing the correct codec is essential for optimal compression. Rice coding is effective for skewed data,
88//! whereas Minimal Binary coding excels with uniformly distributed values.
89//!
90//! ## Endianness
91//!
92//! The library supports both big-endian (`BEIntVec`) and little-endian (`LEIntVec`) formats. Although the
93//! choice of endianness affects the byte order of the compressed data, it does not impact performance.
94//!
95//! ## Memory Analysis
96//!
97//! Both the big-endian and little-endian implementations support the [MemDbg/MemSize](https://docs.rs/mem-dbg/) traits
98//! from the [`mem-dbg`](https://crates.io/crates/mem-dbg) crate. For example, running
99//! `mem_dbg(DbgFlags::empty())` on a large `BEIntVec` might produce:
100//!
101//! ```bash
102//! 11536 B ⏺
103//! 10864 B ├╴data
104//!   656 B ├╴samples
105//!     0 B ├╴codec
106//!     8 B ├╴k
107//!     8 B ├╴len
108//!     0 B ├╴codec_param
109//!     0 B ╰╴endian
110//! ```
111//!
112//! Consider compressing a vector of uniformly distributed `u64` values over `[0, u64::MAX)` and measure the memory usage.
113//!
114//! ```no_run
115//! use rand::distr::{Uniform, Distribution};
116//! use rand::SeedableRng;
117//! use mem_dbg::{MemDbg, DbgFlags};
118//!
119//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
120//!     let mut rng = rand::rngs::StdRng::seed_from_u64(42);
121//!     let uniform = Uniform::new(0, max).unwrap();
122//!     (0..size).map(|_| uniform.sample(&mut rng)).collect()
123//! }
124//! let input_vec = generate_uniform_vec(1000, u64::MAX);
125//!
126//! println!("Size of the standard Vec<u64>");
127//! input_vec.mem_dbg(DbgFlags::empty());
128//! ```
129//!
130//! This outputs:
131//!
132//! ```bash
133//! Size of the standard Vec<u64>
134//! 8024 B ⏺
135//! ```
136//!
137//! Next, let's compress the vector using `MinimalBinaryCodec` and `DeltaCodec` with a sampling parameter of `32`:
138//!
139//! ```no_run
140//! use compressed_intvec::intvec::BEIntVec;
141//! use compressed_intvec::codecs::{MinimalBinaryCodec, DeltaCodec};
142//! use mem_dbg::{MemDbg, DbgFlags};
143//!
144//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
145//!     let mut rng = rand::rngs::StdRng::seed_from_u64(42);
146//!     let uniform = Uniform::new(0, max).unwrap();
147//!     (0..size).map(|_| uniform.sample(&mut rng)).collect()
148//! }
149//! use rand::distr::{Uniform, Distribution};
150//! use rand::SeedableRng;
151//!
152//! let input_vec = generate_uniform_vec(1000, u64::MAX);
153//!
154//! let minimal_intvec = BEIntVec::<MinimalBinaryCodec>::from_with_param(&input_vec, 32, 10).unwrap();
155//! let delta_intvec = BEIntVec::<DeltaCodec>::from(&input_vec, 32).unwrap();
156//!
157//! println!("Size of the standard Vec<u64>");
158//! input_vec.mem_dbg(DbgFlags::empty());
159//!
160//! println!("\nSize of the compressed IntVec with MinimalBinaryCodec");
161//! minimal_intvec.mem_dbg(DbgFlags::empty());
162//!
163//! println!("\nSize of the compressed IntVec with DeltaCodec");
164//! delta_intvec.mem_dbg(DbgFlags::empty());
165//! ```
166//!
167//! This code snippet should output:
168//!
169//! ```bash
170//! Size of the standard Vec<u64>
171//! 8024 B ⏺
172//!
173//! Size of the compressed IntVec with MinimalBinaryCodec
174//! 832 B ⏺
175//! 528 B ├╴data
176//! 280 B ├╴samples
177//!   0 B ├╴codec
178//!   8 B ├╴k
179//!   8 B ├╴len
180//!   8 B ├╴codec_param
181//!   0 B ╰╴endian
182//!
183//! Size of the compressed IntVec with DeltaCodec
184//! 9584 B ⏺
185//! 9288 B ├╴data
186//!  280 B ├╴samples
187//!    0 B ├╴codec
188//!    8 B ├╴k
189//!    8 B ├╴len
190//!    0 B ├╴codec_param
191//!    0 B ╰╴endian
192//! ```
193//!
194//! As demonstrated, `MinimalBinaryCodec` can reduce the memory footprint dramatically, whereas
195//! `DeltaCodec` may lead to an increased size when applied to uniformly distributed data.
196pub mod codecs;
197pub mod intvec;