compressed_intvec/lib.rs
1//! # Compressed Integer Vector Library
2//!
3//! [](https://crates.io/crates/compressed-intvec)
4//! [](https://github.com/lukefleed/compressed-intvec/actions/workflows/rust.yml)
5//! [](https://docs.rs/compressed-intvec)
6//!
7//! This Rust library provides efficient compression for vectors of `u64` integers by leveraging
8//! variable-length coding from the [dsi-bitstream](https://docs.rs/dsi-bitstream) crate. It offers
9//! rapid random access via configurable sampling, thus striking a balance between speed and memory
10//! usage. Users can select between big-endian (`BEIntVec`) and little-endian (`LEIntVec`) encoding
11//! based on their interoperability requirements.
12//!
13//! ## Features
14//!
15//! - **Efficient Compression:** Utilizes codecs such as Gamma, Delta, and Rice to compress data effectively.
16//! - **Fast Random Access:** Provides O(1) access by storing periodic full positions (sampling).
17//! - **Memory Profiling:** Integrates with [`mem-dbg`](https://crates.io/crates/mem-dbg) for memory analysis.
18//! - **Flexible Codecs:** Offers a variety of codecs to best suit the distribution of your data.
19//!
20//! The sampling parameter dictates how often a full positional index is stored to accelerate random access.
21//! A larger sampling parameter reduces memory overhead at the cost of slightly increased access time.
22//! For many datasets, a value such as `32` serves as an effective trade-off.
23//!
24//! ### Example: Gamma Coding
25//!
26//! Gamma coding, introduced by Elias in the 1960s, is a universal code that represents an integer
27//! using a unary prefix followed by its binary representation. For any integer `x`, the code length
28//! is $O(\log x)$, just marginally longer than its binary form. For instance, the integer `9` is encoded as `0001001`.
29//!
30//! ```rust
31//! use compressed_intvec::intvec::BEIntVec;
32//! use compressed_intvec::codecs::GammaCodec;
33//!
34//! let vec = vec![1, 3, 6, 8, 13, 3];
35//! let sampling_param = 2; // A modest sampling parameter for a small vector
36//! let compressed_be = BEIntVec::<GammaCodec>::from(&vec, sampling_param).unwrap();
37//!
38//! assert_eq!(compressed_be.get(3), 8);
39//!
40//! for (i, val) in compressed_be.iter().enumerate() {
41//! assert_eq!(val, vec[i]);
42//! }
43//! ```
44//!
45//! Alternatively, for little-endian encoding:
46//!
47//! ```rust
48//! use compressed_intvec::intvec::LEIntVec;
49//! use compressed_intvec::codecs::GammaCodec;
50//!
51//! let vec = vec![1, 3, 6, 8, 13, 3];
52//! let compressed_le = LEIntVec::<GammaCodec>::from(&vec, 2).unwrap();
53//!
54//! for (i, val) in compressed_le.iter().enumerate() {
55//! assert_eq!(val, vec[i]);
56//! }
57//! ```
58//!
59//! ## Supported Codecs
60//!
61//! The library offers several codecs for compressing integer vectors:
62//!
63//! | Codec Name | Description |
64//! | -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
65//! | `GammaCodec` | Gamma coding without requiring extra runtime parameters. |
66//! | `DeltaCodec` | Delta coding without additional parameters. |
67//! | `ExpGolombCodec` | Exp-Golomb coding, which requires an extra parameter (e.g., a parameter `k`). |
68//! | `ZetaCodec` | Zeta coding with additional runtime parameters ζ. |
69//! | `RiceCodec` | Rice coding, ideal for skewed distributions. The Rice parameter is often chosen as the floor of $\log_2$ of the mean of the values. |
70//! | `MinimalBinaryCodec` | Minimal binary (truncated binary) coding for values in `[0, u)`. Optimal for uniformly distributed data. See [truncated binary encoding](https://en.wikipedia.org/wiki/Truncated_binary_encoding). |
71//! | `ParamZetaCodec` | A parameterized variant of Zeta coding using compile-time flags. |
72//! | `ParamDeltaCodec` | A parameterized variant of Delta coding using compile-time flags. |
73//! | `ParamGammaCodec` | A parameterized variant of Gamma coding using compile-time flags. |
74//!
75//! For codecs requiring extra parameters, use the `from_with_params` method. For example, with Rice coding:
76//!
77//! ```rust
78//! use compressed_intvec::intvec::BEIntVec;
79//! use compressed_intvec::codecs::RiceCodec;
80//!
81//! let vec = vec![1, 3, 6, 8, 13, 3];
82//! let rice_param = 3; // Example Rice parameter
83//! let sampling_param = 2;
84//! let compressed = BEIntVec::<RiceCodec>::from_with_param(&vec, sampling_param, rice_param).unwrap();
85//! ```
86//!
87//! Choosing the correct codec is essential for optimal compression. Rice coding is effective for skewed data,
88//! whereas Minimal Binary coding excels with uniformly distributed values.
89//!
90//! ## Endianness
91//!
92//! The library supports both big-endian (`BEIntVec`) and little-endian (`LEIntVec`) formats. Although the
93//! choice of endianness affects the byte order of the compressed data, it does not impact performance.
94//!
95//! ## Memory Analysis
96//!
97//! Both the big-endian and little-endian implementations support the [MemDbg/MemSize](https://docs.rs/mem-dbg/) traits
98//! from the [`mem-dbg`](https://crates.io/crates/mem-dbg) crate. For example, running
99//! `mem_dbg(DbgFlags::empty())` on a large `BEIntVec` might produce:
100//!
101//! ```bash
102//! 11536 B ⏺
103//! 10864 B ├╴data
104//! 656 B ├╴samples
105//! 0 B ├╴codec
106//! 8 B ├╴k
107//! 8 B ├╴len
108//! 0 B ├╴codec_param
109//! 0 B ╰╴endian
110//! ```
111//!
112//! Consider compressing a vector of uniformly distributed `u64` values over `[0, u64::MAX)` and measure the memory usage.
113//!
114//! ```no_run
115//! use rand::distr::{Uniform, Distribution};
116//! use rand::SeedableRng;
117//! use mem_dbg::{MemDbg, DbgFlags};
118//!
119//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
120//! let mut rng = rand::rngs::StdRng::seed_from_u64(42);
121//! let uniform = Uniform::new(0, max).unwrap();
122//! (0..size).map(|_| uniform.sample(&mut rng)).collect()
123//! }
124//! let input_vec = generate_uniform_vec(1000, u64::MAX);
125//!
126//! println!("Size of the standard Vec<u64>");
127//! input_vec.mem_dbg(DbgFlags::empty());
128//! ```
129//!
130//! This outputs:
131//!
132//! ```bash
133//! Size of the standard Vec<u64>
134//! 8024 B ⏺
135//! ```
136//!
137//! Next, let's compress the vector using `MinimalBinaryCodec` and `DeltaCodec` with a sampling parameter of `32`:
138//!
139//! ```no_run
140//! use compressed_intvec::intvec::BEIntVec;
141//! use compressed_intvec::codecs::{MinimalBinaryCodec, DeltaCodec};
142//! use mem_dbg::{MemDbg, DbgFlags};
143//!
144//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
145//! let mut rng = rand::rngs::StdRng::seed_from_u64(42);
146//! let uniform = Uniform::new(0, max).unwrap();
147//! (0..size).map(|_| uniform.sample(&mut rng)).collect()
148//! }
149//! use rand::distr::{Uniform, Distribution};
150//! use rand::SeedableRng;
151//!
152//! let input_vec = generate_uniform_vec(1000, u64::MAX);
153//!
154//! let minimal_intvec = BEIntVec::<MinimalBinaryCodec>::from_with_param(&input_vec, 32, 10).unwrap();
155//! let delta_intvec = BEIntVec::<DeltaCodec>::from(&input_vec, 32).unwrap();
156//!
157//! println!("Size of the standard Vec<u64>");
158//! input_vec.mem_dbg(DbgFlags::empty());
159//!
160//! println!("\nSize of the compressed IntVec with MinimalBinaryCodec");
161//! minimal_intvec.mem_dbg(DbgFlags::empty());
162//!
163//! println!("\nSize of the compressed IntVec with DeltaCodec");
164//! delta_intvec.mem_dbg(DbgFlags::empty());
165//! ```
166//!
167//! This code snippet should output:
168//!
169//! ```bash
170//! Size of the standard Vec<u64>
171//! 8024 B ⏺
172//!
173//! Size of the compressed IntVec with MinimalBinaryCodec
174//! 832 B ⏺
175//! 528 B ├╴data
176//! 280 B ├╴samples
177//! 0 B ├╴codec
178//! 8 B ├╴k
179//! 8 B ├╴len
180//! 8 B ├╴codec_param
181//! 0 B ╰╴endian
182//!
183//! Size of the compressed IntVec with DeltaCodec
184//! 9584 B ⏺
185//! 9288 B ├╴data
186//! 280 B ├╴samples
187//! 0 B ├╴codec
188//! 8 B ├╴k
189//! 8 B ├╴len
190//! 0 B ├╴codec_param
191//! 0 B ╰╴endian
192//! ```
193//!
194//! As demonstrated, `MinimalBinaryCodec` can reduce the memory footprint dramatically, whereas
195//! `DeltaCodec` may lead to an increased size when applied to uniformly distributed data.
196pub mod codecs;
197pub mod intvec;