compressed_intvec/lib.rs
1//! # Compressed Integer Vector Library
2//!
3//! A Rust library for compressing vectors of `u64` integers using instantaneous codes the from [dsi-bitstream](https://docs.rs/dsi-bitstream) library. Offers fast random access via sampling to balance speed and memory.
4//!
5//! ## Features
6//!
7//! - **Efficient Compression**: Leverage various codecs like Gamma, Delta, and Rice coding.
8//! - **Fast Random Access**: Achieve $O(1)$ access with configurable sampling.
9//! - **Memory Analysis**: Integrate with [`mem-dbg`](https://crates.io/crates/mem-dbg) for memory profiling.
10//! - **Flexible Codecs**: Select codecs based on data distribution for optimal compression.
11//!
12//! The sampling parameter determines how often full positions are stored to speed up access. Higher values reduce memory overhead but may increase access time. For example, `sampling_param = 32` is usually a good trade-off for large datasets.
13//!
14//! ### A Quick Example
15//!
16//! Let's consider the _universal_ code **Gamma** introduced by Elias in the 1960s. This code represents an integer as a unary prefix followed by the binary representation of the integer (thus the name universal, as for every integer `x` the length of the code is always $O(\log x)$ long, so just a constant factor longer than its binary form). So for example `9` will be encoded as `0001001`.
17//!
18//! ```rust
19//! use compressed_intvec::intvec::BEIntVec;
20//! use compressed_intvec::codecs::GammaCodec;
21//!
22//! let vec = vec![1, 3, 6, 8, 13, 3];
23//!
24//! // The compressed-intvec needs a sampling parameter
25//! let sampling_param = 2; // small since the vector is small
26//! // We are using a convenient type alias for the big-endian representation
27//! let compressed_be = BEIntVec::<GammaCodec>::from(&vec, sampling_param).unwrap();
28//!
29//! assert_eq!(compressed_be.get(3), 8);
30//!
31//! for (i, val) in compressed_be.iter().enumerate() {
32//! assert_eq!(val, vec[i]);
33//! }
34//!
35//! ```
36//!
37//! Or alternatively, you can use directly the `IntVec` specifying the endianness (little-endian in this case):
38//!
39//! ```rust
40//! use compressed_intvec::intvec::IntVec;
41//! use compressed_intvec::codecs::GammaCodec;
42//! use dsi_bitstream::traits::LE;
43//!
44//! let vec = vec![1, 3, 6, 8, 13, 3];
45//! let sampling_param = 2; // small since the vector is small
46//! let compressed = IntVec::<LE, _, GammaCodec>::from(&vec, sampling_param).unwrap();
47//!
48//! assert_eq!(compressed.get(3), 8);
49//!
50//! for (i, val) in compressed.iter().enumerate() {
51//! assert_eq!(val, vec[i]);
52//! }
53//! ```
54//!
55//! ## Available Codecs
56//!
57//! | Codec Name | Description |
58//! | -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
59//! | `GammaCodec` | Uses gamma coding without requiring extra runtime parameters. |
60//! | `DeltaCodec` | Uses delta coding, likewise without extra parameters. |
61//! | `ExpGolombCodec` | Requires an extra parameter (e.g., a parameter `k`) for encoding/decoding. |
62//! | `ZetaCodec` | Uses additional runtime parameters ζ. |
63//! | `RiceCodec` | Uses a Rice parameter for encoding/decoding. Ideal for skewed distributions. In this case you may want to set this paramater as the floor of the log_2 of the mean of your values |
64//! | `MinimalBinaryCodec` | A minimal binary code with upper bound `u > 0` ([truncated binary encoding](https://en.wikipedia.org/wiki/Truncated_binary_encoding)). This is optimal for uniformly distributed data in the range [0, u). |
65//! | `ParamZetaCodec` | Parameterized variant of Zeta codec using compile-time flags. |
66//! | `ParamDeltaCodec` | Parameterized variant of Delta codec using compile-time flags. |
67//! | `ParamGammaCodec` | Parameterized variant of Gamma codec using compile-time flags. |
68//!
69//! For codecs that require extra parameters, we can create a compressed int-vec with the method `from_with_param`:
70//!
71//! ```rust
72//! use compressed_intvec::intvec::BEIntVec;
73//! use compressed_intvec::codecs::RiceCodec;
74//!
75//! let vec = vec![1, 3, 6, 8, 13, 3];
76//! let rice_param = 3; // for example
77//! let sampling_param = 2;
78//! let compressed = BEIntVec::<RiceCodec>::from_with_param(&vec, sampling_param, rice_param).unwrap();
79//! ```
80//!
81//! Choosing the right codec is crucial for achieving optimal compression. The efficiency of a codec is highly dependent on the underlying data distribution. For example, Rice coding is usually effective for skewed distributions, while Minimal Binary coding is optimal for uniformly distributed data.
82//!
83//! ## Endianness
84//!
85//! The library supports both little-endian and big-endian representations. The `IntVec` type allows you to specify the endianness at runtime, while the `BEIntVec` and `LEIntVec` types provide convenient aliases for big-endian and little-endian representations, respectively.
86//!
87//! ## Memory Analysis (and why choosing the right codec is important)
88//!
89//! Both the little-endian and big-endian version of the compressed int-vec include support for the [MemDbg/MemSize](https://docs.rs/mem-dbg/) traits from the [mem_dbg](https://crates.io/crates/mem-dbg) crate. For example, this is the output of `mem_dbg(DbgFlags::empty()` for a very large `BEIntVec` instance:
90//!
91//! ```bash
92//! 11536 B ⏺
93//! 10864 B ├╴data
94//! 656 B ├╴samples
95//! 0 B ├╴codec
96//! 8 B ├╴k
97//! 8 B ├╴len
98//! 0 B ├╴codec_param
99//! 0 B ├╴endian
100//! ```
101//!
102//! Consider now a vector of `u64` values uniformly distributed in the range `[0, u64::MAX)` and see it's memory usage
103//!
104//! ```rust
105//! use rand::distr::{Distribution, Uniform};
106//! use rand::SeedableRng;
107//! use mem_dbg::{MemDbg, DbgFlags};
108//!
109//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
110//! let mut rng = rand::rngs::StdRng::seed_from_u64(42);
111//! let uniform = Uniform::new(0, max).unwrap();
112//! (0..size).map(|_| uniform.sample(&mut rng)).collect()
113//! }
114//!
115//! let input_vec = generate_uniform_vec(1000, u64::MAX);
116//!
117//! println!("Size of the standard Vec<u64>");
118//! input_vec.mem_dbg(DbgFlags::empty());
119//! ```
120//!
121//! This outputs:
122//!
123//! ```bash
124//! Size of the standard Vec<u64>
125//! 8024 B ⏺
126//! ```
127//!
128//! Next, we compress the vector using both `MinimalBinaryCodec` and `DeltaCodec` using for both a sampling parameter of `32`:
129//!
130//! ```rust
131//! use compressed_intvec::intvec::BEIntVec;
132//! use compressed_intvec::codecs::{MinimalBinaryCodec, DeltaCodec};
133//! use mem_dbg::{MemDbg, DbgFlags};
134//! use rand::distr::{Distribution, Uniform};
135//! use rand::SeedableRng;
136//!
137//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
138//! let mut rng = rand::rngs::StdRng::seed_from_u64(42);
139//! let uniform = Uniform::new(0, max).unwrap();
140//! (0..size).map(|_| uniform.sample(&mut rng)).collect()
141//! }
142//!
143//! let input_vec = generate_uniform_vec(1000, u64::MAX);
144//! println!("Size of the standard Vec<u64>");
145//! input_vec.mem_dbg(DbgFlags::empty());
146//!
147//! let minimal_intvec = BEIntVec::<MinimalBinaryCodec>::from_with_param(&input_vec, 32, 10).unwrap();
148//! let delta_intvec = BEIntVec::<DeltaCodec>::from(&input_vec, 32).unwrap();
149//!
150//! println!("Size of the compressed IntVec with MinimalBinaryCodec");
151//! minimal_intvec.mem_dbg(DbgFlags::empty());
152//!
153//! println!("\nSize of the compressed IntVec with DeltaCodec");
154//! delta_intvec.mem_dbg(DbgFlags::empty());
155//! ```
156//!
157//! The compression results are:
158//!
159//! ```bash
160//! Size of the standard Vec<u64>
161//! 8024 B ⏺
162//!
163//! Size of the compressed IntVec with MinimalBinaryCodec
164//! 832 B ⏺
165//! 528 B ├╴data
166//! 280 B ├╴samples
167//! 0 B ├╴codec
168//! 8 B ├╴k
169//! 8 B ├╴len
170//! 8 B ├╴codec_param
171//! 0 B ╰╴endian
172//!
173//! Size of the compressed IntVec with DeltaCodec
174//! 9584 B ⏺
175//! 9288 B ├╴data
176//! 280 B ├╴samples
177//! 0 B ├╴codec
178//! 8 B ├╴k
179//! 8 B ├╴len
180//! 0 B ├╴codec_param
181//! 0 B ╰╴endian
182//! ```
183//!
184//! As shown, `MinimalBinaryCodec` is the most efficient for uniformly distributed data, reducing the size to 832 bytes, an order of magnitude smaller than the original vector. In contrast, `DeltaCodec` actually increases the size to 9584 bytes, as it is poorly suited for uniform distributions.
185//!
186//! ## Benchmarks
187//!
188//! Benchmarks are provided to evaluate both the random access speed and space occupancy of compressed vectors.
189//!
190//! - **Random Access Benchmarks:** These benchmarks measure the time required to access individual elements.
191//! - **Space Occupancy Benchmarks:** These benchmarks report the memory footprint of various codec configurations and compressed representations.
192//!
193//! To run the benchmarks, execute:
194//!
195//! ```bash
196//! cargo bench
197//! ```
198//!
199//! The results are output to the terminal and also written to CSV files (e.g. `benchmark_space.csv`).
200pub mod codecs;
201pub mod intvec;