compressed_intvec/lib.rs
1//! # Compressed Integer Vector Library
2//!
3//! [](https://crates.io/crates/compressed-intvec)
4//! [](https://github.com/lukefleed/compressed-intvec/actions/workflows/rust.yml)
5//! [](https://docs.rs/compressed-intvec)
6//! [](https://crates.io/crates/compressed-intvec)
7//! 
8//!
9//! A Rust library for compressing vectors of `u64` integers using instantaneous codes the from [dsi-bitstream](https://docs.rs/dsi-bitstream) library. Offers fast random access via sampling to balance speed and memory. Choose between big-endian (`BEIntVec`) or little-endian (`LEIntVec`) encoding.
10//!
11//! ## Features
12//!
13//! - **Efficient Compression**: Leverage various codecs like Gamma, Delta, and Rice coding.
14//! - **Fast Random Access**: Achieve $O(1)$ access with configurable sampling.
15//! - **Memory Analysis**: Integrate with [`mem-dbg`](https://crates.io/crates/mem-dbg) for memory profiling.
16//! - **Flexible Codecs**: Select codecs based on data distribution for optimal compression.
17//!
18//! The sampling parameter determines how often full positions are stored to speed up access. Higher values reduce memory overhead but may increase access time. For example, `sampling_param = 32` is usually a good trade-off for large datasets.
19//!
20//! ### A Quick Example
21//!
22//! Let's consider the _universal_ code **Gamma** introduced by Elias in the 1960s. This code represents an integer as a unary prefix followed by the binary representation of the integer (thus the name universal, as for every integer `x` the length of the code is always $O(\log x)$ long, so just a constant factor longer than its binary form). So for example `9` will be encoded as `0001001`.
23//!
24//! ```rust
25//! use compressed_intvec::intvec::BEIntVec;
26//! use compressed_intvec::codecs::GammaCodec;
27//!
28//! let vec = vec![1, 3, 6, 8, 13, 3];
29//!
30//! // The compressed-intvec needs a sampling parameter
31//! let sampling_param = 2; // small since the vector is small
32//! let compressed_be = BEIntVec::<GammaCodec>::from(&vec, sampling_param).unwrap();
33//!
34//! assert_eq!(compressed_be.get(3), 8);
35//!
36//! for (i, val) in compressed_be.iter().enumerate() {
37//! assert_eq!(val, vec[i]);
38//! }
39//!
40//! ```
41//!
42//! Or alternatively, you can use the `LEIntVec` for little-endian representation:
43//!
44//! ```rust
45//! use compressed_intvec::intvec::LEIntVec;
46//! use compressed_intvec::codecs::GammaCodec;
47//!
48//! let vec = vec![1, 3, 6, 8, 13, 3];
49//! let compressed_le = LEIntVec::<GammaCodec>::from(&vec, 2).unwrap();
50//!
51//! for (i, val) in compressed_le.iter().enumerate() {
52//! assert_eq!(val, vec[i]);
53//! }
54//! ```
55//!
56//! ## Available Codecs
57//!
58//! | Codec Name | Description |
59//! | -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
60//! | `GammaCodec` | Uses gamma coding without requiring extra runtime parameters. |
61//! | `DeltaCodec` | Uses delta coding, likewise without extra parameters. |
62//! | `ExpGolombCodec` | Requires an extra parameter (e.g., a parameter `k`) for encoding/decoding. |
63//! | `ZetaCodec` | Uses additional runtime parameters ζ. |
64//! | `RiceCodec` | Uses a Rice parameter for encoding/decoding. Ideal for skewed distributions. In this case you may want to set this paramater as the floor of the log_2 of the mean of your values |
65//! | `MinimalBinaryCodec` | A minimal binary code with upper bound `u > 0` ([truncated binary encoding](https://en.wikipedia.org/wiki/Truncated_binary_encoding)). This is optimal for uniformly distributed data in the range [0, u). |
66//! | `ParamZetaCodec` | Parameterized variant of Zeta codec using compile-time flags. |
67//! | `ParamDeltaCodec` | Parameterized variant of Delta codec using compile-time flags. |
68//! | `ParamGammaCodec` | Parameterized variant of Gamma codec using compile-time flags. |
69//!
70//! For codecs that require extra parameters, we can create a compressed int-vec with the method `from_with_param`:
71//!
72//! ```rust
73//! use compressed_intvec::intvec::BEIntVec;
74//! use compressed_intvec::codecs::RiceCodec;
75//!
76//! let vec = vec![1, 3, 6, 8, 13, 3];
77//! let rice_param = 3; // for example
78//! let sampling_param = 2;
79//! let compressed = BEIntVec::<RiceCodec>::from_with_param(&vec, sampling_param, rice_param).unwrap();
80//! ```
81//!
82//! Choosing the right codec is crucial for achieving optimal compression. The efficiency of a codec is highly dependent on the underlying data distribution. For example, Rice coding is usually effective for skewed distributions, while Minimal Binary coding is optimal for uniformly distributed data.
83//!
84//! ## Endianness
85//!
86//! Choose `BEIntVec` or `LEIntVec` based on data interoperability needs. Performance is equivalent; endianness affects byte order in compressed storage.
87//!
88//! ## Memory Analysis (and why choosing the right codec is important)
89//!
90//! Both the little-endian and big-endian version of the compressed int-vec include support for the [MemDbg/MemSize](https://docs.rs/mem-dbg/) traits from the [mem_dbg](https://crates.io/crates/mem-dbg) crate. For example, this is the output of `mem_dbg(DbgFlags::empty()` for a very large `BEIntVec` instance:
91//!
92//! ```bash
93//! 11536 B ⏺
94//! 10864 B ├╴data
95//! 656 B ├╴samples
96//! 0 B ├╴codec
97//! 8 B ├╴k
98//! 8 B ├╴len
99//! 0 B ├╴codec_param
100//! 0 B ╰╴endian
101//! ```
102//!
103//! Consider now a vector of `u64` values uniformly distributed in the range `[0, u64::MAX)` and see it's memory usage
104//!
105//! ```rust
106//! use rand::distr::{Distribution, Uniform};
107//! use rand::SeedableRng;
108//! use mem_dbg::{MemDbg, DbgFlags};
109//!
110//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
111//! let mut rng = rand::rngs::StdRng::seed_from_u64(42);
112//! let uniform = Uniform::new(0, max).unwrap();
113//! (0..size).map(|_| uniform.sample(&mut rng)).collect()
114//! }
115//!
116//! let input_vec = generate_uniform_vec(1000, u64::MAX);
117//!
118//! println!("Size of the standard Vec<u64>");
119//! input_vec.mem_dbg(DbgFlags::empty());
120//! ```
121//!
122//! This outputs:
123//!
124//! ```bash
125//! Size of the standard Vec<u64>
126//! 8024 B ⏺
127//! ```
128//!
129//! Next, we compress the vector using both `MinimalBinaryCodec` and `DeltaCodec` using for both a sampling parameter of `32`:
130//!
131//! ```rust
132//! use compressed_intvec::intvec::BEIntVec;
133//! use compressed_intvec::codecs::{MinimalBinaryCodec, DeltaCodec};
134//! use mem_dbg::{MemDbg, DbgFlags};
135//! use rand::distr::{Distribution, Uniform};
136//! use rand::SeedableRng;
137//!
138//! fn generate_uniform_vec(size: usize, max: u64) -> Vec<u64> {
139//! let mut rng = rand::rngs::StdRng::seed_from_u64(42);
140//! let uniform = Uniform::new(0, max).unwrap();
141//! (0..size).map(|_| uniform.sample(&mut rng)).collect()
142//! }
143//!
144//! let input_vec = generate_uniform_vec(1000, u64::MAX);
145//! println!("Size of the standard Vec<u64>");
146//! input_vec.mem_dbg(DbgFlags::empty());
147//!
148//! let minimal_intvec = BEIntVec::<MinimalBinaryCodec>::from_with_param(&input_vec, 32, 10).unwrap();
149//! let delta_intvec = BEIntVec::<DeltaCodec>::from(&input_vec, 32).unwrap();
150//!
151//! println!("Size of the compressed IntVec with MinimalBinaryCodec");
152//! minimal_intvec.mem_dbg(DbgFlags::empty());
153//!
154//! println!("\nSize of the compressed IntVec with DeltaCodec");
155//! delta_intvec.mem_dbg(DbgFlags::empty());
156//! ```
157//!
158//! The compression results are:
159//!
160//! ```bash
161//! Size of the standard Vec<u64>
162//! 8024 B ⏺
163//!
164//! Size of the compressed IntVec with MinimalBinaryCodec
165//! 832 B ⏺
166//! 528 B ├╴data
167//! 280 B ├╴samples
168//! 0 B ├╴codec
169//! 8 B ├╴k
170//! 8 B ├╴len
171//! 8 B ├╴codec_param
172//! 0 B ╰╴endian
173//!
174//! Size of the compressed IntVec with DeltaCodec
175//! 9584 B ⏺
176//! 9288 B ├╴data
177//! 280 B ├╴samples
178//! 0 B ├╴codec
179//! 8 B ├╴k
180//! 8 B ├╴len
181//! 0 B ├╴codec_param
182//! 0 B ╰╴endian
183//! ```
184//!
185//! As shown, `MinimalBinaryCodec` is the most efficient for uniformly distributed data, reducing the size to 832 bytes, an order of magnitude smaller than the original vector. In contrast, `DeltaCodec` actually increases the size to 9584 bytes, as it is poorly suited for uniform distributions.
186//!
187//! ## Benchmarks
188//!
189//! Benchmarks are provided to evaluate both the random access speed and space occupancy of compressed vectors.
190//!
191//! - **Random Access Benchmarks:** These benchmarks measure the time required to access individual elements.
192//! - **Space Occupancy Benchmarks:** These benchmarks report the memory footprint of various codec configurations and compressed representations.
193//!
194//! To run the benchmarks, execute:
195//!
196//! ```bash
197//! cargo bench
198//! ```
199pub mod codecs;
200pub mod intvec;