compressed_intvec/lib.rs
1//! # Compressed IntVec Library
2//!
3//! A Rust implementation of compressed vectors of unsigned 64-bit
4//! integers using instantaneous codes. The key structure, [`IntVec`](src/intvec.rs),
5//! compresses the data into a bitstream while storing sample offsets for fast random access. The bitstream
6//! and the codes are provided by the library [dsi-bitstream](https://crates.io/crates/dsi-bitstream).
7//!
8//! ## Features
9//!
10//! - **Multiple Codecs:** Use various encoding schemes such as Gamma, Delta, Exp‑Golomb,
11//! Zeta, Rice, and their parameterized variants. All codecs implement the
12//! [`Codec`](src/codecs.rs) trait.
13//! - **Endianness Support:** Dedicated types are available for both big‑endian and little‑endian
14//! representations. See [`BEIntVec`](src/intvec.rs) and [`LEIntVec`](src/intvec.rs).
15//! - **Memory Debugging:** Integration with the [`mem_dbg`](https://crates.io/crates/mem_dbg)
16//! crate helps in estimating the memory footprint of compressed data structures.
17//!
18//! ## Modules
19//!
20//! - **`codecs`** ([src/codecs.rs](src/codecs.rs)): Contains the generic [`Codec`](src/codecs.rs)
21//! trait and its implementations for various coding schemes. Each codec defines both
22//! encoding and decoding functionalities.
23//! - **`intvec`** ([src/intvec.rs](src/intvec.rs)): Implements the main [`IntVec`](src/intvec.rs)
24//! structure, along with big‑endian and little‑endian variants, constructors, element retrieval,
25//! full decoding, and iterator support.
26//!
27//! ## Usage Example
28//!
29//! To compress a vector of integers using gamma coding in big‑endian format:
30//!
31//! ```rust
32//! use compressed_intvec::intvec::BEIntVec;
33//! use compressed_intvec::codecs::GammaCodec;
34//!
35//! let input = vec![1, 2, 3, 4, 5];
36//! // Create a compressed vector with a sampling period of 2 (every 2nd value is stored as a sample)
37//! let intvec = BEIntVec::<GammaCodec>::from(&input, 2);
38//!
39//! // Random access: decode the 4th element
40//! assert_eq!(intvec.get(3), Some(4));
41//!
42//! // Decode the full original vector (note: this may be expensive)
43//! assert_eq!(intvec.into_vec(), input);
44//! ```
45//!
46//! For little‑endian encoding, use [`LEIntVec`](src/intvec.rs) similarly:
47//!
48//! ```rust
49//! use compressed_intvec::intvec::LEIntVec;
50//! use compressed_intvec::codecs::GammaCodec;
51//!
52//! let input = vec![10, 20, 30, 40];
53//! let intvec = LEIntVec::<GammaCodec>::from(&input, 2);
54//! assert_eq!(intvec.get(2), Some(30));
55//! ```
56//!
57//! ## Codecs Overview
58//!
59//! The [`codecs`](src/codecs.rs) module provides implementations of the [`Codec`](src/codecs.rs)
60//! trait using different encoding algorithms:
61//!
62//! - **GammaCodec:** Uses gamma coding without requiring extra runtime parameters.
63//! - **DeltaCodec:** Uses delta coding, likewise without extra parameters.
64//! - **ExpGolombCodec:** Requires an extra parameter (e.g. a parameter `k`) for encoding/decoding.
65//! - **ZetaCodec:** Uses additional runtime parameters ζ.
66//! - **RiceCodec:** Uses a Rice parameter for encoding/decoding. Ideal for skewed distributions, you likely want to the set the rice prameter as the floor of the base‑2 logarithm of the mean value of the data.
67//! - **MinimalBinaryCodec:** A minimal binary code with upper bound `u > 0` ([truncated binary encoding](https://en.wikipedia.org/wiki/Truncated_binary_encoding)). This is optimal for uniformly distributed data in the range [0, u).
68//! - **Parameterized Codecs:** Variants like `ParamZetaCodec`, `ParamDeltaCodec`, and `ParamGammaCodec`
69//! use compile‑time flags (e.g. lookup table usage) to control internal behavior. If you choose to use this, consider that the tables are hardcoded in the library [dsi-dsi-bitstream](https://crates.io/crates/dsi-dsi-bitstream) (in the file `_tables.rs`). If you want to change the tables, you need to clone the repository and change the tables with the python script `gen_code_tables.py` and then compile the library. See
70//!
71//! **Theoretical Consideration:** The efficiency of each codec is closely tied to the data distribution.
72//!
73//! - **Skewed Distributions:** When the data is skewed, Rice coding generally outperforms Gamma coding.
74//! In these cases, choose the Rice parameter as the floor of the base‑2 logarithm of the mean value.
75//! - **Power Law Distributions:** If the data roughly follows a power law (i.e. P(x) ∝ x⁻²),
76//! Gamma coding is often the optimal choice.
77//! - **Uniform Distributions:** For data uniformly distributed across the range [0, u64::MAX), minimal binary coding is the most efficient.
78//!
79//! For more in‐depth information, please consult the literature on entropy coding.
80//!
81//! ## The `IntVec` Structure
82//!
83//! The [`IntVec`](src/intvec.rs) type encapsulates:
84//!
85//! - **Compressed Data:** A [`Vec<u64>`](std::vec::Vec) containing the compressed bitstream.
86//! - **Sampling Offsets:** A [`Vec<usize>`] tracking bit offsets for every k‑th value to allow
87//! fast random access.
88//! - **Codec Parameters:** Any additional parameters required by the selected codec.
89//!
90//! The main functionalities include:
91//!
92//! - **Construction:** Use `from_with_param` (or the convenience method `from` for codecs without
93//! runtime parameters) to build a compressed vector.
94//! - **Random Access:** Retrieve a value using `get(index)` by decoding only the required part of the stream.
95//! - **Full Decoding:** Convert back to a standard vector with `into_vec()`. This operation decodes every value.
96//! - **Iteration:** The `iter()` method returns an iterator that decodes values on the fly.
97//!
98//! ## Crate Features
99//!
100//! - **`mem-debug` (default):** Enables memory debugging features using the [`mem_dbg`](https://crates.io/crates/mem_dbg) crate.
101//! - **`serde` (disabled by default):** Enables serialization/deserialization support using the [`serde`](https://crates.io/crates/serde) crate.
102
103pub mod codecs;
104pub mod intvec;