compressed-intvec 0.6.0

Space-efficient integer vectors with fixed-width, variable-length, and sequence-oriented encodings.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
//! Codec selection for variable-length integer compression.
//!
//! This module defines [`Codec`], an enum for controlling the
//! compression strategy of an [`VarVec`]. The choice of codec is a critical
//! performance parameter, as its effectiveness depends on the statistical
//! properties of the data being compressed.
//!
//! # Codec Selection Strategy
//!
//! Codec selection is performed by a statistical analysis of the entire
//! input dataset at construction time.
//!
//! The [`Codec`] enum provides several ways to specify the compression method:
//!
//! 1.  **Explicit Specification**: A specific codec and all its parameters are
//!     provided. This is suitable when the data characteristics are known in
//!     advance.
//!     - Non-parametric examples: [`Gamma`](Codec::Gamma), [`Delta`](Codec::Delta).
//!     - Parametric example: `Zeta { k: Some(3) }`.
//!
//! ```
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! use compressed_intvec::prelude::*;
//!
//! let data: &[u32] = &(0..1000).collect::<Vec<_>>();
//!
//! // Explicitly specify a non-parametric codec
//! let delta_vec: UVarVec<u32> = VarVec::builder()
//!     .codec(Codec::Delta)
//!     .k(16)
//!     .build(&data)?;
//!
//! // Explicitly specify a parametric codec with a fixed parameter
//! let zeta_vec: UVarVec<u32> = VarVec::builder()
//!     .codec(Codec::Zeta { k: Some(3) })
//!     .build(&data)?;
//! # Ok(())
//! # }
//! ```
//!
//! 2.  **Automatic Parameter Estimation**: A specific codec family is chosen, but
//!     the optimal parameter is determined by the builder based on a full data
//!     analysis. This is achieved by providing `None` as the parameter value.
//!     - Example: `Rice { log2_b: None }` will find the best `log2_b` for the
//!       given data.
//!
//! ```
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! use compressed_intvec::prelude::*;
//!
//! let data: &[u32] = &(0..1000).collect::<Vec<_>>();
//!
//! // Automatically select the best Rice parameter
//! let rice_vec: UVarVec<u32> = VarVec::builder()
//!     .codec(Codec::Rice { log2_b: None })
//!     .build(&data)?;
//! # Ok(())
//! # }
//! ```
//!
//! 3.  **Fully Automatic Selection**: The builder analyzes the data against all
//!     available codecs and their standard parameter ranges to find the single
//!     best configuration. This is activated by using [`Codec::Auto`].
//!
//! ```
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! use compressed_intvec::prelude::*;
//!
//! let data: &[u32] = &(0..1000).collect::<Vec<_>>();
//! // Automatically select the best codec and parameters for the data
//! let auto_vec: UVarVec<u32> = VarVec::builder()
//!    .codec(Codec::Auto)
//!    .build(&data)?;
//! # Ok(())
//! # }
//! ```
//!
//!
//! ## Analysis Mechanism
//!
//! The selection logic uses the [`CodesStats`] utility from the [`dsi-bitstream`]
//! crate. For a given sequence of integers, [`CodesStats`] calculates the exact
//! total bit cost for encoding the sequence with a wide range of instantaneous
//! codes and their common parameterizations.
//!
//! ## Construction Overhead
//!
//! The full-dataset analysis has a one-time computational cost at construction.
//! The complexity is `O(N * C)`, where `N` is the number of elements in the
//! input and `C` is the number of codec configurations tested by [`CodesStats`]
//! (approximately 70).
//!
//! This trade-off is suitable for read-heavy workloads where a higher initial
//! cost is acceptable for better compression and subsequent read performance.
//!
//! # Implementation Notes
//!
//! - The parameter ranges for codecs like Zeta and Rice are defined by the `const
//!   generics` of the [`CodesStats`] struct in [`dsi-bitstream`]. The default
//!   values cover common and effective parameter ranges.
//! - If a data distribution benefits from a parameter outside of the tested
//!   range (e.g., Zeta with `k=20`), it must be specified explicitly in the
//!   builder via `.codec(Codec::Zeta { k: Some(20) })`.
//!
//! [`VarVec`]: crate::variable::VarVec
//! [`VarVecBuilder`]: crate::variable::builder::VarVecBuilder
//! [`dsi-bitstream`]: https://crates.io/crates/dsi-bitstream

use super::VarVecError;
use dsi_bitstream::prelude::{Codes, CodesStats};

/// Specifies the compression codec and its parameters for an [`VarVec`](super::VarVec).
///
/// This enum allows for either explicitly setting the parameters for codes
/// like Rice and Zeta, or requesting that the [`VarVecBuilder`](super::builder::VarVecBuilder)
/// automatically select suitable parameters by performing a full analysis of
/// the data distribution.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum Codec {
    /// Elias γ-coding. A simple, universal code that is effective for integers
    /// with a distribution skewed towards small values. It is the default codec
    /// for the iterator-based builder, which cannot perform data analysis.
    #[default]
    Gamma,

    /// Elias δ-coding. A universal code that is generally more efficient than
    /// Gamma for larger integer values.
    Delta,

    /// Unary coding. Encodes an integer `n` as `n` zeros followed by a one. It is
    /// only efficient for extremely small values (e.g., 0, 1, 2).
    Unary,

    /// Rice-coding with a parameter `log2_b`. This code is optimal for data with
    /// a geometric distribution.
    ///
    /// - If `log2_b` is `Some(val)`, the specified parameter is used.
    /// - If `log2_b` is `None`, an optimal parameter is estimated by analyzing the
    ///   entire dataset.
    Rice { log2_b: Option<u8> },

    /// Boldi-Vigna ζ-coding with a parameter `k`. This code is effective for
    /// data with a power-law distribution, common in web graphs and social networks.
    ///
    /// - If `k` is `Some(val)`, the specified parameter is used (`k > 0`).
    /// - If `k` is `None`, an optimal parameter is estimated by analyzing the
    ///   entire dataset.
    Zeta { k: Option<u64> },

    /// Golomb-coding with a parameter `b`. This is a generalization of Rice coding
    /// and is also suitable for geometric distributions.
    ///
    /// - If `b` is `Some(val)`, the specified parameter is used (`b > 0`).
    /// - If `b` is `None`, an optimal parameter is estimated by analyzing the
    ///   entire dataset.
    Golomb { b: Option<u64> },

    /// Elias-Fano ω-coding. A universal code.
    Omega,

    /// Streamlined Apostolico–Drovandi π code with a parameter `k`.
    ///
    /// - If `k` is `Some(val)`, the specified parameter is used (`k > 0`).
    /// - If `k` is `None`, an optimal parameter is estimated by analyzing the
    ///   entire dataset.
    Pi { k: Option<u64> },

    /// Elias-Fano Exponential-Golomb coding with a parameter `k`.
    ///
    /// - If `k` is `Some(val)`, the specified parameter is used.
    /// - If `k` is `None`, an optimal parameter is estimated by analyzing the
    ///   entire dataset.
    ExpGolomb { k: Option<u64> },

    /// VByte encoding with Little-Endian byte order. This is often one of the
    /// fastest codecs for decoding, though it may not offer the best compression.
    VByteLe,

    /// VByte encoding with Big-Endian byte order.
    VByteBe,

    /// Automatically select the best variable-length code based on the data.
    ///
    /// When this option is used, the builder performs a statistical
    /// analysis on the entire input dataset to determine which codec and
    /// parameterization provides the best compression ratio.
    ///
    /// # Note
    ///
    /// This option is **not** supported for the iterator-based builder,
    /// as it requires pre-analyzing the data.
    Auto,

    /// Use an explicitly provided [`Codes`] variant from [`dsi-bitstream`](https://crates.io/crates/dsi-bitstream).
    ///
    /// This is for advanced use cases where the user has already constructed
    /// a [`Codes`] enum instance.
    Explicit(Codes),
}

/// Deprecated alias for [`Codec`]. Use [`Codec`] instead.
///
/// # Deprecation Notice
///
/// This type has been renamed to [`Codec`] for brevity and consistency.
/// This alias will be removed in version 1.0.0.
#[deprecated(since = "0.6.0", note = "renamed to `Codec`; use `Codec` instead")]
pub type VariableCodecSpec = Codec;

impl Codec {
    /// Returns `true` if this codec specification requires data analysis to resolve.
    ///
    /// Analysis is needed when parameters are not explicitly specified and must
    /// be determined by analyzing the data distribution.
    #[inline]
    pub(crate) fn requires_analysis(&self) -> bool {
        matches!(
            self,
            Codec::Auto
                | Codec::Rice { log2_b: None }
                | Codec::Zeta { k: None }
                | Codec::Golomb { b: None }
                | Codec::Pi { k: None }
                | Codec::ExpGolomb { k: None }
        )
    }
}

/// Resolves a user-provided [`Codec`] into a concrete [`Codes`] variant.
///
/// This function translates the user's high-level request into a fully-parameterized,
/// concrete [`Codes`] variant that can be used for compression.
///
/// If the `spec` includes requests for automatic parameter selection (e.g., `Auto`
/// or `Zeta { k: None }`), this function analyzes the **entire** provided `input`
/// data slice to determine the optimal settings.
pub(crate) fn resolve_codec<U>(input: &[U], spec: Codec) -> Result<Codes, VarVecError>
where
    U: Into<u64> + Copy,
{
    match spec {
        // Parameter-free codecs: direct mapping, no data needed.
        Codec::Gamma => Ok(Codes::Gamma),
        Codec::Delta => Ok(Codes::Delta),
        Codec::Unary => Ok(Codes::Unary),
        Codec::Omega => Ok(Codes::Omega),
        Codec::VByteLe => Ok(Codes::VByteLe),
        Codec::VByteBe => Ok(Codes::VByteBe),
        Codec::Explicit(codes) => Ok(codes),

        // Codecs with explicit parameters: direct mapping.
        Codec::Rice { log2_b: Some(p) } => Ok(Codes::Rice(p as usize)),
        Codec::Zeta { k: Some(p) } => Ok(Codes::Zeta(p as usize)),
        Codec::Golomb { b: Some(p) } => Ok(Codes::Golomb(p)),
        Codec::Pi { k: Some(p) } => Ok(Codes::Pi(p as usize)),
        Codec::ExpGolomb { k: Some(p) } => Ok(Codes::ExpGolomb(p as usize)),

        // Codecs requiring analysis: return error if no data provided.
        Codec::Auto
        | Codec::Rice { log2_b: None }
        | Codec::Zeta { k: None }
        | Codec::Golomb { b: None }
        | Codec::Pi { k: None }
        | Codec::ExpGolomb { k: None } => {
            if input.is_empty() {
                return Ok(Codes::Gamma); // Safe default only for analysis codecs
            }

            // Define a type alias for the default [`CodesStats`] configuration for clarity.
            // These const generics define the range of parameters to test.
            type DefaultCodesStats = CodesStats<10, 20, 10, 10, 10>;

            // Create a stats object and populate it by iterating through the entire dataset.
            let mut stats = DefaultCodesStats::default();
            for &value in input {
                stats.update(value.into());
            }

            // Use the populated stats to resolve the codec specification.
            match spec {
                Codec::Auto => {
                    let (best_code, _) = stats.best_code();
                    Ok(best_code.canonicalize())
                }
                Codec::Rice { log2_b: None } => {
                    let (best_param, _) = stats
                        .rice
                        .iter()
                        .enumerate()
                        .min_by_key(|&(_, cost)| cost)
                        .unwrap_or((0, &0)); // Fallback to 0 if array is empty.
                    Ok(Codes::Rice(best_param))
                }
                Codec::Zeta { k: None } => {
                    let (best_param, _) = stats
                        .zeta
                        .iter()
                        .enumerate()
                        .min_by_key(|&(_, cost)| cost)
                        .unwrap_or((0, &0));
                    Ok(Codes::Zeta(best_param + 1)) // Zeta params are 1-based.
                }
                Codec::Golomb { b: None } => {
                    let (best_param, _) = stats
                        .golomb
                        .iter()
                        .enumerate()
                        .min_by_key(|&(_, cost)| cost)
                        .unwrap_or((0, &0));
                    Ok(Codes::Golomb((best_param + 1) as u64)) // Golomb params are 1-based.
                }
                Codec::Pi { k: None } => {
                    let (best_param, _) = stats
                        .pi
                        .iter()
                        .enumerate()
                        .min_by_key(|&(_, cost)| cost)
                        .unwrap_or((0, &0));
                    Ok(Codes::Pi(best_param + 2)) // Pi params are offset by 2.
                }
                Codec::ExpGolomb { k: None } => {
                    let (best_param, _) = stats
                        .exp_golomb
                        .iter()
                        .enumerate()
                        .min_by_key(|&(_, cost)| cost)
                        .unwrap_or((0, &0));
                    Ok(Codes::ExpGolomb(best_param))
                }
                // This arm is guaranteed to be unreachable because the outer match
                // ensures `spec` is one of the variants handled above.
                _ => unreachable!(),
            }
        }
    }
}

/// Resolves a codec specification by analyzing data from an iterator.
///
/// This function avoids intermediate allocations by consuming the iterator
/// directly. It should only be called when `spec.requires_analysis()` returns
/// `true`.
///
/// # Arguments
///
/// * `iter` - An iterator of u64 values to analyze for codec parameter selection.
///
/// * `spec` - The codec specification requesting analysis (Auto or a codec with
///   None parameters).
pub(crate) fn resolve_codec_from_iter<I>(iter: I, spec: Codec) -> Result<Codes, VarVecError>
where
    I: Iterator<Item = u64>,
{
    // Define the default CodesStats configuration for parameter analysis.
    type DefaultCodesStats = CodesStats<10, 20, 10, 10, 10>;

    let mut stats = DefaultCodesStats::default();
    let mut count = 0usize;

    // Analyze the data stream without materializing it to a vector.
    for value in iter {
        stats.update(value);
        count += 1;
    }

    // If the stream is empty, return a safe default.
    if count == 0 {
        return Ok(Codes::Gamma);
    }

    // Use the accumulated statistics to determine the best codec.
    match spec {
        Codec::Auto => {
            let (best_code, _) = stats.best_code();
            Ok(best_code.canonicalize())
        }
        Codec::Rice { log2_b: None } => {
            let (best_param, _) = stats
                .rice
                .iter()
                .enumerate()
                .min_by_key(|&(_, cost)| cost)
                .unwrap_or((0, &0));
            Ok(Codes::Rice(best_param))
        }
        Codec::Zeta { k: None } => {
            let (best_param, _) = stats
                .zeta
                .iter()
                .enumerate()
                .min_by_key(|&(_, cost)| cost)
                .unwrap_or((0, &0));
            Ok(Codes::Zeta(best_param + 1))
        }
        Codec::Golomb { b: None } => {
            let (best_param, _) = stats
                .golomb
                .iter()
                .enumerate()
                .min_by_key(|&(_, cost)| cost)
                .unwrap_or((0, &0));
            Ok(Codes::Golomb((best_param + 1) as u64))
        }
        Codec::Pi { k: None } => {
            let (best_param, _) = stats
                .pi
                .iter()
                .enumerate()
                .min_by_key(|&(_, cost)| cost)
                .unwrap_or((0, &0));
            Ok(Codes::Pi(best_param + 2))
        }
        Codec::ExpGolomb { k: None } => {
            let (best_param, _) = stats
                .exp_golomb
                .iter()
                .enumerate()
                .min_by_key(|&(_, cost)| cost)
                .unwrap_or((0, &0));
            Ok(Codes::ExpGolomb(best_param))
        }
        _ => unreachable!("resolve_codec_from_iter called with non-analysis codec"),
    }
}