compressed_intvec/variable/
mod.rs

1//! A compressed integer vector using variable-length encoding with fast random access.
2//!
3//! This module provides [`IntVec`], a data structure for storing sequences of
4//! integers in a compressed format while retaining efficient random access. It is
5//! well-suited for datasets where integer values are non-uniformly distributed,
6//! as it uses instantaneous variable-length codes to represent smaller numbers with fewer bits.
7//!
8//! # Core Concepts
9//!
10//! ## Variable-Length Encoding
11//!
12//! Unlike [`FixedVec`], which uses a fixed number of
13//! bits for every integer, [`IntVec`] employs **instantaneous codes** (such as
14//! Gamma, Delta, or Rice codes) provided by the [`dsi-bitstream`] crate. This
15//! approach allows each integer to be encoded with a variable number of bits,
16//! typically using shorter codes for smaller or more frequent values. This can
17//! lead to significant space savings, especially for data with many small numbers.
18//!
19//! Signed integers (e.g., `i8`, `i32`) are supported transparently through
20//! zig-zag encoding, which maps small negative and positive integers to small
21//! unsigned integers.
22//!
23//! ## Random Access and Sampling
24//!
25//! A key challenge with variable-length codes is that the location of the *i*-th
26//! element cannot be calculated directly. To solve this, [`IntVec`] implements a
27//! **sampling mechanism**. It stores the bit position of every *k*-th element in
28//! a separate, auxiliary [`FixedVec`]. This parameter, `k`, is the **sampling rate**.
29//!
30//! To access an element at `index`, [`IntVec`]:
31//! 1.  Finds the nearest sample by calculating `index / k`.
32//! 2.  Retrieves the bit offset of the start of that sampled block.
33//! 3.  Jumps to that offset in the compressed data stream.
34//! 4.  Sequentially decodes the remaining `index % k` elements to reach the target.
35//!
36//! This strategy provides amortized O(1) random access, as the number of
37//! sequential decoding steps is bounded by `k`.
38//!
39//! ### The `k` Trade-off
40//!
41//! The choice of the sampling rate `k` involves a trade-off:
42//! -   **Smaller `k`**: Faster random access (fewer elements to decode per access)
43//!     but higher memory usage due to a larger samples table.
44//! -   **Larger `k`**: Slower random access but better compression, as the
45//!     samples table is smaller.
46//!
47//! The optimal `k` depends on the specific access patterns of an application.
48//!
49//! # Design and Immutability
50//!
51//! [`IntVec`] is **immutable** after creation.
52//! Unlike [`FixedVec`], it does not provide methods for
53//! in-place modification (e.g., `set`, `push`).
54//!
55//! If a value in the middle of the compressed bitstream were changed, its new
56//! encoded length might be different. For example, changing `5` (which might be
57//! 4 bits) to `5000` (which might be 16 bits) would require shifting all
58//! subsequent data, invalidating every sample point that follows. The cost of
59//! such an operation would be prohibitive, scaling with the length of the vector.
60//!
61//! For this reason, [`IntVec`] is designed as a write-once, read-many data structure.
62//!
63//! # Access Strategies and Readers
64//!
65//! [`IntVec`] provides multiple interfaces for accessing data, each optimized for a
66//! different pattern of use.
67//!
68//! - **[`get()`](IntVec::get)**: For single, infrequent lookups. Each call creates and discards
69//!   an internal reader, which incurs overhead if used in a loop.
70//!
71//! - **[`get_many()`](IntVec::get_many)**: The most efficient method for retrieving a batch of elements
72//!   when all indices are known beforehand. It sorts the indices to perform a
73//!   single, monotonic scan over the data, minimizing redundant decoding and seek
74//!   operations.
75//!
76//! - **[`IntVecReader`] (Reusable Stateless Reader)**: This reader is created via
77//!   [`IntVec::reader()`]. It maintains an internal, reusable bitstream reader,
78//!   amortizing its setup cost over multiple calls. Each [`get()`](IntVec::get) call is
79//!   **stateless** with respect to position: it performs a full seek to the nearest
80//!   sample and decodes forward from there, independently of previous calls. It is
81//!   best suited for true random access patterns where indices are sparse and
82//!   unpredictable.
83//!
84//! - **[`IntVecSeqReader`] (Stateful Sequential Reader)**: This reader, created via
85//!   [`IntVec::seq_reader()`], is a stateful object optimized for access patterns with
86//!   high locality. It maintains an internal cursor of the current decoding position.
87//!   - **Fast Path**: If a requested `index` is at or after the cursor's current
88//!     position and within the same sample block, the reader simply decodes
89//!     forward from its last position, avoiding a costly seek operation.
90//!   - **Fallback Path**: If the requested `index` is before the cursor or in a
91//!     different sample block, the reader falls back to the standard behavior of
92//!     seeking to the nearest sample and decoding from there. This makes it very
93//!     efficient for iterating through sorted or clustered indices.
94//!
95//! # Main Components
96//!
97//! - [`IntVec`]: The core compressed vector.
98//! - [`IntVecBuilder`]: The primary tool for constructing an [`IntVec`] with
99//!   custom compression codecs and sampling rates.
100//! - [`VariableCodecSpec`]: An enum to specify the compression codec.
101//! - [`IntVecReader`]: A reusable, stateless reader for efficient random access.
102//! - [`IntVecSeqReader`]: A stateful reader optimized for sequential or localized access patterns.
103//! - [`IntVecSlice`]: An immutable, zero-copy view over a portion of the vector.
104//!
105//! # Examples
106//!
107//! ## Basic Usage with Unsigned Integers
108//!
109//! Create a [`UIntVec`] (an alias for `IntVec<u32, LE>`) from a slice of `u32`. The builder will automatically
110//! select a suitable codec and use a default sampling rate.
111//!
112//! ```
113//! use compressed_intvec::variable::{IntVec, UIntVec};
114//!
115//! let data: Vec<u32> = vec![100, 200, 300, 1024];
116//! let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
117//!
118//! assert_eq!(vec.len(), 4);
119//! // Accessing an element
120//! assert_eq!(vec.get(1), Some(200));
121//! ```
122//!
123//! ## Storing Signed Integers
124//!
125//! [`IntVec`] handles signed integers, such as [`i16`], by mapping them to unsigned
126//! values using zig-zag encoding.
127//!
128//! ```
129//! use compressed_intvec::variable::{IntVec, SIntVec};
130//!
131//! let data: &[i16] = &[-5, 20, -100, 0, 8];
132//! let vec: SIntVec<i16> = IntVec::from_slice(data).unwrap();
133//!
134//! assert_eq!(vec.len(), 5);
135//! assert_eq!(vec.get(0), Some(-5));
136//! assert_eq!(vec.get(2), Some(-100));
137//! ```
138//!
139//! ## Manual Codec and Sampling Rate
140//!
141//! For fine-grained control, use the [`IntVecBuilder`]. Here, we specify a
142//! sampling rate of `k=8` and use the `Zeta` code with `k=3`.
143//!
144//! ```
145//! use compressed_intvec::variable::{IntVec, UIntVec, VariableCodecSpec};
146//!
147//! let data: Vec<u64> = (0..100).map(|i| i * i).collect();
148//!
149//! let vec: UIntVec<u64> = IntVec::builder()
150//!     .k(8) // Set sampling rate
151//!     .codec(VariableCodecSpec::Zeta { k: Some(3) }) // Set compression codec
152//!     .build(&data)
153//!     .unwrap();
154//!
155//! assert_eq!(vec.get_sampling_rate(), 8);
156//! assert_eq!(vec.get(10), Some(100));
157//! ```
158//!
159//! Best performance is achieved when the sampling rate `k` is a power of two. Usually a value of `32` or `16` is a good trade-off between speed and compression ratio.
160//! 
161//! ## Codec Selection and Performance
162//!
163//! The choice of compression codec is critical for performance and space efficiency.
164//! [`IntVecBuilder`] offers automatic codec selection via 
165//! [`VariableCodecSpec::Auto`]. When enabled, the builder analyzes the entire input
166//! dataset to find the codec that offers the best compression ratio.
167//!
168//! This analysis involves calculating the compressed size for the data with
169//! approximately 70 different codec configurations. This process introduces a
170//! significant, one-time **construction overhead**.
171//! 
172//! Use [`Auto`](VariableCodecSpec::Auto) for read-heavy workloads where the [`IntVec`] 
173//! is built once and then accessed many times. The initial cost is easily amortized by
174//! the long-term space savings.
175//! 
176//! If your application creates many small [`IntVec`]s or accesses them frequently,
177//! the repeated cost of analysis can become a performance
178//! bottleneck. In such scenarios, it is better to explicitly specify a codec
179//! (e.g., [`VariableCodecSpec::Gamma`] or [`VariableCodecSpec::Delta`]) that is known
180//! to be a good general-purpose choice for your data.
181//! 
182//! ```
183//! use compressed_intvec::prelude::*;
184//! 
185//! let data: Vec<u32> = (0..100).collect();
186//! 
187//! // Create an IntVec with automatic codec selection
188//! let vec: UIntVec<u32> = IntVec::builder()
189//!     .build(&data)
190//!     .unwrap();
191//!```
192//!  
193//! [`dsi-bitstream`]: https://docs.rs/dsi-bitstream/latest/dsi_bitstream/
194
195#[macro_use]
196mod macros;
197
198pub mod builder;
199pub mod codec;
200pub mod iter;
201#[cfg(feature = "parallel")]
202mod parallel;
203pub mod reader;
204pub mod seq_reader;
205#[cfg(feature = "serde")]
206pub mod serde;
207pub mod slice;
208pub mod traits;
209
210pub use self::{codec::VariableCodecSpec, traits::Storable};
211use crate::fixed::{Error as FixedVecError, FixedVec};
212use dsi_bitstream::{
213    codes::params::DefaultReadParams,
214    dispatch::StaticCodeRead,
215    prelude::{
216        BitRead, BitSeek, BufBitReader, BufBitWriter, Codes, CodesRead, CodesWrite, Endianness,
217        MemWordReader, MemWordWriterVec,
218    },
219    traits::{BitWrite, BE, LE},
220};
221use mem_dbg::{DbgFlags, MemDbgImpl, MemSize, SizeFlags};
222use std::{
223    error::Error,
224    fmt::{self, Write},
225    marker::PhantomData,
226};
227
228pub use builder::{IntVecBuilder, IntVecFromIterBuilder};
229use iter::{IntVecIntoIter, IntVecIter};
230pub use reader::IntVecReader;
231pub use seq_reader::IntVecSeqReader;
232pub use slice::IntVecSlice;
233
234/// Defines the set of errors that can occur in `IntVec` operations.
235#[derive(Debug)]
236pub enum IntVecError {
237    /// An error occurred during an I/O operation, typically from the underlying
238    /// bitstream reader or writer.
239    Io(std::io::Error),
240    /// A generic error from the [`dsi-bitstream`](https://crates.io/crates/dsi-bitstream) library, often related to decoding malformed data.
241    Bitstream(Box<dyn Error + Send + Sync>),
242    /// An error indicating that one or more parameters are invalid for the
243    /// requested operation.
244    InvalidParameters(String),
245    /// An error that occurs during the dynamic dispatch of codec functions.
246    CodecDispatch(String),
247    /// An error indicating that a provided index is outside the valid bounds
248    /// of the vector.
249    IndexOutOfBounds(usize),
250}
251
252impl fmt::Display for IntVecError {
253    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
254        match self {
255            IntVecError::Io(e) => write!(f, "I/O error: {}", e),
256            IntVecError::Bitstream(e) => write!(f, "Bitstream error: {}", e),
257            IntVecError::InvalidParameters(s) => write!(f, "Invalid parameters: {}", s),
258            IntVecError::CodecDispatch(s) => write!(f, "Codec dispatch error: {}", s),
259            IntVecError::IndexOutOfBounds(index) => write!(f, "Index out of bounds: {}", index),
260        }
261    }
262}
263
264impl Error for IntVecError {
265    fn source(&self) -> Option<&(dyn Error + 'static)> {
266        match self {
267            IntVecError::Io(e) => Some(e),
268            IntVecError::Bitstream(e) => Some(e.as_ref()),
269            _ => None,
270        }
271    }
272}
273
274impl From<std::io::Error> for IntVecError {
275    fn from(e: std::io::Error) -> Self {
276        IntVecError::Io(e)
277    }
278}
279
280impl From<core::convert::Infallible> for IntVecError {
281    fn from(_: core::convert::Infallible) -> Self {
282        unreachable!()
283    }
284}
285
286impl From<FixedVecError> for IntVecError {
287    fn from(e: FixedVecError) -> Self {
288        IntVecError::InvalidParameters(e.to_string())
289    }
290}
291
292/// A compressed, randomly accessible vector of integers using variable-length encoding.
293///
294/// `IntVec` achieves compression by using instantaneous codes and enables fast,
295/// amortized O(1) random access via a sampling mechanism. See the
296/// [module-level documentation](crate::variable) for a detailed explanation.
297///
298/// # Type Parameters
299///
300/// - `T`: The integer type for the elements (e.g., `u32`, `i16`). It must
301///   implement the [`Storable`] trait.
302/// - `E`: The [`Endianness`] of the underlying bitstream (e.g., [`LE`] or [`BE`]).
303/// - `B`: The backend storage buffer, such as `Vec<u64>` for an owned vector or
304///   `&[u64]` for a borrowed, zero-copy view.
305#[derive(Debug, Clone)]
306pub struct IntVec<T: Storable, E: Endianness, B: AsRef<[u64]> = Vec<u64>> {
307    /// The raw, bit-packed compressed data.
308    pub(super) data: B,
309    /// A `FixedVec` containing the bit offsets of sampled elements.
310    pub(super) samples: FixedVec<u64, u64, LE, B>,
311    /// The sampling rate `k`. Every `k`-th element's position is stored.
312    pub(super) k: usize,
313    /// The number of elements in the vector.
314    pub(super) len: usize,
315    /// The `dsi-bitstream` code used for compression.
316    pub(super) encoding: Codes,
317    /// Zero-sized markers for the generic type parameters.
318    pub(super) _markers: PhantomData<(T, E)>,
319}
320
321impl<T: Storable, E: Endianness, B: AsRef<[u64]> + MemSize> MemSize for IntVec<T, E, B> {
322    fn mem_size(&self, flags: SizeFlags) -> usize {
323        // Start with the stack size of the struct itself.
324        let mut total_size = core::mem::size_of::<Self>();
325        // Add the heap-allocated memory for the `data` field.
326        total_size += self.data.mem_size(flags) - core::mem::size_of::<B>();
327        // Add the heap-allocated memory for the `samples` field's internal buffer.
328        total_size +=
329            self.samples.mem_size(flags) - core::mem::size_of::<FixedVec<u64, u64, LE, B>>();
330        total_size
331    }
332}
333
334// A local wrapper for `dsi_bitstream::codes::Codes` to override its `MemDbgImpl`.
335// This is necessary because the derived implementation for `Codes` is incorrect
336// and cannot be fixed due to the orphan rule.
337struct CodeWrapper<'a>(&'a Codes);
338
339impl mem_dbg::MemSize for CodeWrapper<'_> {
340    fn mem_size(&self, _flags: mem_dbg::SizeFlags) -> usize {
341        core::mem::size_of_val(self.0)
342    }
343}
344
345impl mem_dbg::MemDbgImpl for CodeWrapper<'_> {
346    // Override the top-level display function for this type.
347    fn _mem_dbg_depth_on(
348        &self,
349        writer: &mut impl core::fmt::Write,
350        total_size: usize,
351        max_depth: usize,
352        prefix: &mut String,
353        field_name: Option<&str>,
354        is_last: bool,
355        padded_size: usize,
356        flags: DbgFlags,
357    ) -> core::fmt::Result {
358        if prefix.len() > max_depth {
359            return Ok(());
360        }
361
362        let real_size = self.mem_size(flags.to_size_flags());
363        let mut buffer = String::new(); // Use a temp buffer to format the size part.
364
365        // Replicate the size and percentage formatting from the default `MemDbgImpl`.
366        if flags.contains(DbgFlags::HUMANIZE) {
367            let (value, uom) = mem_dbg::humanize_float(real_size as f64);
368            if uom == " B" {
369                let _ = write!(&mut buffer, "{:>5}  B ", real_size);
370            } else {
371                let precision = if value.abs() >= 100.0 {
372                    1
373                } else if value.abs() >= 10.0 {
374                    2
375                } else {
376                    3
377                };
378                let _ = write!(&mut buffer, "{0:>4.1$} {2} ", value, precision, uom);
379            }
380        } else {
381            let align = mem_dbg::n_of_digits(total_size);
382            let _ = write!(&mut buffer, "{:>align$} B ", real_size, align = align);
383        }
384
385        if flags.contains(DbgFlags::PERCENTAGE) {
386            let percentage = if total_size == 0 {
387                100.0
388            } else {
389                100.0 * real_size as f64 / total_size as f64
390            };
391            let _ = write!(&mut buffer, "{:>6.2}% ", percentage);
392        }
393
394        // Write the formatted size string with colors if enabled.
395        if flags.contains(DbgFlags::COLOR) {
396            writer.write_fmt(format_args!("{}", mem_dbg::color(real_size)))?;
397        }
398        writer.write_str(&buffer)?;
399        if flags.contains(DbgFlags::COLOR) {
400            writer.write_fmt(format_args!("{}", mem_dbg::reset_color()))?;
401        }
402
403        // Write the tree structure part.
404        if !prefix.is_empty() {
405            writer.write_str(&prefix[2..])?;
406            writer.write_char(if is_last { '╰' } else { '├' })?;
407            writer.write_char('╴')?;
408        }
409
410        if let Some(field_name) = field_name {
411            writer.write_fmt(format_args!("{}", field_name))?;
412        }
413
414        // This is the custom part: print the `Debug` format of the enum.
415        if flags.contains(DbgFlags::TYPE_NAME) {
416            if flags.contains(DbgFlags::COLOR) {
417                writer.write_fmt(format_args!("{}", mem_dbg::type_color()))?;
418            }
419            writer.write_fmt(format_args!(": {:?}", self.0))?;
420            if flags.contains(DbgFlags::COLOR) {
421                writer.write_fmt(format_args!("{}", mem_dbg::reset_color()))?;
422            }
423        }
424
425        // Correctly calculate and print padding.
426        let padding = padded_size - core::mem::size_of_val(self.0);
427        if padding != 0 {
428            writer.write_fmt(format_args!(" [{}B]", padding))?;
429        }
430
431        writer.write_char('\n')?;
432        Ok(())
433    }
434
435    // It's a leaf node in the display tree, so no recursion is needed.
436    fn _mem_dbg_rec_on(
437        &self,
438        _writer: &mut impl core::fmt::Write,
439        _total_size: usize,
440        _max_depth: usize,
441        _prefix: &mut String,
442        _is_last: bool,
443        _flags: DbgFlags,
444    ) -> core::fmt::Result {
445        Ok(())
446    }
447}
448
449impl<T: Storable, E: Endianness, B: AsRef<[u64]> + MemDbgImpl> MemDbgImpl for IntVec<T, E, B> {
450    fn _mem_dbg_rec_on(
451        &self,
452        writer: &mut impl core::fmt::Write,
453        total_size: usize,
454        max_depth: usize,
455        prefix: &mut String,
456        _is_last: bool,
457        flags: DbgFlags,
458    ) -> core::fmt::Result {
459        // Manually display each field, ensuring correct tree structure.
460        self.data._mem_dbg_depth_on(
461            writer,
462            total_size,
463            max_depth,
464            prefix,
465            Some("data"),
466            false,
467            core::mem::size_of_val(&self.data),
468            flags,
469        )?;
470        self.samples._mem_dbg_depth_on(
471            writer,
472            total_size,
473            max_depth,
474            prefix,
475            Some("samples"),
476            false,
477            core::mem::size_of_val(&self.samples),
478            flags,
479        )?;
480        self.k._mem_dbg_depth_on(
481            writer,
482            total_size,
483            max_depth,
484            prefix,
485            Some("k"),
486            false,
487            core::mem::size_of_val(&self.k),
488            flags,
489        )?;
490        self.len._mem_dbg_depth_on(
491            writer,
492            total_size,
493            max_depth,
494            prefix,
495            Some("len"),
496            false,
497            core::mem::size_of_val(&self.len),
498            flags,
499        )?;
500
501        // Use the custom wrapper to correctly display the `encoding` field.
502        let code_wrapper = CodeWrapper(&self.encoding);
503        code_wrapper._mem_dbg_depth_on(
504            writer,
505            total_size,
506            max_depth,
507            prefix,
508            Some("encoding"),
509            false, // Not the last field.
510            core::mem::size_of_val(&self.encoding),
511            flags,
512        )?;
513
514        self._markers._mem_dbg_depth_on(
515            writer,
516            total_size,
517            max_depth,
518            prefix,
519            Some("_markers"),
520            true, // This is the last field.
521            core::mem::size_of_val(&self._markers),
522            flags,
523        )?;
524        Ok(())
525    }
526}
527
528/// Type alias for the bit writer used internally by `IntVec` builders.
529pub(crate) type IntVecBitWriter<E> = BufBitWriter<E, MemWordWriterVec<u64, Vec<u64>>>;
530/// Type alias for the bit reader used internally by `IntVec` accessors.
531pub(crate) type IntVecBitReader<'a, E> =
532    BufBitReader<E, MemWordReader<u64, &'a [u64]>, DefaultReadParams>;
533
534impl<T: Storable + 'static, E: Endianness> IntVec<T, E, Vec<u64>> {
535    /// Creates a builder for constructing an owned [`IntVec`] from a slice of data.
536    ///
537    /// This is the most flexible way to create an [`IntVec`], allowing customization
538    /// of the compression codec and sampling rate.
539    ///
540    /// # Examples
541    ///
542    /// ```
543    /// use compressed_intvec::variable::{IntVec, UIntVec, VariableCodecSpec};
544    ///
545    /// let data: &[u32] = &[5, 8, 13, 21, 34];
546    /// let vec: UIntVec<u32> = IntVec::builder()
547    ///     .k(2) // Sample every 2nd element
548    ///     .codec(VariableCodecSpec::Delta)
549    ///     .build(data)
550    ///     .unwrap();
551    ///
552    /// assert_eq!(vec.get(3), Some(21));
553    /// ```
554    pub fn builder() -> IntVecBuilder<T, E> {
555        IntVecBuilder::new()
556    }
557
558    /// Creates a builder for constructing an owned [`IntVec`] from an iterator.
559    ///
560    /// This is useful for large datasets that are generated on the fly.
561    pub fn from_iter_builder<I>(iter: I) -> IntVecFromIterBuilder<T, E, I>
562    where
563        I: IntoIterator<Item = T> + Clone,
564    {
565        IntVecFromIterBuilder::new(iter)
566    }
567
568    /// Consumes the [`IntVec`] and returns its decoded values as a standard `Vec<T>`.
569    ///
570    /// # Examples
571    ///
572    /// ```
573    /// use compressed_intvec::variable::{IntVec, SIntVec};
574    ///
575    /// let data: &[i32] = &[-10, 0, 10];
576    /// let vec: SIntVec<i32> = IntVec::from_slice(data).unwrap();
577    /// let decoded_data = vec.into_vec();
578    ///
579    /// assert_eq!(decoded_data, &[-10, 0, 10]);
580    /// ```
581    pub fn into_vec(self) -> Vec<T>
582    where
583        for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
584            + CodesRead<E>
585            + BitSeek<Error = core::convert::Infallible>,
586    {
587        self.into_iter().collect()
588    }
589
590    /// Creates an owned [`IntVec`] from a slice of data using default settings.
591    ///
592    /// This method uses [`VariableCodecSpec::Auto`] to select a codec and a
593    /// default sampling rate of `k=16`.
594    pub fn from_slice(slice: &[T]) -> Result<Self, IntVecError>
595    where
596        for<'a> crate::variable::IntVecBitWriter<E>:
597            BitWrite<E, Error = core::convert::Infallible> + CodesWrite<E>,
598    {
599        Self::builder()
600            .k(16)
601            .codec(VariableCodecSpec::Auto)
602            .build(slice)
603    }
604}
605
606impl<T: Storable, E: Endianness, B: AsRef<[u64]>> IntVec<T, E, B> {
607    /// Creates a new [`IntVec`] from its raw components, enabling zero-copy views.
608    ///
609    /// This constructor is intended for advanced use cases, such as memory-mapping
610    /// a pre-built [`IntVec`] from disk without copying the data.
611    ///
612    /// # Errors
613    ///
614    /// Returns an [`IntVecError::InvalidParameters`] if `k` is zero or if the
615    /// number of samples is inconsistent with `len` and `k`.
616    pub fn from_parts(
617        data: B,
618        samples_data: B,
619        samples_len: usize,
620        samples_num_bits: usize,
621        k: usize,
622        len: usize,
623        encoding: Codes,
624    ) -> Result<Self, IntVecError> {
625        let samples =
626            FixedVec::<u64, u64, LE, B>::from_parts(samples_data, samples_len, samples_num_bits)?;
627
628        if k == 0 {
629            return Err(IntVecError::InvalidParameters(
630                "Sampling rate k cannot be zero".to_string(),
631            ));
632        }
633        let expected_samples = if len == 0 { 0 } else { len.div_ceil(k) };
634        if samples.len() != expected_samples {
635            return Err(IntVecError::InvalidParameters(format!(
636                "Inconsistent number of samples. Expected {}, found {}",
637                expected_samples,
638                samples.len()
639            )));
640        }
641
642        Ok(unsafe { Self::new_unchecked(data, samples, k, len, encoding) })
643    }
644
645    /// Creates a new [`IntVec`] from its raw parts without performing safety checks.
646    ///
647    /// # Safety
648    ///
649    /// The caller must ensure that all parameters are consistent and valid. The
650    /// `samples` must contain the correct bit offsets for the `data` stream,
651    /// and `len`, `k`, and `encoding` must accurately describe the layout.
652    /// Mismatched parameters will lead to panics or incorrect data retrieval.
653    pub(crate) unsafe fn new_unchecked(
654        data: B,
655        samples: FixedVec<u64, u64, LE, B>,
656        k: usize,
657        len: usize,
658        encoding: Codes,
659    ) -> Self {
660        Self {
661            data,
662            samples,
663            k,
664            len,
665            encoding,
666            _markers: PhantomData,
667        }
668    }
669
670    /// Creates a zero-copy, immutable view (a _slice_) of this vector.
671    ///
672    /// Returns `None` if the specified range is out of bounds.
673    ///
674    /// # Examples
675    ///
676    /// ```
677    /// use compressed_intvec::variable::{IntVec, UIntVec};
678    ///
679    /// let data: Vec<u32> = (0..20).collect();
680    /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
681    /// let slice = vec.slice(5, 10).unwrap();
682    ///
683    /// assert_eq!(slice.len(), 10);
684    /// assert_eq!(slice.get(0), Some(5)); // Corresponds to index 5 of the original vec
685    /// ```
686    pub fn slice(&'_ self, start: usize, len: usize) -> Option<IntVecSlice<'_, T, E, B>> {
687        if start.saturating_add(len) > self.len {
688            return None;
689        }
690        Some(IntVecSlice::new(self, start..start + len))
691    }
692
693    /// Splits the vector into two immutable slices at a given index.
694    ///
695    /// Returns `None` if `mid` is out of bounds.
696    #[allow(clippy::type_complexity)]
697    pub fn split_at(
698        &'_ self,
699        mid: usize,
700    ) -> Option<(IntVecSlice<'_, T, E, B>, IntVecSlice<'_, T, E, B>)> {
701        if mid > self.len {
702            return None;
703        }
704        let left = IntVecSlice::new(self, 0..mid);
705        let right = IntVecSlice::new(self, mid..self.len);
706        Some((left, right))
707    }
708
709    /// Returns the number of integers in the vector.
710    #[inline]
711    pub fn len(&self) -> usize {
712        self.len
713    }
714
715    /// Returns `true` if the vector contains no elements.
716    #[inline]
717    pub fn is_empty(&self) -> bool {
718        self.len == 0
719    }
720
721    /// Returns the sampling rate `k` used during encoding.
722    #[inline]
723    pub fn get_sampling_rate(&self) -> usize {
724        self.k
725    }
726
727    /// Returns the number of sample points stored in the vector.
728    #[inline]
729    pub fn get_num_samples(&self) -> usize {
730        self.samples.len()
731    }
732
733    /// Returns a reference to the inner `FixedVec` of samples.
734    #[inline]
735    pub fn samples_ref(&self) -> &FixedVec<u64, u64, LE, B> {
736        &self.samples
737    }
738
739    /// Returns a read-only slice of the underlying compressed data words (`&[u64]`).
740    #[inline]
741    pub fn as_limbs(&self) -> &[u64] {
742        self.data.as_ref()
743    }
744
745    /// Returns the concrete [`Codes`] variant that was used for compression.
746    #[inline]
747    pub fn encoding(&self) -> Codes {
748        self.encoding
749    }
750
751    /// Returns a clone of the underlying storage as a `Vec<u64>`.
752    pub fn limbs(&self) -> Vec<u64> {
753        self.data.as_ref().to_vec()
754    }
755
756    /// Returns an iterator over the decompressed values.
757    pub fn iter(&'_ self) -> impl Iterator<Item = T> + '_
758    where
759        for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
760            + CodesRead<E>
761            + BitSeek<Error = core::convert::Infallible>,
762    {
763        IntVecIter::new(self)
764    }
765}
766
767impl<T: Storable, E: Endianness, B: AsRef<[u64]>> IntVec<T, E, B>
768where
769    for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
770        + CodesRead<E>
771        + BitSeek<Error = core::convert::Infallible>,
772{
773    /// Creates a reusable, stateless reader for efficient random access.
774    ///
775    /// This method returns an [`IntVecReader`], a struct that maintains a persistent,
776    /// reusable bitstream reader. This amortizes the setup cost across multiple `get`
777    /// operations, making it more efficient than calling [`get`](IntVec::get) repeatedly in a loop.
778    ///
779    /// This reader is **stateless**: it performs a full seek from the nearest sample point for each call,
780    /// independently of any previous access.
781    ///
782    /// # When to use it
783    /// Use [`IntVecReader`] for true random access patterns where lookup indices are sparse,
784    /// unordered, or not known in advance (e.g., graph traversals, pointer chasing).
785    /// For accessing a known set of indices, [`get_many`](IntVec::get_many) is generally superior.
786    ///
787    /// # Examples
788    ///
789    /// ```
790    /// use compressed_intvec::prelude::*;
791    ///
792    /// let data: Vec<u32> = (0..100).rev().collect(); // Data is not sequential
793    /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
794    ///
795    /// // Create a reusable reader for multiple random lookups
796    /// let mut reader = vec.reader();
797    ///
798    /// assert_eq!(reader.get(99).unwrap(), Some(0));
799    /// assert_eq!(reader.get(0).unwrap(), Some(99));
800    /// assert_eq!(reader.get(50).unwrap(), Some(49));
801    /// ```
802    pub fn reader(&'_ self) -> IntVecReader<'_, T, E, B> {
803        IntVecReader::new(self)
804    }
805
806    /// Creates a stateful, reusable reader optimized for sequential access.
807    ///
808    /// This method returns an [`IntVecSeqReader`], which is specifically designed
809    /// to take advantage of the vector's internal state, tracking the current decoding position (cursor).
810    ///
811    /// This statefulness enables a key optimization:
812    /// - **Fast Path**: If a requested index is at or after the cursor and within
813    ///   the same sample block, the reader decodes forward from its last known
814    ///   position. This avoids a costly seek operation.
815    /// - **Fallback Path**: If the requested index is before the cursor (requiring a
816    ///   backward move) or in a different sample block, the reader falls back to
817    ///   the standard behavior of seeking to the nearest sample point.
818    ///
819    /// # When to use it
820    /// Use [`IntVecSeqReader`] when your access pattern has high locality, meaning
821    /// indices are primarily increasing and often clustered together. It is ideal
822    /// for iterating through a sorted list of indices or for stream-like processing.
823    ///
824    /// # Examples
825    ///
826    /// ```
827    /// use compressed_intvec::prelude::*;
828    ///
829    /// let data: Vec<u32> = (0..100).collect();
830    /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
831    ///
832    /// // Create a reader optimized for sequential access
833    /// let mut seq_reader = vec.seq_reader();
834    ///
835    /// // Accessing indices in increasing order is efficient
836    /// assert_eq!(seq_reader.get(10).unwrap(), Some(10));
837    /// // This next call is fast, as it decodes forward from index 10
838    /// assert_eq!(seq_reader.get(15).unwrap(), Some(15));
839    ///
840    /// // A large jump will trigger a seek to a new sample block
841    /// assert_eq!(seq_reader.get(90).unwrap(), Some(90));
842    ///
843    /// // A backward jump will also trigger a seek
844    /// assert_eq!(seq_reader.get(5).unwrap(), Some(5));
845    /// ```
846    pub fn seq_reader(&'_ self) -> IntVecSeqReader<'_, T, E, B> {
847        IntVecSeqReader::new(self)
848    }
849
850    /// Returns the element at the specified index, or `None` if the index is
851    /// out of bounds.
852    ///
853    /// This operation is amortized O(1).
854    ///
855    /// # Examples
856    ///
857    /// ```
858    /// use compressed_intvec::variable::{IntVec, UIntVec};
859    ///
860    /// let data: Vec<u32> = (0..100).collect();
861    /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
862    ///
863    /// assert_eq!(vec.get(50), Some(50));
864    /// assert_eq!(vec.get(100), None);
865    /// ```
866    #[inline]
867    pub fn get(&self, index: usize) -> Option<T> {
868        if index >= self.len {
869            return None;
870        }
871        Some(unsafe { self.get_unchecked(index) })
872    }
873
874    /// Returns the element at the specified index without bounds checking.
875    ///
876    /// # Safety
877    ///
878    /// Calling this method with an out-of-bounds `index` is undefined behavior.
879    /// The `index` must be less than the vector's `len`.
880    #[inline]
881    pub unsafe fn get_unchecked(&self, index: usize) -> T {
882        let mut reader = self.reader();
883        reader.get_unchecked(index)
884    }
885
886    /// Retrieves multiple elements from the vector at the specified indices.
887    ///
888    /// This method is generally more efficient than calling [`get`](Self::get) in a loop, as
889    /// it sorts the indices and scans through the compressed data stream once.
890    ///
891    /// # Errors
892    ///
893    /// Returns [`IntVecError::IndexOutOfBounds`] if any index is out of bounds.
894    ///
895    /// # Examples
896    ///
897    /// ```
898    /// use compressed_intvec::variable::{IntVec, UIntVec};
899    ///
900    /// let data: Vec<u32> = (0..100).collect();
901    /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
902    ///
903    /// let indices = [99, 0, 50];
904    /// let values = vec.get_many(&indices).unwrap();
905    /// assert_eq!(values, vec![99, 0, 50]);
906    /// ```
907    pub fn get_many(&self, indices: &[usize]) -> Result<Vec<T>, IntVecError> {
908        if indices.is_empty() {
909            return Ok(Vec::new());
910        }
911
912        for &index in indices {
913            if index >= self.len {
914                return Err(IntVecError::IndexOutOfBounds(index));
915            }
916        }
917        // SAFETY: We have just performed the bounds checks.
918        Ok(unsafe { self.get_many_unchecked(indices) })
919    }
920
921    /// Retrieves multiple elements without bounds checking.
922    ///
923    /// # Safety
924    ///
925    /// Calling this method with any out-of-bounds index is undefined behavior.
926    #[allow(clippy::uninit_vec)]
927    pub unsafe fn get_many_unchecked(&self, indices: &[usize]) -> Vec<T> {
928        if indices.is_empty() {
929            return Vec::new();
930        }
931        let mut results = Vec::with_capacity(indices.len());
932        // SAFETY: The vector is immediately populated by the sorted access logic below.
933        results.set_len(indices.len());
934
935        let mut indexed_indices: Vec<(usize, usize)> = indices
936            .iter()
937            .enumerate()
938            .map(|(i, &idx)| (idx, i))
939            .collect();
940        // Sort by the target index to enable efficient sequential scanning.
941        indexed_indices.sort_unstable_by_key(|&(idx, _)| idx);
942
943        if self.k.is_power_of_two() {
944            // Optimization: use bit-shift for division if k is a power of two.
945            let k_exp = self.k.trailing_zeros();
946            self.get_many_dsi_inner(
947                &indexed_indices,
948                &mut results,
949                |idx| idx >> k_exp,
950                |block| block << k_exp,
951            )
952            .unwrap();
953        } else {
954            self.get_many_dsi_inner(
955                &indexed_indices,
956                &mut results,
957                |idx| idx / self.k,
958                |block| block * self.k,
959            )
960            .unwrap();
961        }
962
963        results
964    }
965
966    /// Internal implementation for `get_many_unchecked`.
967    ///
968    /// This function takes closures to abstract away the division/multiplication
969    /// by `k`, allowing for a bit-shift optimization when `k` is a power of two.
970    fn get_many_dsi_inner<F1, F2>(
971        &self,
972        indexed_indices: &[(usize, usize)],
973        results: &mut [T],
974        block_of: F1,
975        start_of_block: F2,
976    ) -> Result<(), IntVecError>
977    where
978        F1: Fn(usize) -> usize,
979        F2: Fn(usize) -> usize,
980    {
981        let mut reader = self.reader();
982        let mut current_decoded_index: usize = 0;
983
984        for &(target_index, original_position) in indexed_indices {
985            // Check if we need to jump to a new sample block. This is true if the
986            // target index is before our current position, or if it's in a different
987            // sample block than the one we're currently in.
988            if target_index < current_decoded_index
989                || block_of(target_index) != block_of(current_decoded_index.saturating_sub(1))
990            {
991                let target_sample_block = block_of(target_index);
992                // SAFETY: The public-facing `get_many` performs bounds checks.
993                let start_bit = unsafe { self.samples.get_unchecked(target_sample_block) };
994                reader.reader.set_bit_pos(start_bit)?;
995                current_decoded_index = start_of_block(target_sample_block);
996            }
997
998            // Sequentially decode elements until we reach our target.
999            for _ in current_decoded_index..target_index {
1000                reader.code_reader.read(&mut reader.reader)?;
1001            }
1002            let value = reader.code_reader.read(&mut reader.reader)?;
1003            // Place the decoded value in its original requested position.
1004            results[original_position] = Storable::from_word(value);
1005            current_decoded_index = target_index + 1;
1006        }
1007        Ok(())
1008    }
1009
1010    /// Retrieves multiple elements from an iterator of indices.
1011    ///
1012    /// This is a convenient alternative to [`get_many`](Self::get_many) when the indices are not
1013    /// already in a slice. It may be less performant as it cannot pre-sort the
1014    /// indices for optimal access.
1015    pub fn get_many_from_iter<I>(&self, indices: I) -> Result<Vec<T>, IntVecError>
1016    where
1017        I: IntoIterator<Item = usize>,
1018    {
1019        let indices_iter = indices.into_iter();
1020        let (lower_bound, _) = indices_iter.size_hint();
1021        let mut results = Vec::with_capacity(lower_bound);
1022        let mut seq_reader = self.seq_reader();
1023
1024        for index in indices_iter {
1025            let value = seq_reader
1026                .get(index)?
1027                .ok_or(IntVecError::IndexOutOfBounds(index))?;
1028            results.push(value);
1029        }
1030
1031        Ok(results)
1032    }
1033}
1034
1035impl<T: Storable + Ord, E: Endianness, B: AsRef<[u64]>> IntVec<T, E, B>
1036where
1037    for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
1038        + CodesRead<E>
1039        + BitSeek<Error = core::convert::Infallible>,
1040{
1041    /// Binary searches this vector for a given element.
1042    ///
1043    /// If the value is found, returns `Ok(usize)` with the index of the
1044    /// matching element. If the value is not found, returns `Err(usize)` with
1045    /// the index where the value could be inserted to maintain order.
1046    ///
1047    /// # Complexity
1048    ///
1049    /// The time complexity of this operation is O(k * log n), where `n` is the
1050    /// number of elements in the vector and `k` is the sampling rate. This is
1051    /// because each of the O(log n) probes during the search requires an
1052    /// element access, which has a cost proportional to `k` in the worst case.
1053    ///
1054    /// # Examples
1055    ///
1056    /// ```
1057    /// use compressed_intvec::variable::{IntVec, SIntVec};
1058    ///
1059    /// let data: &[i32] = &[-10, 0, 10, 20, 30];
1060    /// let vec: SIntVec<i32> = IntVec::from_slice(data).unwrap();
1061    ///
1062    /// assert_eq!(vec.binary_search(&10), Ok(2));
1063    /// assert_eq!(vec.binary_search(&15), Err(3));
1064    /// ```
1065    pub fn binary_search(&self, value: &T) -> Result<usize, usize> {
1066        self.binary_search_by(|probe| probe.cmp(value))
1067    }
1068
1069    /// Binary searches this vector with a custom comparison function.
1070    ///
1071    /// # Complexity
1072    ///
1073    /// The time complexity of this operation is O(k * log n), where `n` is the
1074    /// number of elements in the vector and `k` is the sampling rate.
1075    #[inline]
1076    pub fn binary_search_by<F>(&self, mut f: F) -> Result<usize, usize>
1077    where
1078        F: FnMut(T) -> std::cmp::Ordering,
1079    {
1080        let mut low = 0;
1081        let mut high = self.len();
1082        let mut reader = self.reader();
1083
1084        while low < high {
1085            let mid = low + (high - low) / 2;
1086            // SAFETY: The loop invariants ensure `mid` is always in bounds.
1087            let cmp = f(unsafe { reader.get_unchecked(mid) });
1088
1089            match cmp {
1090                std::cmp::Ordering::Less => low = mid + 1,
1091                std::cmp::Ordering::Equal => return Ok(mid),
1092                std::cmp::Ordering::Greater => high = mid,
1093            }
1094        }
1095        Err(low)
1096    }
1097
1098    /// Binary searches this vector with a key extraction function.
1099    ///
1100    /// # Complexity
1101    ///
1102    /// The time complexity of this operation is O(k * log n), where `n` is the
1103    /// number of elements in the vector and `k` is the sampling rate.
1104    #[inline]
1105    pub fn binary_search_by_key<K: Ord, F>(&self, b: &K, mut f: F) -> Result<usize, usize>
1106    where
1107        F: FnMut(T) -> K,
1108    {
1109        self.binary_search_by(|k| f(k).cmp(b))
1110    }
1111}
1112
1113impl<T: Storable + 'static, E: Endianness + 'static> IntoIterator for IntVec<T, E, Vec<u64>>
1114where
1115    for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
1116        + CodesRead<E>
1117        + BitSeek<Error = core::convert::Infallible>,
1118{
1119    type Item = T;
1120    type IntoIter = IntVecIntoIter<T, E>;
1121
1122    fn into_iter(self) -> Self::IntoIter {
1123        IntVecIntoIter::new(self)
1124    }
1125}
1126
1127/// An [`IntVec`] for unsigned integers with Little-Endian bit layout.
1128pub type UIntVec<T> = IntVec<T, LE>;
1129/// An [`IntVec`] for signed integers with Little-Endian bit layout.
1130pub type SIntVec<T> = IntVec<T, LE>;
1131/// An [`IntVec`] for [`u64`] elements with Big-Endian bit layout.
1132pub type BEIntVec = IntVec<u64, BE>;
1133/// An [`IntVec`] for [`u64`] elements with Little-Endian bit layout.
1134pub type LEIntVec = IntVec<u64, LE>;
1135/// An [`IntVec`] for [`i64`] elements with Big-Endian bit layout.
1136pub type BESIntVec = IntVec<i64, BE>;
1137/// An [`IntVec`] for [`i64`] elements with Little-Endian bit layout.
1138pub type LESIntVec = IntVec<i64, LE>;
1139
1140impl<T, E, B, O> PartialEq<O> for IntVec<T, E, B>
1141where
1142    T: Storable + PartialEq,
1143    E: Endianness,
1144    B: AsRef<[u64]>,
1145    O: AsRef<[T]>,
1146    for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
1147        + CodesRead<E>
1148        + BitSeek<Error = core::convert::Infallible>,
1149{
1150    /// Checks for equality between an [`IntVec`] and a standard slice.
1151    ///
1152    /// The comparison is done by iterating over both and comparing elements
1153    /// one by one. The overall comparison is not a single atomic operation.
1154    fn eq(&self, other: &O) -> bool {
1155        let other_slice = other.as_ref();
1156        if self.len() != other_slice.len() {
1157            return false;
1158        }
1159        self.iter().zip(other_slice.iter()).all(|(a, b)| a == *b)
1160    }
1161}
compressed_intvec/variable/mod.rs

compressed_intvec/variable/
mod.rs