compressed_intvec/variable/mod.rs
1//! A compressed integer vector using variable-length encoding with fast random access.
2//!
3//! This module provides [`IntVec`], a data structure for storing sequences of
4//! integers in a compressed format while retaining efficient random access. It is
5//! well-suited for datasets where integer values are non-uniformly distributed,
6//! as it uses instantaneous variable-length codes to represent smaller numbers with fewer bits.
7//!
8//! # Core Concepts
9//!
10//! ## Variable-Length Encoding
11//!
12//! Unlike [`FixedVec`], which uses a fixed number of
13//! bits for every integer, [`IntVec`] employs **instantaneous codes** (such as
14//! Gamma, Delta, or Rice codes) provided by the [`dsi-bitstream`] crate. This
15//! approach allows each integer to be encoded with a variable number of bits,
16//! typically using shorter codes for smaller or more frequent values. This can
17//! lead to significant space savings, especially for data with many small numbers.
18//!
19//! Signed integers (e.g., `i8`, `i32`) are supported transparently through
20//! zig-zag encoding, which maps small negative and positive integers to small
21//! unsigned integers.
22//!
23//! ## Random Access and Sampling
24//!
25//! A key challenge with variable-length codes is that the location of the *i*-th
26//! element cannot be calculated directly. To solve this, [`IntVec`] implements a
27//! **sampling mechanism**. It stores the bit position of every *k*-th element in
28//! a separate, auxiliary [`FixedVec`]. This parameter, `k`, is the **sampling rate**.
29//!
30//! To access an element at `index`, [`IntVec`]:
31//! 1. Finds the nearest sample by calculating `index / k`.
32//! 2. Retrieves the bit offset of the start of that sampled block.
33//! 3. Jumps to that offset in the compressed data stream.
34//! 4. Sequentially decodes the remaining `index % k` elements to reach the target.
35//!
36//! This strategy provides amortized O(1) random access, as the number of
37//! sequential decoding steps is bounded by `k`.
38//!
39//! ### The `k` Trade-off
40//!
41//! The choice of the sampling rate `k` involves a trade-off:
42//! - **Smaller `k`**: Faster random access (fewer elements to decode per access)
43//! but higher memory usage due to a larger samples table.
44//! - **Larger `k`**: Slower random access but better compression, as the
45//! samples table is smaller.
46//!
47//! The optimal `k` depends on the specific access patterns of an application.
48//!
49//! # Design and Immutability
50//!
51//! [`IntVec`] is **immutable** after creation.
52//! Unlike [`FixedVec`], it does not provide methods for
53//! in-place modification (e.g., `set`, `push`).
54//!
55//! If a value in the middle of the compressed bitstream were changed, its new
56//! encoded length might be different. For example, changing `5` (which might be
57//! 4 bits) to `5000` (which might be 16 bits) would require shifting all
58//! subsequent data, invalidating every sample point that follows. The cost of
59//! such an operation would be prohibitive, scaling with the length of the vector.
60//!
61//! For this reason, [`IntVec`] is designed as a write-once, read-many data structure.
62//!
63//! # Access Strategies and Readers
64//!
65//! [`IntVec`] provides multiple interfaces for accessing data, each optimized for a
66//! different pattern of use.
67//!
68//! - **[`get()`](IntVec::get)**: For single, infrequent lookups. Each call creates and discards
69//! an internal reader, which incurs overhead if used in a loop.
70//!
71//! - **[`get_many()`](IntVec::get_many)**: The most efficient method for retrieving a batch of elements
72//! when all indices are known beforehand. It sorts the indices to perform a
73//! single, monotonic scan over the data, minimizing redundant decoding and seek
74//! operations.
75//!
76//! - **[`IntVecReader`] (Reusable Stateless Reader)**: This reader is created via
77//! [`IntVec::reader()`]. It maintains an internal, reusable bitstream reader,
78//! amortizing its setup cost over multiple calls. Each [`get()`](IntVec::get) call is
79//! **stateless** with respect to position: it performs a full seek to the nearest
80//! sample and decodes forward from there, independently of previous calls. It is
81//! best suited for true random access patterns where indices are sparse and
82//! unpredictable.
83//!
84//! - **[`IntVecSeqReader`] (Stateful Sequential Reader)**: This reader, created via
85//! [`IntVec::seq_reader()`], is a stateful object optimized for access patterns with
86//! high locality. It maintains an internal cursor of the current decoding position.
87//! - **Fast Path**: If a requested `index` is at or after the cursor's current
88//! position and within the same sample block, the reader simply decodes
89//! forward from its last position, avoiding a costly seek operation.
90//! - **Fallback Path**: If the requested `index` is before the cursor or in a
91//! different sample block, the reader falls back to the standard behavior of
92//! seeking to the nearest sample and decoding from there. This makes it very
93//! efficient for iterating through sorted or clustered indices.
94//!
95//! # Main Components
96//!
97//! - [`IntVec`]: The core compressed vector.
98//! - [`IntVecBuilder`]: The primary tool for constructing an [`IntVec`] with
99//! custom compression codecs and sampling rates.
100//! - [`VariableCodecSpec`]: An enum to specify the compression codec.
101//! - [`IntVecReader`]: A reusable, stateless reader for efficient random access.
102//! - [`IntVecSeqReader`]: A stateful reader optimized for sequential or localized access patterns.
103//! - [`IntVecSlice`]: An immutable, zero-copy view over a portion of the vector.
104//!
105//! # Examples
106//!
107//! ## Basic Usage with Unsigned Integers
108//!
109//! Create a [`UIntVec`] (an alias for `IntVec<u32, LE>`) from a slice of `u32`. The builder will automatically
110//! select a suitable codec and use a default sampling rate.
111//!
112//! ```
113//! use compressed_intvec::variable::{IntVec, UIntVec};
114//!
115//! let data: Vec<u32> = vec![100, 200, 300, 1024];
116//! let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
117//!
118//! assert_eq!(vec.len(), 4);
119//! // Accessing an element
120//! assert_eq!(vec.get(1), Some(200));
121//! ```
122//!
123//! ## Storing Signed Integers
124//!
125//! [`IntVec`] handles signed integers, such as [`i16`], by mapping them to unsigned
126//! values using zig-zag encoding.
127//!
128//! ```
129//! use compressed_intvec::variable::{IntVec, SIntVec};
130//!
131//! let data: &[i16] = &[-5, 20, -100, 0, 8];
132//! let vec: SIntVec<i16> = IntVec::from_slice(data).unwrap();
133//!
134//! assert_eq!(vec.len(), 5);
135//! assert_eq!(vec.get(0), Some(-5));
136//! assert_eq!(vec.get(2), Some(-100));
137//! ```
138//!
139//! ## Manual Codec and Sampling Rate
140//!
141//! For fine-grained control, use the [`IntVecBuilder`]. Here, we specify a
142//! sampling rate of `k=8` and use the `Zeta` code with `k=3`.
143//!
144//! ```
145//! use compressed_intvec::variable::{IntVec, UIntVec, VariableCodecSpec};
146//!
147//! let data: Vec<u64> = (0..100).map(|i| i * i).collect();
148//!
149//! let vec: UIntVec<u64> = IntVec::builder()
150//! .k(8) // Set sampling rate
151//! .codec(VariableCodecSpec::Zeta { k: Some(3) }) // Set compression codec
152//! .build(&data)
153//! .unwrap();
154//!
155//! assert_eq!(vec.get_sampling_rate(), 8);
156//! assert_eq!(vec.get(10), Some(100));
157//! ```
158//!
159//! Best performance is achieved when the sampling rate `k` is a power of two. Usually a value of `32` or `16` is a good trade-off between speed and compression ratio.
160//!
161//! ## Codec Selection and Performance
162//!
163//! The choice of compression codec is critical for performance and space efficiency.
164//! [`IntVecBuilder`] offers automatic codec selection via
165//! [`VariableCodecSpec::Auto`]. When enabled, the builder analyzes the entire input
166//! dataset to find the codec that offers the best compression ratio.
167//!
168//! This analysis involves calculating the compressed size for the data with
169//! approximately 70 different codec configurations. This process introduces a
170//! significant, one-time **construction overhead**.
171//!
172//! Use [`Auto`](VariableCodecSpec::Auto) for read-heavy workloads where the [`IntVec`]
173//! is built once and then accessed many times. The initial cost is easily amortized by
174//! the long-term space savings.
175//!
176//! If your application creates many small [`IntVec`]s or accesses them frequently,
177//! the repeated cost of analysis can become a performance
178//! bottleneck. In such scenarios, it is better to explicitly specify a codec
179//! (e.g., [`VariableCodecSpec::Gamma`] or [`VariableCodecSpec::Delta`]) that is known
180//! to be a good general-purpose choice for your data.
181//!
182//! ```
183//! use compressed_intvec::prelude::*;
184//!
185//! let data: Vec<u32> = (0..100).collect();
186//!
187//! // Create an IntVec with automatic codec selection
188//! let vec: UIntVec<u32> = IntVec::builder()
189//! .build(&data)
190//! .unwrap();
191//!```
192//!
193//! [`dsi-bitstream`]: https://docs.rs/dsi-bitstream/latest/dsi_bitstream/
194
195#[macro_use]
196mod macros;
197
198pub mod builder;
199pub mod codec;
200pub mod iter;
201#[cfg(feature = "parallel")]
202mod parallel;
203pub mod reader;
204pub mod seq_reader;
205#[cfg(feature = "serde")]
206pub mod serde;
207pub mod slice;
208pub mod traits;
209
210pub use self::{codec::VariableCodecSpec, traits::Storable};
211use crate::fixed::{Error as FixedVecError, FixedVec};
212use dsi_bitstream::{
213 codes::params::DefaultReadParams,
214 dispatch::StaticCodeRead,
215 prelude::{
216 BitRead, BitSeek, BufBitReader, BufBitWriter, Codes, CodesRead, CodesWrite, Endianness,
217 MemWordReader, MemWordWriterVec,
218 },
219 traits::{BitWrite, BE, LE},
220};
221use mem_dbg::{DbgFlags, MemDbgImpl, MemSize, SizeFlags};
222use std::{
223 error::Error,
224 fmt::{self, Write},
225 marker::PhantomData,
226};
227
228pub use builder::{IntVecBuilder, IntVecFromIterBuilder};
229use iter::{IntVecIntoIter, IntVecIter};
230pub use reader::IntVecReader;
231pub use seq_reader::IntVecSeqReader;
232pub use slice::IntVecSlice;
233
234/// Defines the set of errors that can occur in `IntVec` operations.
235#[derive(Debug)]
236pub enum IntVecError {
237 /// An error occurred during an I/O operation, typically from the underlying
238 /// bitstream reader or writer.
239 Io(std::io::Error),
240 /// A generic error from the [`dsi-bitstream`](https://crates.io/crates/dsi-bitstream) library, often related to decoding malformed data.
241 Bitstream(Box<dyn Error + Send + Sync>),
242 /// An error indicating that one or more parameters are invalid for the
243 /// requested operation.
244 InvalidParameters(String),
245 /// An error that occurs during the dynamic dispatch of codec functions.
246 CodecDispatch(String),
247 /// An error indicating that a provided index is outside the valid bounds
248 /// of the vector.
249 IndexOutOfBounds(usize),
250}
251
252impl fmt::Display for IntVecError {
253 fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
254 match self {
255 IntVecError::Io(e) => write!(f, "I/O error: {}", e),
256 IntVecError::Bitstream(e) => write!(f, "Bitstream error: {}", e),
257 IntVecError::InvalidParameters(s) => write!(f, "Invalid parameters: {}", s),
258 IntVecError::CodecDispatch(s) => write!(f, "Codec dispatch error: {}", s),
259 IntVecError::IndexOutOfBounds(index) => write!(f, "Index out of bounds: {}", index),
260 }
261 }
262}
263
264impl Error for IntVecError {
265 fn source(&self) -> Option<&(dyn Error + 'static)> {
266 match self {
267 IntVecError::Io(e) => Some(e),
268 IntVecError::Bitstream(e) => Some(e.as_ref()),
269 _ => None,
270 }
271 }
272}
273
274impl From<std::io::Error> for IntVecError {
275 fn from(e: std::io::Error) -> Self {
276 IntVecError::Io(e)
277 }
278}
279
280impl From<core::convert::Infallible> for IntVecError {
281 fn from(_: core::convert::Infallible) -> Self {
282 unreachable!()
283 }
284}
285
286impl From<FixedVecError> for IntVecError {
287 fn from(e: FixedVecError) -> Self {
288 IntVecError::InvalidParameters(e.to_string())
289 }
290}
291
292/// A compressed, randomly accessible vector of integers using variable-length encoding.
293///
294/// `IntVec` achieves compression by using instantaneous codes and enables fast,
295/// amortized O(1) random access via a sampling mechanism. See the
296/// [module-level documentation](crate::variable) for a detailed explanation.
297///
298/// # Type Parameters
299///
300/// - `T`: The integer type for the elements (e.g., `u32`, `i16`). It must
301/// implement the [`Storable`] trait.
302/// - `E`: The [`Endianness`] of the underlying bitstream (e.g., [`LE`] or [`BE`]).
303/// - `B`: The backend storage buffer, such as `Vec<u64>` for an owned vector or
304/// `&[u64]` for a borrowed, zero-copy view.
305#[derive(Debug, Clone)]
306pub struct IntVec<T: Storable, E: Endianness, B: AsRef<[u64]> = Vec<u64>> {
307 /// The raw, bit-packed compressed data.
308 pub(super) data: B,
309 /// A `FixedVec` containing the bit offsets of sampled elements.
310 pub(super) samples: FixedVec<u64, u64, LE, B>,
311 /// The sampling rate `k`. Every `k`-th element's position is stored.
312 pub(super) k: usize,
313 /// The number of elements in the vector.
314 pub(super) len: usize,
315 /// The `dsi-bitstream` code used for compression.
316 pub(super) encoding: Codes,
317 /// Zero-sized markers for the generic type parameters.
318 pub(super) _markers: PhantomData<(T, E)>,
319}
320
321impl<T: Storable, E: Endianness, B: AsRef<[u64]> + MemSize> MemSize for IntVec<T, E, B> {
322 fn mem_size(&self, flags: SizeFlags) -> usize {
323 // Start with the stack size of the struct itself.
324 let mut total_size = core::mem::size_of::<Self>();
325 // Add the heap-allocated memory for the `data` field.
326 total_size += self.data.mem_size(flags) - core::mem::size_of::<B>();
327 // Add the heap-allocated memory for the `samples` field's internal buffer.
328 total_size +=
329 self.samples.mem_size(flags) - core::mem::size_of::<FixedVec<u64, u64, LE, B>>();
330 total_size
331 }
332}
333
334// A local wrapper for `dsi_bitstream::codes::Codes` to override its `MemDbgImpl`.
335// This is necessary because the derived implementation for `Codes` is incorrect
336// and cannot be fixed due to the orphan rule.
337struct CodeWrapper<'a>(&'a Codes);
338
339impl mem_dbg::MemSize for CodeWrapper<'_> {
340 fn mem_size(&self, _flags: mem_dbg::SizeFlags) -> usize {
341 core::mem::size_of_val(self.0)
342 }
343}
344
345impl mem_dbg::MemDbgImpl for CodeWrapper<'_> {
346 // Override the top-level display function for this type.
347 fn _mem_dbg_depth_on(
348 &self,
349 writer: &mut impl core::fmt::Write,
350 total_size: usize,
351 max_depth: usize,
352 prefix: &mut String,
353 field_name: Option<&str>,
354 is_last: bool,
355 padded_size: usize,
356 flags: DbgFlags,
357 ) -> core::fmt::Result {
358 if prefix.len() > max_depth {
359 return Ok(());
360 }
361
362 let real_size = self.mem_size(flags.to_size_flags());
363 let mut buffer = String::new(); // Use a temp buffer to format the size part.
364
365 // Replicate the size and percentage formatting from the default `MemDbgImpl`.
366 if flags.contains(DbgFlags::HUMANIZE) {
367 let (value, uom) = mem_dbg::humanize_float(real_size as f64);
368 if uom == " B" {
369 let _ = write!(&mut buffer, "{:>5} B ", real_size);
370 } else {
371 let precision = if value.abs() >= 100.0 {
372 1
373 } else if value.abs() >= 10.0 {
374 2
375 } else {
376 3
377 };
378 let _ = write!(&mut buffer, "{0:>4.1$} {2} ", value, precision, uom);
379 }
380 } else {
381 let align = mem_dbg::n_of_digits(total_size);
382 let _ = write!(&mut buffer, "{:>align$} B ", real_size, align = align);
383 }
384
385 if flags.contains(DbgFlags::PERCENTAGE) {
386 let percentage = if total_size == 0 {
387 100.0
388 } else {
389 100.0 * real_size as f64 / total_size as f64
390 };
391 let _ = write!(&mut buffer, "{:>6.2}% ", percentage);
392 }
393
394 // Write the formatted size string with colors if enabled.
395 if flags.contains(DbgFlags::COLOR) {
396 writer.write_fmt(format_args!("{}", mem_dbg::color(real_size)))?;
397 }
398 writer.write_str(&buffer)?;
399 if flags.contains(DbgFlags::COLOR) {
400 writer.write_fmt(format_args!("{}", mem_dbg::reset_color()))?;
401 }
402
403 // Write the tree structure part.
404 if !prefix.is_empty() {
405 writer.write_str(&prefix[2..])?;
406 writer.write_char(if is_last { '╰' } else { '├' })?;
407 writer.write_char('╴')?;
408 }
409
410 if let Some(field_name) = field_name {
411 writer.write_fmt(format_args!("{}", field_name))?;
412 }
413
414 // This is the custom part: print the `Debug` format of the enum.
415 if flags.contains(DbgFlags::TYPE_NAME) {
416 if flags.contains(DbgFlags::COLOR) {
417 writer.write_fmt(format_args!("{}", mem_dbg::type_color()))?;
418 }
419 writer.write_fmt(format_args!(": {:?}", self.0))?;
420 if flags.contains(DbgFlags::COLOR) {
421 writer.write_fmt(format_args!("{}", mem_dbg::reset_color()))?;
422 }
423 }
424
425 // Correctly calculate and print padding.
426 let padding = padded_size - core::mem::size_of_val(self.0);
427 if padding != 0 {
428 writer.write_fmt(format_args!(" [{}B]", padding))?;
429 }
430
431 writer.write_char('\n')?;
432 Ok(())
433 }
434
435 // It's a leaf node in the display tree, so no recursion is needed.
436 fn _mem_dbg_rec_on(
437 &self,
438 _writer: &mut impl core::fmt::Write,
439 _total_size: usize,
440 _max_depth: usize,
441 _prefix: &mut String,
442 _is_last: bool,
443 _flags: DbgFlags,
444 ) -> core::fmt::Result {
445 Ok(())
446 }
447}
448
449impl<T: Storable, E: Endianness, B: AsRef<[u64]> + MemDbgImpl> MemDbgImpl for IntVec<T, E, B> {
450 fn _mem_dbg_rec_on(
451 &self,
452 writer: &mut impl core::fmt::Write,
453 total_size: usize,
454 max_depth: usize,
455 prefix: &mut String,
456 _is_last: bool,
457 flags: DbgFlags,
458 ) -> core::fmt::Result {
459 // Manually display each field, ensuring correct tree structure.
460 self.data._mem_dbg_depth_on(
461 writer,
462 total_size,
463 max_depth,
464 prefix,
465 Some("data"),
466 false,
467 core::mem::size_of_val(&self.data),
468 flags,
469 )?;
470 self.samples._mem_dbg_depth_on(
471 writer,
472 total_size,
473 max_depth,
474 prefix,
475 Some("samples"),
476 false,
477 core::mem::size_of_val(&self.samples),
478 flags,
479 )?;
480 self.k._mem_dbg_depth_on(
481 writer,
482 total_size,
483 max_depth,
484 prefix,
485 Some("k"),
486 false,
487 core::mem::size_of_val(&self.k),
488 flags,
489 )?;
490 self.len._mem_dbg_depth_on(
491 writer,
492 total_size,
493 max_depth,
494 prefix,
495 Some("len"),
496 false,
497 core::mem::size_of_val(&self.len),
498 flags,
499 )?;
500
501 // Use the custom wrapper to correctly display the `encoding` field.
502 let code_wrapper = CodeWrapper(&self.encoding);
503 code_wrapper._mem_dbg_depth_on(
504 writer,
505 total_size,
506 max_depth,
507 prefix,
508 Some("encoding"),
509 false, // Not the last field.
510 core::mem::size_of_val(&self.encoding),
511 flags,
512 )?;
513
514 self._markers._mem_dbg_depth_on(
515 writer,
516 total_size,
517 max_depth,
518 prefix,
519 Some("_markers"),
520 true, // This is the last field.
521 core::mem::size_of_val(&self._markers),
522 flags,
523 )?;
524 Ok(())
525 }
526}
527
528/// Type alias for the bit writer used internally by `IntVec` builders.
529pub(crate) type IntVecBitWriter<E> = BufBitWriter<E, MemWordWriterVec<u64, Vec<u64>>>;
530/// Type alias for the bit reader used internally by `IntVec` accessors.
531pub(crate) type IntVecBitReader<'a, E> =
532 BufBitReader<E, MemWordReader<u64, &'a [u64]>, DefaultReadParams>;
533
534impl<T: Storable + 'static, E: Endianness> IntVec<T, E, Vec<u64>> {
535 /// Creates a builder for constructing an owned [`IntVec`] from a slice of data.
536 ///
537 /// This is the most flexible way to create an [`IntVec`], allowing customization
538 /// of the compression codec and sampling rate.
539 ///
540 /// # Examples
541 ///
542 /// ```
543 /// use compressed_intvec::variable::{IntVec, UIntVec, VariableCodecSpec};
544 ///
545 /// let data: &[u32] = &[5, 8, 13, 21, 34];
546 /// let vec: UIntVec<u32> = IntVec::builder()
547 /// .k(2) // Sample every 2nd element
548 /// .codec(VariableCodecSpec::Delta)
549 /// .build(data)
550 /// .unwrap();
551 ///
552 /// assert_eq!(vec.get(3), Some(21));
553 /// ```
554 pub fn builder() -> IntVecBuilder<T, E> {
555 IntVecBuilder::new()
556 }
557
558 /// Creates a builder for constructing an owned [`IntVec`] from an iterator.
559 ///
560 /// This is useful for large datasets that are generated on the fly.
561 pub fn from_iter_builder<I>(iter: I) -> IntVecFromIterBuilder<T, E, I>
562 where
563 I: IntoIterator<Item = T> + Clone,
564 {
565 IntVecFromIterBuilder::new(iter)
566 }
567
568 /// Consumes the [`IntVec`] and returns its decoded values as a standard `Vec<T>`.
569 ///
570 /// # Examples
571 ///
572 /// ```
573 /// use compressed_intvec::variable::{IntVec, SIntVec};
574 ///
575 /// let data: &[i32] = &[-10, 0, 10];
576 /// let vec: SIntVec<i32> = IntVec::from_slice(data).unwrap();
577 /// let decoded_data = vec.into_vec();
578 ///
579 /// assert_eq!(decoded_data, &[-10, 0, 10]);
580 /// ```
581 pub fn into_vec(self) -> Vec<T>
582 where
583 for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
584 + CodesRead<E>
585 + BitSeek<Error = core::convert::Infallible>,
586 {
587 self.into_iter().collect()
588 }
589
590 /// Creates an owned [`IntVec`] from a slice of data using default settings.
591 ///
592 /// This method uses [`VariableCodecSpec::Auto`] to select a codec and a
593 /// default sampling rate of `k=16`.
594 pub fn from_slice(slice: &[T]) -> Result<Self, IntVecError>
595 where
596 for<'a> crate::variable::IntVecBitWriter<E>:
597 BitWrite<E, Error = core::convert::Infallible> + CodesWrite<E>,
598 {
599 Self::builder()
600 .k(16)
601 .codec(VariableCodecSpec::Auto)
602 .build(slice)
603 }
604}
605
606impl<T: Storable, E: Endianness, B: AsRef<[u64]>> IntVec<T, E, B> {
607 /// Creates a new [`IntVec`] from its raw components, enabling zero-copy views.
608 ///
609 /// This constructor is intended for advanced use cases, such as memory-mapping
610 /// a pre-built [`IntVec`] from disk without copying the data.
611 ///
612 /// # Errors
613 ///
614 /// Returns an [`IntVecError::InvalidParameters`] if `k` is zero or if the
615 /// number of samples is inconsistent with `len` and `k`.
616 pub fn from_parts(
617 data: B,
618 samples_data: B,
619 samples_len: usize,
620 samples_num_bits: usize,
621 k: usize,
622 len: usize,
623 encoding: Codes,
624 ) -> Result<Self, IntVecError> {
625 let samples =
626 FixedVec::<u64, u64, LE, B>::from_parts(samples_data, samples_len, samples_num_bits)?;
627
628 if k == 0 {
629 return Err(IntVecError::InvalidParameters(
630 "Sampling rate k cannot be zero".to_string(),
631 ));
632 }
633 let expected_samples = if len == 0 { 0 } else { len.div_ceil(k) };
634 if samples.len() != expected_samples {
635 return Err(IntVecError::InvalidParameters(format!(
636 "Inconsistent number of samples. Expected {}, found {}",
637 expected_samples,
638 samples.len()
639 )));
640 }
641
642 Ok(unsafe { Self::new_unchecked(data, samples, k, len, encoding) })
643 }
644
645 /// Creates a new [`IntVec`] from its raw parts without performing safety checks.
646 ///
647 /// # Safety
648 ///
649 /// The caller must ensure that all parameters are consistent and valid. The
650 /// `samples` must contain the correct bit offsets for the `data` stream,
651 /// and `len`, `k`, and `encoding` must accurately describe the layout.
652 /// Mismatched parameters will lead to panics or incorrect data retrieval.
653 pub(crate) unsafe fn new_unchecked(
654 data: B,
655 samples: FixedVec<u64, u64, LE, B>,
656 k: usize,
657 len: usize,
658 encoding: Codes,
659 ) -> Self {
660 Self {
661 data,
662 samples,
663 k,
664 len,
665 encoding,
666 _markers: PhantomData,
667 }
668 }
669
670 /// Creates a zero-copy, immutable view (a _slice_) of this vector.
671 ///
672 /// Returns `None` if the specified range is out of bounds.
673 ///
674 /// # Examples
675 ///
676 /// ```
677 /// use compressed_intvec::variable::{IntVec, UIntVec};
678 ///
679 /// let data: Vec<u32> = (0..20).collect();
680 /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
681 /// let slice = vec.slice(5, 10).unwrap();
682 ///
683 /// assert_eq!(slice.len(), 10);
684 /// assert_eq!(slice.get(0), Some(5)); // Corresponds to index 5 of the original vec
685 /// ```
686 pub fn slice(&'_ self, start: usize, len: usize) -> Option<IntVecSlice<'_, T, E, B>> {
687 if start.saturating_add(len) > self.len {
688 return None;
689 }
690 Some(IntVecSlice::new(self, start..start + len))
691 }
692
693 /// Splits the vector into two immutable slices at a given index.
694 ///
695 /// Returns `None` if `mid` is out of bounds.
696 #[allow(clippy::type_complexity)]
697 pub fn split_at(
698 &'_ self,
699 mid: usize,
700 ) -> Option<(IntVecSlice<'_, T, E, B>, IntVecSlice<'_, T, E, B>)> {
701 if mid > self.len {
702 return None;
703 }
704 let left = IntVecSlice::new(self, 0..mid);
705 let right = IntVecSlice::new(self, mid..self.len);
706 Some((left, right))
707 }
708
709 /// Returns the number of integers in the vector.
710 #[inline]
711 pub fn len(&self) -> usize {
712 self.len
713 }
714
715 /// Returns `true` if the vector contains no elements.
716 #[inline]
717 pub fn is_empty(&self) -> bool {
718 self.len == 0
719 }
720
721 /// Returns the sampling rate `k` used during encoding.
722 #[inline]
723 pub fn get_sampling_rate(&self) -> usize {
724 self.k
725 }
726
727 /// Returns the number of sample points stored in the vector.
728 #[inline]
729 pub fn get_num_samples(&self) -> usize {
730 self.samples.len()
731 }
732
733 /// Returns a reference to the inner `FixedVec` of samples.
734 #[inline]
735 pub fn samples_ref(&self) -> &FixedVec<u64, u64, LE, B> {
736 &self.samples
737 }
738
739 /// Returns a read-only slice of the underlying compressed data words (`&[u64]`).
740 #[inline]
741 pub fn as_limbs(&self) -> &[u64] {
742 self.data.as_ref()
743 }
744
745 /// Returns the concrete [`Codes`] variant that was used for compression.
746 #[inline]
747 pub fn encoding(&self) -> Codes {
748 self.encoding
749 }
750
751 /// Returns a clone of the underlying storage as a `Vec<u64>`.
752 pub fn limbs(&self) -> Vec<u64> {
753 self.data.as_ref().to_vec()
754 }
755
756 /// Returns an iterator over the decompressed values.
757 pub fn iter(&'_ self) -> impl Iterator<Item = T> + '_
758 where
759 for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
760 + CodesRead<E>
761 + BitSeek<Error = core::convert::Infallible>,
762 {
763 IntVecIter::new(self)
764 }
765}
766
767impl<T: Storable, E: Endianness, B: AsRef<[u64]>> IntVec<T, E, B>
768where
769 for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
770 + CodesRead<E>
771 + BitSeek<Error = core::convert::Infallible>,
772{
773 /// Creates a reusable, stateless reader for efficient random access.
774 ///
775 /// This method returns an [`IntVecReader`], a struct that maintains a persistent,
776 /// reusable bitstream reader. This amortizes the setup cost across multiple `get`
777 /// operations, making it more efficient than calling [`get`](IntVec::get) repeatedly in a loop.
778 ///
779 /// This reader is **stateless**: it performs a full seek from the nearest sample point for each call,
780 /// independently of any previous access.
781 ///
782 /// # When to use it
783 /// Use [`IntVecReader`] for true random access patterns where lookup indices are sparse,
784 /// unordered, or not known in advance (e.g., graph traversals, pointer chasing).
785 /// For accessing a known set of indices, [`get_many`](IntVec::get_many) is generally superior.
786 ///
787 /// # Examples
788 ///
789 /// ```
790 /// use compressed_intvec::prelude::*;
791 ///
792 /// let data: Vec<u32> = (0..100).rev().collect(); // Data is not sequential
793 /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
794 ///
795 /// // Create a reusable reader for multiple random lookups
796 /// let mut reader = vec.reader();
797 ///
798 /// assert_eq!(reader.get(99).unwrap(), Some(0));
799 /// assert_eq!(reader.get(0).unwrap(), Some(99));
800 /// assert_eq!(reader.get(50).unwrap(), Some(49));
801 /// ```
802 pub fn reader(&'_ self) -> IntVecReader<'_, T, E, B> {
803 IntVecReader::new(self)
804 }
805
806 /// Creates a stateful, reusable reader optimized for sequential access.
807 ///
808 /// This method returns an [`IntVecSeqReader`], which is specifically designed
809 /// to take advantage of the vector's internal state, tracking the current decoding position (cursor).
810 ///
811 /// This statefulness enables a key optimization:
812 /// - **Fast Path**: If a requested index is at or after the cursor and within
813 /// the same sample block, the reader decodes forward from its last known
814 /// position. This avoids a costly seek operation.
815 /// - **Fallback Path**: If the requested index is before the cursor (requiring a
816 /// backward move) or in a different sample block, the reader falls back to
817 /// the standard behavior of seeking to the nearest sample point.
818 ///
819 /// # When to use it
820 /// Use [`IntVecSeqReader`] when your access pattern has high locality, meaning
821 /// indices are primarily increasing and often clustered together. It is ideal
822 /// for iterating through a sorted list of indices or for stream-like processing.
823 ///
824 /// # Examples
825 ///
826 /// ```
827 /// use compressed_intvec::prelude::*;
828 ///
829 /// let data: Vec<u32> = (0..100).collect();
830 /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
831 ///
832 /// // Create a reader optimized for sequential access
833 /// let mut seq_reader = vec.seq_reader();
834 ///
835 /// // Accessing indices in increasing order is efficient
836 /// assert_eq!(seq_reader.get(10).unwrap(), Some(10));
837 /// // This next call is fast, as it decodes forward from index 10
838 /// assert_eq!(seq_reader.get(15).unwrap(), Some(15));
839 ///
840 /// // A large jump will trigger a seek to a new sample block
841 /// assert_eq!(seq_reader.get(90).unwrap(), Some(90));
842 ///
843 /// // A backward jump will also trigger a seek
844 /// assert_eq!(seq_reader.get(5).unwrap(), Some(5));
845 /// ```
846 pub fn seq_reader(&'_ self) -> IntVecSeqReader<'_, T, E, B> {
847 IntVecSeqReader::new(self)
848 }
849
850 /// Returns the element at the specified index, or `None` if the index is
851 /// out of bounds.
852 ///
853 /// This operation is amortized O(1).
854 ///
855 /// # Examples
856 ///
857 /// ```
858 /// use compressed_intvec::variable::{IntVec, UIntVec};
859 ///
860 /// let data: Vec<u32> = (0..100).collect();
861 /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
862 ///
863 /// assert_eq!(vec.get(50), Some(50));
864 /// assert_eq!(vec.get(100), None);
865 /// ```
866 #[inline]
867 pub fn get(&self, index: usize) -> Option<T> {
868 if index >= self.len {
869 return None;
870 }
871 Some(unsafe { self.get_unchecked(index) })
872 }
873
874 /// Returns the element at the specified index without bounds checking.
875 ///
876 /// # Safety
877 ///
878 /// Calling this method with an out-of-bounds `index` is undefined behavior.
879 /// The `index` must be less than the vector's `len`.
880 #[inline]
881 pub unsafe fn get_unchecked(&self, index: usize) -> T {
882 let mut reader = self.reader();
883 reader.get_unchecked(index)
884 }
885
886 /// Retrieves multiple elements from the vector at the specified indices.
887 ///
888 /// This method is generally more efficient than calling [`get`](Self::get) in a loop, as
889 /// it sorts the indices and scans through the compressed data stream once.
890 ///
891 /// # Errors
892 ///
893 /// Returns [`IntVecError::IndexOutOfBounds`] if any index is out of bounds.
894 ///
895 /// # Examples
896 ///
897 /// ```
898 /// use compressed_intvec::variable::{IntVec, UIntVec};
899 ///
900 /// let data: Vec<u32> = (0..100).collect();
901 /// let vec: UIntVec<u32> = IntVec::from_slice(&data).unwrap();
902 ///
903 /// let indices = [99, 0, 50];
904 /// let values = vec.get_many(&indices).unwrap();
905 /// assert_eq!(values, vec![99, 0, 50]);
906 /// ```
907 pub fn get_many(&self, indices: &[usize]) -> Result<Vec<T>, IntVecError> {
908 if indices.is_empty() {
909 return Ok(Vec::new());
910 }
911
912 for &index in indices {
913 if index >= self.len {
914 return Err(IntVecError::IndexOutOfBounds(index));
915 }
916 }
917 // SAFETY: We have just performed the bounds checks.
918 Ok(unsafe { self.get_many_unchecked(indices) })
919 }
920
921 /// Retrieves multiple elements without bounds checking.
922 ///
923 /// # Safety
924 ///
925 /// Calling this method with any out-of-bounds index is undefined behavior.
926 #[allow(clippy::uninit_vec)]
927 pub unsafe fn get_many_unchecked(&self, indices: &[usize]) -> Vec<T> {
928 if indices.is_empty() {
929 return Vec::new();
930 }
931 let mut results = Vec::with_capacity(indices.len());
932 // SAFETY: The vector is immediately populated by the sorted access logic below.
933 results.set_len(indices.len());
934
935 let mut indexed_indices: Vec<(usize, usize)> = indices
936 .iter()
937 .enumerate()
938 .map(|(i, &idx)| (idx, i))
939 .collect();
940 // Sort by the target index to enable efficient sequential scanning.
941 indexed_indices.sort_unstable_by_key(|&(idx, _)| idx);
942
943 if self.k.is_power_of_two() {
944 // Optimization: use bit-shift for division if k is a power of two.
945 let k_exp = self.k.trailing_zeros();
946 self.get_many_dsi_inner(
947 &indexed_indices,
948 &mut results,
949 |idx| idx >> k_exp,
950 |block| block << k_exp,
951 )
952 .unwrap();
953 } else {
954 self.get_many_dsi_inner(
955 &indexed_indices,
956 &mut results,
957 |idx| idx / self.k,
958 |block| block * self.k,
959 )
960 .unwrap();
961 }
962
963 results
964 }
965
966 /// Internal implementation for `get_many_unchecked`.
967 ///
968 /// This function takes closures to abstract away the division/multiplication
969 /// by `k`, allowing for a bit-shift optimization when `k` is a power of two.
970 fn get_many_dsi_inner<F1, F2>(
971 &self,
972 indexed_indices: &[(usize, usize)],
973 results: &mut [T],
974 block_of: F1,
975 start_of_block: F2,
976 ) -> Result<(), IntVecError>
977 where
978 F1: Fn(usize) -> usize,
979 F2: Fn(usize) -> usize,
980 {
981 let mut reader = self.reader();
982 let mut current_decoded_index: usize = 0;
983
984 for &(target_index, original_position) in indexed_indices {
985 // Check if we need to jump to a new sample block. This is true if the
986 // target index is before our current position, or if it's in a different
987 // sample block than the one we're currently in.
988 if target_index < current_decoded_index
989 || block_of(target_index) != block_of(current_decoded_index.saturating_sub(1))
990 {
991 let target_sample_block = block_of(target_index);
992 // SAFETY: The public-facing `get_many` performs bounds checks.
993 let start_bit = unsafe { self.samples.get_unchecked(target_sample_block) };
994 reader.reader.set_bit_pos(start_bit)?;
995 current_decoded_index = start_of_block(target_sample_block);
996 }
997
998 // Sequentially decode elements until we reach our target.
999 for _ in current_decoded_index..target_index {
1000 reader.code_reader.read(&mut reader.reader)?;
1001 }
1002 let value = reader.code_reader.read(&mut reader.reader)?;
1003 // Place the decoded value in its original requested position.
1004 results[original_position] = Storable::from_word(value);
1005 current_decoded_index = target_index + 1;
1006 }
1007 Ok(())
1008 }
1009
1010 /// Retrieves multiple elements from an iterator of indices.
1011 ///
1012 /// This is a convenient alternative to [`get_many`](Self::get_many) when the indices are not
1013 /// already in a slice. It may be less performant as it cannot pre-sort the
1014 /// indices for optimal access.
1015 pub fn get_many_from_iter<I>(&self, indices: I) -> Result<Vec<T>, IntVecError>
1016 where
1017 I: IntoIterator<Item = usize>,
1018 {
1019 let indices_iter = indices.into_iter();
1020 let (lower_bound, _) = indices_iter.size_hint();
1021 let mut results = Vec::with_capacity(lower_bound);
1022 let mut seq_reader = self.seq_reader();
1023
1024 for index in indices_iter {
1025 let value = seq_reader
1026 .get(index)?
1027 .ok_or(IntVecError::IndexOutOfBounds(index))?;
1028 results.push(value);
1029 }
1030
1031 Ok(results)
1032 }
1033}
1034
1035impl<T: Storable + Ord, E: Endianness, B: AsRef<[u64]>> IntVec<T, E, B>
1036where
1037 for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
1038 + CodesRead<E>
1039 + BitSeek<Error = core::convert::Infallible>,
1040{
1041 /// Binary searches this vector for a given element.
1042 ///
1043 /// If the value is found, returns `Ok(usize)` with the index of the
1044 /// matching element. If the value is not found, returns `Err(usize)` with
1045 /// the index where the value could be inserted to maintain order.
1046 ///
1047 /// # Complexity
1048 ///
1049 /// The time complexity of this operation is O(k * log n), where `n` is the
1050 /// number of elements in the vector and `k` is the sampling rate. This is
1051 /// because each of the O(log n) probes during the search requires an
1052 /// element access, which has a cost proportional to `k` in the worst case.
1053 ///
1054 /// # Examples
1055 ///
1056 /// ```
1057 /// use compressed_intvec::variable::{IntVec, SIntVec};
1058 ///
1059 /// let data: &[i32] = &[-10, 0, 10, 20, 30];
1060 /// let vec: SIntVec<i32> = IntVec::from_slice(data).unwrap();
1061 ///
1062 /// assert_eq!(vec.binary_search(&10), Ok(2));
1063 /// assert_eq!(vec.binary_search(&15), Err(3));
1064 /// ```
1065 pub fn binary_search(&self, value: &T) -> Result<usize, usize> {
1066 self.binary_search_by(|probe| probe.cmp(value))
1067 }
1068
1069 /// Binary searches this vector with a custom comparison function.
1070 ///
1071 /// # Complexity
1072 ///
1073 /// The time complexity of this operation is O(k * log n), where `n` is the
1074 /// number of elements in the vector and `k` is the sampling rate.
1075 #[inline]
1076 pub fn binary_search_by<F>(&self, mut f: F) -> Result<usize, usize>
1077 where
1078 F: FnMut(T) -> std::cmp::Ordering,
1079 {
1080 let mut low = 0;
1081 let mut high = self.len();
1082 let mut reader = self.reader();
1083
1084 while low < high {
1085 let mid = low + (high - low) / 2;
1086 // SAFETY: The loop invariants ensure `mid` is always in bounds.
1087 let cmp = f(unsafe { reader.get_unchecked(mid) });
1088
1089 match cmp {
1090 std::cmp::Ordering::Less => low = mid + 1,
1091 std::cmp::Ordering::Equal => return Ok(mid),
1092 std::cmp::Ordering::Greater => high = mid,
1093 }
1094 }
1095 Err(low)
1096 }
1097
1098 /// Binary searches this vector with a key extraction function.
1099 ///
1100 /// # Complexity
1101 ///
1102 /// The time complexity of this operation is O(k * log n), where `n` is the
1103 /// number of elements in the vector and `k` is the sampling rate.
1104 #[inline]
1105 pub fn binary_search_by_key<K: Ord, F>(&self, b: &K, mut f: F) -> Result<usize, usize>
1106 where
1107 F: FnMut(T) -> K,
1108 {
1109 self.binary_search_by(|k| f(k).cmp(b))
1110 }
1111}
1112
1113impl<T: Storable + 'static, E: Endianness + 'static> IntoIterator for IntVec<T, E, Vec<u64>>
1114where
1115 for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
1116 + CodesRead<E>
1117 + BitSeek<Error = core::convert::Infallible>,
1118{
1119 type Item = T;
1120 type IntoIter = IntVecIntoIter<T, E>;
1121
1122 fn into_iter(self) -> Self::IntoIter {
1123 IntVecIntoIter::new(self)
1124 }
1125}
1126
1127/// An [`IntVec`] for unsigned integers with Little-Endian bit layout.
1128pub type UIntVec<T> = IntVec<T, LE>;
1129/// An [`IntVec`] for signed integers with Little-Endian bit layout.
1130pub type SIntVec<T> = IntVec<T, LE>;
1131/// An [`IntVec`] for [`u64`] elements with Big-Endian bit layout.
1132pub type BEIntVec = IntVec<u64, BE>;
1133/// An [`IntVec`] for [`u64`] elements with Little-Endian bit layout.
1134pub type LEIntVec = IntVec<u64, LE>;
1135/// An [`IntVec`] for [`i64`] elements with Big-Endian bit layout.
1136pub type BESIntVec = IntVec<i64, BE>;
1137/// An [`IntVec`] for [`i64`] elements with Little-Endian bit layout.
1138pub type LESIntVec = IntVec<i64, LE>;
1139
1140impl<T, E, B, O> PartialEq<O> for IntVec<T, E, B>
1141where
1142 T: Storable + PartialEq,
1143 E: Endianness,
1144 B: AsRef<[u64]>,
1145 O: AsRef<[T]>,
1146 for<'a> IntVecBitReader<'a, E>: BitRead<E, Error = core::convert::Infallible>
1147 + CodesRead<E>
1148 + BitSeek<Error = core::convert::Infallible>,
1149{
1150 /// Checks for equality between an [`IntVec`] and a standard slice.
1151 ///
1152 /// The comparison is done by iterating over both and comparing elements
1153 /// one by one. The overall comparison is not a single atomic operation.
1154 fn eq(&self, other: &O) -> bool {
1155 let other_slice = other.as_ref();
1156 if self.len() != other_slice.len() {
1157 return false;
1158 }
1159 self.iter().zip(other_slice.iter()).all(|(a, b)| a == *b)
1160 }
1161}