Expand description
ID set compression primitives.
cnk provides compression algorithms for sorted, unique ID sets where
order doesn’t matter. This is common in information retrieval:
- IVF posting lists (which vectors belong to which cluster)
- HNSW neighbor lists (which nodes are connected)
- Inverted indexes (which documents contain which terms)
§Compression Methods
- Delta+varint (
RocCompressor): practical baseline, varint-encodes gaps between sorted IDs - Elias-Fano (feature
sbits): succinct monotone-sequence codec with random access - Partitioned Elias-Fano (feature
sbits): cluster-aware variant - ROC (Random Order Coding): near-optimal for sets using bits-back with ANS (planned)
§Historical Context
Set compression has a rich history in information retrieval. Classic methods like Elias-Fano (1971) exploit monotonicity of sorted sequences. Modern methods like ROC (Severo et al., 2022) exploit the additional structure that order doesn’t matter, achieving log(C(N,n)) bits instead of log(N^n) bits.
§Example
use cnk::{RocCompressor, IdSetCompressor};
let compressor = RocCompressor::new();
let ids = vec![1u32, 5, 10, 20, 50];
let universe_size = 1000;
// Compress
let compressed = compressor.compress_set(&ids, universe_size).unwrap();
// Decompress
let decompressed = compressor.decompress_set(&compressed, universe_size).unwrap();
assert_eq!(ids, decompressed);§References
- Elias, P. (1974). “Efficient storage and retrieval by content and address”
- Fano, R. (1971). “On the number of bits required to implement an associative memory”
- Severo et al. (2022). “Compressing multisets with large alphabets”
- Severo et al. (2025). “Lossless Compression of Vector IDs for ANN Search”
Structs§
- Auto
Config - Configuration for
compress_set_auto. - Choose
Config - Configuration for the heuristic chooser.
- Codec
Choice - A codec choice, including any method parameters that should be recorded alongside the bytes.
- IdList
Stats - Summary statistics for a sorted, unique ID list within
[0, universe_size). - RocCompressor
- Delta+varint compressor for sorted, unique ID sets.
Enums§
- Compression
Error - Errors that can occur during compression operations.
- IdCompression
Method - Compression method selection.
Traits§
- IdSet
Compressor - Trait for compressing sets of IDs where order doesn’t matter.
Functions§
- choose_
method - Choose a compression method from list statistics.
- compress_
set_ auto - Choose a method (feature-aware) and compress.
- compress_
set_ enveloped - Encode as an envelope: choose method (auto) + store
(method, params, universe_size, payload). - decompress_
set_ auto - Decompress bytes previously produced by
compress_set_auto(or any compatible encoder), using the recordedchoice.method. - decompress_
set_ enveloped - Decode an envelope.
- validate_
ids - Validate that
idsis sorted in strictly ascending order (sorted + unique).