cnk
Set compression via C(n,k): ROC, ANS, delta encoding.
Dual-licensed under MIT or Apache-2.0.
use ;
let compressor = new;
let ids = vec!;
let universe_size = 1000;
let compressed = compressor.compress_set.unwrap;
let decompressed = compressor.decompress_set.unwrap;
assert_eq!;
A set of size (n) from universe ([N]) has $\binom{N}{n}$ possibilities, so the information-theoretic minimum is $\log_2 \binom{N}{n}$ bits. ROC approaches this bound by treating permutation as a latent variable.
Feature matrix
| Compressor | Feature required | Random access |
|---|---|---|
RocCompressor |
(always available) | no |
EliasFanoCompressor |
sbits |
yes |
PartitionedEliasFanoCompressor |
sbits |
yes |
Succinct baselines (feature sbits)
If you enable feature sbits, cnk exposes succinct monotone-sequence codecs powered by sbits/:
EliasFanoCompressorPartitionedEliasFanoCompressor(cluster-aware baseline; see Ottaviano & Venturini, SIGIR 2014: PDF)
These are useful when you care about random access / skipping inside lists (typical in posting lists and some graph/ANN structures).
Method selection (“auto”)
Downstream repos often want a stable default without hardcoding a codec per call site. cnk provides:
IdListStats: cheap per-list summary stats (gaps, clustering proxy)choose_method(...): deterministic heuristic method choicecompress_set_auto(...)/decompress_set_auto(...): choose + encode/decode, with feature-aware fallbacks
The intent is: callers record the returned CodecChoice next to the bytes (so decode does not depend on build-time defaults).