samkhya-core
The foundational crate of the samkhya project — portable, feedback-driven cardinality correction for embedded analytical engines.
samkhya-core is engine-agnostic. It contains every primitive that the
per-engine adapters (samkhya-datafusion, samkhya-duckdb-ext,
samkhya-polars, samkhya-postgres, samkhya-py) build on top of: the four
foundational sketches, the 2D correlated histogram, the Puffin sidecar
reader/writer, the LpBound envelope family, the feedback store, and the
residual corrector trait. Nothing here links to a specific query engine.
What this crate provides
samkhya-core
├── sketch
│ ├── HllSketch HyperLogLog (distinct count)
│ ├── BloomFilter membership
│ ├── CountMinSketch point frequency
│ ├── EquiDepthHistogram 1D range
│ └── CorrelatedHistogram2D 2D joint distribution
├── puffin
│ ├── PuffinReader / PuffinWriter Iceberg Puffin v1 sidecars
│ └── KIND tags samkhya.hll-v1, .bloom-v1, .cms-v1, ...
├── lpbound
│ ├── ProductBound coarse n_1 * n_2 * ... ceiling
│ ├── AgmBound AGM fractional edge cover
│ ├── ChainBound chain-join specialisation
│ └── LpJoinBound LP-derived (feature `lp_solver`)
├── feedback
│ ├── FeedbackStore SQLite-backed observation log
│ └── TemplateHash query-template fingerprint
└── corrector
├── IdentityCorrector no-op (passes the clamped ceiling)
├── GbtCorrector gradient-boosted residual
├── AdditiveGbtCorrector per-template additive residual
└── (trait Corrector) implement your own
Quick start
use HllSketch;
let mut hll = new;
for i in 0..10_000u64
let estimate = hll.estimate;
assert!; // ~0.5% rel err at p=14
let bytes = hll.to_bytes;
let restored = from_bytes.unwrap;
assert_eq!;
A larger end-to-end example — building four sketches, writing them to a
Puffin sidecar, and reading them back — is at
examples/sketch_to_puffin.rs.
Cardinality envelope
Every corrector output in samkhya is clamped above by a provable ceiling derived from sketch-level statistics. The four envelopes form a strict ordering on tightness:
LpJoinBound <= AgmBound <= ChainBound <= ProductBound
(tightest) (loosest)
LpJoinBound requires the lp_solver feature (pulls in good_lp +
microlp). The default build ships ProductBound/AgmBound/ChainBound
without any LP dependency.
Feature flags
| flag | default | what it adds |
|---|---|---|
lp_solver |
off | LpJoinBound via good_lp + microlp |
gbt |
on | GbtCorrector via gbdt |
tabpfn_http |
off | TabPfnHttpCorrector (foundation-model HTTP backend) |
iceberg_compat |
on | Puffin sidecar reader strictness for Iceberg payloads |
Disabling gbt removes the only ML dep; pure-sketch deployments can do that.
Safety / format stability
All from_bytes constructors take untrusted input and are in-scope for the
project's SECURITY.md. They are fuzzed (cargo fuzz) on every release and
must never panic on adversarial bytes — they return Err instead.
Sketch payload codecs and the Puffin KIND tags are pinned at v1 for the
v1.x line. Format bumps will use new kinds (samkhya.hll-v2, …) and the
reader's coexistence contract: unknown kinds are skipped, never errored.
Integration
samkhya-core is the only crate the engine adapters depend on. If you're
embedding samkhya into a new engine, start by depending on this crate and
mirroring the integration pattern in samkhya-datafusion.
License
Apache-2.0. Sole author: Prateek Singh.