samkhya-arrow
Engine-agnostic Apache Arrow integration for samkhya sketches. Feed an
arrow::array::Array or a RecordBatch in, get back ready-to-serialize HLL,
Bloom, Count-Min, and equi-depth-histogram sketches.
Part of the samkhya project — portable, feedback-driven cardinality correction for embedded analytical engines.
What this crate provides
ingest— array-level helpers that dispatch once onDataType, downcast to the concrete primitive / byte array, and walk values into a sketch:ingest_array_into_hll(array, &mut HllSketch)ingest_array_into_bloom(array, &mut BloomFilter)ingest_array_into_cms(array, &mut CountMinSketch, count_per_value)ingest_array_into_histogram_values(array) -> Result<Vec<f64>>
batch—RecordBatch-level convenience wrappers that fan out the array helpers across every column:build_column_sketches(batch, precision) -> Result<Vec<HllSketch>>build_blooms(batch, fp_rate) -> Result<Vec<BloomFilter>>build_histograms(batch, buckets) -> Result<Vec<Option<EquiDepthHistogram>>>
The crate intentionally does not depend on DataFusion, DuckDB, Polars, or
any other engine — only on arrow itself — so any Arrow-aware caller can use
it. Sketches built from a DataFusion RecordBatch hash to the same keys as
sketches built from a Polars DataFrame's Arrow chunks.
Hash-key conventions
All ingestion paths hash a column value by its canonical byte form:
| Arrow type | Bytes fed to the sketch |
|---|---|
Int8 … Int64, UInt8 … UInt64 |
little-endian primitive bytes |
Float32, Float64 |
little-endian (to_le_bytes) |
Utf8, LargeUtf8 |
raw UTF-8 bytes |
Binary, LargeBinary |
bytes as-is |
Date32, Date64, TimestampNanosecond |
little-endian of the underlying int |
Boolean |
[0] for false, [1] for true |
These match the byte form samkhya-core sketches consume directly, so values
added through this crate and values added directly via the core API hash to
the same key.
Quick start
use ;
use ;
use Arc;
use build_column_sketches;
let schema = new;
let batch = try_new?;
let sketches = build_column_sketches?;
let approx_distinct = sketches.estimate;
println!; // ~3
# Ok::
Feature flags
This crate has no cargo features. The arrow dependency is pinned to the
major version DataFusion 46 vendors (arrow = "54"), so consumers that
already pull DataFusion never end up with two parallel Arrow stacks linked
into the same binary.
Behavior on unsupported types
- HLL / Bloom / CMS helpers silently skip arrays whose
DataTypeis not recognized (e.g. nestedStruct,List,Dictionary). A generalized "build sketches for every column" caller can fan out without first auditing the schema. - The histogram helper is stricter: it requires a numeric column and returns
Error::InvalidSketchfor non-numeric input rather than producing a meaningless empty histogram.
Integration
Any caller that already speaks Arrow and wants engine-neutral cardinality
stats. Inside the samkhya workspace, this is the path adapters take when they
receive data as Arrow RecordBatch rather than as engine-native rows —
keeping sketch construction in one tested place rather than re-implemented
per engine.
License
Apache-2.0. Sole author: Prateek Singh.