samkhya-arrow 1.0.0

samkhya integration helpers for the Arrow ecosystem (Series → Sketch builders)
Documentation
  • Coverage
  • 100%
    10 out of 10 items documented0 out of 7 items with examples
  • Size
  • Source code size: 67.77 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 586.23 kB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 38s Average build duration of successful builds.
  • all releases: 38s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • singhpratech/samkhya
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • singhpratech

samkhya-arrow

crates.io docs.rs Apache-2.0

Engine-agnostic Apache Arrow integration for samkhya sketches. Feed an arrow::array::Array or a RecordBatch in, get back ready-to-serialize HLL, Bloom, Count-Min, and equi-depth-histogram sketches.

Part of the samkhya project — portable, feedback-driven cardinality correction for embedded analytical engines.

What this crate provides

  • ingest — array-level helpers that dispatch once on DataType, downcast to the concrete primitive / byte array, and walk values into a sketch:
    • ingest_array_into_hll(array, &mut HllSketch)
    • ingest_array_into_bloom(array, &mut BloomFilter)
    • ingest_array_into_cms(array, &mut CountMinSketch, count_per_value)
    • ingest_array_into_histogram_values(array) -> Result<Vec<f64>>
  • batchRecordBatch-level convenience wrappers that fan out the array helpers across every column:
    • build_column_sketches(batch, precision) -> Result<Vec<HllSketch>>
    • build_blooms(batch, fp_rate) -> Result<Vec<BloomFilter>>
    • build_histograms(batch, buckets) -> Result<Vec<Option<EquiDepthHistogram>>>

The crate intentionally does not depend on DataFusion, DuckDB, Polars, or any other engine — only on arrow itself — so any Arrow-aware caller can use it. Sketches built from a DataFusion RecordBatch hash to the same keys as sketches built from a Polars DataFrame's Arrow chunks.

Hash-key conventions

All ingestion paths hash a column value by its canonical byte form:

Arrow type Bytes fed to the sketch
Int8Int64, UInt8UInt64 little-endian primitive bytes
Float32, Float64 little-endian (to_le_bytes)
Utf8, LargeUtf8 raw UTF-8 bytes
Binary, LargeBinary bytes as-is
Date32, Date64, TimestampNanosecond little-endian of the underlying int
Boolean [0] for false, [1] for true

These match the byte form samkhya-core sketches consume directly, so values added through this crate and values added directly via the core API hash to the same key.

Quick start

use arrow::array::{Int64Array, RecordBatch};
use arrow::datatypes::{DataType, Field, Schema};
use std::sync::Arc;

use samkhya_arrow::batch::build_column_sketches;

let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int64, false)]));
let batch = RecordBatch::try_new(
    schema,
    vec![Arc::new(Int64Array::from(vec![1, 2, 3, 1, 2]))],
)?;

let sketches = build_column_sketches(&batch, 12)?;
let approx_distinct = sketches[0].estimate();
println!("approx_distinct = {approx_distinct}"); // ~3
# Ok::<(), Box<dyn std::error::Error>>(())

Feature flags

This crate has no cargo features. The arrow dependency is pinned to the major version DataFusion 46 vendors (arrow = "54"), so consumers that already pull DataFusion never end up with two parallel Arrow stacks linked into the same binary.

Behavior on unsupported types

  • HLL / Bloom / CMS helpers silently skip arrays whose DataType is not recognized (e.g. nested Struct, List, Dictionary). A generalized "build sketches for every column" caller can fan out without first auditing the schema.
  • The histogram helper is stricter: it requires a numeric column and returns Error::InvalidSketch for non-numeric input rather than producing a meaningless empty histogram.

Integration

Any caller that already speaks Arrow and wants engine-neutral cardinality stats. Inside the samkhya workspace, this is the path adapters take when they receive data as Arrow RecordBatch rather than as engine-native rows — keeping sketch construction in one tested place rather than re-implemented per engine.

License

Apache-2.0. Sole author: Prateek Singh.