Skip to main content

samkhya_arrow/
lib.rs

1//! Engine-agnostic Arrow integration for samkhya sketches.
2//!
3//! This crate is the bridge between Apache Arrow data and samkhya's
4//! cardinality / membership / range sketches. Consumers (DataFusion,
5//! DuckDB extensions reading Arrow, Polars, custom Arrow pipelines)
6//! feed an [`arrow::array::Array`] or an [`arrow::record_batch::RecordBatch`]
7//! in, and get back ready-to-serialize sketches.
8//!
9//! The crate intentionally does **not** depend on DataFusion or any
10//! other compute engine — only on `arrow` itself — so it stays usable
11//! from any Arrow-aware caller.
12//!
13//! # Hash-key conventions
14//!
15//! All ingestion paths hash a column value by its canonical byte form:
16//!
17//! - Numeric types: little-endian bytes of the underlying primitive.
18//! - `Utf8` / `LargeUtf8`: the raw UTF-8 bytes of the string.
19//! - `Binary` / `LargeBinary`: the bytes as-is.
20//! - `Date32` / `Date64` / `TimestampNanosecond`: little-endian bytes
21//!   of the underlying integer.
22//! - `Boolean`: a single byte, `0` for false, `1` for true.
23//!
24//! These conventions match the byte-form `samkhya-core` sketches already
25//! consume (see `HllSketch::add`, `BloomFilter::insert`,
26//! `CountMinSketch::add`), so values added through this crate and values
27//! added directly via the core API hash to the same key.
28#![deny(rustdoc::broken_intra_doc_links)]
29
30pub mod batch;
31pub mod ingest;