Skip to main content

Crate asap_sketchlib

Crate asap_sketchlib 

Source
Expand description

§asap_sketchlib

Crates.io docs.rs License MSRV

A Rust library for streaming data sketches — compact data structures that give approximate answers (counts, distinct counts, percentiles) over data streams too large to store exactly.

§Why asap_sketchlib

  • Fast. Up to 8–14× higher insertion throughput than comparable libraries on frequency sketches, 2–3× on cardinality sketches, and 2–4× on quantile sketches. Rust-native with no language-boundary overhead. See benchmarks.
  • High coverage. Supports frequency, cardinality, quantile, and distribution sketches (CountMin, Count, HyperLogLog, KLL, DDSketch). Also includes algorithms not found in other libraries: UnivMon for estimating a broad class of streaming statistics (L1/L2 norms, entropy) in a single pass, Hydra for answering sketch queries over arbitrary subpopulations without per-group sketches, and NitroBatch for accelerating sketch updates through batching. Unique sketch frameworks for sliding windows (ExponentialHistogram) and subpopulation queries (Hydra).
  • Easy to use. Most sketches provide a unified API style, while some (such as KLL) use update/quantile; the crate also offers typed inputs via DataInput, pluggable hashing via SketchHasher, and multi-sketch composition with shared hashing (HashSketchEnsemble).

§Supported Sketches

GoalSketchWhen to pick itWhat it doesPolars equivalent
Frequency estimationCountMin, CountFast approximate counts for high-volume keysEstimates how often each key appears in a streamdf.group_by("key").agg(pl.len())
Cardinality estimationHyperLogLog (Classic, ErtlMLE, HIP)Approximate distinct counts with bounded memoryEstimates the number of unique elementsdf["col"].n_unique()
Quantiles / distributionKLL, DDSketchPercentile / latency summaries over streamsApproximates arbitrary quantiles (e.g. p50, p99) of a value distributiondf["col"].quantile(0.99)
Subpopulation queriesHydraHierarchical / filtered sketch queriesAnswers sketch queries over arbitrary subpopulations without maintaining per-group sketchesNo direct equivalent — requires per-group aggregation
Universal monitoringUnivMonG-sum queries (L1/L2 norms, cardinality, entropy)Estimates a broad class of streaming statistics in a single passNo direct equivalent — requires custom multi-pass pipelines
Update accelerationNitroBatchBatch-accelerated sketch updatesSpeeds up sketch insertions by batching updatesNo direct equivalent

Full sketch status and API details: APIs Index.

§Quick Start

Minimum Supported Rust Version (MSRV): 1.85 (Rust 2024 edition)

Install from crates.io:

cargo add asap_sketchlib
[dependencies]
asap_sketchlib = "0.2"

API docs are hosted on docs.rs.

Alternatively, pin to a tagged revision from GitHub:

[dependencies]
asap_sketchlib = { git = "https://github.com/ProjectASAP/asap_sketchlib", tag = "v0.2.2" }

§Count distinct users with HyperLogLog

use asap_sketchlib::{ErtlMLE, HyperLogLog, DataInput};

// HyperLogLog estimates the number of distinct items in a stream using fixed memory.
// ErtlMLE is one of the HLL variants we offer — it tends to be more accurate than
// the `Classic` variant, especially at very low or very high cardinalities.
let mut hll = HyperLogLog::<ErtlMLE>::default();

// Insert some user IDs — HLL handles distinct counting and deduplicates items.
for user_id in [101, 202, 303, 101, 404, 202, 505, 101] {
    hll.insert(&DataInput::U64(user_id));
}

let unique_users = hll.estimate();
println!("estimated unique users: {unique_users}"); // ≈ 5

§Estimate frequency of items with Count-Min Sketch

use asap_sketchlib::{CountMin, FastPath, Vector2D, DataInput};

// Count-Min Sketch estimates how often each item appears in a stream.
// It may over-count but never under-counts.
//
// Vector2D<i32> is the backing storage (a 2D array of 32-bit counters).
// FastPath uses a single hash with bit-masking to pick row indices — faster
// than the default RegularPath which hashes once per row.
let mut cms = CountMin::<Vector2D<i32>, FastPath>::with_dimensions(3, 2048);

// Simulate an event stream with known frequencies.
let events = [
    ("page_view", 1000),
    ("click",      500),
    ("signup",     100),
    ("purchase",    50),
];
for &(event, count) in &events {
    for _ in 0..count {
        cms.insert(&DataInput::Str(event));
    }
}

// Estimates are close to the true counts (CMS may over-count, but never under-counts).
for &(event, true_count) in &events {
    let est = cms.estimate(&DataInput::Str(event));
    println!("{event:>10}: estimate = {est}, true = {true_count}");
}

§Track latency percentiles with KLL

use asap_sketchlib::KLL;

// KLL is a quantile sketch — it tracks the distribution of values so you can
// ask questions like "what is the median?" without storing every data point.
let mut sketch = KLL::<f64>::default();

// Simulate 1000 latency samples in milliseconds
for i in 0..1000 {
    let ms = (i as f64) * 0.5 + 1.0;
    sketch.update(&ms);
}

let p50 = sketch.quantile(0.50);
let p99 = sketch.quantile(0.99);
println!("median ≈ {p50:.1} ms, p99 ≈ {p99:.1} ms");

§Merge multiple sketch instances

use asap_sketchlib::{ErtlMLE, HyperLogLog, DataInput};

// Sketches are mergeable — you can build one per node and combine them later
// to get a global answer without shipping raw data.
let mut node_a = HyperLogLog::<ErtlMLE>::default();
let mut node_b = HyperLogLog::<ErtlMLE>::default();

// Each node sees different (and some overlapping) users
for id in [1, 2, 3, 4, 5]  { node_a.insert(&DataInput::U64(id)); }
for id in [4, 5, 6, 7, 8]  { node_b.insert(&DataInput::U64(id)); }

node_a.merge(&node_b);
println!("total unique users ≈ {}", node_a.estimate()); // ≈ 8

§Choosing Between Sketches

Several sketches address the same goal with different trade-offs — for example, CountMin vs Count for frequency, or KLL vs DDSketch for quantiles.

We are building SketchPlan, a profiler that analyzes a representative sample of your data and recommends the best sketch configuration (algorithm, memory budget, error tolerance) for your workload. Until SketchPlan is ready, the APIs Index lists guarantees, error bounds, and caveats for each sketch to help you decide.

§Performance

Insertion throughput on 10M Zipf-distributed values, averaged over 10 runs:

  • Frequency sketches: up to 8-14x higher insertion throughput than comparable libraries
  • Cardinality sketches: roughly 2-3x higher insertion throughput
  • Quantile sketches: roughly 2-4x higher insertion throughput

Benchmark methodology, tuning notes, and performance details (including cache-friendly layouts and FastPath single-hash mode) are in Performance Notes.

§Documentation

DocContents
APIs IndexPer-sketch API reference with status and error guarantees
Advanced Use CasesHierarchical queries, windowed sketching, multi-sketch coordination
Docs IndexFull documentation index

If you are evaluating the crate for production use, start with the API index first. It calls out which APIs are stable today and which are still feature-gated or experimental.

§Dev Commands

cargo build --all-targets
cargo test --all-features
  • --all-targets builds everything: the library, binaries, and tests.
  • --all-features enables every Cargo feature, so all feature-gated code is compiled and tested. The features include:
    • experimental — enables sketches and APIs that are still under development and may change without notice.
    • octo-runtime — enables the Octo multi-threaded runtime (pulls in core_affinity and crossbeam-channel).

To build or test with a specific feature:

cargo build --features experimental
cargo test --features "experimental octo-runtime"

§Protobuf code generation

asap_sketchlib is a pure-Rust crate: building it does not require protoc or any build script. The Rust types generated from proto/**/*.proto are vendored into src/proto/generated/ and refreshed manually by maintainers using the in-repo tool at tools/gen-proto/.

Downstream users can therefore simply add the crate to their Cargo.toml and build it like any other pure-Rust dependency.

§For maintainers

After editing any .proto file, regenerate the vendored output and commit the result:

cargo run --manifest-path tools/gen-proto/Cargo.toml
git add src/proto/generated

The tool uses prost-build together with the protoc-bin-vendored binary, so no system protoc is required to regenerate either. CI rejects any pull request whose committed src/proto/generated/ does not match the result of running this command on its current .proto sources.

§FAQ

§When is Apache DataSketches a better fit?

  • You need its broader algorithm catalog (CPC, Theta/Tuple with set operators, REQ, VarOpt/Reservoir, FM85).
  • You need cross-language binary compatibility with existing DataSketches deployments in Java, C++, or Python.
  • You need long-running production maturity and an Apache-governed release cycle.

§Contributors

§Major Contributors

§Other Contributors

§License

Copyright 2025 - present ProjectASAP

Licensed under the MIT License. See LICENSE. asap_sketchlib is organized into the following layers:

  • sketches: core sketch data structures such as Count-Min, HyperLogLog, KLL, and DDSketch.
  • common: shared input types, hashing abstractions, storage backends, and reusable utilities used across sketches.
  • proto: portable protobuf message types for sketch interchange.
  • sketch_framework: higher-level composition layers such as Hydra, UnivMon, tumbling windows, and batch/parallel execution helpers.
  • message_pack_format: MessagePack/proto wire format shared with sketchlib-go. Owns the wire-DTO sketch types consumed by the ASAP query engine (CountMinSketch, CountSketch, KllSketch, HllSketch, DdSketch, HydraKllSketch, CountMinSketchWithHeap, SetAggregator, DeltaResult).

Most users can start with the crate-root re-exports such as DataInput, CountMin, HyperLogLog, KLL, and DDSketch. Reach for the submodules directly when you need lower-level storage, hashing, or framework-specific APIs.

Re-exports§

pub use message_pack_format::portable::countminsketch::CountMinSketch;
pub use message_pack_format::portable::countminsketch::CountMinSketchDelta;
pub use message_pack_format::portable::countminsketch_topk::CmsHeapItem;
pub use message_pack_format::portable::countminsketch_topk::CountMinSketchWithHeap;
pub use message_pack_format::portable::countsketch::COUNT_SKETCH_TOPK_CAPACITY;
pub use message_pack_format::portable::countsketch::CountSketch;
pub use message_pack_format::portable::countsketch::CountSketchDelta;
pub use message_pack_format::portable::ddsketch::DDSKETCH_GROW_CHUNK;
pub use message_pack_format::portable::ddsketch::DdSketch;
pub use message_pack_format::portable::ddsketch::DdSketchDelta;
pub use message_pack_format::portable::delta_set_aggregator::DeltaResult;
pub use message_pack_format::portable::hll::HllSketch;
pub use message_pack_format::portable::hll::HllSketchDelta;
pub use message_pack_format::portable::hll::HllVariant;
pub use message_pack_format::portable::hydra_kll::HydraKllSketch;
pub use message_pack_format::portable::kll::KllSketch;
pub use message_pack_format::portable::kll::KllSketchData;
pub use message_pack_format::portable::set_aggregator::SetAggregator;
pub use common::*;
pub use sketch_framework::*;
pub use sketches::*;

Modules§

common
Shared primitives used across sketches, including input wrappers, hashers, storage backends, and reusable utilities. Shared building blocks used across the library.
message_pack_format
MessagePack format description shared with sketchlib-go.
proto
Portable protobuf message types for sketch interchange.
sketch_framework
Higher-level composition layers such as Hydra, UnivMon, tumbling windows, and batch/parallel execution helpers. Higher-level frameworks built on top of the core sketches.
sketches
Core sketch implementations such as Count-Min, HyperLogLog, KLL, and DDSketch. Core sketch implementations.

Macros§

impl_fixed_matrix
Generates a fixed-size matrix storage type.