dsrs contains bindings for a subset of Apache DataSketches.
Stateful reducers which maintain distinct count and heavy
hitters sketches, aimed at servicing the
dsrs command-line tool
for deduplicating byte lines of input.
A small abstraction for reducing over byte lines from a stream,
used for the command line tool
The Heavy Hitter (HH) sketch computes an approximate set of the heavy hitters, the items in a data stream which appear most often. Along with each proposed approximate heavy hitter, the sketch can provide an estimate of the number of its appearances.
The Theta sketch is, essentially, an adaptive random sample of a stream. As a result, it can be used to estimate distinct counts and the sketches can be combined to estimate distinct counts of unions and and intersections and differences of streams.