# quantiles

This crate is intended to be a collection of approxiate quantile algorithms that provide guarantees around space and computation. Recent literature has advanced approximation techniques but none are generally applicable and have fundamental tradeoffs.

Initial work was done to support internal Postmates projects but the hope is that the crate can be generally useful.

## The Algorithms

### CKMS - Effective Computation of Biased Quantiles over Data Streams

This is an implementation of the algorithm presented in Cormode, Korn,
Muthukrishnan, Srivastava's paper "Effective Computation of Biased Quantiles
over Data Streams". The ambition here is to approximate quantiles on a stream of
data without having a boatload of information kept in memory. This
implementation follows the
IEEE version
of the paper. The authors' self-published copy of the paper is incorrect and
this implementation will *not* make sense if you follow along using that
version. Only the 'full biased' invariant is used. The 'targeted quantiles'
variant of this algorithm is fundamentally flawed, an issue which the authors
correct in their "Space- and Time-Efficient Deterministic Algorithms for Biased
Quantiles over Data Streams"

```
use CKMS;
let mut ckms = CKMS:: new;
for i in 1..1001
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
```

Queries provide an approximation to the true quantile, +/- εΦn. In the above, ε is set to 0.001, n is 1000. Minimum and maximum quantiles--0.0 and 1.0--are already precise. The error for the middle query is then +/- 0.998. (This so happens to be the exact quantile, but that doesn't always hold.)

For an error ε this structure will require `T*(floor(1/(2*ε)) + O(1/ε log εn)) + f64 + usize + usize`

words of storage, where T is the specialized type.

In local testing, insertion per point takes approximately 4 microseconds with a variance of 7%. This comes to 250k points per second.

### Misra Gries - ε-approximate frequency counts

Misra-Gries calculates an ε-approximate frequency count for a stream of N elements. The output is the k most frequent elements.

- the approximate count f'[e] is smaller than the true frequency f[e] of e, but by at most εN, i.e., (f[e] - εN) ≤ f'[e] ≤ f[e]
- any element e with a frequency f[e] ≥ εN appears in the result set

The error bound ε = 1/(k+1) where k is the number of counters used in the algorithm. When k = 1 i.e. a single counter, the algorithm is equivalent to the Boyer-Moore Majority algorithm.

If you want to check for elements that appear at least εN times, you will want to perform a second pass to calculate the exact frequencies of the values in the result set which can be done in constant space.

```
use *;
let k: usize = 3;
let numbers: = vec!;
let counts = misra_gries;
let bound = numbers.len / ;
let in_range = ;
assert!;
assert!;
assert!;
```

### Greenwald Khanna - ε-approximate quantiles

Greenwald Khanna calculates ε-approximate quantiles.
If the desired quantile is `φ`

, the ε-approximate
quantile is any element in the range of elements that rank
between `⌊(φ-ε)N⌋`

and `⌊(φ+ε)N⌋`

The stream summary datastructure can cope with up to max[usize] observations.

The beginning and end quantiles are clamped at the Minimum and maximum observed elements respectively.

This page explains the theory: http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/08-Quantile/Greenwald.html

```
use *;
let epsilon = 0.01;
let mut stream = new;
let n = 1001;
for i in 1..n
let in_range = ;
assert!;
assert!;
assert!;
assert!;
assert!;
assert!;
```

## Upgrading

### 0.2 -> 0.3

This release introduces two new algorithms, "Greenwald Khanna" and "Misra Gries". The existing CKMS has been moved from root to its own submodule. You'll need to update your imports from

```
use CMKS;
```

to

```
use CKMS;
```