pub struct Quantogram { /* private fields */ }
Expand description

Provides a weighted Histogram of f64 values for computing approximate quantiles. This guarantees a configurable maximum absolute relative error and uses sparse storage to reduce memory usage.

Worst case accuracy defaults to one percent (0.01) absolute relative error. The error is unbiased, uniform for the entire range of numbers. The error for quantiles 0 and 1 (the minimum and maximum, respectively) is guaranteed to be zero, except if either of those values is removed.

If all inserted values are given a weight of one, this behaves as an unweighted (normal) histogram.

Samples may be added or removed. However, removing a sample that equals the minimum or maximum will cause those values to be replaced by the center value of the appropriate extreme histogram bucket.

The full valid range of floats is divided into two levels of buckets. For the default case with 1% error, here is what that means:

Top. The top level divides the number range into buckets for each power of two and between positive and negative numbers. It has a maximum of 508 buckets in two sparse lists. f64 values can range from 2^-126 to 2^127, thus there are a maximum of 254 buckets for positive numbers and 254 buckets for negatives. Bottom. The second level of buckets are in a single sparse list spanning the full range from the smallest negative number to the largest positive number. Each power of two range is broken into 35 buckets whose size varies exponentially. Each bucket is larger than the previous by a factor of 1.02 (or smaller by 1.02, over the range of negative values). The value of 1.02 was chosen because 1.02^35 = 1.999889553, which differs from 2 by 0.00011. That means that the 35th bucket for each power of two is slightly larger than the rest, but not by enough to wreck the error guarantee. There are a maximum of 1 + 508*35 buckets in this list, or 17,781. (The “1” is for the zero bucket.)

A bucket is not created until at least one value is added to it. Removing the last item in a bucket will not cause its memory to be freed; its weight will be set to zero.

(If you are familiar with the Rule of 70 used to estimate the doubling period for a given interest rate, it was probably chosen because of this nice property of 1.02.)

The error rate of 0.01 and the bin scale factor of 1.02 are related in this way:

 
            (1 + error)     (1 + 1/101)      102
   scale = ------------- = ------------- = ------- = 1.02
            (1 - error)     (1 - 1/101)      100

So technically, the worst case error is 1/101, or 0.99%, not 1%, but the math is easier when using 1.02 instead of 1.020202020202 and the error on the last bucket is ideal. (A smaller error in the last bucket is available for error = 1/176, but the memory requirements are much higher.)

Usage:

Typical usage (unweighted samples with the default accuracy of 1%):

 use quantogram::Quantogram;
 let mut q = Quantogram::new();
 q.add(10.0);
 q.add(40.0);
 q.add(20.0);
 q.add(30.0);
 q.add(50.0);

 assert_eq!(q.min().unwrap(), 10.0);
 assert_eq!(q.max().unwrap(), 50.0);
 assert_eq!(q.mean().unwrap(), 30.0);
 assert_eq!(q.median().unwrap(), 30.0);
 assert_eq!(q.quantile(0.75).unwrap(), 40.0);

 q.remove(10.0);
 q.remove(20.0);
 assert_eq!(q.mean().unwrap(), 40.0);

Notes:

  1. Coarse bins are for powers of two not some other value because getting the largest power of two less than or equal to a number calls a fast intrinsic function. This makes assigning a number to a bin very fast.

  2. When inquiring about a quantile, a value will be returned so long as one number in range is added. NANs and Infinities will be disregarded. The NANs and Infinities are available as separate counts.

  3. Unbounded errors are possible in the edge case of a large gap in the middle of the data. Take the case of the median. If there are an even number of items in the data with a large gap in the exact middle, then the proper formula for the median is the mean of the last value below the gap and the first value above the gap. To correct for this, use the fussy_quantile method. It will probe for quantiles at Φ + ε and Φ - ε for a small value of ε, and average the pair that best span the gap. If the histogram is unweighted (all weights are one) then the value of ε should be 1/2N, where N is the number of items already added to the Histogram. If the samples are weighted, not sure what to do.

  4. If one sample is in a bin, the value for the bin will be set to the accurate sample, not the midpoint of the range for that bin. If a second sample is added, the bin value will be set to the midpoint. Consequently, bins with one sample added have no error.

Implementations

Create a Quantogram with the default error rate of 1%.

Create a Quantogram where growth and bins are set, and underflow and overflow bounds are set to the given powers of two.

  • Inserted values having absolute magnitude below 2^smallest_power will be treated as zeroes.
  • Inserted values having absolute magnitude above 2^largest_power will be treated as infinities.

Note: To ensure that consist values for growth and bins are set, it is advised to use QuantogramBuilder instead of directly calling this constructor.

Add a sample to the histogram with a weight of one.

Remove a sample from the histogram, deducting one from the weight. If the weight goes negative, the call panics.

Add a sample to the histogram with the given weight. If the weight is positive, the sample is added, otherwise removed. If the cumulative weight for the bin holding that sample goes negative, the call panics.

Add many samples to the Quantogram, all having a weight of 1.0.

Return actual (not estimated) weighted mean of inserted samples (with machine precision), or None if no finite values have been inserted. Excludes all NANs and Infinities that have been inserted.

Return actual (not estimated) minimum of inserted samples (with machine precision), or None if no finite values have been inserted. Excludes all NANs and Infinities that have been inserted.

Return actual (not estimated) maximum of inserted samples (with machine precision), or None if no finite values have been inserted. Excludes all NANs and Infinities that have been inserted.

Return count of inserted samples, including NANs and Infinities.

Return the weighted fraction of values that are finite (not NAN or +/- Infinity) as a value in the range [0,1]. If no values have yet been added, return NAN.

Return the weighted fraction of values that are zero, including NANs and infinities in the basis. If no values have yet been added, return NAN.

Return the weighted fraction of values that are NAN, including infinities in the basis. If no values have yet been added, return NAN.

Return an estimate of the median.

Exact weighted variance of all added finite values.

Exact weighted population standard deviation of all added finite values.

Return an estimate of the mode. If no finite samples have been added, returns an empty list. If there are multiple modes tieing with the same weight, return all of them in a list. If all samples in the collection are integers, round the mode.

Estimate the half-sample mode. More resistant than the mode to outliers and contamination (noise).

Based on the mode estimator by Robertson and Cryer (1974), and an algorithm described in “On a Fast, Robust Estimator of the Mode” by David Bickel and Rudolf Fruhwirth. That algorithm is applied to raw samples, whereas this is applied to already histogrammed values.

Edge case: If fewer than five bins of data are present in the histogram, the true mode will be returned, unless the data is multi-moded, in which case None will be returned.

NOTE:

Estimate the quantile for the inserted data. For the minimum, use phi = 0.0. For the median, use phi = 0.5. For the 90th percentile, phi = 0.9. For the maximum, use phi = 1.0.

Added samples that are NANs or Infinities are excluded from the computation.

Quantile measurement that corrects for issues that occur when there is a large gap in samples right where the quantile falls.

For example, the median of { 1,2,3,4,5,95,96,97,98,99 } is 50! This is because there are an even number of items in the set, so you have to average the middle two values.

Three quantile calculations will be taken, at Φ - ε, Φ and Φ + ε. If the difference between the results at the two lower quantiles differs substantially from the difference between the results at the two higher quantiles, assume there is a gap and average the middle value and the most extreme value.

threshold_ratio decides whether a gap exists. If none exists, no averaging occurs. It is used to compare the deltas between quantiles computed at the desired phi and values of phi slightly lower or higher. If the deltas are similar, no gap exists. A value over 2.0 seems sensible, but testing should be done to determine the best value.

Get the range of quantiles between which the given value falls, or None if no finite samples have been added yet or the queried value is not finite.

Range between the highest and lowest values sampled.

Sample at the first quartile.

Sample at the third quartile.

Inter quartile range, which is (Q3 - Q1)

Quartile deviation or semi-interquartile range, which is (Q3 - Q1) / 2

Coefficient of range.

                       max - min
   coeff of range =  ------------
                       max + min

Coefficient of quartile deviation, which measures the relative spread between the first and third quartiles.

                                   q3 - q1
   coeff of quartile deviation =  ---------
                                   q3 + q1

Coefficient of standard deviation.

This give the standard deviation divided by the mean.

Coefficient of variation.

This give the standard deviation divided by the mean expressed as a percentage.

Computes a relative measure of the storage being used by the histogram.

As bins are dynamically added to the sparse Skipmaps, this increases.

Exponent on the largest power of two less than or equal to the magnitude of the given number, or None if a NAN, Infinity or Zero.

  use quantogram::Quantogram;
  assert_eq!(Quantogram::power_of_two(5.0).unwrap(), 2);
  assert_eq!(Quantogram::power_of_two(4.0).unwrap(), 2);
  assert_eq!(Quantogram::power_of_two(1.5).unwrap(), 0);
  assert_eq!(Quantogram::power_of_two(-8.5).unwrap(), 3);
  assert_eq!(Quantogram::power_of_two(0.0), None);
  assert_eq!(Quantogram::power_of_two(0.25).unwrap(), -2);

Trait Implementations

Formats the value using the given formatter. Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.