Struct quantogram::Quantogram
source · [−]pub struct Quantogram { /* private fields */ }
Expand description
Provides a weighted Histogram of f64 values for computing approximate quantiles. This guarantees a configurable maximum absolute relative error and uses sparse storage to reduce memory usage.
Worst case accuracy defaults to one percent (0.01) absolute relative error. The error is unbiased, uniform for the entire range of numbers. The error for quantiles 0 and 1 (the minimum and maximum, respectively) is guaranteed to be zero, except if either of those values is removed.
If all inserted values are given a weight of one, this behaves as an unweighted (normal) histogram.
Samples may be added or removed. However, removing a sample that equals the minimum or maximum will cause those values to be replaced by the center value of the appropriate extreme histogram bucket.
The full valid range of floats is divided into two levels of buckets. For the default case with 1% error, here is what that means:
Top. The top level divides the number range into buckets for each power of two and between positive and negative numbers. It has a maximum of 508 buckets in two sparse lists. f64 values can range from 2^-126 to 2^127, thus there are a maximum of 254 buckets for positive numbers and 254 buckets for negatives. Bottom. The second level of buckets are in a single sparse list spanning the full range from the smallest negative number to the largest positive number. Each power of two range is broken into 35 buckets whose size varies exponentially. Each bucket is larger than the previous by a factor of 1.02 (or smaller by 1.02, over the range of negative values). The value of 1.02 was chosen because 1.02^35 = 1.999889553, which differs from 2 by 0.00011. That means that the 35th bucket for each power of two is slightly larger than the rest, but not by enough to wreck the error guarantee. There are a maximum of 1 + 508*35 buckets in this list, or 17,781. (The “1” is for the zero bucket.)
A bucket is not created until at least one value is added to it. Removing the last item in a bucket will not cause its memory to be freed; its weight will be set to zero.
(If you are familiar with the Rule of 70 used to estimate the doubling period for a given interest rate, it was probably chosen because of this nice property of 1.02.)
The error rate of 0.01 and the bin scale factor of 1.02 are related in this way:
(1 + error) (1 + 1/101) 102
scale = ------------- = ------------- = ------- = 1.02
(1 - error) (1 - 1/101) 100
So technically, the worst case error is 1/101, or 0.99%, not 1%, but the math is easier when using 1.02 instead of 1.020202020202 and the error on the last bucket is ideal. (A smaller error in the last bucket is available for error = 1/176, but the memory requirements are much higher.)
Usage:
Typical usage (unweighted samples with the default accuracy of 1%):
use quantogram::Quantogram;
let mut q = Quantogram::new();
q.add(10.0);
q.add(40.0);
q.add(20.0);
q.add(30.0);
q.add(50.0);
assert_eq!(q.min().unwrap(), 10.0);
assert_eq!(q.max().unwrap(), 50.0);
assert_eq!(q.mean().unwrap(), 30.0);
assert_eq!(q.median().unwrap(), 30.0);
assert_eq!(q.quantile(0.75).unwrap(), 40.0);
q.remove(10.0);
q.remove(20.0);
assert_eq!(q.mean().unwrap(), 40.0);
Notes:
-
Coarse bins are for powers of two not some other value because getting the largest power of two less than or equal to a number calls a fast intrinsic function. This makes assigning a number to a bin very fast.
-
When inquiring about a quantile, a value will be returned so long as one number in range is added. NANs and Infinities will be disregarded. The NANs and Infinities are available as separate counts.
-
Unbounded errors are possible in the edge case of a large gap in the middle of the data. Take the case of the median. If there are an even number of items in the data with a large gap in the exact middle, then the proper formula for the median is the mean of the last value below the gap and the first value above the gap. To correct for this, use the fussy_quantile method. It will probe for quantiles at Φ + ε and Φ - ε for a small value of ε, and average the pair that best span the gap. If the histogram is unweighted (all weights are one) then the value of ε should be 1/2N, where N is the number of items already added to the Histogram. If the samples are weighted, not sure what to do.
-
If one sample is in a bin, the value for the bin will be set to the accurate sample, not the midpoint of the range for that bin. If a second sample is added, the bin value will be set to the midpoint. Consequently, bins with one sample added have no error.
Implementations
sourceimpl Quantogram
impl Quantogram
sourcepub fn with_configuration(
growth: f64,
bins: usize,
smallest_power: isize,
largest_power: isize
) -> Self
pub fn with_configuration(
growth: f64,
bins: usize,
smallest_power: isize,
largest_power: isize
) -> Self
Create a Quantogram where growth and bins are set, and underflow and overflow bounds are set to the given powers of two.
- Inserted values having absolute magnitude below 2^smallest_power will be treated as zeroes.
- Inserted values having absolute magnitude above 2^largest_power will be treated as infinities.
Note: To ensure that consist values for growth and bins are set, it is advised to use QuantogramBuilder instead of directly calling this constructor.
pub fn replace_hsm_cache(&mut self, new_cache: HalfSampleModeCache)
sourcepub fn remove(&mut self, sample: f64)
pub fn remove(&mut self, sample: f64)
Remove a sample from the histogram, deducting one from the weight. If the weight goes negative, the call panics.
sourcepub fn add_weighted(&mut self, sample: f64, weight: f64) -> f64
pub fn add_weighted(&mut self, sample: f64, weight: f64) -> f64
Add a sample to the histogram with the given weight. If the weight is positive, the sample is added, otherwise removed. If the cumulative weight for the bin holding that sample goes negative, the call panics.
sourcepub fn add_unweighted_samples<'a, S>(
&mut self,
samples: impl Iterator<Item = &'a S>
) where
S: 'a + Into<f64> + Copy,
pub fn add_unweighted_samples<'a, S>(
&mut self,
samples: impl Iterator<Item = &'a S>
) where
S: 'a + Into<f64> + Copy,
Add many samples to the Quantogram, all having a weight of 1.0.
sourcepub fn mean(&self) -> Option<f64>
pub fn mean(&self) -> Option<f64>
Return actual (not estimated) weighted mean of inserted samples (with machine precision), or None if no finite values have been inserted. Excludes all NANs and Infinities that have been inserted.
sourcepub fn min(&self) -> Option<f64>
pub fn min(&self) -> Option<f64>
Return actual (not estimated) minimum of inserted samples (with machine precision), or None if no finite values have been inserted. Excludes all NANs and Infinities that have been inserted.
sourcepub fn max(&self) -> Option<f64>
pub fn max(&self) -> Option<f64>
Return actual (not estimated) maximum of inserted samples (with machine precision), or None if no finite values have been inserted. Excludes all NANs and Infinities that have been inserted.
sourcepub fn finite(&self) -> f64
pub fn finite(&self) -> f64
Return the weighted fraction of values that are finite (not NAN or +/- Infinity) as a value in the range [0,1]. If no values have yet been added, return NAN.
sourcepub fn zero(&self) -> f64
pub fn zero(&self) -> f64
Return the weighted fraction of values that are zero, including NANs and infinities in the basis. If no values have yet been added, return NAN.
sourcepub fn nan(&self) -> f64
pub fn nan(&self) -> f64
Return the weighted fraction of values that are NAN, including infinities in the basis. If no values have yet been added, return NAN.
sourcepub fn stddev(&self) -> Option<f64>
pub fn stddev(&self) -> Option<f64>
Exact weighted population standard deviation of all added finite values.
sourcepub fn mode(&self) -> Vec<f64>
pub fn mode(&self) -> Vec<f64>
Return an estimate of the mode. If no finite samples have been added, returns an empty list. If there are multiple modes tieing with the same weight, return all of them in a list. If all samples in the collection are integers, round the mode.
sourcepub fn hsm(&self) -> Option<f64>
pub fn hsm(&self) -> Option<f64>
Estimate the half-sample mode. More resistant than the mode to outliers and contamination (noise).
Based on the mode estimator by Robertson and Cryer (1974), and an algorithm described in “On a Fast, Robust Estimator of the Mode” by David Bickel and Rudolf Fruhwirth. That algorithm is applied to raw samples, whereas this is applied to already histogrammed values.
Edge case: If fewer than five bins of data are present in the histogram, the true mode will be returned, unless the data is multi-moded, in which case None will be returned.
NOTE:
sourcepub fn quantile(&self, phi: f64) -> Option<f64>
pub fn quantile(&self, phi: f64) -> Option<f64>
Estimate the quantile for the inserted data. For the minimum, use phi = 0.0. For the median, use phi = 0.5. For the 90th percentile, phi = 0.9. For the maximum, use phi = 1.0.
Added samples that are NANs or Infinities are excluded from the computation.
sourcepub fn fussy_quantile(&self, phi: f64, threshold_ratio: f64) -> Option<f64>
pub fn fussy_quantile(&self, phi: f64, threshold_ratio: f64) -> Option<f64>
Quantile measurement that corrects for issues that occur when there is a large gap in samples right where the quantile falls.
For example, the median of { 1,2,3,4,5,95,96,97,98,99 } is 50! This is because there are an even number of items in the set, so you have to average the middle two values.
Three quantile calculations will be taken, at Φ - ε, Φ and Φ + ε. If the difference between the results at the two lower quantiles differs substantially from the difference between the results at the two higher quantiles, assume there is a gap and average the middle value and the most extreme value.
threshold_ratio decides whether a gap exists. If none exists, no averaging occurs. It is used to compare the deltas between quantiles computed at the desired phi and values of phi slightly lower or higher. If the deltas are similar, no gap exists. A value over 2.0 seems sensible, but testing should be done to determine the best value.
sourcepub fn quantile_at(&self, value: f64) -> Option<(f64, f64)>
pub fn quantile_at(&self, value: f64) -> Option<(f64, f64)>
Get the range of quantiles between which the given value falls, or None if no finite samples have been added yet or the queried value is not finite.
sourcepub fn quartile_deviation(&self) -> Option<f64>
pub fn quartile_deviation(&self) -> Option<f64>
Quartile deviation or semi-interquartile range, which is (Q3 - Q1) / 2
sourcepub fn coeff_of_range(&self) -> Option<f64>
pub fn coeff_of_range(&self) -> Option<f64>
Coefficient of range.
max - min
coeff of range = ------------
max + min
sourcepub fn coeff_of_quartile_dev(&self) -> Option<f64>
pub fn coeff_of_quartile_dev(&self) -> Option<f64>
Coefficient of quartile deviation, which measures the relative spread between the first and third quartiles.
q3 - q1
coeff of quartile deviation = ---------
q3 + q1
sourcepub fn coeff_of_stddev(&self) -> Option<f64>
pub fn coeff_of_stddev(&self) -> Option<f64>
Coefficient of standard deviation.
This give the standard deviation divided by the mean.
sourcepub fn coeff_of_variation(&self) -> Option<f64>
pub fn coeff_of_variation(&self) -> Option<f64>
Coefficient of variation.
This give the standard deviation divided by the mean expressed as a percentage.
sourcepub fn size(&self) -> usize
pub fn size(&self) -> usize
Computes a relative measure of the storage being used by the histogram.
As bins are dynamically added to the sparse Skipmaps, this increases.
sourcepub fn power_of_two(sample: f64) -> Option<isize>
pub fn power_of_two(sample: f64) -> Option<isize>
Exponent on the largest power of two less than or equal to the magnitude of the given number, or None if a NAN, Infinity or Zero.
use quantogram::Quantogram;
assert_eq!(Quantogram::power_of_two(5.0).unwrap(), 2);
assert_eq!(Quantogram::power_of_two(4.0).unwrap(), 2);
assert_eq!(Quantogram::power_of_two(1.5).unwrap(), 0);
assert_eq!(Quantogram::power_of_two(-8.5).unwrap(), 3);
assert_eq!(Quantogram::power_of_two(0.0), None);
assert_eq!(Quantogram::power_of_two(0.25).unwrap(), -2);
Trait Implementations
Auto Trait Implementations
impl !RefUnwindSafe for Quantogram
impl Send for Quantogram
impl !Sync for Quantogram
impl Unpin for Quantogram
impl UnwindSafe for Quantogram
Blanket Implementations
sourceimpl<T> BorrowMut<T> for T where
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
const: unstable · sourcefn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more