Expand description
The aggfunc
module is the central module for computing statistics from a stream of records.
The central component of this is a trait called Accumulate
that implements a new
function on initialization,
an update
function to add a new record, and a compute
function to compute the final value of the aggregation.
This trait requires two types, an input type (which is used by the new
and update
functions) and an output
type.
Internally, all of the structs implementing this trait are used in the main aggregation
module
with the input type bounded by FromStr
so the tool can convert from string records to the internal data types
that these aggregation types manipulate. And the output type is bounded by Display
so the tool can write
the outputs to standard output.
Structs
- The total number of records added to the accumulator.
- The total number of unique records.
- The largest value (or the value that would appear last in a sorted array)
- The mean. This is only implemented for
DecimalWrapper
, though it could probably be extended for floating point types. - The median value. I’ve stored values in a
BTreeMap
in order to minimize memory usage. As a result, this is the least performant of all the functions (running atNlog(m)
, rather than theN
of all the other algorithms (wherem
is the number of unique values in the accumulator). - A combination of the minimum and maximum values, producing a string concatenating the minimum value and the maximum value together, separated by a hyphen.
- The minimum value
- The most commonly appearing item.
- The range, or the difference between the minimum and maximum values (where the minimum value is subtracted from the maximum value).
- Computes the sample variance in a single pass, using Welford’s algorithm. The attributes in this method refer to the same ones described in Accuracy and Stability of Numerical Algorithms by Higham (2nd Edition, page 11).
- The running sum of a stream of values.
Traits
- Accumulates records from a stream, in order to allow functions to be optimized for minimal memory usage.