Module clipivot::aggfunc

source ·
Expand description

The aggfunc module is the central module for computing statistics from a stream of records.

The central component of this is a trait called Accumulate that implements a new function on initialization, an update function to add a new record, and a compute function to compute the final value of the aggregation. This trait requires two types, an input type (which is used by the new and update functions) and an output type.

Internally, all of the structs implementing this trait are used in the main aggregation module with the input type bounded by FromStr so the tool can convert from string records to the internal data types that these aggregation types manipulate. And the output type is bounded by Display so the tool can write the outputs to standard output.

Structs

  • The total number of records added to the accumulator.
  • The total number of unique records.
  • The largest value (or the value that would appear last in a sorted array)
  • The mean. This is only implemented for DecimalWrapper, though it could probably be extended for floating point types.
  • The median value. I’ve stored values in a BTreeMap in order to minimize memory usage. As a result, this is the least performant of all the functions (running at Nlog(m), rather than the N of all the other algorithms (where m is the number of unique values in the accumulator).
  • A combination of the minimum and maximum values, producing a string concatenating the minimum value and the maximum value together, separated by a hyphen.
  • The minimum value
  • The most commonly appearing item.
  • The range, or the difference between the minimum and maximum values (where the minimum value is subtracted from the maximum value).
  • Computes the sample variance in a single pass, using Welford’s algorithm. The attributes in this method refer to the same ones described in Accuracy and Stability of Numerical Algorithms by Higham (2nd Edition, page 11).
  • The running sum of a stream of values.

Traits

  • Accumulates records from a stream, in order to allow functions to be optimized for minimal memory usage.