Expand description

Strategies used by GridBuilder to infer optimal parameters from data for building Bins and Grid instances.

The docs for each strategy have been taken almost verbatim from NumPy.

Each strategy specifies how to compute the optimal number of Bins or the optimal bin width. For those strategies that prescribe the optimal number of Bins, the optimal bin width is computed by bin_width = (max - min)/n.

Since all bins are left-closed and right-open, it is guaranteed to add an extra bin to include the maximum value from the given data when necessary, so that no data is discarded.

Strategies

Currently, the following strategies are implemented:

  • Auto: Maximum of the Sturges and FreedmanDiaconis strategies. Provides good all around performance.
  • FreedmanDiaconis: Robust (resilient to outliers) strategy that takes into account data variability and data size.
  • Rice: A strategy that does not take variability into account, only data size. Commonly overestimates number of bins required.
  • Sqrt: Square root (of data size) strategy, used by Excel and other programs for its speed and simplicity.
  • Sturges: R’s default strategy, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.

Notes

In general, successful infererence on optimal bin width and number of bins relies on variability of data. In other word, the provided ovservations should not be empty or constant.

In addition, Auto and FreedmanDiaconis requires the interquartile range (IQR), i.e. the difference between upper and lower quartiles, to be positive.

Structs

Maximum of the Sturges and FreedmanDiaconis strategies. Provides good all around performance.

Robust (resilient to outliers) strategy that takes into account data variability and data size.

A strategy that does not take variability into account, only data size. Commonly overestimates number of bins required.

Square root (of data size) strategy, used by Excel and other programs for its speed and simplicity.

R’s default strategy, only accounts for data size. Only optimal for gaussian data and underestimates number of bins for large non-gaussian datasets.

Traits

A trait implemented by all strategies to build Bins with parameters inferred from observations.