Skip to main content

Module stats

Module stats 

Source
Expand description

Descriptive and inferential statistics functions. Descriptive statistics — population & sample variance, sd, se, median, quantile, IQR, skewness, kurtosis, z-score, standardize.

§Determinism Contract

  • Unordered reductions use BinnedAccumulatorF64 for order-invariant, bit-identical results regardless of input order.
  • Cumulative (ordered) operations use KahanAccumulatorF64 where the addition order IS the semantics.
  • Sorting uses f64::total_cmp for deterministic NaN handling.
  • No HashMap, no par_iter, no OS randomness.
  • Same input => bit-identical output.

Functions§

cor
Pearson correlation coefficient between two arrays. cor(x, y) = cov(x,y) / (sd(x) * sd(y))
cor_ci
Confidence interval for Pearson correlation using Fisher z-transform. Returns (lower_bound, upper_bound).
cor_matrix
Correlation matrix for a set of variables (columns). Returns flat Vec of n x n correlation matrix.
cov
Population covariance: sum((xi-mx)(yi-my)) / n.
cov_matrix
Covariance matrix. Returns flat Vec of n x n covariance matrix.
cume_dist
Cumulative distribution: count(x_i <= x_j) / n for each x_j.
cummax
Cumulative max.
cummin
Cumulative min.
cumprod
Cumulative product.
cumsum
Cumulative sum with Kahan summation.
dense_rank
Dense rank (no gaps for ties). Returns 1-based ranks.
filter_mask
Boolean mask selection: return elements of data where mask is true.
histogram
Histogram: bin data into n equal-width bins. Returns (bin_edges: Vec, counts: Vec).
iqr
Interquartile range: Q3 - Q1.
kendall_cor
Kendall tau-b correlation coefficient with tie adjustment. O(n^2) pairwise comparison for determinism.
kurtosis
Kurtosis (excess kurtosis, Fisher’s): E[(X-mu)^4] / sigma^4 - 3.
lag
Lag: shift values forward by n positions, fill with NaN.
lead
Lead: shift values backward by n positions, fill with NaN.
mad
Median absolute deviation: median(|x[i] - median(x)|). Does NOT multiply by 1.4826 scaling factor.
median
Median: middle value of sorted data. For even n, average of two middle values. Clones and sorts internally — never mutates input.
median_fast
O(n) median using introselect instead of O(n log n) sort. For even n, selects both middle elements via two partial sorts.
mode
Mode: most frequent value. Ties broken by smallest value. Uses bit-exact comparison via to_bits() on a sorted copy.
n_distinct
Number of distinct values in the data. Uses sorted unique comparison for determinism (no HashMap).
nth_element
Introselect: partition-based O(n) expected selection of the k-th smallest element. Operates on a mutable slice and partially reorders it so that data[k] holds the k-th smallest value (0-indexed), all elements data[..k] are <= data[k], and all data[k+1..] are >= data[k].
nth_element_copy
Non-mutating nth_element: clones data, selects k-th element, returns it. O(n) expected time, O(n) space for the clone.
ntile
Divide data into n roughly equal groups (ntile/quantile binning). Returns 1-based group assignments matching original data order.
partial_cor
Partial correlation: correlation of x and y controlling for z.
percent_rank_fn
Percent rank: (rank - 1) / (n - 1), range [0, 1]. Uses average-tie ranking from existing rank() function.
percentile_rank
Percentile rank: fraction of data values strictly less than the given value, plus half the fraction equal to the value. Returns a value in [0, 1].
pop_sd
Population standard deviation: sqrt(pop_variance).
pop_variance
Population variance: sum((xi - mean)^2) / N.
quantile
Quantile at probability p (0.0 to 1.0). Linear interpolation between adjacent ranks (R type 7 / NumPy default).
quantile_fast
O(n) quantile using introselect (R type 7 interpolation).
rank
Rank (average ties). Returns 1-based ranks. DETERMINISM: uses stable sort with index tracking.
row_number
Row number (sequential, tie-broken by original position — stable).
sample_cov
Sample covariance: sum((xi-mx)(yi-my)) / (n-1).
sample_indices
Generate k random indices in [0, n) with or without replacement.
sample_sd
Sample standard deviation: alias for sd() (both use N-1 denominator).
sample_variance
Sample variance: alias for variance() (both use N-1 denominator).
sd
Standard deviation (sample, N-1 denominator — R/pandas default).
se
Standard error of the mean: sample_sd / sqrt(n).
skewness
Skewness (Fisher’s definition): E[(X-mu)^3] / sigma^3.
spearman_cor
Spearman rank correlation: Pearson correlation of the ranks of x and y.
standardize
Min-max normalization: (xi - min) / (max - min).
trimmed_mean
Trimmed mean: mean of data with proportion fraction removed from each tail. proportion=0.1 removes bottom 10% and top 10%, computing mean of middle 80%.
variance
Variance (sample, N-1 denominator — R/pandas default). Two-pass: first binned mean, then binned sum of squared deviations. For single element, returns 0.
weighted_mean
Weighted mean: sum(data[i] * weights[i]) / sum(weights). Uses binned accumulation for both numerator and denominator.
weighted_var
Weighted variance: sum(w[i] * (x[i] - weighted_mean)^2) / sum(w). Two-pass: first weighted mean (binned), then binned sum of squared deviations.
winsorize
Winsorize: replace values below the proportion quantile with the lower boundary, and values above the (1-proportion) quantile with the upper boundary.
z_score
Z-scores: (xi - mean) / sd for each element.