Module stats

Expand description

Descriptive and inferential statistics functions. Descriptive statistics — population & sample variance, sd, se, median, quantile, IQR, skewness, kurtosis, z-score, standardize.

§Determinism Contract

Unordered reductions use BinnedAccumulatorF64 for order-invariant, bit-identical results regardless of input order.
Cumulative (ordered) operations use KahanAccumulatorF64 where the addition order IS the semantics.
Sorting uses f64::total_cmp for deterministic NaN handling.
No HashMap, no par_iter, no OS randomness.
Same input => bit-identical output.

Functions§

cor: Pearson correlation coefficient between two arrays. cor(x, y) = cov(x,y) / (sd(x) * sd(y))
cor_ci: Confidence interval for Pearson correlation using Fisher z-transform. Returns (lower_bound, upper_bound).
cor_matrix: Correlation matrix for a set of variables (columns). Returns flat Vec of n x n correlation matrix.
cov: Population covariance: sum((xi-mx)(yi-my)) / n.
cov_matrix: Covariance matrix. Returns flat Vec of n x n covariance matrix.
cume_dist: Cumulative distribution: count(x_i <= x_j) / n for each x_j.
cummax: Cumulative max.
cummin: Cumulative min.
cumprod: Cumulative product.
cumsum: Cumulative sum with Kahan summation.
dense_rank: Dense rank (no gaps for ties). Returns 1-based ranks.
filter_mask: Boolean mask selection: return elements of data where mask is true.
histogram: Histogram: bin data into n equal-width bins. Returns (bin_edges: Vec, counts: Vec).
iqr: Interquartile range: Q3 - Q1.
kendall_cor: Kendall tau-b correlation coefficient with tie adjustment. O(n^2) pairwise comparison for determinism.
kurtosis: Kurtosis (excess kurtosis, Fisher’s): E[(X-mu)^4] / sigma^4 - 3.
lag: Lag: shift values forward by n positions, fill with NaN.
lead: Lead: shift values backward by n positions, fill with NaN.
mad: Median absolute deviation: median(|x[i] - median(x)|). Does NOT multiply by 1.4826 scaling factor.
median: Median: middle value of sorted data. For even n, average of two middle values. Clones and sorts internally — never mutates input.
median_fast: O(n) median using introselect instead of O(n log n) sort. For even n, selects both middle elements via two partial sorts.
mode: Mode: most frequent value. Ties broken by smallest value. Uses bit-exact comparison via to_bits() on a sorted copy.
n_distinct: Number of distinct values in the data. Uses sorted unique comparison for determinism (no HashMap).
nth_element: Introselect: partition-based O(n) expected selection of the k-th smallest element. Operates on a mutable slice and partially reorders it so that data[k] holds the k-th smallest value (0-indexed), all elements data[..k] are <= data[k], and all data[k+1..] are >= data[k].
nth_element_copy: Non-mutating nth_element: clones data, selects k-th element, returns it. O(n) expected time, O(n) space for the clone.
ntile: Divide data into n roughly equal groups (ntile/quantile binning). Returns 1-based group assignments matching original data order.
partial_cor: Partial correlation: correlation of x and y controlling for z.
percent_rank_fn: Percent rank: (rank - 1) / (n - 1), range [0, 1]. Uses average-tie ranking from existing rank() function.
percentile_rank: Percentile rank: fraction of data values strictly less than the given value, plus half the fraction equal to the value. Returns a value in [0, 1].
pop_sd: Population standard deviation: sqrt(pop_variance).
pop_variance: Population variance: sum((xi - mean)^2) / N.
quantile: Quantile at probability p (0.0 to 1.0). Linear interpolation between adjacent ranks (R type 7 / NumPy default).
quantile_fast: O(n) quantile using introselect (R type 7 interpolation).
rank: Rank (average ties). Returns 1-based ranks. DETERMINISM: uses stable sort with index tracking.
row_number: Row number (sequential, tie-broken by original position — stable).
sample_cov: Sample covariance: sum((xi-mx)(yi-my)) / (n-1).
sample_indices: Generate k random indices in [0, n) with or without replacement.
sample_sd: Sample standard deviation: alias for sd() (both use N-1 denominator).
sample_variance: Sample variance: alias for variance() (both use N-1 denominator).
sd: Standard deviation (sample, N-1 denominator — R/pandas default).
se: Standard error of the mean: sample_sd / sqrt(n).
skewness: Skewness (Fisher’s definition): E[(X-mu)^3] / sigma^3.
spearman_cor: Spearman rank correlation: Pearson correlation of the ranks of x and y.
standardize: Min-max normalization: (xi - min) / (max - min).
trimmed_mean: Trimmed mean: mean of data with proportion fraction removed from each tail. proportion=0.1 removes bottom 10% and top 10%, computing mean of middle 80%.
variance: Variance (sample, N-1 denominator — R/pandas default). Two-pass: first binned mean, then binned sum of squared deviations. For single element, returns 0.
weighted_mean: Weighted mean: sum(data[i] * weights[i]) / sum(weights). Uses binned accumulation for both numerator and denominator.
weighted_var: Weighted variance: sum(w[i] * (x[i] - weighted_mean)^2) / sum(w). Two-pass: first weighted mean (binned), then binned sum of squared deviations.
winsorize: Winsorize: replace values below the proportion quantile with the lower boundary, and values above the (1-proportion) quantile with the upper boundary.
z_score: Z-scores: (xi - mean) / sd for each element.

Module stats

Module stats Copy item path

§Determinism Contract

Functions§

Module stats