Expand description
Descriptive and inferential statistics functions. Descriptive statistics — population & sample variance, sd, se, median, quantile, IQR, skewness, kurtosis, z-score, standardize.
§Determinism Contract
- Unordered reductions use
BinnedAccumulatorF64for order-invariant, bit-identical results regardless of input order. - Cumulative (ordered) operations use
KahanAccumulatorF64where the addition order IS the semantics. - Sorting uses
f64::total_cmpfor deterministic NaN handling. - No
HashMap, nopar_iter, no OS randomness. - Same input => bit-identical output.
Functions§
- cor
- Pearson correlation coefficient between two arrays. cor(x, y) = cov(x,y) / (sd(x) * sd(y))
- cor_ci
- Confidence interval for Pearson correlation using Fisher z-transform. Returns (lower_bound, upper_bound).
- cor_
matrix - Correlation matrix for a set of variables (columns).
Returns flat Vec
of n x n correlation matrix. - cov
- Population covariance: sum((xi-mx)(yi-my)) / n.
- cov_
matrix - Covariance matrix.
Returns flat Vec
of n x n covariance matrix. - cume_
dist - Cumulative distribution: count(x_i <= x_j) / n for each x_j.
- cummax
- Cumulative max.
- cummin
- Cumulative min.
- cumprod
- Cumulative product.
- cumsum
- Cumulative sum with Kahan summation.
- dense_
rank - Dense rank (no gaps for ties). Returns 1-based ranks.
- filter_
mask - Boolean mask selection: return elements of data where mask is true.
- histogram
- Histogram: bin data into n equal-width bins.
Returns (bin_edges: Vec
, counts: Vec ). - iqr
- Interquartile range: Q3 - Q1.
- kendall_
cor - Kendall tau-b correlation coefficient with tie adjustment. O(n^2) pairwise comparison for determinism.
- kurtosis
- Kurtosis (excess kurtosis, Fisher’s): E[(X-mu)^4] / sigma^4 - 3.
- lag
- Lag: shift values forward by n positions, fill with NaN.
- lead
- Lead: shift values backward by n positions, fill with NaN.
- mad
- Median absolute deviation: median(|x[i] - median(x)|). Does NOT multiply by 1.4826 scaling factor.
- median
- Median: middle value of sorted data. For even n, average of two middle values. Clones and sorts internally — never mutates input.
- median_
fast - O(n) median using introselect instead of O(n log n) sort. For even n, selects both middle elements via two partial sorts.
- mode
- Mode: most frequent value. Ties broken by smallest value. Uses bit-exact comparison via to_bits() on a sorted copy.
- n_
distinct - Number of distinct values in the data. Uses sorted unique comparison for determinism (no HashMap).
- nth_
element - Introselect: partition-based O(n) expected selection of the k-th smallest
element. Operates on a mutable slice and partially reorders it so that
data[k]holds the k-th smallest value (0-indexed), all elementsdata[..k]are <= data[k], and alldata[k+1..]are >= data[k]. - nth_
element_ copy - Non-mutating nth_element: clones data, selects k-th element, returns it. O(n) expected time, O(n) space for the clone.
- ntile
- Divide data into n roughly equal groups (ntile/quantile binning). Returns 1-based group assignments matching original data order.
- partial_
cor - Partial correlation: correlation of x and y controlling for z.
- percent_
rank_ fn - Percent rank: (rank - 1) / (n - 1), range [0, 1]. Uses average-tie ranking from existing rank() function.
- percentile_
rank - Percentile rank: fraction of data values strictly less than the given value, plus half the fraction equal to the value. Returns a value in [0, 1].
- pop_sd
- Population standard deviation: sqrt(pop_variance).
- pop_
variance - Population variance: sum((xi - mean)^2) / N.
- quantile
- Quantile at probability p (0.0 to 1.0). Linear interpolation between adjacent ranks (R type 7 / NumPy default).
- quantile_
fast - O(n) quantile using introselect (R type 7 interpolation).
- rank
- Rank (average ties). Returns 1-based ranks. DETERMINISM: uses stable sort with index tracking.
- row_
number - Row number (sequential, tie-broken by original position — stable).
- sample_
cov - Sample covariance: sum((xi-mx)(yi-my)) / (n-1).
- sample_
indices - Generate k random indices in [0, n) with or without replacement.
- sample_
sd - Sample standard deviation: alias for sd() (both use N-1 denominator).
- sample_
variance - Sample variance: alias for variance() (both use N-1 denominator).
- sd
- Standard deviation (sample, N-1 denominator — R/pandas default).
- se
- Standard error of the mean: sample_sd / sqrt(n).
- skewness
- Skewness (Fisher’s definition): E[(X-mu)^3] / sigma^3.
- spearman_
cor - Spearman rank correlation: Pearson correlation of the ranks of x and y.
- standardize
- Min-max normalization: (xi - min) / (max - min).
- trimmed_
mean - Trimmed mean: mean of data with
proportionfraction removed from each tail. proportion=0.1 removes bottom 10% and top 10%, computing mean of middle 80%. - variance
- Variance (sample, N-1 denominator — R/pandas default). Two-pass: first binned mean, then binned sum of squared deviations. For single element, returns 0.
- weighted_
mean - Weighted mean: sum(data[i] * weights[i]) / sum(weights). Uses binned accumulation for both numerator and denominator.
- weighted_
var - Weighted variance: sum(w[i] * (x[i] - weighted_mean)^2) / sum(w). Two-pass: first weighted mean (binned), then binned sum of squared deviations.
- winsorize
- Winsorize: replace values below the
proportionquantile with the lower boundary, and values above the(1-proportion)quantile with the upper boundary. - z_score
- Z-scores: (xi - mean) / sd for each element.