ferrolearn_preprocess/
quantile_transformer.rs

1//! Quantile transformer: map features to a uniform or normal distribution.
2//!
3//! [`QuantileTransformer`] transforms features by mapping each value through
4//! its empirical cumulative distribution function (CDF), producing values
5//! uniformly distributed in `[0, 1]`. Optionally, the result can be mapped
6//! to a standard normal distribution using the inverse normal CDF (probit).
7//!
8//! This is useful for making features more Gaussian-like, which can improve
9//! the performance of many machine learning algorithms.
10//!
11//! Translation target: scikit-learn 1.5.2 `class QuantileTransformer`
12//! (`sklearn/preprocessing/_data.py:2540`) + `quantile_transform` (`:2978`).
13//! Design: `.design/preprocess/quantile_transformer.md`. Tracking: #1319.
14//!
15//! `## REQ status`
16//!
17//! | REQ | Status | Anchor |
18//! |---|---|---|
19//! | REQ-1 forward-CDF value surface (uniform + normal, distinct + tied) | SHIPPED | `fit` references/landmarks (#1322) + `transform`; sklearn `_data.py:2694`,`:2702`,`:2795` |
20//! | REQ-2 forward+reversed AVERAGED interpolation | SHIPPED (#1321) | `interpolate_cdf`/`np_interp`; sklearn `_data.py:2843-2846` |
21//! | REQ-3 Normal output accuracy (Acklam ppf + clip) | SHIPPED (#1320) | `probit`; sklearn `_data.py:2855-2862` |
22//! | REQ-5 error/parameter contracts (scoped) | SHIPPED | `fit`/`transform` guards; sklearn `_data.py:2654` |
23//! | REQ-4 `np.maximum.accumulate` monotonic repair (unobservable) | NOT-STARTED (#1323) | sklearn `_data.py:2707` |
24//! | REQ-6 random subsample + random_state + n_quantiles>subsample error | NOT-STARTED (#1324) | sklearn `_data.py:2696-2700`,`:2774-2779` |
25//! | REQ-7 `inverse_transform` (reverse interp + `norm.cdf` + bounds + NaN) | SHIPPED (#1325) | `FittedQuantileTransformer::inverse_transform` + `norm_cdf` (ndtr via `erf`/`erfc`); sklearn `_data.py:2947`,`:2813-2851`,`:2821`. Uniform ~exact, Normal ~1e-7 (follows the forward REQ-3 ~1e-9 ndtr/probit contract). Consumer: crate re-export `pub use quantile_transformer::FittedQuantileTransformer` (`lib.rs`, boundary public API). Live-oracle: `tests/divergence_quantile_transformer.rs` (uniform/normal round-trip + held-out + bounds + NaN + `norm_cdf` sanity) |
26//! | REQ-8 `ignore_implicit_zeros` + sparse CSC | NOT-STARTED (#1326) | sklearn `_data.py:2709-2752` |
27//! | REQ-9 `quantile_transform` free function | SHIPPED (#1327) | `quantile_transform` free fn delegates to the SHIPPED `QuantileTransformer::new(n_quantiles, output_distribution, subsample)` + `Fit`→`FittedQuantileTransformer::transform` (NO quantile math duplicated); mirrors sklearn `quantile_transform(X, *, axis=0, n_quantiles=1000, output_distribution="uniform", ignore_implicit_zeros=False, subsample=1e5, random_state=None, copy=True)` (`_data.py:2978`,`:3107-3119`): `axis==0` → `fit_transform(X)` (per-feature, `:3115-3116`); `axis==1` → `fit_transform(X.T).T` (per-sample, `:3117-3118`, contiguous owned transpose); `axis ∉ {0,1}` → `InvalidParameter` "axis should be either equal to 0 or 1. Got axis=<n>" (matches the `else` ValueError `:3117`). SCOPE: `ignore_implicit_zeros`/`random_state`/`copy` are not surfaced (dense, deterministic, always-copy); `subsample` is threaded but the actual random subsampling is REQ-6 NOT-STARTED (#1324) so the path is deterministic and matches sklearn only for n_samples ≤ subsample (`subsample=None`/large in the oracle). Consumer: crate re-export `pub use quantile_transformer::quantile_transform` (`lib.rs`, boundary public API, R-DEFER-1). Live-oracle: `tests/divergence_quantile_transformer.rs` (axis=0 uniform/normal distinct+tied, axis=1 non-square, axis-error, == estimator fit_transform, multi-feature column-by-column, f32). |
28//! | REQ-10 `copy` + OneToOneFeatureMixin fitted-attr surface | NOT-STARTED (#1328) | sklearn `_data.py:2540`,`:2790-2795` |
29//! | REQ-11 PyO3 binding | SHIPPED (#1329) | `_RsQuantileTransformer` (`ferrolearn-python/src/extras.rs`) over `QuantileTransformer`/`FittedQuantileTransformer` exposes `fit`/`transform`/`inverse_transform` + the `#[getter]`s `quantiles_` (`&[Vec<F>]` → `(n_quantiles_, n_features)` numpy, transposed), `references_` (this module's `FittedQuantileTransformer::references`), `n_quantiles_` (effective, `FittedQuantileTransformer::n_quantiles` == `references().len()`, the `min(n_quantiles, n_samples)` clamp `_data.py:2790`), `n_features_in_`; `output_distribution` "uniform"/"normal" string→`OutputDistribution` (bad→`PyValueError`, `_data.py:2655` StrOptions). Python `class QuantileTransformer(_TransformerWrapper)` (`_extras.py`) mirrors sklearn's keyword-only ctor `(*, n_quantiles=1000, output_distribution="uniform", ignore_implicit_zeros=False, subsample=10_000, random_state=None, copy=True)` (`_data.py:2662`); pre-fit transform/attr→`NotFittedError`; n_quantiles>subsample→`ValueError` (`_data.py:2774`); registered `lib.rs`/`__init__.py` (non-test consumer, R-DEFER-1). SCOPE: `subsample`/`random_state` accepted for get_params compat but the deterministic path uses ALL samples (REQ-6 actual subsample NOT-STARTED #1324, RNG-gated #2118) — MATCHES sklearn only for n_samples ≤ subsample; `ignore_implicit_zeros` accepted-and-documented (REQ-8 dense-only #1326); `copy` no-op. float32 input → ~1e-5 tolerant (f64 ABI cast-back, #2215-class); Normal output → ~1e-7 (REQ-3 Acklam ppf/ndtr). Live-oracle: `tests/divergence_quantile_transformer_py.py`. |
30//! | REQ-12 ferray substrate | NOT-STARTED (#1330) | R-SUBSTRATE |
31
32use ferrolearn_core::error::FerroError;
33use ferrolearn_core::traits::{Fit, FitTransform, Transform};
34use ferrolearn_numerical::special::{erf, erfc};
35use ndarray::Array2;
36use num_traits::Float;
37
38// ---------------------------------------------------------------------------
39// OutputDistribution
40// ---------------------------------------------------------------------------
41
42/// Target output distribution for the quantile transformer.
43#[derive(Debug, Clone, Copy, PartialEq, Eq)]
44pub enum OutputDistribution {
45    /// Map to the uniform distribution on `[0, 1]`.
46    Uniform,
47    /// Map to the standard normal distribution via the probit function.
48    Normal,
49}
50
51// ---------------------------------------------------------------------------
52// QuantileTransformer (unfitted)
53// ---------------------------------------------------------------------------
54
55/// An unfitted quantile transformer.
56///
57/// Calling [`Fit::fit`] computes the quantiles for each feature and returns a
58/// [`FittedQuantileTransformer`].
59///
60/// # Parameters
61///
62/// - `n_quantiles` — number of quantile reference points (default 1000).
63/// - `output_distribution` — target distribution (default `Uniform`).
64/// - `subsample` — maximum number of samples used to compute quantiles
65///   (default 100_000; set to 0 to use all samples).
66///
67/// # Examples
68///
69/// ```
70/// use ferrolearn_preprocess::quantile_transformer::{QuantileTransformer, OutputDistribution};
71/// use ferrolearn_core::traits::{Fit, Transform};
72/// use ndarray::array;
73///
74/// let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
75/// let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
76/// let fitted = qt.fit(&x, &()).unwrap();
77/// let out = fitted.transform(&x).unwrap();
78/// // Values should be in [0, 1]
79/// for v in out.iter() {
80///     assert!(*v >= 0.0 && *v <= 1.0);
81/// }
82/// ```
83#[must_use]
84#[derive(Debug, Clone)]
85pub struct QuantileTransformer<F> {
86    /// Number of quantile reference points.
87    n_quantiles: usize,
88    /// Target output distribution.
89    output_distribution: OutputDistribution,
90    /// Maximum number of samples for quantile computation (0 = all).
91    subsample: usize,
92    _marker: std::marker::PhantomData<F>,
93}
94
95impl<F: Float + Send + Sync + 'static> QuantileTransformer<F> {
96    /// Create a new `QuantileTransformer`.
97    pub fn new(
98        n_quantiles: usize,
99        output_distribution: OutputDistribution,
100        subsample: usize,
101    ) -> Self {
102        Self {
103            n_quantiles,
104            output_distribution,
105            subsample,
106            _marker: std::marker::PhantomData,
107        }
108    }
109
110    /// Return the number of quantiles.
111    #[must_use]
112    pub fn n_quantiles(&self) -> usize {
113        self.n_quantiles
114    }
115
116    /// Return the target output distribution.
117    #[must_use]
118    pub fn output_distribution(&self) -> OutputDistribution {
119        self.output_distribution
120    }
121
122    /// Return the subsample size.
123    #[must_use]
124    pub fn subsample(&self) -> usize {
125        self.subsample
126    }
127}
128
129impl<F: Float + Send + Sync + 'static> Default for QuantileTransformer<F> {
130    fn default() -> Self {
131        Self::new(1000, OutputDistribution::Uniform, 100_000)
132    }
133}
134
135// ---------------------------------------------------------------------------
136// FittedQuantileTransformer
137// ---------------------------------------------------------------------------
138
139/// A fitted quantile transformer holding per-feature quantile references.
140///
141/// Created by calling [`Fit::fit`] on a [`QuantileTransformer`].
142#[derive(Debug, Clone)]
143pub struct FittedQuantileTransformer<F> {
144    /// Quantile reference values per feature: `quantiles[j]` is a sorted
145    /// vector of reference values for feature `j`.
146    quantiles: Vec<Vec<F>>,
147    /// The reference quantile levels (evenly spaced in [0, 1]).
148    references: Vec<F>,
149    /// Target output distribution.
150    output_distribution: OutputDistribution,
151}
152
153impl<F: Float + Send + Sync + 'static> FittedQuantileTransformer<F> {
154    /// Return the computed quantile reference values per feature.
155    #[must_use]
156    pub fn quantiles(&self) -> &[Vec<F>] {
157        &self.quantiles
158    }
159
160    /// Return the reference quantile levels (the evenly-spaced `[0, 1]` grid).
161    ///
162    /// Mirrors scikit-learn `references_` (`_data.py:2795` `references_ =
163    /// np.linspace(0, 1, n_quantiles_, endpoint=True)`): a length-`n_quantiles_`
164    /// vector, where `n_quantiles_` is the effective (clamped) quantile count.
165    #[must_use]
166    pub fn references(&self) -> &[F] {
167        &self.references
168    }
169
170    /// Return the effective number of quantile reference points.
171    ///
172    /// Mirrors scikit-learn `n_quantiles_` (`_data.py:2790` `n_quantiles_ =
173    /// max(1, min(n_quantiles, n_samples))`): the requested `n_quantiles`
174    /// clamped down to the number of samples seen during `fit`. Equals
175    /// `references().len()`.
176    #[must_use]
177    pub fn n_quantiles(&self) -> usize {
178        self.references.len()
179    }
180
181    /// Return the number of features.
182    #[must_use]
183    pub fn n_features(&self) -> usize {
184        self.quantiles.len()
185    }
186
187    /// Back-project transformed data to the original feature space.
188    ///
189    /// The inverse of [`Transform::transform`]: each transformed column value is
190    /// mapped back to an original feature value. Mirrors scikit-learn
191    /// `QuantileTransformer.inverse_transform` (`_data.py:2947`) which runs
192    /// `_transform_col(..., inverse=True)` (`_data.py:2813-2866`):
193    ///
194    /// - **Normal** output: the value is first mapped to its uniform rank in
195    ///   `[0, 1]` via the standard normal CDF `stats.norm.cdf` (`:2821`,
196    ///   [`norm_cdf`]); the bound masks use `BOUNDS_THRESHOLD = 1e-7` (`:47`)
197    ///   against `0` / `1` (`:2827-2828`).
198    /// - **Uniform** output: the value IS the rank; the bound masks use exact
199    ///   equality `== 0` / `== 1` (`:2830-2831`).
200    ///
201    /// The finite ranks are interpolated with a **plain** `np.interp(rank,
202    /// references_, quantiles)` — the REVERSE of the forward interp, and NOT the
203    /// forward/reversed-averaged variant (the averaging is forward-only,
204    /// `:2848`). Ranks at/below the lower bound map to the column minimum
205    /// `quantiles[0]` and at/above the upper bound to the column maximum
206    /// `quantiles[-1]` (`:2850-2851`). `NaN` passes through unchanged
207    /// (`isfinite_mask`, `:2833`).
208    ///
209    /// # Errors
210    ///
211    /// Returns [`FerroError::ShapeMismatch`] if the number of columns differs
212    /// from the number of features seen during fitting.
213    pub fn inverse_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
214        let n_features = self.quantiles.len();
215        if x.ncols() != n_features {
216            return Err(FerroError::ShapeMismatch {
217                expected: vec![x.nrows(), n_features],
218                actual: vec![x.nrows(), x.ncols()],
219                context: "FittedQuantileTransformer::inverse_transform".into(),
220            });
221        }
222        // sklearn `inverse_transform` -> `_check_inputs(force_all_finite=
223        // "allow-nan")` (`_data.py:2876`, called `:2965`): NaN passes through,
224        // but +/-inf raises ValueError (#2212, MinMaxScaler #2200 precedent).
225        if x.iter().any(|v| v.is_infinite()) {
226            return Err(FerroError::InvalidParameter {
227                name: "X".into(),
228                reason: "Input X contains infinity or a value too large for dtype.".into(),
229            });
230        }
231
232        // sklearn _transform_col inverse: lower_bound_x = 0, upper_bound_x = 1,
233        // lower_bound_y = quantiles[0], upper_bound_y = quantiles[-1]
234        // (`_data.py:2814-2817`). BOUNDS_THRESHOLD = 1e-7 (`:47`).
235        let bounds_threshold = F::from(1e-7).unwrap_or_else(F::zero);
236        let zero = F::zero();
237        let one = F::one();
238
239        let mut out = x.to_owned();
240
241        for j in 0..n_features {
242            let quantiles_col = &self.quantiles[j];
243            let (lower_y, upper_y) = match (quantiles_col.first(), quantiles_col.last()) {
244                (Some(&first), Some(&last)) => (first, last),
245                // Empty landmark column: nothing to map to; leave values as-is.
246                _ => continue,
247            };
248
249            for i in 0..out.nrows() {
250                let val = out[[i, j]];
251                if val.is_nan() {
252                    // isfinite_mask excludes NaN; it stays NaN (`_data.py:2833`).
253                    continue;
254                }
255
256                // For Normal output, map the normal-space value back to the
257                // uniform rank in [0, 1] (`_data.py:2821`). For Uniform the
258                // value already IS the rank.
259                let rank = match self.output_distribution {
260                    OutputDistribution::Uniform => val,
261                    OutputDistribution::Normal => norm_cdf(val),
262                };
263
264                // Bound masks (`_data.py:2826-2831`). Under errstate(invalid=
265                // "ignore") NaN comparisons are false — but `rank` is finite
266                // here (norm_cdf of a finite value is finite, val is finite).
267                let (lower_idx, upper_idx) = match self.output_distribution {
268                    OutputDistribution::Normal => (
269                        rank - bounds_threshold < zero,
270                        rank + bounds_threshold > one,
271                    ),
272                    OutputDistribution::Uniform => (rank == zero, rank == one),
273                };
274
275                // Plain reverse interp references_ -> quantiles (`_data.py:2848`).
276                let mut mapped = np_interp(rank, &self.references, quantiles_col);
277
278                // Bound overrides (`_data.py:2850-2851`): upper applied first,
279                // then lower (matching sklearn's order).
280                if upper_idx {
281                    mapped = upper_y;
282                }
283                if lower_idx {
284                    mapped = lower_y;
285                }
286
287                out[[i, j]] = mapped;
288            }
289        }
290
291        Ok(out)
292    }
293}
294
295// ---------------------------------------------------------------------------
296// Helpers
297// ---------------------------------------------------------------------------
298
299/// Inverse of the standard normal CDF (probit / norm.ppf), accurate to ~1e-9
300/// via Acklam's rational approximation. Mirrors scipy `stats.norm.ppf` as used
301/// by sklearn QuantileTransformer (`_data.py:2856`), with the output clipped to
302/// sklearn's ASYMMETRIC bounds (`_data.py:2860-2862`):
303/// `clip_min = norm.ppf(BOUNDS_THRESHOLD - spacing(1)) = -5.199337582605575`,
304/// `clip_max = norm.ppf(1 - (BOUNDS_THRESHOLD - spacing(1))) = 5.19933758270342`
305/// (Δ ≈ 9.8e-11 — `norm.ppf` is not exactly symmetric about 0 at these args).
306fn probit<F: Float>(p: F) -> F {
307    let clip_min = F::from(-5.199337582605575).unwrap_or_else(F::min_value);
308    let clip_max = F::from(5.19933758270342).unwrap_or_else(F::max_value);
309    let cf = |k: f64| F::from(k).unwrap_or_else(F::zero);
310    if p <= F::zero() {
311        return clip_min;
312    }
313    if p >= F::one() {
314        return clip_max;
315    }
316    let a = [
317        -3.969683028665376e+01,
318        2.209460984245205e+02,
319        -2.759285104469687e+02,
320        1.38357751867269e+02,
321        -3.066479806614716e+01,
322        2.506628277459239e+00,
323    ];
324    let b = [
325        -5.447609879822406e+01,
326        1.615858368580409e+02,
327        -1.556989798598866e+02,
328        6.680131188771972e+01,
329        -1.328068155288572e+01,
330    ];
331    let c = [
332        -7.784894002430293e-03,
333        -3.223964580411365e-01,
334        -2.400758277161838e+00,
335        -2.549732539343734e+00,
336        4.374664141464968e+00,
337        2.938163982698783e+00,
338    ];
339    let d = [
340        7.784695709041462e-03,
341        3.224671290700398e-01,
342        2.445134137142996e+00,
343        3.754408661907416e+00,
344    ];
345    let p_low = F::from(0.02425).unwrap_or_else(F::zero);
346    let p_high = F::one() - p_low;
347    let two = F::from(2.0).unwrap_or_else(F::one);
348    let half = F::from(0.5).unwrap_or_else(F::zero);
349    let x = if p < p_low {
350        let qv = (-two * p.ln()).sqrt();
351        (((((cf(c[0]) * qv + cf(c[1])) * qv + cf(c[2])) * qv + cf(c[3])) * qv + cf(c[4])) * qv
352            + cf(c[5]))
353            / ((((cf(d[0]) * qv + cf(d[1])) * qv + cf(d[2])) * qv + cf(d[3])) * qv + F::one())
354    } else if p <= p_high {
355        let qv = p - half;
356        let r = qv * qv;
357        (((((cf(a[0]) * r + cf(a[1])) * r + cf(a[2])) * r + cf(a[3])) * r + cf(a[4])) * r
358            + cf(a[5]))
359            * qv
360            / (((((cf(b[0]) * r + cf(b[1])) * r + cf(b[2])) * r + cf(b[3])) * r + cf(b[4])) * r
361                + F::one())
362    } else {
363        let qv = (-two * (F::one() - p).ln()).sqrt();
364        -(((((cf(c[0]) * qv + cf(c[1])) * qv + cf(c[2])) * qv + cf(c[3])) * qv + cf(c[4])) * qv
365            + cf(c[5]))
366            / ((((cf(d[0]) * qv + cf(d[1])) * qv + cf(d[2])) * qv + cf(d[3])) * qv + F::one())
367    };
368    if x < clip_min {
369        clip_min
370    } else if x > clip_max {
371        clip_max
372    } else {
373        x
374    }
375}
376
377/// numpy-`np.interp`-faithful 1-D linear interpolation (`xp` ascending, may have
378/// duplicate landmarks). Used twice (ascending + reversed) and averaged in the
379/// CDF lookup to mirror sklearn `_transform_col` (`_data.py:2843-2846`).
380fn np_interp<F: Float>(x: F, xp: &[F], fp: &[F]) -> F {
381    let n = xp.len();
382    if n == 0 {
383        return F::zero();
384    }
385    if x <= xp[0] {
386        return fp[0];
387    }
388    if x >= xp[n - 1] {
389        return fp[n - 1];
390    }
391    // first index with xp[idx] > x (numpy searchsorted side='right'); interval (idx-1, idx)
392    let mut lo = 0usize;
393    let mut hi = n;
394    while lo < hi {
395        let mid = lo + (hi - lo) / 2;
396        if xp[mid] <= x {
397            lo = mid + 1;
398        } else {
399            hi = mid;
400        }
401    }
402    let i = lo - 1;
403    let denom = xp[i + 1] - xp[i];
404    if denom == F::zero() {
405        fp[i]
406    } else {
407        fp[i] + (x - xp[i]) / denom * (fp[i + 1] - fp[i])
408    }
409}
410
411/// Standard normal CDF Φ(x) (scipy `ndtr` / `stats.norm.cdf`), computed in f64
412/// for accuracy then cast back to `F`. Used by [`FittedQuantileTransformer::inverse_transform`]
413/// to map a Normal-space value back to its uniform rank in `[0, 1]`, mirroring
414/// sklearn `_transform_col` inverse branch (`_data.py:2821`: `stats.norm.cdf(X_col)`).
415///
416/// Built from [`ferrolearn_numerical::special::erf`]/[`erfc`] (libm Cephes,
417/// machine precision) via the standard three-branch `ndtr` split on
418/// `t = x / sqrt(2)`, which preserves tail accuracy by using `erfc` (not
419/// `1 - erf`) in the tails:
420/// - `t < -1`  → `0.5 * erfc(-t)`   (left tail; no catastrophic cancellation)
421/// - `-1 ≤ t ≤ 1` → `0.5 * (1 + erf(t))` (central region)
422/// - `t > 1`   → `1 - 0.5 * erfc(t)` (right tail)
423///
424/// This is the analytic inverse of the forward [`probit`] (Acklam ppf, ~1e-9),
425/// so a Normal round-trip agrees with `scipy.stats.norm.cdf` to ~1e-9. A `NaN`
426/// input is passed through (returns `NaN`); ±∞ map to 1/0 since `erf(±∞)=±1`.
427fn norm_cdf<F: Float>(x: F) -> F {
428    let xf = x.to_f64().unwrap_or(f64::NAN);
429    if xf.is_nan() {
430        return F::nan();
431    }
432    let t = xf / std::f64::consts::SQRT_2;
433    let cdf = if t < -1.0 {
434        0.5 * erfc(-t)
435    } else if t <= 1.0 {
436        0.5 * (1.0 + erf(t))
437    } else {
438        1.0 - 0.5 * erfc(t)
439    };
440    F::from(cdf).unwrap_or_else(F::zero)
441}
442
443/// Map `value` to its quantile level by averaging the ascending and reversed
444/// linear interpolations, mirroring sklearn `_transform_col`
445/// (`_data.py:2843-2846`) so plateaus map to the midpoint of their span.
446fn interpolate_cdf<F: Float>(value: F, quantiles: &[F], references: &[F]) -> F {
447    if quantiles.is_empty() {
448        return F::from(0.5).unwrap_or_else(F::zero);
449    }
450    let forward = np_interp(value, quantiles, references);
451    let neg_q: Vec<F> = quantiles.iter().rev().map(|&qv| -qv).collect();
452    let neg_r: Vec<F> = references.iter().rev().map(|&rv| -rv).collect();
453    let backward = np_interp(-value, &neg_q, &neg_r);
454    F::from(0.5).unwrap_or_else(F::zero) * (forward - backward)
455}
456
457// ---------------------------------------------------------------------------
458// Trait implementations
459// ---------------------------------------------------------------------------
460
461impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for QuantileTransformer<F> {
462    type Fitted = FittedQuantileTransformer<F>;
463    type Error = FerroError;
464
465    /// Fit by computing per-feature quantile reference values.
466    ///
467    /// # Errors
468    ///
469    /// - [`FerroError::InsufficientSamples`] if the input has 0 rows (empty).
470    /// - [`FerroError::InvalidParameter`] if `n_quantiles` is less than 2.
471    fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedQuantileTransformer<F>, FerroError> {
472        let n_samples = x.nrows();
473        // sklearn accepts n_samples >= 1: `n_quantiles_ = max(1, min(n_quantiles,
474        // n_samples))` (`_data.py:2790`). Reject only the empty (0-row) case.
475        if n_samples < 1 {
476            return Err(FerroError::InsufficientSamples {
477                required: 1,
478                actual: n_samples,
479                context: "QuantileTransformer::fit".into(),
480            });
481        }
482        if self.n_quantiles < 2 {
483            return Err(FerroError::InvalidParameter {
484                name: "n_quantiles".into(),
485                reason: "n_quantiles must be at least 2".into(),
486            });
487        }
488
489        let n_features = x.ncols();
490        let effective_quantiles = self.n_quantiles.min(n_samples);
491
492        // numpy np.linspace(0,1,K): step = 1/(K-1) computed once, y[i] = i*step,
493        // endpoint y[K-1] pinned to exactly 1.0 (_data.py:2795). For K == 1
494        // (n_samples == 1, sklearn `n_quantiles_ = max(1, ...)`) np.linspace
495        // returns the single point [0.0] — guard the 1/(K-1) divide that would
496        // otherwise yield 0 * (1/0) = NaN.
497        let k = effective_quantiles;
498        let mut references: Vec<F> = if k <= 1 {
499            vec![F::zero()]
500        } else {
501            let denom = F::from(k - 1).unwrap_or_else(F::one);
502            let step = F::one() / denom;
503            (0..k)
504                .map(|i| F::from(i).unwrap_or_else(F::zero) * step)
505                .collect()
506        };
507        if k >= 2 {
508            let last = references.len() - 1;
509            references[last] = F::one();
510        }
511
512        let mut quantiles = Vec::with_capacity(n_features);
513
514        for j in 0..n_features {
515            let mut col_vals: Vec<F> = x.column(j).iter().copied().collect();
516            // Remove NaN values
517            col_vals.retain(|v| !v.is_nan());
518            col_vals.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
519
520            // Subsample if needed
521            if self.subsample > 0 && col_vals.len() > self.subsample {
522                let step = col_vals.len() as f64 / self.subsample as f64;
523                let mut sampled = Vec::with_capacity(self.subsample);
524                for i in 0..self.subsample {
525                    let idx = (i as f64 * step) as usize;
526                    sampled.push(col_vals[idx.min(col_vals.len() - 1)]);
527                }
528                col_vals = sampled;
529            }
530
531            // Compute quantile reference values
532            let n = col_vals.len();
533            let mut feature_quantiles = Vec::with_capacity(effective_quantiles);
534            for &ref_level in &references {
535                // sklearn: np.nanpercentile(X, references_*100); numpy 'linear' virtual
536                // index is q/100*(n-1). Replicate the *100 then /100 round-trip
537                // (_data.py:2694,:2702).
538                let hundred = F::from(100.0).unwrap_or_else(F::one);
539                let q = ref_level * hundred;
540                let pos = (q / hundred) * F::from(n.saturating_sub(1)).unwrap_or_else(F::one);
541                let lo = pos.floor().to_usize().unwrap_or(0).min(n.saturating_sub(1));
542                let hi = pos.ceil().to_usize().unwrap_or(0).min(n.saturating_sub(1));
543                let frac = pos - F::from(lo).unwrap_or_else(F::zero);
544                let val = if lo == hi {
545                    col_vals[lo]
546                } else {
547                    col_vals[lo] * (F::one() - frac) + col_vals[hi] * frac
548                };
549                feature_quantiles.push(val);
550            }
551
552            quantiles.push(feature_quantiles);
553        }
554
555        Ok(FittedQuantileTransformer {
556            quantiles,
557            references,
558            output_distribution: self.output_distribution,
559        })
560    }
561}
562
563impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedQuantileTransformer<F> {
564    type Output = Array2<F>;
565    type Error = FerroError;
566
567    /// Transform data by mapping each value through the empirical CDF.
568    ///
569    /// # Errors
570    ///
571    /// Returns [`FerroError::ShapeMismatch`] if the number of columns differs
572    /// from the number of features seen during fitting.
573    fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
574        let n_features = self.quantiles.len();
575        if x.ncols() != n_features {
576            return Err(FerroError::ShapeMismatch {
577                expected: vec![x.nrows(), n_features],
578                actual: vec![x.nrows(), x.ncols()],
579                context: "FittedQuantileTransformer::transform".into(),
580            });
581        }
582        // sklearn `transform` -> `_check_inputs(force_all_finite="allow-nan")`
583        // (`_data.py:2879`, called `:2943`): NaN passes through, but +/-inf
584        // raises ValueError (#2217, the forward mirror of the inverse #2212 guard
585        // above; MinMaxScaler #2200 precedent).
586        if x.iter().any(|v| v.is_infinite()) {
587            return Err(FerroError::InvalidParameter {
588                name: "X".into(),
589                reason: "Input X contains infinity or a value too large for dtype.".into(),
590            });
591        }
592
593        let mut out = x.to_owned();
594
595        // sklearn forward `_transform_col` bound masks (`_data.py:2829-2831`,
596        // `:2850-2851`): for `uniform` output the masks use EXACT equality
597        // against the landmark bounds — `lower_bounds_idx = X_col == quantiles[0]`,
598        // `upper_bounds_idx = X_col == quantiles[-1]` — and after the averaged
599        // interp the upper override is applied FIRST then the lower (`:2850-2851`),
600        // so a value matching BOTH bounds (e.g. a constant feature where every
601        // landmark == c) resolves to `lower_bound_y = references_[0] = 0`. The
602        // averaged-interp midpoint (#2318/#2319) is thus overridden at the
603        // exact-bound values. `lower_bound_y`/`upper_bound_y` are 0/1 (the
604        // [0,1] rank space) — the Normal `norm.ppf` + clip is applied afterwards.
605        let zero = F::zero();
606        let one = F::one();
607
608        for j in 0..n_features {
609            let feature_quantiles = &self.quantiles[j];
610            let (lower_bound_x, upper_bound_x) =
611                match (feature_quantiles.first(), feature_quantiles.last()) {
612                    (Some(&first), Some(&last)) => (first, last),
613                    _ => (zero, zero),
614                };
615            for i in 0..out.nrows() {
616                let val = out[[i, j]];
617                if val.is_nan() {
618                    continue;
619                }
620                let cdf_val = if feature_quantiles.len() == 1 {
621                    // Single landmark (n_quantiles_ == 1, e.g. a single-sample
622                    // fit). sklearn's forward bound masks are OUTPUT-DISTRIBUTION
623                    // DEPENDENT (`_data.py:2823-2828`, #2219):
624                    //   * Uniform uses EXACT-equality masks (X == bound), so a
625                    //     held-out value keeps the `np_interp` single-point clamp
626                    //     (references_[0] = 0) and the fitted value (== bound)
627                    //     also resolves to lower_bound_y = 0 → always 0.
628                    //   * Normal uses ±BOUNDS_THRESHOLD masks, so a value ABOVE
629                    //     the lone landmark → upper_bound_y = 1 (then
630                    //     probit(1) = +clip), at/below → lower_bound_y = 0
631                    //     (probit(0) = -clip).
632                    if self.output_distribution == OutputDistribution::Normal
633                        && val > feature_quantiles[0]
634                    {
635                        F::one()
636                    } else {
637                        F::zero()
638                    }
639                } else {
640                    interpolate_cdf(val, feature_quantiles, &self.references)
641                };
642
643                // Forward bound masks (`_data.py:2829-2831`,`:2850-2851`). UPPER
644                // applied first, then LOWER — so a value equal to BOTH bounds
645                // (constant feature) gets `lower_bound_y = 0`. Exact equality, as
646                // sklearn does on the f64 landmark values. Only override
647                // multi-landmark columns; the single-landmark path (above) already
648                // models its own output-distribution-dependent masks (#2219).
649                let rank = if feature_quantiles.len() == 1 {
650                    cdf_val
651                } else {
652                    let mut r = cdf_val;
653                    if val == upper_bound_x {
654                        r = one; // upper_bound_y = references_[-1] = 1
655                    }
656                    if val == lower_bound_x {
657                        r = zero; // lower_bound_y = references_[0] = 0
658                    }
659                    r
660                };
661
662                out[[i, j]] = match self.output_distribution {
663                    OutputDistribution::Uniform => rank,
664                    OutputDistribution::Normal => probit(rank),
665                };
666            }
667        }
668
669        Ok(out)
670    }
671}
672
673/// Implement `Transform` on the unfitted transformer to satisfy the
674/// `FitTransform: Transform` supertrait bound.
675impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for QuantileTransformer<F> {
676    type Output = Array2<F>;
677    type Error = FerroError;
678
679    /// Always returns an error — the transformer must be fitted first.
680    fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError> {
681        Err(FerroError::InvalidParameter {
682            name: "QuantileTransformer".into(),
683            reason: "transformer must be fitted before calling transform; use fit() first".into(),
684        })
685    }
686}
687
688impl<F: Float + Send + Sync + 'static> FitTransform<Array2<F>> for QuantileTransformer<F> {
689    type FitError = FerroError;
690
691    /// Fit and transform in one step.
692    ///
693    /// # Errors
694    ///
695    /// Returns an error if fitting fails.
696    fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
697        let fitted = self.fit(x, &())?;
698        fitted.transform(x)
699    }
700}
701
702// ---------------------------------------------------------------------------
703// Standalone `quantile_transform` free function (sklearn `quantile_transform`,
704// `_data.py:2978`)
705// ---------------------------------------------------------------------------
706
707/// Transform features using quantiles information — the standalone,
708/// estimator-less API mirroring scikit-learn's `quantile_transform` free
709/// function (`sklearn/preprocessing/_data.py:2978`,`:3107-3119`).
710///
711/// This is a thin wrapper that **reuses** the fitted estimator: it constructs a
712/// [`QuantileTransformer`] with the given parameters, then runs the SHIPPED
713/// [`Fit::fit`] → [`Transform::transform`] (i.e. `fit_transform`) — it does NOT
714/// reimplement any quantile math. Each call fits on the supplied data and
715/// immediately transforms it (sklearn `n = QuantileTransformer(...); X =
716/// n.fit_transform(X)`, `:3107-3116`).
717///
718/// # Parameters
719///
720/// - `x` — the data to transform, shape `(n_samples, n_features)`.
721/// - `axis` — axis along which to transform (sklearn `axis=0` default,
722///   `:3013-3015`). `axis == 0` transforms each **feature** (column)
723///   independently — `fit_transform(X)` (`:3115-3116`). `axis == 1` transforms
724///   each **sample** (row) independently — `fit_transform(X.T).T`
725///   (`:3117-3118`): the matrix is transposed (so rows become columns / the
726///   per-feature path operates on each original row), fit-transformed, then
727///   transposed back to the original orientation.
728/// - `n_quantiles` — number of quantile reference landmarks (sklearn default
729///   1000, clamped to `n_samples`, `:3017-3023`).
730/// - `output_distribution` — [`OutputDistribution::Uniform`] (sklearn default,
731///   `:3025-3027`) or [`OutputDistribution::Normal`].
732/// - `subsample` — maximum samples used to estimate the quantiles (`0` = use
733///   all). See the scope note below.
734///
735/// # Scope
736///
737/// Unlike sklearn's `quantile_transform`, the `ignore_implicit_zeros`,
738/// `random_state`, and `copy` keyword arguments are **not** surfaced: this path
739/// is dense, deterministic, and always returns a freshly allocated array
740/// (`copy` is implicitly `True`). The `subsample` parameter is threaded to the
741/// estimator, but the actual **random** subsampling is REQ-6 (#1324,
742/// NOT-STARTED): for `n_samples > subsample` the deterministic strided pick is
743/// used instead of sklearn's RNG draw, so this function matches sklearn's output
744/// only when `n_samples <= subsample` (use `subsample=None` or a large value on
745/// the sklearn side to compare). `NaN` is disregarded when fitting and preserved
746/// in the output (sklearn Notes, `:3078-3079`).
747///
748/// # Errors
749///
750/// Returns [`FerroError::InvalidParameter`] if `axis` is neither `0` nor `1`
751/// (mirroring sklearn's `ValueError("axis should be either equal to 0 or 1.
752/// Got axis={axis}")`). Also propagates any error from the underlying
753/// [`Fit::fit`] / [`Transform::transform`] (e.g.
754/// [`FerroError::InsufficientSamples`] for fewer than 2 rows along the chosen
755/// axis, [`FerroError::InvalidParameter`] for `n_quantiles < 2` or non-finite
756/// `±inf` input).
757///
758/// # Examples
759///
760/// ```
761/// use ferrolearn_preprocess::quantile_transformer::{quantile_transform, OutputDistribution};
762/// use ndarray::array;
763///
764/// let x = array![[1.0_f64], [2.0], [3.0], [4.0], [5.0]];
765/// let out = quantile_transform(&x, 0, 5, OutputDistribution::Uniform, 0).unwrap();
766/// // First maps to 0.0, last to 1.0.
767/// assert!((out[[0, 0]] - 0.0_f64).abs() < 1e-9);
768/// assert!((out[[4, 0]] - 1.0_f64).abs() < 1e-9);
769/// ```
770#[must_use = "quantile_transform returns a new array; the input is not modified"]
771pub fn quantile_transform<F: Float + Send + Sync + 'static>(
772    x: &Array2<F>,
773    axis: usize,
774    n_quantiles: usize,
775    output_distribution: OutputDistribution,
776    subsample: usize,
777) -> Result<Array2<F>, FerroError> {
778    let qt = QuantileTransformer::<F>::new(n_quantiles, output_distribution, subsample);
779    match axis {
780        // sklearn `if axis == 0: X = n.fit_transform(X)` (`_data.py:3115-3116`):
781        // each feature (column) transformed independently.
782        0 => {
783            let fitted = qt.fit(x, &())?;
784            fitted.transform(x)
785        }
786        // sklearn `else: X = n.fit_transform(X.T).T` (`_data.py:3117-3118`): each
787        // sample (row) transformed independently. Transpose to a CONTIGUOUS owned
788        // array so the per-feature `fit`/`transform` operates on each original
789        // row, then transpose the result back to the original orientation.
790        1 => {
791            let xt = x.t().to_owned();
792            let fitted = qt.fit(&xt, &())?;
793            let out_t = fitted.transform(&xt)?;
794            Ok(out_t.t().to_owned())
795        }
796        // sklearn `else: raise ValueError("axis should be either equal to 0 or 1.
797        // Got axis={axis}")` (`_data.py:3117` guard via `@validate_params`).
798        other => Err(FerroError::InvalidParameter {
799            name: "axis".into(),
800            reason: format!("axis should be either equal to 0 or 1. Got axis={other}"),
801        }),
802    }
803}
804
805// ---------------------------------------------------------------------------
806// Tests
807// ---------------------------------------------------------------------------
808
809#[cfg(test)]
810mod tests {
811    use super::*;
812    use approx::assert_abs_diff_eq;
813    use ndarray::array;
814
815    #[test]
816    fn test_quantile_transformer_uniform() {
817        let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
818        let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
819        let fitted = qt.fit(&x, &()).unwrap();
820        let out = fitted.transform(&x).unwrap();
821        // All values should be in [0, 1]
822        for v in &out {
823            assert!(*v >= 0.0 && *v <= 1.0, "Value {v} not in [0,1]");
824        }
825        // First should be 0, last should be 1
826        assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-6);
827        assert_abs_diff_eq!(out[[4, 0]], 1.0, epsilon = 1e-6);
828    }
829
830    #[test]
831    fn test_quantile_transformer_normal() {
832        let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Normal, 0);
833        let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
834        let fitted = qt.fit(&x, &()).unwrap();
835        let out = fitted.transform(&x).unwrap();
836        // Middle value should be close to 0 (median → 0 in normal)
837        assert!(out[[2, 0]].abs() < 0.5, "Median should map near 0");
838        // First should be negative, last positive
839        assert!(out[[0, 0]] < out[[4, 0]]);
840    }
841
842    #[test]
843    fn test_quantile_transformer_monotonic() {
844        let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
845        let x = array![[5.0], [3.0], [1.0], [4.0], [2.0]];
846        let fitted = qt.fit(&x, &()).unwrap();
847        let out = fitted.transform(&x).unwrap();
848        // Transform should preserve ordering: rank(5) > rank(3) > rank(1)
849        assert!(out[[0, 0]] > out[[1, 0]]); // 5 > 3
850        assert!(out[[1, 0]] > out[[2, 0]]); // 3 > 1
851    }
852
853    #[test]
854    fn test_quantile_transformer_multiple_features() {
855        let qt = QuantileTransformer::<f64>::new(50, OutputDistribution::Uniform, 0);
856        let x = array![[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]];
857        let fitted = qt.fit(&x, &()).unwrap();
858        let out = fitted.transform(&x).unwrap();
859        assert_eq!(out.ncols(), 2);
860        // Each feature independently transformed
861        for j in 0..2 {
862            assert!(out[[0, j]] <= out[[2, j]]);
863        }
864    }
865
866    #[test]
867    fn test_quantile_transformer_fit_transform() {
868        let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
869        let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
870        let out = qt.fit_transform(&x).unwrap();
871        assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-6);
872        assert_abs_diff_eq!(out[[4, 0]], 1.0, epsilon = 1e-6);
873    }
874
875    #[test]
876    fn test_quantile_transformer_insufficient_samples_error() {
877        // sklearn accepts n_samples == 1 (clamps n_quantiles_ to 1, `_data.py:2790`);
878        // only the empty (0-row) case is rejected (#2218).
879        let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
880        let empty = ndarray::Array2::<f64>::zeros((0, 1));
881        assert!(qt.fit(&empty, &()).is_err());
882    }
883
884    #[test]
885    fn test_quantile_transformer_single_sample_all_zeros_no_nan() {
886        // sklearn: QuantileTransformer(5,'uniform').fit_transform([[7.]]) -> [[0.0]]
887        // (n_quantiles_ clamped to 1, references_=[0.0]). Must not produce NaN.
888        let qt = QuantileTransformer::<f64>::new(5, OutputDistribution::Uniform, 0);
889        let x = array![[7.0]];
890        let res = qt.fit_transform(&x);
891        assert!(res.is_ok(), "single-sample fit_transform must succeed");
892        if let Ok(out) = res {
893            assert_eq!(out.dim(), (1, 1));
894            let v = out[[0, 0]];
895            assert!(v.abs() <= 1e-12 && !v.is_nan());
896        }
897    }
898
899    #[test]
900    fn test_quantile_transformer_too_few_quantiles_error() {
901        let qt = QuantileTransformer::<f64>::new(1, OutputDistribution::Uniform, 0);
902        let x = array![[1.0], [2.0], [3.0]];
903        assert!(qt.fit(&x, &()).is_err());
904    }
905
906    #[test]
907    fn test_quantile_transformer_shape_mismatch() {
908        let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
909        let x_train = array![[1.0, 2.0], [3.0, 4.0]];
910        let fitted = qt.fit(&x_train, &()).unwrap();
911        let x_bad = array![[1.0, 2.0, 3.0]];
912        assert!(fitted.transform(&x_bad).is_err());
913    }
914
915    #[test]
916    fn test_quantile_transformer_unfitted_error() {
917        let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
918        let x = array![[1.0]];
919        assert!(qt.transform(&x).is_err());
920    }
921
922    #[test]
923    fn test_quantile_transformer_default() {
924        let qt = QuantileTransformer::<f64>::default();
925        assert_eq!(qt.n_quantiles(), 1000);
926        assert_eq!(qt.output_distribution(), OutputDistribution::Uniform);
927        assert_eq!(qt.subsample(), 100_000);
928    }
929
930    #[test]
931    fn test_quantile_transformer_f32() {
932        let qt = QuantileTransformer::<f32>::new(50, OutputDistribution::Uniform, 0);
933        let x: Array2<f32> = array![[1.0f32], [2.0], [3.0], [4.0], [5.0]];
934        let fitted = qt.fit(&x, &()).unwrap();
935        let out = fitted.transform(&x).unwrap();
936        assert!(out[[0, 0]] >= 0.0f32);
937        assert!(out[[4, 0]] <= 1.0f32);
938    }
939}
ferrolearn_preprocess/quantile_transformer.rs

ferrolearn_preprocess/
quantile_transformer.rs