ferrolearn_preprocess/quantile_transformer.rs
1//! Quantile transformer: map features to a uniform or normal distribution.
2//!
3//! [`QuantileTransformer`] transforms features by mapping each value through
4//! its empirical cumulative distribution function (CDF), producing values
5//! uniformly distributed in `[0, 1]`. Optionally, the result can be mapped
6//! to a standard normal distribution using the inverse normal CDF (probit).
7//!
8//! This is useful for making features more Gaussian-like, which can improve
9//! the performance of many machine learning algorithms.
10//!
11//! Translation target: scikit-learn 1.5.2 `class QuantileTransformer`
12//! (`sklearn/preprocessing/_data.py:2540`) + `quantile_transform` (`:2978`).
13//! Design: `.design/preprocess/quantile_transformer.md`. Tracking: #1319.
14//!
15//! `## REQ status`
16//!
17//! | REQ | Status | Anchor |
18//! |---|---|---|
19//! | REQ-1 forward-CDF value surface (uniform + normal, distinct + tied) | SHIPPED | `fit` references/landmarks (#1322) + `transform`; sklearn `_data.py:2694`,`:2702`,`:2795` |
20//! | REQ-2 forward+reversed AVERAGED interpolation | SHIPPED (#1321) | `interpolate_cdf`/`np_interp`; sklearn `_data.py:2843-2846` |
21//! | REQ-3 Normal output accuracy (Acklam ppf + clip) | SHIPPED (#1320) | `probit`; sklearn `_data.py:2855-2862` |
22//! | REQ-5 error/parameter contracts (scoped) | SHIPPED | `fit`/`transform` guards; sklearn `_data.py:2654` |
23//! | REQ-4 `np.maximum.accumulate` monotonic repair (unobservable) | NOT-STARTED (#1323) | sklearn `_data.py:2707` |
24//! | REQ-6 random subsample + random_state + n_quantiles>subsample error | NOT-STARTED (#1324) | sklearn `_data.py:2696-2700`,`:2774-2779` |
25//! | REQ-7 `inverse_transform` (reverse interp + `norm.cdf` + bounds + NaN) | SHIPPED (#1325) | `FittedQuantileTransformer::inverse_transform` + `norm_cdf` (ndtr via `erf`/`erfc`); sklearn `_data.py:2947`,`:2813-2851`,`:2821`. Uniform ~exact, Normal ~1e-7 (follows the forward REQ-3 ~1e-9 ndtr/probit contract). Consumer: crate re-export `pub use quantile_transformer::FittedQuantileTransformer` (`lib.rs`, boundary public API). Live-oracle: `tests/divergence_quantile_transformer.rs` (uniform/normal round-trip + held-out + bounds + NaN + `norm_cdf` sanity) |
26//! | REQ-8 `ignore_implicit_zeros` + sparse CSC | NOT-STARTED (#1326) | sklearn `_data.py:2709-2752` |
27//! | REQ-9 `quantile_transform` free function | SHIPPED (#1327) | `quantile_transform` free fn delegates to the SHIPPED `QuantileTransformer::new(n_quantiles, output_distribution, subsample)` + `Fit`→`FittedQuantileTransformer::transform` (NO quantile math duplicated); mirrors sklearn `quantile_transform(X, *, axis=0, n_quantiles=1000, output_distribution="uniform", ignore_implicit_zeros=False, subsample=1e5, random_state=None, copy=True)` (`_data.py:2978`,`:3107-3119`): `axis==0` → `fit_transform(X)` (per-feature, `:3115-3116`); `axis==1` → `fit_transform(X.T).T` (per-sample, `:3117-3118`, contiguous owned transpose); `axis ∉ {0,1}` → `InvalidParameter` "axis should be either equal to 0 or 1. Got axis=<n>" (matches the `else` ValueError `:3117`). SCOPE: `ignore_implicit_zeros`/`random_state`/`copy` are not surfaced (dense, deterministic, always-copy); `subsample` is threaded but the actual random subsampling is REQ-6 NOT-STARTED (#1324) so the path is deterministic and matches sklearn only for n_samples ≤ subsample (`subsample=None`/large in the oracle). Consumer: crate re-export `pub use quantile_transformer::quantile_transform` (`lib.rs`, boundary public API, R-DEFER-1). Live-oracle: `tests/divergence_quantile_transformer.rs` (axis=0 uniform/normal distinct+tied, axis=1 non-square, axis-error, == estimator fit_transform, multi-feature column-by-column, f32). |
28//! | REQ-10 `copy` + OneToOneFeatureMixin fitted-attr surface | NOT-STARTED (#1328) | sklearn `_data.py:2540`,`:2790-2795` |
29//! | REQ-11 PyO3 binding | SHIPPED (#1329) | `_RsQuantileTransformer` (`ferrolearn-python/src/extras.rs`) over `QuantileTransformer`/`FittedQuantileTransformer` exposes `fit`/`transform`/`inverse_transform` + the `#[getter]`s `quantiles_` (`&[Vec<F>]` → `(n_quantiles_, n_features)` numpy, transposed), `references_` (this module's `FittedQuantileTransformer::references`), `n_quantiles_` (effective, `FittedQuantileTransformer::n_quantiles` == `references().len()`, the `min(n_quantiles, n_samples)` clamp `_data.py:2790`), `n_features_in_`; `output_distribution` "uniform"/"normal" string→`OutputDistribution` (bad→`PyValueError`, `_data.py:2655` StrOptions). Python `class QuantileTransformer(_TransformerWrapper)` (`_extras.py`) mirrors sklearn's keyword-only ctor `(*, n_quantiles=1000, output_distribution="uniform", ignore_implicit_zeros=False, subsample=10_000, random_state=None, copy=True)` (`_data.py:2662`); pre-fit transform/attr→`NotFittedError`; n_quantiles>subsample→`ValueError` (`_data.py:2774`); registered `lib.rs`/`__init__.py` (non-test consumer, R-DEFER-1). SCOPE: `subsample`/`random_state` accepted for get_params compat but the deterministic path uses ALL samples (REQ-6 actual subsample NOT-STARTED #1324, RNG-gated #2118) — MATCHES sklearn only for n_samples ≤ subsample; `ignore_implicit_zeros` accepted-and-documented (REQ-8 dense-only #1326); `copy` no-op. float32 input → ~1e-5 tolerant (f64 ABI cast-back, #2215-class); Normal output → ~1e-7 (REQ-3 Acklam ppf/ndtr). Live-oracle: `tests/divergence_quantile_transformer_py.py`. |
30//! | REQ-12 ferray substrate | NOT-STARTED (#1330) | R-SUBSTRATE |
31
32use ferrolearn_core::error::FerroError;
33use ferrolearn_core::traits::{Fit, FitTransform, Transform};
34use ferrolearn_numerical::special::{erf, erfc};
35use ndarray::Array2;
36use num_traits::Float;
37
38// ---------------------------------------------------------------------------
39// OutputDistribution
40// ---------------------------------------------------------------------------
41
42/// Target output distribution for the quantile transformer.
43#[derive(Debug, Clone, Copy, PartialEq, Eq)]
44pub enum OutputDistribution {
45 /// Map to the uniform distribution on `[0, 1]`.
46 Uniform,
47 /// Map to the standard normal distribution via the probit function.
48 Normal,
49}
50
51// ---------------------------------------------------------------------------
52// QuantileTransformer (unfitted)
53// ---------------------------------------------------------------------------
54
55/// An unfitted quantile transformer.
56///
57/// Calling [`Fit::fit`] computes the quantiles for each feature and returns a
58/// [`FittedQuantileTransformer`].
59///
60/// # Parameters
61///
62/// - `n_quantiles` — number of quantile reference points (default 1000).
63/// - `output_distribution` — target distribution (default `Uniform`).
64/// - `subsample` — maximum number of samples used to compute quantiles
65/// (default 100_000; set to 0 to use all samples).
66///
67/// # Examples
68///
69/// ```
70/// use ferrolearn_preprocess::quantile_transformer::{QuantileTransformer, OutputDistribution};
71/// use ferrolearn_core::traits::{Fit, Transform};
72/// use ndarray::array;
73///
74/// let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
75/// let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
76/// let fitted = qt.fit(&x, &()).unwrap();
77/// let out = fitted.transform(&x).unwrap();
78/// // Values should be in [0, 1]
79/// for v in out.iter() {
80/// assert!(*v >= 0.0 && *v <= 1.0);
81/// }
82/// ```
83#[must_use]
84#[derive(Debug, Clone)]
85pub struct QuantileTransformer<F> {
86 /// Number of quantile reference points.
87 n_quantiles: usize,
88 /// Target output distribution.
89 output_distribution: OutputDistribution,
90 /// Maximum number of samples for quantile computation (0 = all).
91 subsample: usize,
92 _marker: std::marker::PhantomData<F>,
93}
94
95impl<F: Float + Send + Sync + 'static> QuantileTransformer<F> {
96 /// Create a new `QuantileTransformer`.
97 pub fn new(
98 n_quantiles: usize,
99 output_distribution: OutputDistribution,
100 subsample: usize,
101 ) -> Self {
102 Self {
103 n_quantiles,
104 output_distribution,
105 subsample,
106 _marker: std::marker::PhantomData,
107 }
108 }
109
110 /// Return the number of quantiles.
111 #[must_use]
112 pub fn n_quantiles(&self) -> usize {
113 self.n_quantiles
114 }
115
116 /// Return the target output distribution.
117 #[must_use]
118 pub fn output_distribution(&self) -> OutputDistribution {
119 self.output_distribution
120 }
121
122 /// Return the subsample size.
123 #[must_use]
124 pub fn subsample(&self) -> usize {
125 self.subsample
126 }
127}
128
129impl<F: Float + Send + Sync + 'static> Default for QuantileTransformer<F> {
130 fn default() -> Self {
131 Self::new(1000, OutputDistribution::Uniform, 100_000)
132 }
133}
134
135// ---------------------------------------------------------------------------
136// FittedQuantileTransformer
137// ---------------------------------------------------------------------------
138
139/// A fitted quantile transformer holding per-feature quantile references.
140///
141/// Created by calling [`Fit::fit`] on a [`QuantileTransformer`].
142#[derive(Debug, Clone)]
143pub struct FittedQuantileTransformer<F> {
144 /// Quantile reference values per feature: `quantiles[j]` is a sorted
145 /// vector of reference values for feature `j`.
146 quantiles: Vec<Vec<F>>,
147 /// The reference quantile levels (evenly spaced in [0, 1]).
148 references: Vec<F>,
149 /// Target output distribution.
150 output_distribution: OutputDistribution,
151}
152
153impl<F: Float + Send + Sync + 'static> FittedQuantileTransformer<F> {
154 /// Return the computed quantile reference values per feature.
155 #[must_use]
156 pub fn quantiles(&self) -> &[Vec<F>] {
157 &self.quantiles
158 }
159
160 /// Return the reference quantile levels (the evenly-spaced `[0, 1]` grid).
161 ///
162 /// Mirrors scikit-learn `references_` (`_data.py:2795` `references_ =
163 /// np.linspace(0, 1, n_quantiles_, endpoint=True)`): a length-`n_quantiles_`
164 /// vector, where `n_quantiles_` is the effective (clamped) quantile count.
165 #[must_use]
166 pub fn references(&self) -> &[F] {
167 &self.references
168 }
169
170 /// Return the effective number of quantile reference points.
171 ///
172 /// Mirrors scikit-learn `n_quantiles_` (`_data.py:2790` `n_quantiles_ =
173 /// max(1, min(n_quantiles, n_samples))`): the requested `n_quantiles`
174 /// clamped down to the number of samples seen during `fit`. Equals
175 /// `references().len()`.
176 #[must_use]
177 pub fn n_quantiles(&self) -> usize {
178 self.references.len()
179 }
180
181 /// Return the number of features.
182 #[must_use]
183 pub fn n_features(&self) -> usize {
184 self.quantiles.len()
185 }
186
187 /// Back-project transformed data to the original feature space.
188 ///
189 /// The inverse of [`Transform::transform`]: each transformed column value is
190 /// mapped back to an original feature value. Mirrors scikit-learn
191 /// `QuantileTransformer.inverse_transform` (`_data.py:2947`) which runs
192 /// `_transform_col(..., inverse=True)` (`_data.py:2813-2866`):
193 ///
194 /// - **Normal** output: the value is first mapped to its uniform rank in
195 /// `[0, 1]` via the standard normal CDF `stats.norm.cdf` (`:2821`,
196 /// [`norm_cdf`]); the bound masks use `BOUNDS_THRESHOLD = 1e-7` (`:47`)
197 /// against `0` / `1` (`:2827-2828`).
198 /// - **Uniform** output: the value IS the rank; the bound masks use exact
199 /// equality `== 0` / `== 1` (`:2830-2831`).
200 ///
201 /// The finite ranks are interpolated with a **plain** `np.interp(rank,
202 /// references_, quantiles)` — the REVERSE of the forward interp, and NOT the
203 /// forward/reversed-averaged variant (the averaging is forward-only,
204 /// `:2848`). Ranks at/below the lower bound map to the column minimum
205 /// `quantiles[0]` and at/above the upper bound to the column maximum
206 /// `quantiles[-1]` (`:2850-2851`). `NaN` passes through unchanged
207 /// (`isfinite_mask`, `:2833`).
208 ///
209 /// # Errors
210 ///
211 /// Returns [`FerroError::ShapeMismatch`] if the number of columns differs
212 /// from the number of features seen during fitting.
213 pub fn inverse_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
214 let n_features = self.quantiles.len();
215 if x.ncols() != n_features {
216 return Err(FerroError::ShapeMismatch {
217 expected: vec![x.nrows(), n_features],
218 actual: vec![x.nrows(), x.ncols()],
219 context: "FittedQuantileTransformer::inverse_transform".into(),
220 });
221 }
222 // sklearn `inverse_transform` -> `_check_inputs(force_all_finite=
223 // "allow-nan")` (`_data.py:2876`, called `:2965`): NaN passes through,
224 // but +/-inf raises ValueError (#2212, MinMaxScaler #2200 precedent).
225 if x.iter().any(|v| v.is_infinite()) {
226 return Err(FerroError::InvalidParameter {
227 name: "X".into(),
228 reason: "Input X contains infinity or a value too large for dtype.".into(),
229 });
230 }
231
232 // sklearn _transform_col inverse: lower_bound_x = 0, upper_bound_x = 1,
233 // lower_bound_y = quantiles[0], upper_bound_y = quantiles[-1]
234 // (`_data.py:2814-2817`). BOUNDS_THRESHOLD = 1e-7 (`:47`).
235 let bounds_threshold = F::from(1e-7).unwrap_or_else(F::zero);
236 let zero = F::zero();
237 let one = F::one();
238
239 let mut out = x.to_owned();
240
241 for j in 0..n_features {
242 let quantiles_col = &self.quantiles[j];
243 let (lower_y, upper_y) = match (quantiles_col.first(), quantiles_col.last()) {
244 (Some(&first), Some(&last)) => (first, last),
245 // Empty landmark column: nothing to map to; leave values as-is.
246 _ => continue,
247 };
248
249 for i in 0..out.nrows() {
250 let val = out[[i, j]];
251 if val.is_nan() {
252 // isfinite_mask excludes NaN; it stays NaN (`_data.py:2833`).
253 continue;
254 }
255
256 // For Normal output, map the normal-space value back to the
257 // uniform rank in [0, 1] (`_data.py:2821`). For Uniform the
258 // value already IS the rank.
259 let rank = match self.output_distribution {
260 OutputDistribution::Uniform => val,
261 OutputDistribution::Normal => norm_cdf(val),
262 };
263
264 // Bound masks (`_data.py:2826-2831`). Under errstate(invalid=
265 // "ignore") NaN comparisons are false — but `rank` is finite
266 // here (norm_cdf of a finite value is finite, val is finite).
267 let (lower_idx, upper_idx) = match self.output_distribution {
268 OutputDistribution::Normal => (
269 rank - bounds_threshold < zero,
270 rank + bounds_threshold > one,
271 ),
272 OutputDistribution::Uniform => (rank == zero, rank == one),
273 };
274
275 // Plain reverse interp references_ -> quantiles (`_data.py:2848`).
276 let mut mapped = np_interp(rank, &self.references, quantiles_col);
277
278 // Bound overrides (`_data.py:2850-2851`): upper applied first,
279 // then lower (matching sklearn's order).
280 if upper_idx {
281 mapped = upper_y;
282 }
283 if lower_idx {
284 mapped = lower_y;
285 }
286
287 out[[i, j]] = mapped;
288 }
289 }
290
291 Ok(out)
292 }
293}
294
295// ---------------------------------------------------------------------------
296// Helpers
297// ---------------------------------------------------------------------------
298
299/// Inverse of the standard normal CDF (probit / norm.ppf), accurate to ~1e-9
300/// via Acklam's rational approximation. Mirrors scipy `stats.norm.ppf` as used
301/// by sklearn QuantileTransformer (`_data.py:2856`), with the output clipped to
302/// sklearn's ASYMMETRIC bounds (`_data.py:2860-2862`):
303/// `clip_min = norm.ppf(BOUNDS_THRESHOLD - spacing(1)) = -5.199337582605575`,
304/// `clip_max = norm.ppf(1 - (BOUNDS_THRESHOLD - spacing(1))) = 5.19933758270342`
305/// (Δ ≈ 9.8e-11 — `norm.ppf` is not exactly symmetric about 0 at these args).
306fn probit<F: Float>(p: F) -> F {
307 let clip_min = F::from(-5.199337582605575).unwrap_or_else(F::min_value);
308 let clip_max = F::from(5.19933758270342).unwrap_or_else(F::max_value);
309 let cf = |k: f64| F::from(k).unwrap_or_else(F::zero);
310 if p <= F::zero() {
311 return clip_min;
312 }
313 if p >= F::one() {
314 return clip_max;
315 }
316 let a = [
317 -3.969683028665376e+01,
318 2.209460984245205e+02,
319 -2.759285104469687e+02,
320 1.38357751867269e+02,
321 -3.066479806614716e+01,
322 2.506628277459239e+00,
323 ];
324 let b = [
325 -5.447609879822406e+01,
326 1.615858368580409e+02,
327 -1.556989798598866e+02,
328 6.680131188771972e+01,
329 -1.328068155288572e+01,
330 ];
331 let c = [
332 -7.784894002430293e-03,
333 -3.223964580411365e-01,
334 -2.400758277161838e+00,
335 -2.549732539343734e+00,
336 4.374664141464968e+00,
337 2.938163982698783e+00,
338 ];
339 let d = [
340 7.784695709041462e-03,
341 3.224671290700398e-01,
342 2.445134137142996e+00,
343 3.754408661907416e+00,
344 ];
345 let p_low = F::from(0.02425).unwrap_or_else(F::zero);
346 let p_high = F::one() - p_low;
347 let two = F::from(2.0).unwrap_or_else(F::one);
348 let half = F::from(0.5).unwrap_or_else(F::zero);
349 let x = if p < p_low {
350 let qv = (-two * p.ln()).sqrt();
351 (((((cf(c[0]) * qv + cf(c[1])) * qv + cf(c[2])) * qv + cf(c[3])) * qv + cf(c[4])) * qv
352 + cf(c[5]))
353 / ((((cf(d[0]) * qv + cf(d[1])) * qv + cf(d[2])) * qv + cf(d[3])) * qv + F::one())
354 } else if p <= p_high {
355 let qv = p - half;
356 let r = qv * qv;
357 (((((cf(a[0]) * r + cf(a[1])) * r + cf(a[2])) * r + cf(a[3])) * r + cf(a[4])) * r
358 + cf(a[5]))
359 * qv
360 / (((((cf(b[0]) * r + cf(b[1])) * r + cf(b[2])) * r + cf(b[3])) * r + cf(b[4])) * r
361 + F::one())
362 } else {
363 let qv = (-two * (F::one() - p).ln()).sqrt();
364 -(((((cf(c[0]) * qv + cf(c[1])) * qv + cf(c[2])) * qv + cf(c[3])) * qv + cf(c[4])) * qv
365 + cf(c[5]))
366 / ((((cf(d[0]) * qv + cf(d[1])) * qv + cf(d[2])) * qv + cf(d[3])) * qv + F::one())
367 };
368 if x < clip_min {
369 clip_min
370 } else if x > clip_max {
371 clip_max
372 } else {
373 x
374 }
375}
376
377/// numpy-`np.interp`-faithful 1-D linear interpolation (`xp` ascending, may have
378/// duplicate landmarks). Used twice (ascending + reversed) and averaged in the
379/// CDF lookup to mirror sklearn `_transform_col` (`_data.py:2843-2846`).
380fn np_interp<F: Float>(x: F, xp: &[F], fp: &[F]) -> F {
381 let n = xp.len();
382 if n == 0 {
383 return F::zero();
384 }
385 if x <= xp[0] {
386 return fp[0];
387 }
388 if x >= xp[n - 1] {
389 return fp[n - 1];
390 }
391 // first index with xp[idx] > x (numpy searchsorted side='right'); interval (idx-1, idx)
392 let mut lo = 0usize;
393 let mut hi = n;
394 while lo < hi {
395 let mid = lo + (hi - lo) / 2;
396 if xp[mid] <= x {
397 lo = mid + 1;
398 } else {
399 hi = mid;
400 }
401 }
402 let i = lo - 1;
403 let denom = xp[i + 1] - xp[i];
404 if denom == F::zero() {
405 fp[i]
406 } else {
407 fp[i] + (x - xp[i]) / denom * (fp[i + 1] - fp[i])
408 }
409}
410
411/// Standard normal CDF Φ(x) (scipy `ndtr` / `stats.norm.cdf`), computed in f64
412/// for accuracy then cast back to `F`. Used by [`FittedQuantileTransformer::inverse_transform`]
413/// to map a Normal-space value back to its uniform rank in `[0, 1]`, mirroring
414/// sklearn `_transform_col` inverse branch (`_data.py:2821`: `stats.norm.cdf(X_col)`).
415///
416/// Built from [`ferrolearn_numerical::special::erf`]/[`erfc`] (libm Cephes,
417/// machine precision) via the standard three-branch `ndtr` split on
418/// `t = x / sqrt(2)`, which preserves tail accuracy by using `erfc` (not
419/// `1 - erf`) in the tails:
420/// - `t < -1` → `0.5 * erfc(-t)` (left tail; no catastrophic cancellation)
421/// - `-1 ≤ t ≤ 1` → `0.5 * (1 + erf(t))` (central region)
422/// - `t > 1` → `1 - 0.5 * erfc(t)` (right tail)
423///
424/// This is the analytic inverse of the forward [`probit`] (Acklam ppf, ~1e-9),
425/// so a Normal round-trip agrees with `scipy.stats.norm.cdf` to ~1e-9. A `NaN`
426/// input is passed through (returns `NaN`); ±∞ map to 1/0 since `erf(±∞)=±1`.
427fn norm_cdf<F: Float>(x: F) -> F {
428 let xf = x.to_f64().unwrap_or(f64::NAN);
429 if xf.is_nan() {
430 return F::nan();
431 }
432 let t = xf / std::f64::consts::SQRT_2;
433 let cdf = if t < -1.0 {
434 0.5 * erfc(-t)
435 } else if t <= 1.0 {
436 0.5 * (1.0 + erf(t))
437 } else {
438 1.0 - 0.5 * erfc(t)
439 };
440 F::from(cdf).unwrap_or_else(F::zero)
441}
442
443/// Map `value` to its quantile level by averaging the ascending and reversed
444/// linear interpolations, mirroring sklearn `_transform_col`
445/// (`_data.py:2843-2846`) so plateaus map to the midpoint of their span.
446fn interpolate_cdf<F: Float>(value: F, quantiles: &[F], references: &[F]) -> F {
447 if quantiles.is_empty() {
448 return F::from(0.5).unwrap_or_else(F::zero);
449 }
450 let forward = np_interp(value, quantiles, references);
451 let neg_q: Vec<F> = quantiles.iter().rev().map(|&qv| -qv).collect();
452 let neg_r: Vec<F> = references.iter().rev().map(|&rv| -rv).collect();
453 let backward = np_interp(-value, &neg_q, &neg_r);
454 F::from(0.5).unwrap_or_else(F::zero) * (forward - backward)
455}
456
457// ---------------------------------------------------------------------------
458// Trait implementations
459// ---------------------------------------------------------------------------
460
461impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for QuantileTransformer<F> {
462 type Fitted = FittedQuantileTransformer<F>;
463 type Error = FerroError;
464
465 /// Fit by computing per-feature quantile reference values.
466 ///
467 /// # Errors
468 ///
469 /// - [`FerroError::InsufficientSamples`] if the input has 0 rows (empty).
470 /// - [`FerroError::InvalidParameter`] if `n_quantiles` is less than 2.
471 fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedQuantileTransformer<F>, FerroError> {
472 let n_samples = x.nrows();
473 // sklearn accepts n_samples >= 1: `n_quantiles_ = max(1, min(n_quantiles,
474 // n_samples))` (`_data.py:2790`). Reject only the empty (0-row) case.
475 if n_samples < 1 {
476 return Err(FerroError::InsufficientSamples {
477 required: 1,
478 actual: n_samples,
479 context: "QuantileTransformer::fit".into(),
480 });
481 }
482 if self.n_quantiles < 2 {
483 return Err(FerroError::InvalidParameter {
484 name: "n_quantiles".into(),
485 reason: "n_quantiles must be at least 2".into(),
486 });
487 }
488
489 let n_features = x.ncols();
490 let effective_quantiles = self.n_quantiles.min(n_samples);
491
492 // numpy np.linspace(0,1,K): step = 1/(K-1) computed once, y[i] = i*step,
493 // endpoint y[K-1] pinned to exactly 1.0 (_data.py:2795). For K == 1
494 // (n_samples == 1, sklearn `n_quantiles_ = max(1, ...)`) np.linspace
495 // returns the single point [0.0] — guard the 1/(K-1) divide that would
496 // otherwise yield 0 * (1/0) = NaN.
497 let k = effective_quantiles;
498 let mut references: Vec<F> = if k <= 1 {
499 vec![F::zero()]
500 } else {
501 let denom = F::from(k - 1).unwrap_or_else(F::one);
502 let step = F::one() / denom;
503 (0..k)
504 .map(|i| F::from(i).unwrap_or_else(F::zero) * step)
505 .collect()
506 };
507 if k >= 2 {
508 let last = references.len() - 1;
509 references[last] = F::one();
510 }
511
512 let mut quantiles = Vec::with_capacity(n_features);
513
514 for j in 0..n_features {
515 let mut col_vals: Vec<F> = x.column(j).iter().copied().collect();
516 // Remove NaN values
517 col_vals.retain(|v| !v.is_nan());
518 col_vals.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
519
520 // Subsample if needed
521 if self.subsample > 0 && col_vals.len() > self.subsample {
522 let step = col_vals.len() as f64 / self.subsample as f64;
523 let mut sampled = Vec::with_capacity(self.subsample);
524 for i in 0..self.subsample {
525 let idx = (i as f64 * step) as usize;
526 sampled.push(col_vals[idx.min(col_vals.len() - 1)]);
527 }
528 col_vals = sampled;
529 }
530
531 // Compute quantile reference values
532 let n = col_vals.len();
533 let mut feature_quantiles = Vec::with_capacity(effective_quantiles);
534 for &ref_level in &references {
535 // sklearn: np.nanpercentile(X, references_*100); numpy 'linear' virtual
536 // index is q/100*(n-1). Replicate the *100 then /100 round-trip
537 // (_data.py:2694,:2702).
538 let hundred = F::from(100.0).unwrap_or_else(F::one);
539 let q = ref_level * hundred;
540 let pos = (q / hundred) * F::from(n.saturating_sub(1)).unwrap_or_else(F::one);
541 let lo = pos.floor().to_usize().unwrap_or(0).min(n.saturating_sub(1));
542 let hi = pos.ceil().to_usize().unwrap_or(0).min(n.saturating_sub(1));
543 let frac = pos - F::from(lo).unwrap_or_else(F::zero);
544 let val = if lo == hi {
545 col_vals[lo]
546 } else {
547 col_vals[lo] * (F::one() - frac) + col_vals[hi] * frac
548 };
549 feature_quantiles.push(val);
550 }
551
552 quantiles.push(feature_quantiles);
553 }
554
555 Ok(FittedQuantileTransformer {
556 quantiles,
557 references,
558 output_distribution: self.output_distribution,
559 })
560 }
561}
562
563impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedQuantileTransformer<F> {
564 type Output = Array2<F>;
565 type Error = FerroError;
566
567 /// Transform data by mapping each value through the empirical CDF.
568 ///
569 /// # Errors
570 ///
571 /// Returns [`FerroError::ShapeMismatch`] if the number of columns differs
572 /// from the number of features seen during fitting.
573 fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
574 let n_features = self.quantiles.len();
575 if x.ncols() != n_features {
576 return Err(FerroError::ShapeMismatch {
577 expected: vec![x.nrows(), n_features],
578 actual: vec![x.nrows(), x.ncols()],
579 context: "FittedQuantileTransformer::transform".into(),
580 });
581 }
582 // sklearn `transform` -> `_check_inputs(force_all_finite="allow-nan")`
583 // (`_data.py:2879`, called `:2943`): NaN passes through, but +/-inf
584 // raises ValueError (#2217, the forward mirror of the inverse #2212 guard
585 // above; MinMaxScaler #2200 precedent).
586 if x.iter().any(|v| v.is_infinite()) {
587 return Err(FerroError::InvalidParameter {
588 name: "X".into(),
589 reason: "Input X contains infinity or a value too large for dtype.".into(),
590 });
591 }
592
593 let mut out = x.to_owned();
594
595 // sklearn forward `_transform_col` bound masks (`_data.py:2829-2831`,
596 // `:2850-2851`): for `uniform` output the masks use EXACT equality
597 // against the landmark bounds — `lower_bounds_idx = X_col == quantiles[0]`,
598 // `upper_bounds_idx = X_col == quantiles[-1]` — and after the averaged
599 // interp the upper override is applied FIRST then the lower (`:2850-2851`),
600 // so a value matching BOTH bounds (e.g. a constant feature where every
601 // landmark == c) resolves to `lower_bound_y = references_[0] = 0`. The
602 // averaged-interp midpoint (#2318/#2319) is thus overridden at the
603 // exact-bound values. `lower_bound_y`/`upper_bound_y` are 0/1 (the
604 // [0,1] rank space) — the Normal `norm.ppf` + clip is applied afterwards.
605 let zero = F::zero();
606 let one = F::one();
607
608 for j in 0..n_features {
609 let feature_quantiles = &self.quantiles[j];
610 let (lower_bound_x, upper_bound_x) =
611 match (feature_quantiles.first(), feature_quantiles.last()) {
612 (Some(&first), Some(&last)) => (first, last),
613 _ => (zero, zero),
614 };
615 for i in 0..out.nrows() {
616 let val = out[[i, j]];
617 if val.is_nan() {
618 continue;
619 }
620 let cdf_val = if feature_quantiles.len() == 1 {
621 // Single landmark (n_quantiles_ == 1, e.g. a single-sample
622 // fit). sklearn's forward bound masks are OUTPUT-DISTRIBUTION
623 // DEPENDENT (`_data.py:2823-2828`, #2219):
624 // * Uniform uses EXACT-equality masks (X == bound), so a
625 // held-out value keeps the `np_interp` single-point clamp
626 // (references_[0] = 0) and the fitted value (== bound)
627 // also resolves to lower_bound_y = 0 → always 0.
628 // * Normal uses ±BOUNDS_THRESHOLD masks, so a value ABOVE
629 // the lone landmark → upper_bound_y = 1 (then
630 // probit(1) = +clip), at/below → lower_bound_y = 0
631 // (probit(0) = -clip).
632 if self.output_distribution == OutputDistribution::Normal
633 && val > feature_quantiles[0]
634 {
635 F::one()
636 } else {
637 F::zero()
638 }
639 } else {
640 interpolate_cdf(val, feature_quantiles, &self.references)
641 };
642
643 // Forward bound masks (`_data.py:2829-2831`,`:2850-2851`). UPPER
644 // applied first, then LOWER — so a value equal to BOTH bounds
645 // (constant feature) gets `lower_bound_y = 0`. Exact equality, as
646 // sklearn does on the f64 landmark values. Only override
647 // multi-landmark columns; the single-landmark path (above) already
648 // models its own output-distribution-dependent masks (#2219).
649 let rank = if feature_quantiles.len() == 1 {
650 cdf_val
651 } else {
652 let mut r = cdf_val;
653 if val == upper_bound_x {
654 r = one; // upper_bound_y = references_[-1] = 1
655 }
656 if val == lower_bound_x {
657 r = zero; // lower_bound_y = references_[0] = 0
658 }
659 r
660 };
661
662 out[[i, j]] = match self.output_distribution {
663 OutputDistribution::Uniform => rank,
664 OutputDistribution::Normal => probit(rank),
665 };
666 }
667 }
668
669 Ok(out)
670 }
671}
672
673/// Implement `Transform` on the unfitted transformer to satisfy the
674/// `FitTransform: Transform` supertrait bound.
675impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for QuantileTransformer<F> {
676 type Output = Array2<F>;
677 type Error = FerroError;
678
679 /// Always returns an error — the transformer must be fitted first.
680 fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError> {
681 Err(FerroError::InvalidParameter {
682 name: "QuantileTransformer".into(),
683 reason: "transformer must be fitted before calling transform; use fit() first".into(),
684 })
685 }
686}
687
688impl<F: Float + Send + Sync + 'static> FitTransform<Array2<F>> for QuantileTransformer<F> {
689 type FitError = FerroError;
690
691 /// Fit and transform in one step.
692 ///
693 /// # Errors
694 ///
695 /// Returns an error if fitting fails.
696 fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
697 let fitted = self.fit(x, &())?;
698 fitted.transform(x)
699 }
700}
701
702// ---------------------------------------------------------------------------
703// Standalone `quantile_transform` free function (sklearn `quantile_transform`,
704// `_data.py:2978`)
705// ---------------------------------------------------------------------------
706
707/// Transform features using quantiles information — the standalone,
708/// estimator-less API mirroring scikit-learn's `quantile_transform` free
709/// function (`sklearn/preprocessing/_data.py:2978`,`:3107-3119`).
710///
711/// This is a thin wrapper that **reuses** the fitted estimator: it constructs a
712/// [`QuantileTransformer`] with the given parameters, then runs the SHIPPED
713/// [`Fit::fit`] → [`Transform::transform`] (i.e. `fit_transform`) — it does NOT
714/// reimplement any quantile math. Each call fits on the supplied data and
715/// immediately transforms it (sklearn `n = QuantileTransformer(...); X =
716/// n.fit_transform(X)`, `:3107-3116`).
717///
718/// # Parameters
719///
720/// - `x` — the data to transform, shape `(n_samples, n_features)`.
721/// - `axis` — axis along which to transform (sklearn `axis=0` default,
722/// `:3013-3015`). `axis == 0` transforms each **feature** (column)
723/// independently — `fit_transform(X)` (`:3115-3116`). `axis == 1` transforms
724/// each **sample** (row) independently — `fit_transform(X.T).T`
725/// (`:3117-3118`): the matrix is transposed (so rows become columns / the
726/// per-feature path operates on each original row), fit-transformed, then
727/// transposed back to the original orientation.
728/// - `n_quantiles` — number of quantile reference landmarks (sklearn default
729/// 1000, clamped to `n_samples`, `:3017-3023`).
730/// - `output_distribution` — [`OutputDistribution::Uniform`] (sklearn default,
731/// `:3025-3027`) or [`OutputDistribution::Normal`].
732/// - `subsample` — maximum samples used to estimate the quantiles (`0` = use
733/// all). See the scope note below.
734///
735/// # Scope
736///
737/// Unlike sklearn's `quantile_transform`, the `ignore_implicit_zeros`,
738/// `random_state`, and `copy` keyword arguments are **not** surfaced: this path
739/// is dense, deterministic, and always returns a freshly allocated array
740/// (`copy` is implicitly `True`). The `subsample` parameter is threaded to the
741/// estimator, but the actual **random** subsampling is REQ-6 (#1324,
742/// NOT-STARTED): for `n_samples > subsample` the deterministic strided pick is
743/// used instead of sklearn's RNG draw, so this function matches sklearn's output
744/// only when `n_samples <= subsample` (use `subsample=None` or a large value on
745/// the sklearn side to compare). `NaN` is disregarded when fitting and preserved
746/// in the output (sklearn Notes, `:3078-3079`).
747///
748/// # Errors
749///
750/// Returns [`FerroError::InvalidParameter`] if `axis` is neither `0` nor `1`
751/// (mirroring sklearn's `ValueError("axis should be either equal to 0 or 1.
752/// Got axis={axis}")`). Also propagates any error from the underlying
753/// [`Fit::fit`] / [`Transform::transform`] (e.g.
754/// [`FerroError::InsufficientSamples`] for fewer than 2 rows along the chosen
755/// axis, [`FerroError::InvalidParameter`] for `n_quantiles < 2` or non-finite
756/// `±inf` input).
757///
758/// # Examples
759///
760/// ```
761/// use ferrolearn_preprocess::quantile_transformer::{quantile_transform, OutputDistribution};
762/// use ndarray::array;
763///
764/// let x = array![[1.0_f64], [2.0], [3.0], [4.0], [5.0]];
765/// let out = quantile_transform(&x, 0, 5, OutputDistribution::Uniform, 0).unwrap();
766/// // First maps to 0.0, last to 1.0.
767/// assert!((out[[0, 0]] - 0.0_f64).abs() < 1e-9);
768/// assert!((out[[4, 0]] - 1.0_f64).abs() < 1e-9);
769/// ```
770#[must_use = "quantile_transform returns a new array; the input is not modified"]
771pub fn quantile_transform<F: Float + Send + Sync + 'static>(
772 x: &Array2<F>,
773 axis: usize,
774 n_quantiles: usize,
775 output_distribution: OutputDistribution,
776 subsample: usize,
777) -> Result<Array2<F>, FerroError> {
778 let qt = QuantileTransformer::<F>::new(n_quantiles, output_distribution, subsample);
779 match axis {
780 // sklearn `if axis == 0: X = n.fit_transform(X)` (`_data.py:3115-3116`):
781 // each feature (column) transformed independently.
782 0 => {
783 let fitted = qt.fit(x, &())?;
784 fitted.transform(x)
785 }
786 // sklearn `else: X = n.fit_transform(X.T).T` (`_data.py:3117-3118`): each
787 // sample (row) transformed independently. Transpose to a CONTIGUOUS owned
788 // array so the per-feature `fit`/`transform` operates on each original
789 // row, then transpose the result back to the original orientation.
790 1 => {
791 let xt = x.t().to_owned();
792 let fitted = qt.fit(&xt, &())?;
793 let out_t = fitted.transform(&xt)?;
794 Ok(out_t.t().to_owned())
795 }
796 // sklearn `else: raise ValueError("axis should be either equal to 0 or 1.
797 // Got axis={axis}")` (`_data.py:3117` guard via `@validate_params`).
798 other => Err(FerroError::InvalidParameter {
799 name: "axis".into(),
800 reason: format!("axis should be either equal to 0 or 1. Got axis={other}"),
801 }),
802 }
803}
804
805// ---------------------------------------------------------------------------
806// Tests
807// ---------------------------------------------------------------------------
808
809#[cfg(test)]
810mod tests {
811 use super::*;
812 use approx::assert_abs_diff_eq;
813 use ndarray::array;
814
815 #[test]
816 fn test_quantile_transformer_uniform() {
817 let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
818 let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
819 let fitted = qt.fit(&x, &()).unwrap();
820 let out = fitted.transform(&x).unwrap();
821 // All values should be in [0, 1]
822 for v in &out {
823 assert!(*v >= 0.0 && *v <= 1.0, "Value {v} not in [0,1]");
824 }
825 // First should be 0, last should be 1
826 assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-6);
827 assert_abs_diff_eq!(out[[4, 0]], 1.0, epsilon = 1e-6);
828 }
829
830 #[test]
831 fn test_quantile_transformer_normal() {
832 let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Normal, 0);
833 let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
834 let fitted = qt.fit(&x, &()).unwrap();
835 let out = fitted.transform(&x).unwrap();
836 // Middle value should be close to 0 (median → 0 in normal)
837 assert!(out[[2, 0]].abs() < 0.5, "Median should map near 0");
838 // First should be negative, last positive
839 assert!(out[[0, 0]] < out[[4, 0]]);
840 }
841
842 #[test]
843 fn test_quantile_transformer_monotonic() {
844 let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
845 let x = array![[5.0], [3.0], [1.0], [4.0], [2.0]];
846 let fitted = qt.fit(&x, &()).unwrap();
847 let out = fitted.transform(&x).unwrap();
848 // Transform should preserve ordering: rank(5) > rank(3) > rank(1)
849 assert!(out[[0, 0]] > out[[1, 0]]); // 5 > 3
850 assert!(out[[1, 0]] > out[[2, 0]]); // 3 > 1
851 }
852
853 #[test]
854 fn test_quantile_transformer_multiple_features() {
855 let qt = QuantileTransformer::<f64>::new(50, OutputDistribution::Uniform, 0);
856 let x = array![[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]];
857 let fitted = qt.fit(&x, &()).unwrap();
858 let out = fitted.transform(&x).unwrap();
859 assert_eq!(out.ncols(), 2);
860 // Each feature independently transformed
861 for j in 0..2 {
862 assert!(out[[0, j]] <= out[[2, j]]);
863 }
864 }
865
866 #[test]
867 fn test_quantile_transformer_fit_transform() {
868 let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
869 let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
870 let out = qt.fit_transform(&x).unwrap();
871 assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-6);
872 assert_abs_diff_eq!(out[[4, 0]], 1.0, epsilon = 1e-6);
873 }
874
875 #[test]
876 fn test_quantile_transformer_insufficient_samples_error() {
877 // sklearn accepts n_samples == 1 (clamps n_quantiles_ to 1, `_data.py:2790`);
878 // only the empty (0-row) case is rejected (#2218).
879 let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
880 let empty = ndarray::Array2::<f64>::zeros((0, 1));
881 assert!(qt.fit(&empty, &()).is_err());
882 }
883
884 #[test]
885 fn test_quantile_transformer_single_sample_all_zeros_no_nan() {
886 // sklearn: QuantileTransformer(5,'uniform').fit_transform([[7.]]) -> [[0.0]]
887 // (n_quantiles_ clamped to 1, references_=[0.0]). Must not produce NaN.
888 let qt = QuantileTransformer::<f64>::new(5, OutputDistribution::Uniform, 0);
889 let x = array![[7.0]];
890 let res = qt.fit_transform(&x);
891 assert!(res.is_ok(), "single-sample fit_transform must succeed");
892 if let Ok(out) = res {
893 assert_eq!(out.dim(), (1, 1));
894 let v = out[[0, 0]];
895 assert!(v.abs() <= 1e-12 && !v.is_nan());
896 }
897 }
898
899 #[test]
900 fn test_quantile_transformer_too_few_quantiles_error() {
901 let qt = QuantileTransformer::<f64>::new(1, OutputDistribution::Uniform, 0);
902 let x = array![[1.0], [2.0], [3.0]];
903 assert!(qt.fit(&x, &()).is_err());
904 }
905
906 #[test]
907 fn test_quantile_transformer_shape_mismatch() {
908 let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
909 let x_train = array![[1.0, 2.0], [3.0, 4.0]];
910 let fitted = qt.fit(&x_train, &()).unwrap();
911 let x_bad = array![[1.0, 2.0, 3.0]];
912 assert!(fitted.transform(&x_bad).is_err());
913 }
914
915 #[test]
916 fn test_quantile_transformer_unfitted_error() {
917 let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
918 let x = array![[1.0]];
919 assert!(qt.transform(&x).is_err());
920 }
921
922 #[test]
923 fn test_quantile_transformer_default() {
924 let qt = QuantileTransformer::<f64>::default();
925 assert_eq!(qt.n_quantiles(), 1000);
926 assert_eq!(qt.output_distribution(), OutputDistribution::Uniform);
927 assert_eq!(qt.subsample(), 100_000);
928 }
929
930 #[test]
931 fn test_quantile_transformer_f32() {
932 let qt = QuantileTransformer::<f32>::new(50, OutputDistribution::Uniform, 0);
933 let x: Array2<f32> = array![[1.0f32], [2.0], [3.0], [4.0], [5.0]];
934 let fitted = qt.fit(&x, &()).unwrap();
935 let out = fitted.transform(&x).unwrap();
936 assert!(out[[0, 0]] >= 0.0f32);
937 assert!(out[[4, 0]] <= 1.0f32);
938 }
939}