//! Quantile transformer: map features to a uniform or normal distribution.
//!
//! [`QuantileTransformer`] transforms features by mapping each value through
//! its empirical cumulative distribution function (CDF), producing values
//! uniformly distributed in `[0, 1]`. Optionally, the result can be mapped
//! to a standard normal distribution using the inverse normal CDF (probit).
//!
//! This is useful for making features more Gaussian-like, which can improve
//! the performance of many machine learning algorithms.
//!
//! Translation target: scikit-learn 1.5.2 `class QuantileTransformer`
//! (`sklearn/preprocessing/_data.py:2540`) + `quantile_transform` (`:2978`).
//! Design: `.design/preprocess/quantile_transformer.md`. Tracking: #1319.
//!
//! `## REQ status`
//!
//! | REQ | Status | Anchor |
//! |---|---|---|
//! | REQ-1 forward-CDF value surface (uniform + normal, distinct + tied) | SHIPPED | `fit` references/landmarks (#1322) + `transform`; sklearn `_data.py:2694`,`:2702`,`:2795` |
//! | REQ-2 forward+reversed AVERAGED interpolation | SHIPPED (#1321) | `interpolate_cdf`/`np_interp`; sklearn `_data.py:2843-2846` |
//! | REQ-3 Normal output accuracy (Acklam ppf + clip) | SHIPPED (#1320) | `probit`; sklearn `_data.py:2855-2862` |
//! | REQ-5 error/parameter contracts (scoped) | SHIPPED | `fit`/`transform` guards; sklearn `_data.py:2654` |
//! | REQ-4 `np.maximum.accumulate` monotonic repair (unobservable) | NOT-STARTED (#1323) | sklearn `_data.py:2707` |
//! | REQ-6 random subsample + random_state + n_quantiles>subsample error | NOT-STARTED (#1324) | sklearn `_data.py:2696-2700`,`:2774-2779` |
//! | REQ-7 `inverse_transform` (reverse interp + `norm.cdf` + bounds + NaN) | SHIPPED (#1325) | `FittedQuantileTransformer::inverse_transform` + `norm_cdf` (ndtr via `erf`/`erfc`); sklearn `_data.py:2947`,`:2813-2851`,`:2821`. Uniform ~exact, Normal ~1e-7 (follows the forward REQ-3 ~1e-9 ndtr/probit contract). Consumer: crate re-export `pub use quantile_transformer::FittedQuantileTransformer` (`lib.rs`, boundary public API). Live-oracle: `tests/divergence_quantile_transformer.rs` (uniform/normal round-trip + held-out + bounds + NaN + `norm_cdf` sanity) |
//! | REQ-8 `ignore_implicit_zeros` + sparse CSC | NOT-STARTED (#1326) | sklearn `_data.py:2709-2752` |
//! | REQ-9 `quantile_transform` free function | SHIPPED (#1327) | `quantile_transform` free fn delegates to the SHIPPED `QuantileTransformer::new(n_quantiles, output_distribution, subsample)` + `Fit`→`FittedQuantileTransformer::transform` (NO quantile math duplicated); mirrors sklearn `quantile_transform(X, *, axis=0, n_quantiles=1000, output_distribution="uniform", ignore_implicit_zeros=False, subsample=1e5, random_state=None, copy=True)` (`_data.py:2978`,`:3107-3119`): `axis==0` → `fit_transform(X)` (per-feature, `:3115-3116`); `axis==1` → `fit_transform(X.T).T` (per-sample, `:3117-3118`, contiguous owned transpose); `axis ∉ {0,1}` → `InvalidParameter` "axis should be either equal to 0 or 1. Got axis=<n>" (matches the `else` ValueError `:3117`). SCOPE: `ignore_implicit_zeros`/`random_state`/`copy` are not surfaced (dense, deterministic, always-copy); `subsample` is threaded but the actual random subsampling is REQ-6 NOT-STARTED (#1324) so the path is deterministic and matches sklearn only for n_samples ≤ subsample (`subsample=None`/large in the oracle). Consumer: crate re-export `pub use quantile_transformer::quantile_transform` (`lib.rs`, boundary public API, R-DEFER-1). Live-oracle: `tests/divergence_quantile_transformer.rs` (axis=0 uniform/normal distinct+tied, axis=1 non-square, axis-error, == estimator fit_transform, multi-feature column-by-column, f32). |
//! | REQ-10 `copy` + OneToOneFeatureMixin fitted-attr surface | NOT-STARTED (#1328) | sklearn `_data.py:2540`,`:2790-2795` |
//! | REQ-11 PyO3 binding | SHIPPED (#1329) | `_RsQuantileTransformer` (`ferrolearn-python/src/extras.rs`) over `QuantileTransformer`/`FittedQuantileTransformer` exposes `fit`/`transform`/`inverse_transform` + the `#[getter]`s `quantiles_` (`&[Vec<F>]` → `(n_quantiles_, n_features)` numpy, transposed), `references_` (this module's `FittedQuantileTransformer::references`), `n_quantiles_` (effective, `FittedQuantileTransformer::n_quantiles` == `references().len()`, the `min(n_quantiles, n_samples)` clamp `_data.py:2790`), `n_features_in_`; `output_distribution` "uniform"/"normal" string→`OutputDistribution` (bad→`PyValueError`, `_data.py:2655` StrOptions). Python `class QuantileTransformer(_TransformerWrapper)` (`_extras.py`) mirrors sklearn's keyword-only ctor `(*, n_quantiles=1000, output_distribution="uniform", ignore_implicit_zeros=False, subsample=10_000, random_state=None, copy=True)` (`_data.py:2662`); pre-fit transform/attr→`NotFittedError`; n_quantiles>subsample→`ValueError` (`_data.py:2774`); registered `lib.rs`/`__init__.py` (non-test consumer, R-DEFER-1). SCOPE: `subsample`/`random_state` accepted for get_params compat but the deterministic path uses ALL samples (REQ-6 actual subsample NOT-STARTED #1324, RNG-gated #2118) — MATCHES sklearn only for n_samples ≤ subsample; `ignore_implicit_zeros` accepted-and-documented (REQ-8 dense-only #1326); `copy` no-op. float32 input → ~1e-5 tolerant (f64 ABI cast-back, #2215-class); Normal output → ~1e-7 (REQ-3 Acklam ppf/ndtr). Live-oracle: `tests/divergence_quantile_transformer_py.py`. |
//! | REQ-12 ferray substrate | NOT-STARTED (#1330) | R-SUBSTRATE |
use ferrolearn_core::error::FerroError;
use ferrolearn_core::traits::{Fit, FitTransform, Transform};
use ferrolearn_numerical::special::{erf, erfc};
use ndarray::Array2;
use num_traits::Float;
// ---------------------------------------------------------------------------
// OutputDistribution
// ---------------------------------------------------------------------------
/// Target output distribution for the quantile transformer.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum OutputDistribution {
/// Map to the uniform distribution on `[0, 1]`.
Uniform,
/// Map to the standard normal distribution via the probit function.
Normal,
}
// ---------------------------------------------------------------------------
// QuantileTransformer (unfitted)
// ---------------------------------------------------------------------------
/// An unfitted quantile transformer.
///
/// Calling [`Fit::fit`] computes the quantiles for each feature and returns a
/// [`FittedQuantileTransformer`].
///
/// # Parameters
///
/// - `n_quantiles` — number of quantile reference points (default 1000).
/// - `output_distribution` — target distribution (default `Uniform`).
/// - `subsample` — maximum number of samples used to compute quantiles
/// (default 100_000; set to 0 to use all samples).
///
/// # Examples
///
/// ```
/// use ferrolearn_preprocess::quantile_transformer::{QuantileTransformer, OutputDistribution};
/// use ferrolearn_core::traits::{Fit, Transform};
/// use ndarray::array;
///
/// let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
/// let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
/// let fitted = qt.fit(&x, &()).unwrap();
/// let out = fitted.transform(&x).unwrap();
/// // Values should be in [0, 1]
/// for v in out.iter() {
/// assert!(*v >= 0.0 && *v <= 1.0);
/// }
/// ```
#[must_use]
#[derive(Debug, Clone)]
pub struct QuantileTransformer<F> {
/// Number of quantile reference points.
n_quantiles: usize,
/// Target output distribution.
output_distribution: OutputDistribution,
/// Maximum number of samples for quantile computation (0 = all).
subsample: usize,
_marker: std::marker::PhantomData<F>,
}
impl<F: Float + Send + Sync + 'static> QuantileTransformer<F> {
/// Create a new `QuantileTransformer`.
pub fn new(
n_quantiles: usize,
output_distribution: OutputDistribution,
subsample: usize,
) -> Self {
Self {
n_quantiles,
output_distribution,
subsample,
_marker: std::marker::PhantomData,
}
}
/// Return the number of quantiles.
#[must_use]
pub fn n_quantiles(&self) -> usize {
self.n_quantiles
}
/// Return the target output distribution.
#[must_use]
pub fn output_distribution(&self) -> OutputDistribution {
self.output_distribution
}
/// Return the subsample size.
#[must_use]
pub fn subsample(&self) -> usize {
self.subsample
}
}
impl<F: Float + Send + Sync + 'static> Default for QuantileTransformer<F> {
fn default() -> Self {
Self::new(1000, OutputDistribution::Uniform, 100_000)
}
}
// ---------------------------------------------------------------------------
// FittedQuantileTransformer
// ---------------------------------------------------------------------------
/// A fitted quantile transformer holding per-feature quantile references.
///
/// Created by calling [`Fit::fit`] on a [`QuantileTransformer`].
#[derive(Debug, Clone)]
pub struct FittedQuantileTransformer<F> {
/// Quantile reference values per feature: `quantiles[j]` is a sorted
/// vector of reference values for feature `j`.
quantiles: Vec<Vec<F>>,
/// The reference quantile levels (evenly spaced in [0, 1]).
references: Vec<F>,
/// Target output distribution.
output_distribution: OutputDistribution,
}
impl<F: Float + Send + Sync + 'static> FittedQuantileTransformer<F> {
/// Return the computed quantile reference values per feature.
#[must_use]
pub fn quantiles(&self) -> &[Vec<F>] {
&self.quantiles
}
/// Return the reference quantile levels (the evenly-spaced `[0, 1]` grid).
///
/// Mirrors scikit-learn `references_` (`_data.py:2795` `references_ =
/// np.linspace(0, 1, n_quantiles_, endpoint=True)`): a length-`n_quantiles_`
/// vector, where `n_quantiles_` is the effective (clamped) quantile count.
#[must_use]
pub fn references(&self) -> &[F] {
&self.references
}
/// Return the effective number of quantile reference points.
///
/// Mirrors scikit-learn `n_quantiles_` (`_data.py:2790` `n_quantiles_ =
/// max(1, min(n_quantiles, n_samples))`): the requested `n_quantiles`
/// clamped down to the number of samples seen during `fit`. Equals
/// `references().len()`.
#[must_use]
pub fn n_quantiles(&self) -> usize {
self.references.len()
}
/// Return the number of features.
#[must_use]
pub fn n_features(&self) -> usize {
self.quantiles.len()
}
/// Back-project transformed data to the original feature space.
///
/// The inverse of [`Transform::transform`]: each transformed column value is
/// mapped back to an original feature value. Mirrors scikit-learn
/// `QuantileTransformer.inverse_transform` (`_data.py:2947`) which runs
/// `_transform_col(..., inverse=True)` (`_data.py:2813-2866`):
///
/// - **Normal** output: the value is first mapped to its uniform rank in
/// `[0, 1]` via the standard normal CDF `stats.norm.cdf` (`:2821`,
/// [`norm_cdf`]); the bound masks use `BOUNDS_THRESHOLD = 1e-7` (`:47`)
/// against `0` / `1` (`:2827-2828`).
/// - **Uniform** output: the value IS the rank; the bound masks use exact
/// equality `== 0` / `== 1` (`:2830-2831`).
///
/// The finite ranks are interpolated with a **plain** `np.interp(rank,
/// references_, quantiles)` — the REVERSE of the forward interp, and NOT the
/// forward/reversed-averaged variant (the averaging is forward-only,
/// `:2848`). Ranks at/below the lower bound map to the column minimum
/// `quantiles[0]` and at/above the upper bound to the column maximum
/// `quantiles[-1]` (`:2850-2851`). `NaN` passes through unchanged
/// (`isfinite_mask`, `:2833`).
///
/// # Errors
///
/// Returns [`FerroError::ShapeMismatch`] if the number of columns differs
/// from the number of features seen during fitting.
pub fn inverse_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let n_features = self.quantiles.len();
if x.ncols() != n_features {
return Err(FerroError::ShapeMismatch {
expected: vec![x.nrows(), n_features],
actual: vec![x.nrows(), x.ncols()],
context: "FittedQuantileTransformer::inverse_transform".into(),
});
}
// sklearn `inverse_transform` -> `_check_inputs(force_all_finite=
// "allow-nan")` (`_data.py:2876`, called `:2965`): NaN passes through,
// but +/-inf raises ValueError (#2212, MinMaxScaler #2200 precedent).
if x.iter().any(|v| v.is_infinite()) {
return Err(FerroError::InvalidParameter {
name: "X".into(),
reason: "Input X contains infinity or a value too large for dtype.".into(),
});
}
// sklearn _transform_col inverse: lower_bound_x = 0, upper_bound_x = 1,
// lower_bound_y = quantiles[0], upper_bound_y = quantiles[-1]
// (`_data.py:2814-2817`). BOUNDS_THRESHOLD = 1e-7 (`:47`).
let bounds_threshold = F::from(1e-7).unwrap_or_else(F::zero);
let zero = F::zero();
let one = F::one();
let mut out = x.to_owned();
for j in 0..n_features {
let quantiles_col = &self.quantiles[j];
let (lower_y, upper_y) = match (quantiles_col.first(), quantiles_col.last()) {
(Some(&first), Some(&last)) => (first, last),
// Empty landmark column: nothing to map to; leave values as-is.
_ => continue,
};
for i in 0..out.nrows() {
let val = out[[i, j]];
if val.is_nan() {
// isfinite_mask excludes NaN; it stays NaN (`_data.py:2833`).
continue;
}
// For Normal output, map the normal-space value back to the
// uniform rank in [0, 1] (`_data.py:2821`). For Uniform the
// value already IS the rank.
let rank = match self.output_distribution {
OutputDistribution::Uniform => val,
OutputDistribution::Normal => norm_cdf(val),
};
// Bound masks (`_data.py:2826-2831`). Under errstate(invalid=
// "ignore") NaN comparisons are false — but `rank` is finite
// here (norm_cdf of a finite value is finite, val is finite).
let (lower_idx, upper_idx) = match self.output_distribution {
OutputDistribution::Normal => (
rank - bounds_threshold < zero,
rank + bounds_threshold > one,
),
OutputDistribution::Uniform => (rank == zero, rank == one),
};
// Plain reverse interp references_ -> quantiles (`_data.py:2848`).
let mut mapped = np_interp(rank, &self.references, quantiles_col);
// Bound overrides (`_data.py:2850-2851`): upper applied first,
// then lower (matching sklearn's order).
if upper_idx {
mapped = upper_y;
}
if lower_idx {
mapped = lower_y;
}
out[[i, j]] = mapped;
}
}
Ok(out)
}
}
// ---------------------------------------------------------------------------
// Helpers
// ---------------------------------------------------------------------------
/// Inverse of the standard normal CDF (probit / norm.ppf), accurate to ~1e-9
/// via Acklam's rational approximation. Mirrors scipy `stats.norm.ppf` as used
/// by sklearn QuantileTransformer (`_data.py:2856`), with the output clipped to
/// sklearn's ASYMMETRIC bounds (`_data.py:2860-2862`):
/// `clip_min = norm.ppf(BOUNDS_THRESHOLD - spacing(1)) = -5.199337582605575`,
/// `clip_max = norm.ppf(1 - (BOUNDS_THRESHOLD - spacing(1))) = 5.19933758270342`
/// (Δ ≈ 9.8e-11 — `norm.ppf` is not exactly symmetric about 0 at these args).
fn probit<F: Float>(p: F) -> F {
let clip_min = F::from(-5.199337582605575).unwrap_or_else(F::min_value);
let clip_max = F::from(5.19933758270342).unwrap_or_else(F::max_value);
let cf = |k: f64| F::from(k).unwrap_or_else(F::zero);
if p <= F::zero() {
return clip_min;
}
if p >= F::one() {
return clip_max;
}
let a = [
-3.969683028665376e+01,
2.209460984245205e+02,
-2.759285104469687e+02,
1.38357751867269e+02,
-3.066479806614716e+01,
2.506628277459239e+00,
];
let b = [
-5.447609879822406e+01,
1.615858368580409e+02,
-1.556989798598866e+02,
6.680131188771972e+01,
-1.328068155288572e+01,
];
let c = [
-7.784894002430293e-03,
-3.223964580411365e-01,
-2.400758277161838e+00,
-2.549732539343734e+00,
4.374664141464968e+00,
2.938163982698783e+00,
];
let d = [
7.784695709041462e-03,
3.224671290700398e-01,
2.445134137142996e+00,
3.754408661907416e+00,
];
let p_low = F::from(0.02425).unwrap_or_else(F::zero);
let p_high = F::one() - p_low;
let two = F::from(2.0).unwrap_or_else(F::one);
let half = F::from(0.5).unwrap_or_else(F::zero);
let x = if p < p_low {
let qv = (-two * p.ln()).sqrt();
(((((cf(c[0]) * qv + cf(c[1])) * qv + cf(c[2])) * qv + cf(c[3])) * qv + cf(c[4])) * qv
+ cf(c[5]))
/ ((((cf(d[0]) * qv + cf(d[1])) * qv + cf(d[2])) * qv + cf(d[3])) * qv + F::one())
} else if p <= p_high {
let qv = p - half;
let r = qv * qv;
(((((cf(a[0]) * r + cf(a[1])) * r + cf(a[2])) * r + cf(a[3])) * r + cf(a[4])) * r
+ cf(a[5]))
* qv
/ (((((cf(b[0]) * r + cf(b[1])) * r + cf(b[2])) * r + cf(b[3])) * r + cf(b[4])) * r
+ F::one())
} else {
let qv = (-two * (F::one() - p).ln()).sqrt();
-(((((cf(c[0]) * qv + cf(c[1])) * qv + cf(c[2])) * qv + cf(c[3])) * qv + cf(c[4])) * qv
+ cf(c[5]))
/ ((((cf(d[0]) * qv + cf(d[1])) * qv + cf(d[2])) * qv + cf(d[3])) * qv + F::one())
};
if x < clip_min {
clip_min
} else if x > clip_max {
clip_max
} else {
x
}
}
/// numpy-`np.interp`-faithful 1-D linear interpolation (`xp` ascending, may have
/// duplicate landmarks). Used twice (ascending + reversed) and averaged in the
/// CDF lookup to mirror sklearn `_transform_col` (`_data.py:2843-2846`).
fn np_interp<F: Float>(x: F, xp: &[F], fp: &[F]) -> F {
let n = xp.len();
if n == 0 {
return F::zero();
}
if x <= xp[0] {
return fp[0];
}
if x >= xp[n - 1] {
return fp[n - 1];
}
// first index with xp[idx] > x (numpy searchsorted side='right'); interval (idx-1, idx)
let mut lo = 0usize;
let mut hi = n;
while lo < hi {
let mid = lo + (hi - lo) / 2;
if xp[mid] <= x {
lo = mid + 1;
} else {
hi = mid;
}
}
let i = lo - 1;
let denom = xp[i + 1] - xp[i];
if denom == F::zero() {
fp[i]
} else {
fp[i] + (x - xp[i]) / denom * (fp[i + 1] - fp[i])
}
}
/// Standard normal CDF Φ(x) (scipy `ndtr` / `stats.norm.cdf`), computed in f64
/// for accuracy then cast back to `F`. Used by [`FittedQuantileTransformer::inverse_transform`]
/// to map a Normal-space value back to its uniform rank in `[0, 1]`, mirroring
/// sklearn `_transform_col` inverse branch (`_data.py:2821`: `stats.norm.cdf(X_col)`).
///
/// Built from [`ferrolearn_numerical::special::erf`]/[`erfc`] (libm Cephes,
/// machine precision) via the standard three-branch `ndtr` split on
/// `t = x / sqrt(2)`, which preserves tail accuracy by using `erfc` (not
/// `1 - erf`) in the tails:
/// - `t < -1` → `0.5 * erfc(-t)` (left tail; no catastrophic cancellation)
/// - `-1 ≤ t ≤ 1` → `0.5 * (1 + erf(t))` (central region)
/// - `t > 1` → `1 - 0.5 * erfc(t)` (right tail)
///
/// This is the analytic inverse of the forward [`probit`] (Acklam ppf, ~1e-9),
/// so a Normal round-trip agrees with `scipy.stats.norm.cdf` to ~1e-9. A `NaN`
/// input is passed through (returns `NaN`); ±∞ map to 1/0 since `erf(±∞)=±1`.
fn norm_cdf<F: Float>(x: F) -> F {
let xf = x.to_f64().unwrap_or(f64::NAN);
if xf.is_nan() {
return F::nan();
}
let t = xf / std::f64::consts::SQRT_2;
let cdf = if t < -1.0 {
0.5 * erfc(-t)
} else if t <= 1.0 {
0.5 * (1.0 + erf(t))
} else {
1.0 - 0.5 * erfc(t)
};
F::from(cdf).unwrap_or_else(F::zero)
}
/// Map `value` to its quantile level by averaging the ascending and reversed
/// linear interpolations, mirroring sklearn `_transform_col`
/// (`_data.py:2843-2846`) so plateaus map to the midpoint of their span.
fn interpolate_cdf<F: Float>(value: F, quantiles: &[F], references: &[F]) -> F {
if quantiles.is_empty() {
return F::from(0.5).unwrap_or_else(F::zero);
}
let forward = np_interp(value, quantiles, references);
let neg_q: Vec<F> = quantiles.iter().rev().map(|&qv| -qv).collect();
let neg_r: Vec<F> = references.iter().rev().map(|&rv| -rv).collect();
let backward = np_interp(-value, &neg_q, &neg_r);
F::from(0.5).unwrap_or_else(F::zero) * (forward - backward)
}
// ---------------------------------------------------------------------------
// Trait implementations
// ---------------------------------------------------------------------------
impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for QuantileTransformer<F> {
type Fitted = FittedQuantileTransformer<F>;
type Error = FerroError;
/// Fit by computing per-feature quantile reference values.
///
/// # Errors
///
/// - [`FerroError::InsufficientSamples`] if the input has 0 rows (empty).
/// - [`FerroError::InvalidParameter`] if `n_quantiles` is less than 2.
fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedQuantileTransformer<F>, FerroError> {
let n_samples = x.nrows();
// sklearn accepts n_samples >= 1: `n_quantiles_ = max(1, min(n_quantiles,
// n_samples))` (`_data.py:2790`). Reject only the empty (0-row) case.
if n_samples < 1 {
return Err(FerroError::InsufficientSamples {
required: 1,
actual: n_samples,
context: "QuantileTransformer::fit".into(),
});
}
if self.n_quantiles < 2 {
return Err(FerroError::InvalidParameter {
name: "n_quantiles".into(),
reason: "n_quantiles must be at least 2".into(),
});
}
let n_features = x.ncols();
let effective_quantiles = self.n_quantiles.min(n_samples);
// numpy np.linspace(0,1,K): step = 1/(K-1) computed once, y[i] = i*step,
// endpoint y[K-1] pinned to exactly 1.0 (_data.py:2795). For K == 1
// (n_samples == 1, sklearn `n_quantiles_ = max(1, ...)`) np.linspace
// returns the single point [0.0] — guard the 1/(K-1) divide that would
// otherwise yield 0 * (1/0) = NaN.
let k = effective_quantiles;
let mut references: Vec<F> = if k <= 1 {
vec![F::zero()]
} else {
let denom = F::from(k - 1).unwrap_or_else(F::one);
let step = F::one() / denom;
(0..k)
.map(|i| F::from(i).unwrap_or_else(F::zero) * step)
.collect()
};
if k >= 2 {
let last = references.len() - 1;
references[last] = F::one();
}
let mut quantiles = Vec::with_capacity(n_features);
for j in 0..n_features {
let mut col_vals: Vec<F> = x.column(j).iter().copied().collect();
// Remove NaN values
col_vals.retain(|v| !v.is_nan());
col_vals.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
// Subsample if needed
if self.subsample > 0 && col_vals.len() > self.subsample {
let step = col_vals.len() as f64 / self.subsample as f64;
let mut sampled = Vec::with_capacity(self.subsample);
for i in 0..self.subsample {
let idx = (i as f64 * step) as usize;
sampled.push(col_vals[idx.min(col_vals.len() - 1)]);
}
col_vals = sampled;
}
// Compute quantile reference values
let n = col_vals.len();
let mut feature_quantiles = Vec::with_capacity(effective_quantiles);
for &ref_level in &references {
// sklearn: np.nanpercentile(X, references_*100); numpy 'linear' virtual
// index is q/100*(n-1). Replicate the *100 then /100 round-trip
// (_data.py:2694,:2702).
let hundred = F::from(100.0).unwrap_or_else(F::one);
let q = ref_level * hundred;
let pos = (q / hundred) * F::from(n.saturating_sub(1)).unwrap_or_else(F::one);
let lo = pos.floor().to_usize().unwrap_or(0).min(n.saturating_sub(1));
let hi = pos.ceil().to_usize().unwrap_or(0).min(n.saturating_sub(1));
let frac = pos - F::from(lo).unwrap_or_else(F::zero);
let val = if lo == hi {
col_vals[lo]
} else {
col_vals[lo] * (F::one() - frac) + col_vals[hi] * frac
};
feature_quantiles.push(val);
}
quantiles.push(feature_quantiles);
}
Ok(FittedQuantileTransformer {
quantiles,
references,
output_distribution: self.output_distribution,
})
}
}
impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedQuantileTransformer<F> {
type Output = Array2<F>;
type Error = FerroError;
/// Transform data by mapping each value through the empirical CDF.
///
/// # Errors
///
/// Returns [`FerroError::ShapeMismatch`] if the number of columns differs
/// from the number of features seen during fitting.
fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let n_features = self.quantiles.len();
if x.ncols() != n_features {
return Err(FerroError::ShapeMismatch {
expected: vec![x.nrows(), n_features],
actual: vec![x.nrows(), x.ncols()],
context: "FittedQuantileTransformer::transform".into(),
});
}
// sklearn `transform` -> `_check_inputs(force_all_finite="allow-nan")`
// (`_data.py:2879`, called `:2943`): NaN passes through, but +/-inf
// raises ValueError (#2217, the forward mirror of the inverse #2212 guard
// above; MinMaxScaler #2200 precedent).
if x.iter().any(|v| v.is_infinite()) {
return Err(FerroError::InvalidParameter {
name: "X".into(),
reason: "Input X contains infinity or a value too large for dtype.".into(),
});
}
let mut out = x.to_owned();
// sklearn forward `_transform_col` bound masks (`_data.py:2829-2831`,
// `:2850-2851`): for `uniform` output the masks use EXACT equality
// against the landmark bounds — `lower_bounds_idx = X_col == quantiles[0]`,
// `upper_bounds_idx = X_col == quantiles[-1]` — and after the averaged
// interp the upper override is applied FIRST then the lower (`:2850-2851`),
// so a value matching BOTH bounds (e.g. a constant feature where every
// landmark == c) resolves to `lower_bound_y = references_[0] = 0`. The
// averaged-interp midpoint (#2318/#2319) is thus overridden at the
// exact-bound values. `lower_bound_y`/`upper_bound_y` are 0/1 (the
// [0,1] rank space) — the Normal `norm.ppf` + clip is applied afterwards.
let zero = F::zero();
let one = F::one();
for j in 0..n_features {
let feature_quantiles = &self.quantiles[j];
let (lower_bound_x, upper_bound_x) =
match (feature_quantiles.first(), feature_quantiles.last()) {
(Some(&first), Some(&last)) => (first, last),
_ => (zero, zero),
};
for i in 0..out.nrows() {
let val = out[[i, j]];
if val.is_nan() {
continue;
}
let cdf_val = if feature_quantiles.len() == 1 {
// Single landmark (n_quantiles_ == 1, e.g. a single-sample
// fit). sklearn's forward bound masks are OUTPUT-DISTRIBUTION
// DEPENDENT (`_data.py:2823-2828`, #2219):
// * Uniform uses EXACT-equality masks (X == bound), so a
// held-out value keeps the `np_interp` single-point clamp
// (references_[0] = 0) and the fitted value (== bound)
// also resolves to lower_bound_y = 0 → always 0.
// * Normal uses ±BOUNDS_THRESHOLD masks, so a value ABOVE
// the lone landmark → upper_bound_y = 1 (then
// probit(1) = +clip), at/below → lower_bound_y = 0
// (probit(0) = -clip).
if self.output_distribution == OutputDistribution::Normal
&& val > feature_quantiles[0]
{
F::one()
} else {
F::zero()
}
} else {
interpolate_cdf(val, feature_quantiles, &self.references)
};
// Forward bound masks (`_data.py:2829-2831`,`:2850-2851`). UPPER
// applied first, then LOWER — so a value equal to BOTH bounds
// (constant feature) gets `lower_bound_y = 0`. Exact equality, as
// sklearn does on the f64 landmark values. Only override
// multi-landmark columns; the single-landmark path (above) already
// models its own output-distribution-dependent masks (#2219).
let rank = if feature_quantiles.len() == 1 {
cdf_val
} else {
let mut r = cdf_val;
if val == upper_bound_x {
r = one; // upper_bound_y = references_[-1] = 1
}
if val == lower_bound_x {
r = zero; // lower_bound_y = references_[0] = 0
}
r
};
out[[i, j]] = match self.output_distribution {
OutputDistribution::Uniform => rank,
OutputDistribution::Normal => probit(rank),
};
}
}
Ok(out)
}
}
/// Implement `Transform` on the unfitted transformer to satisfy the
/// `FitTransform: Transform` supertrait bound.
impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for QuantileTransformer<F> {
type Output = Array2<F>;
type Error = FerroError;
/// Always returns an error — the transformer must be fitted first.
fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError> {
Err(FerroError::InvalidParameter {
name: "QuantileTransformer".into(),
reason: "transformer must be fitted before calling transform; use fit() first".into(),
})
}
}
impl<F: Float + Send + Sync + 'static> FitTransform<Array2<F>> for QuantileTransformer<F> {
type FitError = FerroError;
/// Fit and transform in one step.
///
/// # Errors
///
/// Returns an error if fitting fails.
fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let fitted = self.fit(x, &())?;
fitted.transform(x)
}
}
// ---------------------------------------------------------------------------
// Standalone `quantile_transform` free function (sklearn `quantile_transform`,
// `_data.py:2978`)
// ---------------------------------------------------------------------------
/// Transform features using quantiles information — the standalone,
/// estimator-less API mirroring scikit-learn's `quantile_transform` free
/// function (`sklearn/preprocessing/_data.py:2978`,`:3107-3119`).
///
/// This is a thin wrapper that **reuses** the fitted estimator: it constructs a
/// [`QuantileTransformer`] with the given parameters, then runs the SHIPPED
/// [`Fit::fit`] → [`Transform::transform`] (i.e. `fit_transform`) — it does NOT
/// reimplement any quantile math. Each call fits on the supplied data and
/// immediately transforms it (sklearn `n = QuantileTransformer(...); X =
/// n.fit_transform(X)`, `:3107-3116`).
///
/// # Parameters
///
/// - `x` — the data to transform, shape `(n_samples, n_features)`.
/// - `axis` — axis along which to transform (sklearn `axis=0` default,
/// `:3013-3015`). `axis == 0` transforms each **feature** (column)
/// independently — `fit_transform(X)` (`:3115-3116`). `axis == 1` transforms
/// each **sample** (row) independently — `fit_transform(X.T).T`
/// (`:3117-3118`): the matrix is transposed (so rows become columns / the
/// per-feature path operates on each original row), fit-transformed, then
/// transposed back to the original orientation.
/// - `n_quantiles` — number of quantile reference landmarks (sklearn default
/// 1000, clamped to `n_samples`, `:3017-3023`).
/// - `output_distribution` — [`OutputDistribution::Uniform`] (sklearn default,
/// `:3025-3027`) or [`OutputDistribution::Normal`].
/// - `subsample` — maximum samples used to estimate the quantiles (`0` = use
/// all). See the scope note below.
///
/// # Scope
///
/// Unlike sklearn's `quantile_transform`, the `ignore_implicit_zeros`,
/// `random_state`, and `copy` keyword arguments are **not** surfaced: this path
/// is dense, deterministic, and always returns a freshly allocated array
/// (`copy` is implicitly `True`). The `subsample` parameter is threaded to the
/// estimator, but the actual **random** subsampling is REQ-6 (#1324,
/// NOT-STARTED): for `n_samples > subsample` the deterministic strided pick is
/// used instead of sklearn's RNG draw, so this function matches sklearn's output
/// only when `n_samples <= subsample` (use `subsample=None` or a large value on
/// the sklearn side to compare). `NaN` is disregarded when fitting and preserved
/// in the output (sklearn Notes, `:3078-3079`).
///
/// # Errors
///
/// Returns [`FerroError::InvalidParameter`] if `axis` is neither `0` nor `1`
/// (mirroring sklearn's `ValueError("axis should be either equal to 0 or 1.
/// Got axis={axis}")`). Also propagates any error from the underlying
/// [`Fit::fit`] / [`Transform::transform`] (e.g.
/// [`FerroError::InsufficientSamples`] for fewer than 2 rows along the chosen
/// axis, [`FerroError::InvalidParameter`] for `n_quantiles < 2` or non-finite
/// `±inf` input).
///
/// # Examples
///
/// ```
/// use ferrolearn_preprocess::quantile_transformer::{quantile_transform, OutputDistribution};
/// use ndarray::array;
///
/// let x = array![[1.0_f64], [2.0], [3.0], [4.0], [5.0]];
/// let out = quantile_transform(&x, 0, 5, OutputDistribution::Uniform, 0).unwrap();
/// // First maps to 0.0, last to 1.0.
/// assert!((out[[0, 0]] - 0.0_f64).abs() < 1e-9);
/// assert!((out[[4, 0]] - 1.0_f64).abs() < 1e-9);
/// ```
#[must_use = "quantile_transform returns a new array; the input is not modified"]
pub fn quantile_transform<F: Float + Send + Sync + 'static>(
x: &Array2<F>,
axis: usize,
n_quantiles: usize,
output_distribution: OutputDistribution,
subsample: usize,
) -> Result<Array2<F>, FerroError> {
let qt = QuantileTransformer::<F>::new(n_quantiles, output_distribution, subsample);
match axis {
// sklearn `if axis == 0: X = n.fit_transform(X)` (`_data.py:3115-3116`):
// each feature (column) transformed independently.
0 => {
let fitted = qt.fit(x, &())?;
fitted.transform(x)
}
// sklearn `else: X = n.fit_transform(X.T).T` (`_data.py:3117-3118`): each
// sample (row) transformed independently. Transpose to a CONTIGUOUS owned
// array so the per-feature `fit`/`transform` operates on each original
// row, then transpose the result back to the original orientation.
1 => {
let xt = x.t().to_owned();
let fitted = qt.fit(&xt, &())?;
let out_t = fitted.transform(&xt)?;
Ok(out_t.t().to_owned())
}
// sklearn `else: raise ValueError("axis should be either equal to 0 or 1.
// Got axis={axis}")` (`_data.py:3117` guard via `@validate_params`).
other => Err(FerroError::InvalidParameter {
name: "axis".into(),
reason: format!("axis should be either equal to 0 or 1. Got axis={other}"),
}),
}
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
use approx::assert_abs_diff_eq;
use ndarray::array;
#[test]
fn test_quantile_transformer_uniform() {
let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
let fitted = qt.fit(&x, &()).unwrap();
let out = fitted.transform(&x).unwrap();
// All values should be in [0, 1]
for v in &out {
assert!(*v >= 0.0 && *v <= 1.0, "Value {v} not in [0,1]");
}
// First should be 0, last should be 1
assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-6);
assert_abs_diff_eq!(out[[4, 0]], 1.0, epsilon = 1e-6);
}
#[test]
fn test_quantile_transformer_normal() {
let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Normal, 0);
let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
let fitted = qt.fit(&x, &()).unwrap();
let out = fitted.transform(&x).unwrap();
// Middle value should be close to 0 (median → 0 in normal)
assert!(out[[2, 0]].abs() < 0.5, "Median should map near 0");
// First should be negative, last positive
assert!(out[[0, 0]] < out[[4, 0]]);
}
#[test]
fn test_quantile_transformer_monotonic() {
let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
let x = array![[5.0], [3.0], [1.0], [4.0], [2.0]];
let fitted = qt.fit(&x, &()).unwrap();
let out = fitted.transform(&x).unwrap();
// Transform should preserve ordering: rank(5) > rank(3) > rank(1)
assert!(out[[0, 0]] > out[[1, 0]]); // 5 > 3
assert!(out[[1, 0]] > out[[2, 0]]); // 3 > 1
}
#[test]
fn test_quantile_transformer_multiple_features() {
let qt = QuantileTransformer::<f64>::new(50, OutputDistribution::Uniform, 0);
let x = array![[1.0, 10.0], [2.0, 20.0], [3.0, 30.0]];
let fitted = qt.fit(&x, &()).unwrap();
let out = fitted.transform(&x).unwrap();
assert_eq!(out.ncols(), 2);
// Each feature independently transformed
for j in 0..2 {
assert!(out[[0, j]] <= out[[2, j]]);
}
}
#[test]
fn test_quantile_transformer_fit_transform() {
let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
let x = array![[1.0], [2.0], [3.0], [4.0], [5.0]];
let out = qt.fit_transform(&x).unwrap();
assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-6);
assert_abs_diff_eq!(out[[4, 0]], 1.0, epsilon = 1e-6);
}
#[test]
fn test_quantile_transformer_insufficient_samples_error() {
// sklearn accepts n_samples == 1 (clamps n_quantiles_ to 1, `_data.py:2790`);
// only the empty (0-row) case is rejected (#2218).
let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
let empty = ndarray::Array2::<f64>::zeros((0, 1));
assert!(qt.fit(&empty, &()).is_err());
}
#[test]
fn test_quantile_transformer_single_sample_all_zeros_no_nan() {
// sklearn: QuantileTransformer(5,'uniform').fit_transform([[7.]]) -> [[0.0]]
// (n_quantiles_ clamped to 1, references_=[0.0]). Must not produce NaN.
let qt = QuantileTransformer::<f64>::new(5, OutputDistribution::Uniform, 0);
let x = array![[7.0]];
let res = qt.fit_transform(&x);
assert!(res.is_ok(), "single-sample fit_transform must succeed");
if let Ok(out) = res {
assert_eq!(out.dim(), (1, 1));
let v = out[[0, 0]];
assert!(v.abs() <= 1e-12 && !v.is_nan());
}
}
#[test]
fn test_quantile_transformer_too_few_quantiles_error() {
let qt = QuantileTransformer::<f64>::new(1, OutputDistribution::Uniform, 0);
let x = array![[1.0], [2.0], [3.0]];
assert!(qt.fit(&x, &()).is_err());
}
#[test]
fn test_quantile_transformer_shape_mismatch() {
let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
let x_train = array![[1.0, 2.0], [3.0, 4.0]];
let fitted = qt.fit(&x_train, &()).unwrap();
let x_bad = array![[1.0, 2.0, 3.0]];
assert!(fitted.transform(&x_bad).is_err());
}
#[test]
fn test_quantile_transformer_unfitted_error() {
let qt = QuantileTransformer::<f64>::new(100, OutputDistribution::Uniform, 0);
let x = array![[1.0]];
assert!(qt.transform(&x).is_err());
}
#[test]
fn test_quantile_transformer_default() {
let qt = QuantileTransformer::<f64>::default();
assert_eq!(qt.n_quantiles(), 1000);
assert_eq!(qt.output_distribution(), OutputDistribution::Uniform);
assert_eq!(qt.subsample(), 100_000);
}
#[test]
fn test_quantile_transformer_f32() {
let qt = QuantileTransformer::<f32>::new(50, OutputDistribution::Uniform, 0);
let x: Array2<f32> = array![[1.0f32], [2.0], [3.0], [4.0], [5.0]];
let fitted = qt.fit(&x, &()).unwrap();
let out = fitted.transform(&x).unwrap();
assert!(out[[0, 0]] >= 0.0f32);
assert!(out[[4, 0]] <= 1.0f32);
}
}