ferrolearn_preprocess/
normalizer.rs

1//! Normalizer: scale each sample (row) to unit norm.
2//!
3//! Unlike column-wise scalers, the `Normalizer` operates row-wise: each
4//! sample is scaled independently so that its chosen norm equals 1.
5//!
6//! Supported norms:
7//! - **L1**: divide by the sum of absolute values
8//! - **L2**: divide by the Euclidean norm (default)
9//! - **Max**: divide by the maximum absolute value
10//!
11//! Samples that already have a zero norm are left unchanged.
12//!
13//! This transformer is **stateless** — no fitting is required. Call
14//! [`Transform::transform`] directly. For scikit-learn API parity it ALSO
15//! supports the stateful [`Fit`](ferrolearn_core::traits::Fit) →
16//! [`FittedNormalizer`] path, which records `n_features_in_` and (like
17//! sklearn) validates the input in `fit`; the fitted type's `transform`
18//! reuses the very same row-norm logic as the stateless path, so both paths
19//! are bit-identical.
20//!
21//! # `## REQ status`
22//!
23//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_data.py` (`class Normalizer`
24//! `:1980`, `normalize` `:1866`). Design doc: `.design/preprocess/normalizer.md`. Expected
25//! values from the live sklearn 1.5.2 oracle (R-CHAR-3). Consumers: the in-file
26//! `PipelineTransformer`/`FittedPipelineTransformer` impls (pipeline integration) + crate
27//! re-export (`lib.rs:119`, grandfathered S5). No PyO3 binding.
28//!
29//! | REQ | Status | Evidence |
30//! |---|---|---|
31//! | REQ-1 (row-wise L1/L2/Max transform) | SHIPPED | `Transform::transform` divides each row by its norm (L1=Σ\|v\|, L2=√Σv², Max=max\|v\|; zero-norm row unchanged), default L2; mirrors sklearn dense `normalize` (`_data.py:1962-1969`, `_handle_zeros_in_scale` `:1968`). Critic-verified bit-identical to live oracle: `guard_l1/l2/max/zero_row/f32_matches_oracle` in `tests/divergence_normalizer.rs`. Consumers: `FittedPipelineTransformer::transform_pipeline` + crate re-export `lib.rs:119`. |
32//! | REQ-2 (transform input validation per check_array) | SHIPPED | FIXED #1140. `transform` guards (sklearn order) zero-samples → `InsufficientSamples` (`validation.py:1084`), zero-features → `InvalidParameter` (`:1093`), non-finite NaN/±inf → `InvalidParameter` (`:1063`) — matching `Normalizer.transform` → `normalize` → `check_array` (`_data.py:1933-1940`). Mirrors converged `binarizer.rs`. Critic two-round CLEAN: 6 rejection pins + finite-not-over-rejected guards (zero-NORM-row/1e308/subnormal/-0.0); pipeline consumer inherits validation. |
33//! | REQ-3 (validating fit + parameter constraints) | SHIPPED | FIXED #1141. `impl Fit<Array2<F>, ()> for Normalizer` (`fit`): runs the SAME `validate_normalize_input` guard as `Transform::transform`/`normalize` (REQ-2: zero-samples → `InsufficientSamples`, zero-features/non-finite NaN±inf → `InvalidParameter`, sklearn `_validate_data` default `force_all_finite=True` REJECTS NaN/inf — confirmed `Normalizer().fit([[nan]])`/`[[inf]]` raise ValueError, `:2082`,`utils/validation.py:1063/1084/1093`), records `n_features_in_ = x.ncols()`, returns `FittedNormalizer { norm, copy, n_features_in_ }` (no fitted statistics — Normalizer is stateless, sklearn fit "Only validates", `:2062-2083`). sklearn's `_parameter_constraints {norm:[StrOptions{l1,l2,max}]}` (`:2053-2055`) has NO ferrolearn analog: `NormType` is a closed Rust enum, so an out-of-domain norm is UNREPRESENTABLE rather than runtime-rejected — the type system satisfies the param-domain check. Live-oracle tests: `fit_l1/l2/max_matches_oracle_and_stateless`, `fit_rejects_nan/pos_inf/neg_inf`, `fit_zero_row_unchanged`, `fitted_transform_shape_mismatch`, `fit_path_equals_stateless_path` in `tests/divergence_normalizer.rs`. Consumers: `FittedNormalizer::transform` (the fitted path) + crate re-export `lib.rs:140`. |
34//! | REQ-4 (normalize free fn: axis / return_norm) | SHIPPED | FIXED #1142. `pub fn normalize` + `pub fn normalize_with_norms` (free fns) mirror sklearn `normalize(X, norm, *, axis=1, copy=True, return_norm=False)` (`_data.py:1866`). Shared `row_norm` helper computes L1=Σ\|v\|, L2=√Σv², Max=max\|v\| (`:1962-1967`); `_handle_zeros_in_scale` zero→1 (`:1968`); `X /= norms` (`:1969`). `axis=1` row-normalizes; `axis=0` column-normalizes (sklearn transpose `:1926-1942`,`:1971-1972`); `axis ∉ {0,1}` → `InvalidParameter`. `normalize_with_norms` returns `(normalized, raw_norms)` (return_norm `:1974-1975`; raw, NOT zero-handled). Same validation as `Transform::transform` (REQ-2). Oracle-grounded tests in `#[cfg(test)]`: `normalize_l2/l1/max_axis1_matches_sklearn`, `normalize_l2_axis0_matches_sklearn`, `normalize_return_norm_l2_and_l1`, `normalize_invalid_axis_errors`. |
35//! | REQ-5 (copy parameter) | SHIPPED | FIXED #1143. `Normalizer<F>` gains a `copy: bool` field (default `true`) + `#[must_use] with_copy` builder + `copy()` getter, threaded onto `FittedNormalizer`, mirroring sklearn `__init__(norm='l2', *, copy=True)` (`_data.py:2058-2060`, `_parameter_constraints {copy:["boolean"]}` `:2055`). ACCEPT-AND-DOCUMENT no-op: ferrolearn's `Transform` always returns a freshly allocated array (`to_owned()`), so `copy` has no observable effect — `copy=True`/`copy=False` produce identical output (sklearn's `copy=False` does in-place row normalization, an optimization Rust's ownership makes moot here). Live-oracle test `fit_copy_true_false_identical`. Consumers: `FittedNormalizer` carries the flag + crate re-export `lib.rs:140`. |
36//! | REQ-6 (n_features_in_ / feature names) | PARTIAL | `n_features_in_` SHIPPED, `get_feature_names_out` NOT-STARTED. `FittedNormalizer<F>` records `n_features_in_ = x.ncols()` in `fit` and exposes `pub fn n_features_in(&self) -> usize`, mirroring sklearn's `_validate_data` setting `n_features_in_` (`:2082`); `FittedNormalizer::transform` validates the input column count against it (`ShapeMismatch`, sklearn `_validate_data(reset=False)` `:2104`). The `OneToOneFeatureMixin.get_feature_names_out` / `feature_names_in_` string-name plumbing is OUT OF SCOPE for this build (no string feature-name infrastructure in ferrolearn yet) — open prereq blocker #1144 for the feature-name half. Live-oracle test `fit_n_features_in_matches_ncols`. |
37//! | REQ-7 (sparse support) | NOT-STARTED | open prereq blocker #1145. Dense-only; no CSR `inplace_csr_row_normalize_l1/l2` / `min_max_axis` Max (`:1944-1960`). |
38//! | REQ-8 (PyO3 binding) | SHIPPED | FIXED #1146. `ferrolearn-python` surfaces `Normalizer` as `ferrolearn.Normalizer`: the hand-written `_RsNormalizer` `#[pyclass]` (`ferrolearn-python/src/extras.rs`, registered `lib.rs`) maps sklearn's `norm` STRING ('l1'/'l2'/'max') to the closed Rust `NormType` enum via `RsNormalizer::resolve_norm` — a bad string → `PyValueError` (sklearn `_parameter_constraints {norm: StrOptions({"l1","l2","max"})}`, `_data.py:2055`, `InvalidParameterError` ⊂ ValueError), builds `Normalizer::<f64>::new(normtype).with_copy(copy)`, runs the validating `Fit` (NaN/±inf → `PyValueError`, REQ-3) and delegates `transform` to `FittedNormalizer`. The non-test production consumer is `_extras.py::Normalizer(_TransformerWrapper)` with sklearn's `__init__(self, norm="l2", *, copy=True)` ABI (norm positional-or-keyword, copy keyword-only, `_data.py:2058`) + an overridden STATELESS `transform` (build-on-demand without fit, `_more_tags stateless=True` `_data.py:2110`, #2213) doing a FLOAT-ONLY dtype cast-back (float32→float32, float64→float64, int64→float64 UPCAST per `check_array(dtype=FLOAT_DTYPES)` `_data.py:2104`, #2214-analog — DIFFERS from Binarizer's number-preserving cast); re-exported in `__init__.py`. Verified vs the live sklearn 1.5.2 oracle: `tests/divergence_normalizer.py` (l1/l2/max values, default-l2, positional-norm, stateless, dtype, NaN/±inf, zero-norm, bad-norm, clone/get_params/set_params, copy no-op, pipeline). **Reduced-precision caveat (#2215, tracked):** sklearn `normalize` casts X to the INPUT float precision via `check_array(dtype=FLOAT_DTYPES)` (`_data.py:1933`) and computes the norm + division IN that precision (float16/float32), but the f64-only binding ABI (shared by EVERY `_Rs*` transformer) computes the norm in float64 then casts the result back — so float32 (~6e-8) and float16 (~5e-4) VALUES diverge slightly (dtype LABELS match; the float64 path is bit-exact, <1e-12). Same class as the generic-F precision caveats #2205/#2206; float16 is fundamentally unmatchable (the Rust core has no f16). Pinned `#[skip]` in `tests/divergence_normalizer_reduced_precision.py`. |
39//! | REQ-9 (ferray substrate) | NOT-STARTED | open prereq blocker #1147. `ndarray::Array2` + `num_traits::Float`, not `ferray-core`/`ferray-ufunc` (R-SUBSTRATE-1/2). |
40
41use ferrolearn_core::error::FerroError;
42use ferrolearn_core::pipeline::{FittedPipelineTransformer, PipelineTransformer};
43use ferrolearn_core::traits::{Fit, Transform};
44use ndarray::{Array1, Array2, ArrayView1};
45use num_traits::Float;
46
47// ---------------------------------------------------------------------------
48// NormType
49// ---------------------------------------------------------------------------
50
51/// The norm used by [`Normalizer`] when scaling each sample.
52#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
53pub enum NormType {
54    /// L1 norm: sum of absolute values.
55    L1,
56    /// L2 norm: Euclidean norm (square root of sum of squares). This is the default.
57    #[default]
58    L2,
59    /// Max norm: maximum absolute value in the sample.
60    Max,
61}
62
63// ---------------------------------------------------------------------------
64// Normalizer
65// ---------------------------------------------------------------------------
66
67/// A stateless row-wise normalizer.
68///
69/// Each sample (row) is independently scaled so that its chosen norm equals 1.
70/// Samples with a zero norm are left unchanged.
71///
72/// This transformer is stateless — no [`Fit`](ferrolearn_core::traits::Fit)
73/// step is needed. Call [`Transform::transform`] directly.
74///
75/// # Examples
76///
77/// ```
78/// use ferrolearn_preprocess::normalizer::{Normalizer, NormType};
79/// use ferrolearn_core::traits::Transform;
80/// use ndarray::array;
81///
82/// let normalizer = Normalizer::<f64>::new(NormType::L2);
83/// let x = array![[3.0, 4.0], [1.0, 0.0]];
84/// let out = normalizer.transform(&x).unwrap();
85/// // Row 0: [3/5, 4/5], Row 1: [1.0, 0.0]
86/// ```
87#[derive(Debug, Clone)]
88pub struct Normalizer<F> {
89    /// The norm to use for normalisation.
90    pub(crate) norm: NormType,
91    /// sklearn's `copy` constructor parameter (`__init__(norm='l2', *, copy=True)`,
92    /// `_data.py:2058-2060`; `_parameter_constraints {copy:["boolean"]}` `:2055`).
93    /// ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always returns a
94    /// freshly allocated array, so `copy` has no observable effect. Retained for
95    /// API parity. Defaults to `true`.
96    pub(crate) copy: bool,
97    _marker: std::marker::PhantomData<F>,
98}
99
100impl<F: Float + Send + Sync + 'static> Normalizer<F> {
101    /// Create a new `Normalizer` with the specified norm type.
102    #[must_use]
103    pub fn new(norm: NormType) -> Self {
104        Self {
105            norm,
106            copy: true,
107            _marker: std::marker::PhantomData,
108        }
109    }
110
111    /// Create a new `Normalizer` using the default L2 norm.
112    #[must_use]
113    pub fn l2() -> Self {
114        Self::new(NormType::L2)
115    }
116
117    /// Create a new `Normalizer` using the L1 norm.
118    #[must_use]
119    pub fn l1() -> Self {
120        Self::new(NormType::L1)
121    }
122
123    /// Create a new `Normalizer` using the Max norm.
124    #[must_use]
125    pub fn max() -> Self {
126        Self::new(NormType::Max)
127    }
128
129    /// Return the configured norm type.
130    #[must_use]
131    pub fn norm(&self) -> NormType {
132        self.norm
133    }
134
135    /// Set the `copy` parameter (sklearn `Normalizer(copy=...)`,
136    /// `_data.py:2058`, `_parameter_constraints {copy:["boolean"]}` `:2055`).
137    ///
138    /// This is an ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always
139    /// returns a freshly allocated array, so `copy` has no observable effect on
140    /// the output. It is retained for API parity with scikit-learn.
141    #[must_use]
142    pub fn with_copy(mut self, copy: bool) -> Self {
143        self.copy = copy;
144        self
145    }
146
147    /// Return the configured `copy` flag (sklearn `Normalizer.copy`).
148    #[must_use]
149    pub fn copy(&self) -> bool {
150        self.copy
151    }
152}
153
154impl<F: Float + Send + Sync + 'static> Default for Normalizer<F> {
155    fn default() -> Self {
156        Self::new(NormType::L2)
157    }
158}
159
160// ---------------------------------------------------------------------------
161// FittedNormalizer (sklearn stateful `fit` -> fitted estimator path)
162// ---------------------------------------------------------------------------
163
164/// A fitted [`Normalizer`].
165///
166/// `Normalizer` is stateless — its `fit` (sklearn `Normalizer.fit`,
167/// `_data.py:2062-2083`, "Only validates estimator's parameters") learns NO
168/// statistics; it merely validates the input and records `n_features_in_`. The
169/// fitted type therefore carries only the configured `norm`, the `copy` flag,
170/// and the recorded feature count. Its [`Transform::transform`] reuses the very
171/// same row-norm logic as the stateless [`Normalizer`]/[`normalize`] path, so
172/// the two paths are bit-identical.
173#[derive(Debug, Clone)]
174pub struct FittedNormalizer<F> {
175    /// The norm to use for normalisation.
176    pub(crate) norm: NormType,
177    /// The `copy` flag carried from the unfitted [`Normalizer`] (no-op; see
178    /// [`Normalizer::with_copy`]).
179    pub(crate) copy: bool,
180    /// Number of features (columns) seen during [`Fit::fit`] — sklearn's
181    /// `n_features_in_` (`_data.py:2082`, set by `_validate_data`).
182    pub(crate) n_features_in_: usize,
183    _marker: std::marker::PhantomData<F>,
184}
185
186impl<F: Float + Send + Sync + 'static> FittedNormalizer<F> {
187    /// Return the number of features (columns) seen during [`Fit::fit`].
188    ///
189    /// Mirrors scikit-learn's `Normalizer.n_features_in_` (`_data.py:2082`).
190    #[must_use]
191    pub fn n_features_in(&self) -> usize {
192        self.n_features_in_
193    }
194
195    /// Return the configured norm type.
196    #[must_use]
197    pub fn norm(&self) -> NormType {
198        self.norm
199    }
200
201    /// Return the configured `copy` flag (no-op; see [`Normalizer::with_copy`]).
202    #[must_use]
203    pub fn copy(&self) -> bool {
204        self.copy
205    }
206}
207
208impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for Normalizer<F> {
209    type Fitted = FittedNormalizer<F>;
210    type Error = FerroError;
211
212    /// Validate the input and record `n_features_in_`, returning a
213    /// [`FittedNormalizer`].
214    ///
215    /// `Normalizer` is stateless: like scikit-learn's `Normalizer.fit`
216    /// (`sklearn/preprocessing/_data.py:2062-2083`, "Only validates estimator's
217    /// parameters"), this learns NO statistics. It runs the SAME `check_array`
218    /// validation as [`Transform::transform`] / [`normalize`] (REQ-2, via the
219    /// shared `validate_normalize_input` helper) and records
220    /// `n_features_in_ = x.ncols()`. sklearn's `_validate_data` uses the default
221    /// `force_all_finite=True`, so NaN/±inf are REJECTED in `fit`
222    /// (`Normalizer().fit([[nan]])` / `[[inf]]` raise `ValueError`).
223    ///
224    /// # Errors
225    ///
226    /// Returns [`FerroError::InsufficientSamples`] for zero rows and
227    /// [`FerroError::InvalidParameter`] for zero features or any non-finite
228    /// value (NaN, +inf, -inf) — matching `check_array`
229    /// (`sklearn/utils/validation.py:1084`, `:1093`, `:1063`) as routed through
230    /// `Normalizer.fit` -> `_validate_data` (`_data.py:2082`).
231    fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedNormalizer<F>, FerroError> {
232        validate_normalize_input(x)?;
233        Ok(FittedNormalizer {
234            norm: self.norm,
235            copy: self.copy,
236            n_features_in_: x.ncols(),
237            _marker: std::marker::PhantomData,
238        })
239    }
240}
241
242impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedNormalizer<F> {
243    type Output = Array2<F>;
244    type Error = FerroError;
245
246    /// Normalize each row of `x` to unit norm, delegating to the SAME row-norm
247    /// logic as the stateless [`Normalizer`] / [`normalize`] path.
248    ///
249    /// First validates that `x` has the same number of columns recorded during
250    /// [`Fit::fit`] (sklearn `_validate_data(reset=False)`,
251    /// `sklearn/preprocessing/_data.py:2104`) and applies the REQ-2
252    /// `check_array` guards, then calls the shared [`normalize`] free function
253    /// with `axis=1` (sklearn `Normalizer.transform` ->
254    /// `normalize(X, norm=self.norm, axis=1)`, `:2106`). The output is therefore
255    /// byte-identical to `Normalizer::transform`.
256    ///
257    /// # Errors
258    ///
259    /// Returns [`FerroError::ShapeMismatch`] if the column count differs from
260    /// `n_features_in_`. Returns [`FerroError::InsufficientSamples`] for zero
261    /// rows and [`FerroError::InvalidParameter`] for zero features or any
262    /// non-finite value (REQ-2, via [`normalize`]).
263    fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
264        // sklearn `_validate_data(reset=False)` runs `check_array` (finite /
265        // min-samples / min-features) BEFORE `_check_n_features` (`base.py:633`
266        // then `:654`, #2207). So validate + normalize FIRST (this is
267        // `check_array`'s job via the shared REQ-2 guard in `normalize`); a NaN /
268        // +-inf / zero-sample / zero-feature input must raise its check_array
269        // error EVEN when the column count is also wrong. Only after that does
270        // the n_features comparison fire.
271        let normalized = normalize(x, self.norm, 1)?;
272        if x.ncols() != self.n_features_in_ {
273            return Err(FerroError::ShapeMismatch {
274                expected: vec![x.nrows(), self.n_features_in_],
275                actual: vec![x.nrows(), x.ncols()],
276                context: "FittedNormalizer::transform".into(),
277            });
278        }
279        Ok(normalized)
280    }
281}
282
283// ---------------------------------------------------------------------------
284// Trait implementations
285// ---------------------------------------------------------------------------
286
287impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for Normalizer<F> {
288    type Output = Array2<F>;
289    type Error = FerroError;
290
291    /// Normalize each row of `x` to unit norm.
292    ///
293    /// Rows with a zero norm value are left unchanged.
294    ///
295    /// # Errors
296    ///
297    /// Returns [`FerroError::InsufficientSamples`] if `x` has zero rows. This
298    /// mirrors scikit-learn's `Normalizer.transform` ->
299    /// `normalize` -> `check_array` (`sklearn/preprocessing/_data.py:1933`),
300    /// whose min-samples check (`utils/validation.py:1084`,
301    /// `ensure_min_samples=1`) raises `ValueError: Found array with 0 sample(s)
302    /// ... while a minimum of 1 is required by Normalizer.`
303    ///
304    /// Returns [`FerroError::InvalidParameter`] if `x` has zero features
305    /// (columns). This mirrors the same `check_array` min-features check
306    /// (`utils/validation.py:1093`, `ensure_min_features=1`) which raises
307    /// `ValueError: Found array with 0 feature(s) ... while a minimum of 1 is
308    /// required by Normalizer.`
309    ///
310    /// Returns [`FerroError::InvalidParameter`] if `x` contains any non-finite
311    /// value (NaN, +inf, or -inf). This mirrors `check_array(force_all_finite=
312    /// True)` (`utils/validation.py:1063`), which raises `ValueError: Input X
313    /// contains NaN.` / `Input X contains infinity ...` before normalizing.
314    fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
315        if x.nrows() == 0 {
316            return Err(FerroError::InsufficientSamples {
317                required: 1,
318                actual: 0,
319                context: "Normalizer::transform".into(),
320            });
321        }
322        if x.ncols() == 0 {
323            return Err(FerroError::InvalidParameter {
324                name: "X".to_string(),
325                reason: "Found array with 0 feature(s); a minimum of 1 is required \
326                         by Normalizer"
327                    .to_string(),
328            });
329        }
330        if x.iter().any(|v| !v.is_finite()) {
331            return Err(FerroError::InvalidParameter {
332                name: "X".to_string(),
333                reason: "Input X contains non-finite values (NaN or infinity); \
334                         Normalizer requires all-finite input"
335                    .to_string(),
336            });
337        }
338        let mut out = x.to_owned();
339        for mut row in out.rows_mut() {
340            let norm_val =
341                match self.norm {
342                    NormType::L1 => row.iter().copied().fold(F::zero(), |acc, v| acc + v.abs()),
343                    NormType::L2 => row
344                        .iter()
345                        .copied()
346                        .fold(F::zero(), |acc, v| acc + v * v)
347                        .sqrt(),
348                    NormType::Max => row.iter().copied().fold(F::zero(), |acc, v| {
349                        if v.abs() > acc { v.abs() } else { acc }
350                    }),
351                };
352            if norm_val == F::zero() {
353                // Zero-norm row: leave unchanged.
354                continue;
355            }
356            for v in &mut row {
357                *v = *v / norm_val;
358            }
359        }
360        Ok(out)
361    }
362}
363
364// ---------------------------------------------------------------------------
365// Standalone `normalize` free function (sklearn `normalize`, `_data.py:1866`)
366// ---------------------------------------------------------------------------
367
368/// Compute the `norm` of a single 1-D slice (one row or one column).
369///
370/// Mirrors sklearn's dense `normalize` per-vector norms (`_data.py:1962-1967`):
371/// L1 = Σ|v|, L2 = √Σv², Max = max|v|.
372fn row_norm<F: Float>(row: ArrayView1<F>, norm: NormType) -> F {
373    match norm {
374        NormType::L1 => row.iter().copied().fold(F::zero(), |acc, v| acc + v.abs()),
375        NormType::L2 => row
376            .iter()
377            .copied()
378            .fold(F::zero(), |acc, v| acc + v * v)
379            .sqrt(),
380        NormType::Max => {
381            row.iter().copied().fold(
382                F::zero(),
383                |acc, v| {
384                    if v.abs() > acc { v.abs() } else { acc }
385                },
386            )
387        }
388    }
389}
390
391/// Run the shared `check_array` input validation (REQ-2) used by both
392/// [`Normalizer`]'s `transform` and the free [`normalize`]/[`normalize_with_norms`]
393/// functions, in sklearn's `check_array` order (zero-samples → zero-features →
394/// non-finite; `sklearn/utils/validation.py:1084`, `:1093`, `:1063`).
395fn validate_normalize_input<F: Float>(x: &Array2<F>) -> Result<(), FerroError> {
396    if x.nrows() == 0 {
397        return Err(FerroError::InsufficientSamples {
398            required: 1,
399            actual: 0,
400            context: "normalize".into(),
401        });
402    }
403    if x.ncols() == 0 {
404        return Err(FerroError::InvalidParameter {
405            name: "X".to_string(),
406            reason: "Found array with 0 feature(s); a minimum of 1 is required \
407                     by the normalize function"
408                .to_string(),
409        });
410    }
411    if x.iter().any(|v| !v.is_finite()) {
412        return Err(FerroError::InvalidParameter {
413            name: "X".to_string(),
414            reason: "Input X contains non-finite values (NaN or infinity); \
415                     the normalize function requires all-finite input"
416                .to_string(),
417        });
418    }
419    Ok(())
420}
421
422/// Shared core of [`normalize`] / [`normalize_with_norms`]: validate `axis` and
423/// input, then return the normalized array plus the per-axis **raw** norm vector.
424///
425/// The returned `norms` are the actual computed norms (NOT zero-handled): a
426/// zero-norm row/column appears as `0.0` even though the division used `1`
427/// (`_handle_zeros_in_scale`, `_data.py:1968`) to leave it unchanged. This
428/// matches sklearn's `normalize(..., return_norm=True)` (`:1974-1975`).
429fn normalize_inner<F: Float>(
430    x: &Array2<F>,
431    norm: NormType,
432    axis: usize,
433) -> Result<(Array2<F>, Array1<F>), FerroError> {
434    if axis != 0 && axis != 1 {
435        return Err(FerroError::InvalidParameter {
436            name: "axis".into(),
437            reason: "must be 0 or 1".into(),
438        });
439    }
440    validate_normalize_input(x)?;
441
442    let mut out = x.to_owned();
443    if axis == 1 {
444        // Row-normalize (sklearn default axis=1).
445        let mut norms = Array1::<F>::zeros(out.nrows());
446        for (i, mut row) in out.rows_mut().into_iter().enumerate() {
447            let n = row_norm(row.view(), norm);
448            norms[i] = n;
449            // _handle_zeros_in_scale: a zero norm divides by 1 (row unchanged).
450            let eff = if n == F::zero() { F::one() } else { n };
451            for v in &mut row {
452                *v = *v / eff;
453            }
454        }
455        Ok((out, norms))
456    } else {
457        // axis == 0: column-normalize. sklearn transposes, runs the axis=1
458        // path, then transposes back (`_data.py:1926-1942`, `:1971-1972`).
459        let mut norms = Array1::<F>::zeros(out.ncols());
460        for (j, mut col) in out.columns_mut().into_iter().enumerate() {
461            let n = row_norm(col.view(), norm);
462            norms[j] = n;
463            let eff = if n == F::zero() { F::one() } else { n };
464            for v in &mut col {
465                *v = *v / eff;
466            }
467        }
468        Ok((out, norms))
469    }
470}
471
472/// Scale input vectors individually to unit norm — the standalone, estimator-less
473/// API mirroring scikit-learn's `normalize` free function
474/// (`sklearn/preprocessing/_data.py:1866`).
475///
476/// With `axis == 1` (sklearn's default) each **row** (sample) is divided by its
477/// `norm` (L1 = Σ|v|, L2 = √Σv², Max = max|v|); with `axis == 0` each **column**
478/// (feature) is normalized instead (sklearn transposes, row-normalizes, and
479/// transposes back — `:1926-1942`, `:1971-1972`). A row/column whose norm is zero
480/// is left unchanged, matching `_handle_zeros_in_scale` (`:1968`).
481///
482/// # Errors
483///
484/// Returns [`FerroError::InvalidParameter`] if `axis` is not `0` or `1`. Also
485/// applies the same `check_array` input validation as [`Normalizer`]'s
486/// `transform` (REQ-2): [`FerroError::InsufficientSamples`] for zero rows, and
487/// [`FerroError::InvalidParameter`] for zero features or any non-finite value
488/// (`_data.py:1933-1940`).
489#[must_use = "normalize returns a new array; the input is not modified"]
490pub fn normalize<F: Float>(
491    x: &Array2<F>,
492    norm: NormType,
493    axis: usize,
494) -> Result<Array2<F>, FerroError> {
495    let (out, _norms) = normalize_inner(x, norm, axis)?;
496    Ok(out)
497}
498
499/// Like [`normalize`] but also returns the per-axis norm vector — the
500/// `return_norm=True` form of scikit-learn's `normalize`
501/// (`sklearn/preprocessing/_data.py:1971-1975`).
502///
503/// Returns `(normalized, norms)` where `norms` is the per-row vector for
504/// `axis == 1` (length = n_rows) or the per-column vector for `axis == 0`
505/// (length = n_cols). The norms are the **raw** computed norms, NOT
506/// zero-handled: a zero norm appears as `0.0` in the returned vector even though
507/// the division used `1` to leave that row/column unchanged (sklearn returns the
508/// raw `norms` array — `:1974-1975`).
509///
510/// # Errors
511///
512/// Same as [`normalize`].
513#[must_use = "normalize_with_norms returns a new array and the norm vector"]
514pub fn normalize_with_norms<F: Float>(
515    x: &Array2<F>,
516    norm: NormType,
517    axis: usize,
518) -> Result<(Array2<F>, Array1<F>), FerroError> {
519    normalize_inner(x, norm, axis)
520}
521
522// ---------------------------------------------------------------------------
523// Pipeline integration (generic)
524// ---------------------------------------------------------------------------
525
526impl<F: Float + Send + Sync + 'static> PipelineTransformer<F> for Normalizer<F> {
527    /// Fit the normalizer using the pipeline interface.
528    ///
529    /// Because `Normalizer` is stateless, this simply boxes `self` as a
530    /// [`FittedPipelineTransformer`].
531    ///
532    /// # Errors
533    ///
534    /// This implementation never returns an error.
535    fn fit_pipeline(
536        &self,
537        _x: &Array2<F>,
538        _y: &Array1<F>,
539    ) -> Result<Box<dyn FittedPipelineTransformer<F>>, FerroError> {
540        Ok(Box::new(self.clone()))
541    }
542}
543
544impl<F: Float + Send + Sync + 'static> FittedPipelineTransformer<F> for Normalizer<F> {
545    /// Transform data using the pipeline interface.
546    ///
547    /// # Errors
548    ///
549    /// Propagates errors from [`Transform::transform`].
550    fn transform_pipeline(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
551        self.transform(x)
552    }
553}
554
555// ---------------------------------------------------------------------------
556// Tests
557// ---------------------------------------------------------------------------
558
559#[cfg(test)]
560mod tests {
561    use super::*;
562    use approx::assert_abs_diff_eq;
563    use ndarray::array;
564
565    #[test]
566    fn test_l2_norm_basic() {
567        let norm = Normalizer::<f64>::l2();
568        // Row [3, 4] has L2 norm 5.
569        let x = array![[3.0, 4.0]];
570        let out = norm.transform(&x).unwrap();
571        assert_abs_diff_eq!(out[[0, 0]], 0.6, epsilon = 1e-10);
572        assert_abs_diff_eq!(out[[0, 1]], 0.8, epsilon = 1e-10);
573    }
574
575    #[test]
576    fn test_l2_unit_norm_after_transform() {
577        let norm = Normalizer::<f64>::l2();
578        let x = array![[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
579        let out = norm.transform(&x).unwrap();
580        for row in out.rows() {
581            let row_norm: f64 = row.iter().map(|v| v * v).sum::<f64>().sqrt();
582            assert_abs_diff_eq!(row_norm, 1.0, epsilon = 1e-10);
583        }
584    }
585
586    #[test]
587    fn test_l1_norm_basic() {
588        let norm = Normalizer::<f64>::l1();
589        // Row [1, 2, 3] has L1 norm 6.
590        let x = array![[1.0, 2.0, 3.0]];
591        let out = norm.transform(&x).unwrap();
592        assert_abs_diff_eq!(out[[0, 0]], 1.0 / 6.0, epsilon = 1e-10);
593        assert_abs_diff_eq!(out[[0, 1]], 2.0 / 6.0, epsilon = 1e-10);
594        assert_abs_diff_eq!(out[[0, 2]], 3.0 / 6.0, epsilon = 1e-10);
595    }
596
597    #[test]
598    fn test_l1_unit_norm_after_transform() {
599        let norm = Normalizer::<f64>::l1();
600        let x = array![[1.0, 2.0, 3.0], [-4.0, 5.0, 6.0]];
601        let out = norm.transform(&x).unwrap();
602        for row in out.rows() {
603            let row_norm: f64 = row.iter().map(|v| v.abs()).sum();
604            assert_abs_diff_eq!(row_norm, 1.0, epsilon = 1e-10);
605        }
606    }
607
608    #[test]
609    fn test_max_norm_basic() {
610        let norm = Normalizer::<f64>::max();
611        // Row [-5, 3, 1] has max norm 5.
612        let x = array![[-5.0, 3.0, 1.0]];
613        let out = norm.transform(&x).unwrap();
614        assert_abs_diff_eq!(out[[0, 0]], -1.0, epsilon = 1e-10);
615        assert_abs_diff_eq!(out[[0, 1]], 0.6, epsilon = 1e-10);
616        assert_abs_diff_eq!(out[[0, 2]], 0.2, epsilon = 1e-10);
617    }
618
619    #[test]
620    fn test_zero_row_unchanged() {
621        let norm = Normalizer::<f64>::l2();
622        let x = array![[0.0, 0.0, 0.0], [1.0, 2.0, 3.0]];
623        let out = norm.transform(&x).unwrap();
624        // Zero row stays zero
625        assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-15);
626        assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-15);
627        assert_abs_diff_eq!(out[[0, 2]], 0.0, epsilon = 1e-15);
628    }
629
630    #[test]
631    fn test_negative_values_l2() {
632        let norm = Normalizer::<f64>::l2();
633        let x = array![[-3.0, -4.0]];
634        let out = norm.transform(&x).unwrap();
635        assert_abs_diff_eq!(out[[0, 0]], -0.6, epsilon = 1e-10);
636        assert_abs_diff_eq!(out[[0, 1]], -0.8, epsilon = 1e-10);
637    }
638
639    #[test]
640    fn test_default_is_l2() {
641        let norm = Normalizer::<f64>::default();
642        assert_eq!(norm.norm(), NormType::L2);
643    }
644
645    #[test]
646    fn test_multiple_rows_independent() {
647        let norm = Normalizer::<f64>::l2();
648        let x = array![[3.0, 4.0], [0.0, 5.0]];
649        let out = norm.transform(&x).unwrap();
650        // Row 0: L2 norm = 5
651        assert_abs_diff_eq!(out[[0, 0]], 0.6, epsilon = 1e-10);
652        assert_abs_diff_eq!(out[[0, 1]], 0.8, epsilon = 1e-10);
653        // Row 1: L2 norm = 5
654        assert_abs_diff_eq!(out[[1, 0]], 0.0, epsilon = 1e-10);
655        assert_abs_diff_eq!(out[[1, 1]], 1.0, epsilon = 1e-10);
656    }
657
658    #[test]
659    fn test_pipeline_integration() {
660        use ferrolearn_core::pipeline::PipelineTransformer;
661        let norm = Normalizer::<f64>::l2();
662        let x = array![[3.0, 4.0], [0.0, 2.0]];
663        let y = Array1::zeros(2);
664        let fitted = norm.fit_pipeline(&x, &y).unwrap();
665        let result = fitted.transform_pipeline(&x).unwrap();
666        assert_abs_diff_eq!(result[[0, 0]], 0.6, epsilon = 1e-10);
667        assert_abs_diff_eq!(result[[0, 1]], 0.8, epsilon = 1e-10);
668    }
669
670    #[test]
671    fn test_f32_normalizer() {
672        let norm = Normalizer::<f32>::l2();
673        let x: Array2<f32> = array![[3.0f32, 4.0]];
674        let out = norm.transform(&x).unwrap();
675        assert!((out[[0, 0]] - 0.6f32).abs() < 1e-6);
676        assert!((out[[0, 1]] - 0.8f32).abs() < 1e-6);
677    }
678
679    // -----------------------------------------------------------------------
680    // REQ-4 — standalone `normalize` / `normalize_with_norms` free functions.
681    // Oracle: live sklearn 1.5.2 (R-CHAR-3), X = [[1,2,2],[0,3,4]].
682    //   normalize(X, l2, axis=1) -> [[.33333333,.66666667,.66666667],[0,.6,.8]]
683    //   normalize(X, l1, axis=1) -> [[.2,.4,.4],[0,.42857143,.57142857]]
684    //   normalize(X, max,axis=1) -> [[.5,1,1],[0,.75,1]]
685    //   normalize(X, l2, axis=0) -> [[1,.5547002,.4472136],[0,.83205029,.89442719]]
686    //   return_norm l2 axis=1 norms -> [3,5]; l1 axis=1 norms -> [5,7]
687    // -----------------------------------------------------------------------
688
689    #[test]
690    fn normalize_l2_axis1_matches_sklearn() -> Result<(), FerroError> {
691        let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
692        let out = normalize(&x, NormType::L2, 1)?;
693        let expected = array![[0.33333333, 0.66666667, 0.66666667], [0.0, 0.6, 0.8]];
694        for (a, b) in out.iter().zip(expected.iter()) {
695            assert_abs_diff_eq!(a, b, epsilon = 1e-7);
696        }
697        Ok(())
698    }
699
700    #[test]
701    fn normalize_l1_axis1_matches_sklearn() -> Result<(), FerroError> {
702        let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
703        let out = normalize(&x, NormType::L1, 1)?;
704        let expected = array![[0.2, 0.4, 0.4], [0.0, 0.42857143, 0.57142857]];
705        for (a, b) in out.iter().zip(expected.iter()) {
706            assert_abs_diff_eq!(a, b, epsilon = 1e-7);
707        }
708        Ok(())
709    }
710
711    #[test]
712    fn normalize_max_axis1_matches_sklearn() -> Result<(), FerroError> {
713        let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
714        let out = normalize(&x, NormType::Max, 1)?;
715        let expected = array![[0.5, 1.0, 1.0], [0.0, 0.75, 1.0]];
716        for (a, b) in out.iter().zip(expected.iter()) {
717            assert_abs_diff_eq!(a, b, epsilon = 1e-7);
718        }
719        Ok(())
720    }
721
722    #[test]
723    fn normalize_l2_axis0_matches_sklearn() -> Result<(), FerroError> {
724        let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
725        let out = normalize(&x, NormType::L2, 0)?;
726        let expected = array![[1.0, 0.5547002, 0.4472136], [0.0, 0.83205029, 0.89442719]];
727        for (a, b) in out.iter().zip(expected.iter()) {
728            assert_abs_diff_eq!(a, b, epsilon = 1e-7);
729        }
730        Ok(())
731    }
732
733    #[test]
734    fn normalize_return_norm_l2_and_l1() -> Result<(), FerroError> {
735        let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
736
737        let (_out_l2, norms_l2) = normalize_with_norms(&x, NormType::L2, 1)?;
738        assert_abs_diff_eq!(norms_l2[0], 3.0, epsilon = 1e-9);
739        assert_abs_diff_eq!(norms_l2[1], 5.0, epsilon = 1e-9);
740
741        let (_out_l1, norms_l1) = normalize_with_norms(&x, NormType::L1, 1)?;
742        assert_abs_diff_eq!(norms_l1[0], 5.0, epsilon = 1e-9);
743        assert_abs_diff_eq!(norms_l1[1], 7.0, epsilon = 1e-9);
744        Ok(())
745    }
746
747    #[test]
748    fn normalize_invalid_axis_errors() {
749        let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
750        let err = normalize(&x, NormType::L2, 2);
751        assert!(matches!(err, Err(FerroError::InvalidParameter { .. })));
752    }
753}
ferrolearn_preprocess/normalizer.rs

ferrolearn_preprocess/
normalizer.rs