ferrolearn_preprocess/normalizer.rs
1//! Normalizer: scale each sample (row) to unit norm.
2//!
3//! Unlike column-wise scalers, the `Normalizer` operates row-wise: each
4//! sample is scaled independently so that its chosen norm equals 1.
5//!
6//! Supported norms:
7//! - **L1**: divide by the sum of absolute values
8//! - **L2**: divide by the Euclidean norm (default)
9//! - **Max**: divide by the maximum absolute value
10//!
11//! Samples that already have a zero norm are left unchanged.
12//!
13//! This transformer is **stateless** — no fitting is required. Call
14//! [`Transform::transform`] directly. For scikit-learn API parity it ALSO
15//! supports the stateful [`Fit`](ferrolearn_core::traits::Fit) →
16//! [`FittedNormalizer`] path, which records `n_features_in_` and (like
17//! sklearn) validates the input in `fit`; the fitted type's `transform`
18//! reuses the very same row-norm logic as the stateless path, so both paths
19//! are bit-identical.
20//!
21//! # `## REQ status`
22//!
23//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_data.py` (`class Normalizer`
24//! `:1980`, `normalize` `:1866`). Design doc: `.design/preprocess/normalizer.md`. Expected
25//! values from the live sklearn 1.5.2 oracle (R-CHAR-3). Consumers: the in-file
26//! `PipelineTransformer`/`FittedPipelineTransformer` impls (pipeline integration) + crate
27//! re-export (`lib.rs:119`, grandfathered S5). No PyO3 binding.
28//!
29//! | REQ | Status | Evidence |
30//! |---|---|---|
31//! | REQ-1 (row-wise L1/L2/Max transform) | SHIPPED | `Transform::transform` divides each row by its norm (L1=Σ\|v\|, L2=√Σv², Max=max\|v\|; zero-norm row unchanged), default L2; mirrors sklearn dense `normalize` (`_data.py:1962-1969`, `_handle_zeros_in_scale` `:1968`). Critic-verified bit-identical to live oracle: `guard_l1/l2/max/zero_row/f32_matches_oracle` in `tests/divergence_normalizer.rs`. Consumers: `FittedPipelineTransformer::transform_pipeline` + crate re-export `lib.rs:119`. |
32//! | REQ-2 (transform input validation per check_array) | SHIPPED | FIXED #1140. `transform` guards (sklearn order) zero-samples → `InsufficientSamples` (`validation.py:1084`), zero-features → `InvalidParameter` (`:1093`), non-finite NaN/±inf → `InvalidParameter` (`:1063`) — matching `Normalizer.transform` → `normalize` → `check_array` (`_data.py:1933-1940`). Mirrors converged `binarizer.rs`. Critic two-round CLEAN: 6 rejection pins + finite-not-over-rejected guards (zero-NORM-row/1e308/subnormal/-0.0); pipeline consumer inherits validation. |
33//! | REQ-3 (validating fit + parameter constraints) | SHIPPED | FIXED #1141. `impl Fit<Array2<F>, ()> for Normalizer` (`fit`): runs the SAME `validate_normalize_input` guard as `Transform::transform`/`normalize` (REQ-2: zero-samples → `InsufficientSamples`, zero-features/non-finite NaN±inf → `InvalidParameter`, sklearn `_validate_data` default `force_all_finite=True` REJECTS NaN/inf — confirmed `Normalizer().fit([[nan]])`/`[[inf]]` raise ValueError, `:2082`,`utils/validation.py:1063/1084/1093`), records `n_features_in_ = x.ncols()`, returns `FittedNormalizer { norm, copy, n_features_in_ }` (no fitted statistics — Normalizer is stateless, sklearn fit "Only validates", `:2062-2083`). sklearn's `_parameter_constraints {norm:[StrOptions{l1,l2,max}]}` (`:2053-2055`) has NO ferrolearn analog: `NormType` is a closed Rust enum, so an out-of-domain norm is UNREPRESENTABLE rather than runtime-rejected — the type system satisfies the param-domain check. Live-oracle tests: `fit_l1/l2/max_matches_oracle_and_stateless`, `fit_rejects_nan/pos_inf/neg_inf`, `fit_zero_row_unchanged`, `fitted_transform_shape_mismatch`, `fit_path_equals_stateless_path` in `tests/divergence_normalizer.rs`. Consumers: `FittedNormalizer::transform` (the fitted path) + crate re-export `lib.rs:140`. |
34//! | REQ-4 (normalize free fn: axis / return_norm) | SHIPPED | FIXED #1142. `pub fn normalize` + `pub fn normalize_with_norms` (free fns) mirror sklearn `normalize(X, norm, *, axis=1, copy=True, return_norm=False)` (`_data.py:1866`). Shared `row_norm` helper computes L1=Σ\|v\|, L2=√Σv², Max=max\|v\| (`:1962-1967`); `_handle_zeros_in_scale` zero→1 (`:1968`); `X /= norms` (`:1969`). `axis=1` row-normalizes; `axis=0` column-normalizes (sklearn transpose `:1926-1942`,`:1971-1972`); `axis ∉ {0,1}` → `InvalidParameter`. `normalize_with_norms` returns `(normalized, raw_norms)` (return_norm `:1974-1975`; raw, NOT zero-handled). Same validation as `Transform::transform` (REQ-2). Oracle-grounded tests in `#[cfg(test)]`: `normalize_l2/l1/max_axis1_matches_sklearn`, `normalize_l2_axis0_matches_sklearn`, `normalize_return_norm_l2_and_l1`, `normalize_invalid_axis_errors`. |
35//! | REQ-5 (copy parameter) | SHIPPED | FIXED #1143. `Normalizer<F>` gains a `copy: bool` field (default `true`) + `#[must_use] with_copy` builder + `copy()` getter, threaded onto `FittedNormalizer`, mirroring sklearn `__init__(norm='l2', *, copy=True)` (`_data.py:2058-2060`, `_parameter_constraints {copy:["boolean"]}` `:2055`). ACCEPT-AND-DOCUMENT no-op: ferrolearn's `Transform` always returns a freshly allocated array (`to_owned()`), so `copy` has no observable effect — `copy=True`/`copy=False` produce identical output (sklearn's `copy=False` does in-place row normalization, an optimization Rust's ownership makes moot here). Live-oracle test `fit_copy_true_false_identical`. Consumers: `FittedNormalizer` carries the flag + crate re-export `lib.rs:140`. |
36//! | REQ-6 (n_features_in_ / feature names) | PARTIAL | `n_features_in_` SHIPPED, `get_feature_names_out` NOT-STARTED. `FittedNormalizer<F>` records `n_features_in_ = x.ncols()` in `fit` and exposes `pub fn n_features_in(&self) -> usize`, mirroring sklearn's `_validate_data` setting `n_features_in_` (`:2082`); `FittedNormalizer::transform` validates the input column count against it (`ShapeMismatch`, sklearn `_validate_data(reset=False)` `:2104`). The `OneToOneFeatureMixin.get_feature_names_out` / `feature_names_in_` string-name plumbing is OUT OF SCOPE for this build (no string feature-name infrastructure in ferrolearn yet) — open prereq blocker #1144 for the feature-name half. Live-oracle test `fit_n_features_in_matches_ncols`. |
37//! | REQ-7 (sparse support) | NOT-STARTED | open prereq blocker #1145. Dense-only; no CSR `inplace_csr_row_normalize_l1/l2` / `min_max_axis` Max (`:1944-1960`). |
38//! | REQ-8 (PyO3 binding) | SHIPPED | FIXED #1146. `ferrolearn-python` surfaces `Normalizer` as `ferrolearn.Normalizer`: the hand-written `_RsNormalizer` `#[pyclass]` (`ferrolearn-python/src/extras.rs`, registered `lib.rs`) maps sklearn's `norm` STRING ('l1'/'l2'/'max') to the closed Rust `NormType` enum via `RsNormalizer::resolve_norm` — a bad string → `PyValueError` (sklearn `_parameter_constraints {norm: StrOptions({"l1","l2","max"})}`, `_data.py:2055`, `InvalidParameterError` ⊂ ValueError), builds `Normalizer::<f64>::new(normtype).with_copy(copy)`, runs the validating `Fit` (NaN/±inf → `PyValueError`, REQ-3) and delegates `transform` to `FittedNormalizer`. The non-test production consumer is `_extras.py::Normalizer(_TransformerWrapper)` with sklearn's `__init__(self, norm="l2", *, copy=True)` ABI (norm positional-or-keyword, copy keyword-only, `_data.py:2058`) + an overridden STATELESS `transform` (build-on-demand without fit, `_more_tags stateless=True` `_data.py:2110`, #2213) doing a FLOAT-ONLY dtype cast-back (float32→float32, float64→float64, int64→float64 UPCAST per `check_array(dtype=FLOAT_DTYPES)` `_data.py:2104`, #2214-analog — DIFFERS from Binarizer's number-preserving cast); re-exported in `__init__.py`. Verified vs the live sklearn 1.5.2 oracle: `tests/divergence_normalizer.py` (l1/l2/max values, default-l2, positional-norm, stateless, dtype, NaN/±inf, zero-norm, bad-norm, clone/get_params/set_params, copy no-op, pipeline). **Reduced-precision caveat (#2215, tracked):** sklearn `normalize` casts X to the INPUT float precision via `check_array(dtype=FLOAT_DTYPES)` (`_data.py:1933`) and computes the norm + division IN that precision (float16/float32), but the f64-only binding ABI (shared by EVERY `_Rs*` transformer) computes the norm in float64 then casts the result back — so float32 (~6e-8) and float16 (~5e-4) VALUES diverge slightly (dtype LABELS match; the float64 path is bit-exact, <1e-12). Same class as the generic-F precision caveats #2205/#2206; float16 is fundamentally unmatchable (the Rust core has no f16). Pinned `#[skip]` in `tests/divergence_normalizer_reduced_precision.py`. |
39//! | REQ-9 (ferray substrate) | NOT-STARTED | open prereq blocker #1147. `ndarray::Array2` + `num_traits::Float`, not `ferray-core`/`ferray-ufunc` (R-SUBSTRATE-1/2). |
40
41use ferrolearn_core::error::FerroError;
42use ferrolearn_core::pipeline::{FittedPipelineTransformer, PipelineTransformer};
43use ferrolearn_core::traits::{Fit, Transform};
44use ndarray::{Array1, Array2, ArrayView1};
45use num_traits::Float;
46
47// ---------------------------------------------------------------------------
48// NormType
49// ---------------------------------------------------------------------------
50
51/// The norm used by [`Normalizer`] when scaling each sample.
52#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
53pub enum NormType {
54 /// L1 norm: sum of absolute values.
55 L1,
56 /// L2 norm: Euclidean norm (square root of sum of squares). This is the default.
57 #[default]
58 L2,
59 /// Max norm: maximum absolute value in the sample.
60 Max,
61}
62
63// ---------------------------------------------------------------------------
64// Normalizer
65// ---------------------------------------------------------------------------
66
67/// A stateless row-wise normalizer.
68///
69/// Each sample (row) is independently scaled so that its chosen norm equals 1.
70/// Samples with a zero norm are left unchanged.
71///
72/// This transformer is stateless — no [`Fit`](ferrolearn_core::traits::Fit)
73/// step is needed. Call [`Transform::transform`] directly.
74///
75/// # Examples
76///
77/// ```
78/// use ferrolearn_preprocess::normalizer::{Normalizer, NormType};
79/// use ferrolearn_core::traits::Transform;
80/// use ndarray::array;
81///
82/// let normalizer = Normalizer::<f64>::new(NormType::L2);
83/// let x = array![[3.0, 4.0], [1.0, 0.0]];
84/// let out = normalizer.transform(&x).unwrap();
85/// // Row 0: [3/5, 4/5], Row 1: [1.0, 0.0]
86/// ```
87#[derive(Debug, Clone)]
88pub struct Normalizer<F> {
89 /// The norm to use for normalisation.
90 pub(crate) norm: NormType,
91 /// sklearn's `copy` constructor parameter (`__init__(norm='l2', *, copy=True)`,
92 /// `_data.py:2058-2060`; `_parameter_constraints {copy:["boolean"]}` `:2055`).
93 /// ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always returns a
94 /// freshly allocated array, so `copy` has no observable effect. Retained for
95 /// API parity. Defaults to `true`.
96 pub(crate) copy: bool,
97 _marker: std::marker::PhantomData<F>,
98}
99
100impl<F: Float + Send + Sync + 'static> Normalizer<F> {
101 /// Create a new `Normalizer` with the specified norm type.
102 #[must_use]
103 pub fn new(norm: NormType) -> Self {
104 Self {
105 norm,
106 copy: true,
107 _marker: std::marker::PhantomData,
108 }
109 }
110
111 /// Create a new `Normalizer` using the default L2 norm.
112 #[must_use]
113 pub fn l2() -> Self {
114 Self::new(NormType::L2)
115 }
116
117 /// Create a new `Normalizer` using the L1 norm.
118 #[must_use]
119 pub fn l1() -> Self {
120 Self::new(NormType::L1)
121 }
122
123 /// Create a new `Normalizer` using the Max norm.
124 #[must_use]
125 pub fn max() -> Self {
126 Self::new(NormType::Max)
127 }
128
129 /// Return the configured norm type.
130 #[must_use]
131 pub fn norm(&self) -> NormType {
132 self.norm
133 }
134
135 /// Set the `copy` parameter (sklearn `Normalizer(copy=...)`,
136 /// `_data.py:2058`, `_parameter_constraints {copy:["boolean"]}` `:2055`).
137 ///
138 /// This is an ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always
139 /// returns a freshly allocated array, so `copy` has no observable effect on
140 /// the output. It is retained for API parity with scikit-learn.
141 #[must_use]
142 pub fn with_copy(mut self, copy: bool) -> Self {
143 self.copy = copy;
144 self
145 }
146
147 /// Return the configured `copy` flag (sklearn `Normalizer.copy`).
148 #[must_use]
149 pub fn copy(&self) -> bool {
150 self.copy
151 }
152}
153
154impl<F: Float + Send + Sync + 'static> Default for Normalizer<F> {
155 fn default() -> Self {
156 Self::new(NormType::L2)
157 }
158}
159
160// ---------------------------------------------------------------------------
161// FittedNormalizer (sklearn stateful `fit` -> fitted estimator path)
162// ---------------------------------------------------------------------------
163
164/// A fitted [`Normalizer`].
165///
166/// `Normalizer` is stateless — its `fit` (sklearn `Normalizer.fit`,
167/// `_data.py:2062-2083`, "Only validates estimator's parameters") learns NO
168/// statistics; it merely validates the input and records `n_features_in_`. The
169/// fitted type therefore carries only the configured `norm`, the `copy` flag,
170/// and the recorded feature count. Its [`Transform::transform`] reuses the very
171/// same row-norm logic as the stateless [`Normalizer`]/[`normalize`] path, so
172/// the two paths are bit-identical.
173#[derive(Debug, Clone)]
174pub struct FittedNormalizer<F> {
175 /// The norm to use for normalisation.
176 pub(crate) norm: NormType,
177 /// The `copy` flag carried from the unfitted [`Normalizer`] (no-op; see
178 /// [`Normalizer::with_copy`]).
179 pub(crate) copy: bool,
180 /// Number of features (columns) seen during [`Fit::fit`] — sklearn's
181 /// `n_features_in_` (`_data.py:2082`, set by `_validate_data`).
182 pub(crate) n_features_in_: usize,
183 _marker: std::marker::PhantomData<F>,
184}
185
186impl<F: Float + Send + Sync + 'static> FittedNormalizer<F> {
187 /// Return the number of features (columns) seen during [`Fit::fit`].
188 ///
189 /// Mirrors scikit-learn's `Normalizer.n_features_in_` (`_data.py:2082`).
190 #[must_use]
191 pub fn n_features_in(&self) -> usize {
192 self.n_features_in_
193 }
194
195 /// Return the configured norm type.
196 #[must_use]
197 pub fn norm(&self) -> NormType {
198 self.norm
199 }
200
201 /// Return the configured `copy` flag (no-op; see [`Normalizer::with_copy`]).
202 #[must_use]
203 pub fn copy(&self) -> bool {
204 self.copy
205 }
206}
207
208impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for Normalizer<F> {
209 type Fitted = FittedNormalizer<F>;
210 type Error = FerroError;
211
212 /// Validate the input and record `n_features_in_`, returning a
213 /// [`FittedNormalizer`].
214 ///
215 /// `Normalizer` is stateless: like scikit-learn's `Normalizer.fit`
216 /// (`sklearn/preprocessing/_data.py:2062-2083`, "Only validates estimator's
217 /// parameters"), this learns NO statistics. It runs the SAME `check_array`
218 /// validation as [`Transform::transform`] / [`normalize`] (REQ-2, via the
219 /// shared `validate_normalize_input` helper) and records
220 /// `n_features_in_ = x.ncols()`. sklearn's `_validate_data` uses the default
221 /// `force_all_finite=True`, so NaN/±inf are REJECTED in `fit`
222 /// (`Normalizer().fit([[nan]])` / `[[inf]]` raise `ValueError`).
223 ///
224 /// # Errors
225 ///
226 /// Returns [`FerroError::InsufficientSamples`] for zero rows and
227 /// [`FerroError::InvalidParameter`] for zero features or any non-finite
228 /// value (NaN, +inf, -inf) — matching `check_array`
229 /// (`sklearn/utils/validation.py:1084`, `:1093`, `:1063`) as routed through
230 /// `Normalizer.fit` -> `_validate_data` (`_data.py:2082`).
231 fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedNormalizer<F>, FerroError> {
232 validate_normalize_input(x)?;
233 Ok(FittedNormalizer {
234 norm: self.norm,
235 copy: self.copy,
236 n_features_in_: x.ncols(),
237 _marker: std::marker::PhantomData,
238 })
239 }
240}
241
242impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedNormalizer<F> {
243 type Output = Array2<F>;
244 type Error = FerroError;
245
246 /// Normalize each row of `x` to unit norm, delegating to the SAME row-norm
247 /// logic as the stateless [`Normalizer`] / [`normalize`] path.
248 ///
249 /// First validates that `x` has the same number of columns recorded during
250 /// [`Fit::fit`] (sklearn `_validate_data(reset=False)`,
251 /// `sklearn/preprocessing/_data.py:2104`) and applies the REQ-2
252 /// `check_array` guards, then calls the shared [`normalize`] free function
253 /// with `axis=1` (sklearn `Normalizer.transform` ->
254 /// `normalize(X, norm=self.norm, axis=1)`, `:2106`). The output is therefore
255 /// byte-identical to `Normalizer::transform`.
256 ///
257 /// # Errors
258 ///
259 /// Returns [`FerroError::ShapeMismatch`] if the column count differs from
260 /// `n_features_in_`. Returns [`FerroError::InsufficientSamples`] for zero
261 /// rows and [`FerroError::InvalidParameter`] for zero features or any
262 /// non-finite value (REQ-2, via [`normalize`]).
263 fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
264 // sklearn `_validate_data(reset=False)` runs `check_array` (finite /
265 // min-samples / min-features) BEFORE `_check_n_features` (`base.py:633`
266 // then `:654`, #2207). So validate + normalize FIRST (this is
267 // `check_array`'s job via the shared REQ-2 guard in `normalize`); a NaN /
268 // +-inf / zero-sample / zero-feature input must raise its check_array
269 // error EVEN when the column count is also wrong. Only after that does
270 // the n_features comparison fire.
271 let normalized = normalize(x, self.norm, 1)?;
272 if x.ncols() != self.n_features_in_ {
273 return Err(FerroError::ShapeMismatch {
274 expected: vec![x.nrows(), self.n_features_in_],
275 actual: vec![x.nrows(), x.ncols()],
276 context: "FittedNormalizer::transform".into(),
277 });
278 }
279 Ok(normalized)
280 }
281}
282
283// ---------------------------------------------------------------------------
284// Trait implementations
285// ---------------------------------------------------------------------------
286
287impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for Normalizer<F> {
288 type Output = Array2<F>;
289 type Error = FerroError;
290
291 /// Normalize each row of `x` to unit norm.
292 ///
293 /// Rows with a zero norm value are left unchanged.
294 ///
295 /// # Errors
296 ///
297 /// Returns [`FerroError::InsufficientSamples`] if `x` has zero rows. This
298 /// mirrors scikit-learn's `Normalizer.transform` ->
299 /// `normalize` -> `check_array` (`sklearn/preprocessing/_data.py:1933`),
300 /// whose min-samples check (`utils/validation.py:1084`,
301 /// `ensure_min_samples=1`) raises `ValueError: Found array with 0 sample(s)
302 /// ... while a minimum of 1 is required by Normalizer.`
303 ///
304 /// Returns [`FerroError::InvalidParameter`] if `x` has zero features
305 /// (columns). This mirrors the same `check_array` min-features check
306 /// (`utils/validation.py:1093`, `ensure_min_features=1`) which raises
307 /// `ValueError: Found array with 0 feature(s) ... while a minimum of 1 is
308 /// required by Normalizer.`
309 ///
310 /// Returns [`FerroError::InvalidParameter`] if `x` contains any non-finite
311 /// value (NaN, +inf, or -inf). This mirrors `check_array(force_all_finite=
312 /// True)` (`utils/validation.py:1063`), which raises `ValueError: Input X
313 /// contains NaN.` / `Input X contains infinity ...` before normalizing.
314 fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
315 if x.nrows() == 0 {
316 return Err(FerroError::InsufficientSamples {
317 required: 1,
318 actual: 0,
319 context: "Normalizer::transform".into(),
320 });
321 }
322 if x.ncols() == 0 {
323 return Err(FerroError::InvalidParameter {
324 name: "X".to_string(),
325 reason: "Found array with 0 feature(s); a minimum of 1 is required \
326 by Normalizer"
327 .to_string(),
328 });
329 }
330 if x.iter().any(|v| !v.is_finite()) {
331 return Err(FerroError::InvalidParameter {
332 name: "X".to_string(),
333 reason: "Input X contains non-finite values (NaN or infinity); \
334 Normalizer requires all-finite input"
335 .to_string(),
336 });
337 }
338 let mut out = x.to_owned();
339 for mut row in out.rows_mut() {
340 let norm_val =
341 match self.norm {
342 NormType::L1 => row.iter().copied().fold(F::zero(), |acc, v| acc + v.abs()),
343 NormType::L2 => row
344 .iter()
345 .copied()
346 .fold(F::zero(), |acc, v| acc + v * v)
347 .sqrt(),
348 NormType::Max => row.iter().copied().fold(F::zero(), |acc, v| {
349 if v.abs() > acc { v.abs() } else { acc }
350 }),
351 };
352 if norm_val == F::zero() {
353 // Zero-norm row: leave unchanged.
354 continue;
355 }
356 for v in &mut row {
357 *v = *v / norm_val;
358 }
359 }
360 Ok(out)
361 }
362}
363
364// ---------------------------------------------------------------------------
365// Standalone `normalize` free function (sklearn `normalize`, `_data.py:1866`)
366// ---------------------------------------------------------------------------
367
368/// Compute the `norm` of a single 1-D slice (one row or one column).
369///
370/// Mirrors sklearn's dense `normalize` per-vector norms (`_data.py:1962-1967`):
371/// L1 = Σ|v|, L2 = √Σv², Max = max|v|.
372fn row_norm<F: Float>(row: ArrayView1<F>, norm: NormType) -> F {
373 match norm {
374 NormType::L1 => row.iter().copied().fold(F::zero(), |acc, v| acc + v.abs()),
375 NormType::L2 => row
376 .iter()
377 .copied()
378 .fold(F::zero(), |acc, v| acc + v * v)
379 .sqrt(),
380 NormType::Max => {
381 row.iter().copied().fold(
382 F::zero(),
383 |acc, v| {
384 if v.abs() > acc { v.abs() } else { acc }
385 },
386 )
387 }
388 }
389}
390
391/// Run the shared `check_array` input validation (REQ-2) used by both
392/// [`Normalizer`]'s `transform` and the free [`normalize`]/[`normalize_with_norms`]
393/// functions, in sklearn's `check_array` order (zero-samples → zero-features →
394/// non-finite; `sklearn/utils/validation.py:1084`, `:1093`, `:1063`).
395fn validate_normalize_input<F: Float>(x: &Array2<F>) -> Result<(), FerroError> {
396 if x.nrows() == 0 {
397 return Err(FerroError::InsufficientSamples {
398 required: 1,
399 actual: 0,
400 context: "normalize".into(),
401 });
402 }
403 if x.ncols() == 0 {
404 return Err(FerroError::InvalidParameter {
405 name: "X".to_string(),
406 reason: "Found array with 0 feature(s); a minimum of 1 is required \
407 by the normalize function"
408 .to_string(),
409 });
410 }
411 if x.iter().any(|v| !v.is_finite()) {
412 return Err(FerroError::InvalidParameter {
413 name: "X".to_string(),
414 reason: "Input X contains non-finite values (NaN or infinity); \
415 the normalize function requires all-finite input"
416 .to_string(),
417 });
418 }
419 Ok(())
420}
421
422/// Shared core of [`normalize`] / [`normalize_with_norms`]: validate `axis` and
423/// input, then return the normalized array plus the per-axis **raw** norm vector.
424///
425/// The returned `norms` are the actual computed norms (NOT zero-handled): a
426/// zero-norm row/column appears as `0.0` even though the division used `1`
427/// (`_handle_zeros_in_scale`, `_data.py:1968`) to leave it unchanged. This
428/// matches sklearn's `normalize(..., return_norm=True)` (`:1974-1975`).
429fn normalize_inner<F: Float>(
430 x: &Array2<F>,
431 norm: NormType,
432 axis: usize,
433) -> Result<(Array2<F>, Array1<F>), FerroError> {
434 if axis != 0 && axis != 1 {
435 return Err(FerroError::InvalidParameter {
436 name: "axis".into(),
437 reason: "must be 0 or 1".into(),
438 });
439 }
440 validate_normalize_input(x)?;
441
442 let mut out = x.to_owned();
443 if axis == 1 {
444 // Row-normalize (sklearn default axis=1).
445 let mut norms = Array1::<F>::zeros(out.nrows());
446 for (i, mut row) in out.rows_mut().into_iter().enumerate() {
447 let n = row_norm(row.view(), norm);
448 norms[i] = n;
449 // _handle_zeros_in_scale: a zero norm divides by 1 (row unchanged).
450 let eff = if n == F::zero() { F::one() } else { n };
451 for v in &mut row {
452 *v = *v / eff;
453 }
454 }
455 Ok((out, norms))
456 } else {
457 // axis == 0: column-normalize. sklearn transposes, runs the axis=1
458 // path, then transposes back (`_data.py:1926-1942`, `:1971-1972`).
459 let mut norms = Array1::<F>::zeros(out.ncols());
460 for (j, mut col) in out.columns_mut().into_iter().enumerate() {
461 let n = row_norm(col.view(), norm);
462 norms[j] = n;
463 let eff = if n == F::zero() { F::one() } else { n };
464 for v in &mut col {
465 *v = *v / eff;
466 }
467 }
468 Ok((out, norms))
469 }
470}
471
472/// Scale input vectors individually to unit norm — the standalone, estimator-less
473/// API mirroring scikit-learn's `normalize` free function
474/// (`sklearn/preprocessing/_data.py:1866`).
475///
476/// With `axis == 1` (sklearn's default) each **row** (sample) is divided by its
477/// `norm` (L1 = Σ|v|, L2 = √Σv², Max = max|v|); with `axis == 0` each **column**
478/// (feature) is normalized instead (sklearn transposes, row-normalizes, and
479/// transposes back — `:1926-1942`, `:1971-1972`). A row/column whose norm is zero
480/// is left unchanged, matching `_handle_zeros_in_scale` (`:1968`).
481///
482/// # Errors
483///
484/// Returns [`FerroError::InvalidParameter`] if `axis` is not `0` or `1`. Also
485/// applies the same `check_array` input validation as [`Normalizer`]'s
486/// `transform` (REQ-2): [`FerroError::InsufficientSamples`] for zero rows, and
487/// [`FerroError::InvalidParameter`] for zero features or any non-finite value
488/// (`_data.py:1933-1940`).
489#[must_use = "normalize returns a new array; the input is not modified"]
490pub fn normalize<F: Float>(
491 x: &Array2<F>,
492 norm: NormType,
493 axis: usize,
494) -> Result<Array2<F>, FerroError> {
495 let (out, _norms) = normalize_inner(x, norm, axis)?;
496 Ok(out)
497}
498
499/// Like [`normalize`] but also returns the per-axis norm vector — the
500/// `return_norm=True` form of scikit-learn's `normalize`
501/// (`sklearn/preprocessing/_data.py:1971-1975`).
502///
503/// Returns `(normalized, norms)` where `norms` is the per-row vector for
504/// `axis == 1` (length = n_rows) or the per-column vector for `axis == 0`
505/// (length = n_cols). The norms are the **raw** computed norms, NOT
506/// zero-handled: a zero norm appears as `0.0` in the returned vector even though
507/// the division used `1` to leave that row/column unchanged (sklearn returns the
508/// raw `norms` array — `:1974-1975`).
509///
510/// # Errors
511///
512/// Same as [`normalize`].
513#[must_use = "normalize_with_norms returns a new array and the norm vector"]
514pub fn normalize_with_norms<F: Float>(
515 x: &Array2<F>,
516 norm: NormType,
517 axis: usize,
518) -> Result<(Array2<F>, Array1<F>), FerroError> {
519 normalize_inner(x, norm, axis)
520}
521
522// ---------------------------------------------------------------------------
523// Pipeline integration (generic)
524// ---------------------------------------------------------------------------
525
526impl<F: Float + Send + Sync + 'static> PipelineTransformer<F> for Normalizer<F> {
527 /// Fit the normalizer using the pipeline interface.
528 ///
529 /// Because `Normalizer` is stateless, this simply boxes `self` as a
530 /// [`FittedPipelineTransformer`].
531 ///
532 /// # Errors
533 ///
534 /// This implementation never returns an error.
535 fn fit_pipeline(
536 &self,
537 _x: &Array2<F>,
538 _y: &Array1<F>,
539 ) -> Result<Box<dyn FittedPipelineTransformer<F>>, FerroError> {
540 Ok(Box::new(self.clone()))
541 }
542}
543
544impl<F: Float + Send + Sync + 'static> FittedPipelineTransformer<F> for Normalizer<F> {
545 /// Transform data using the pipeline interface.
546 ///
547 /// # Errors
548 ///
549 /// Propagates errors from [`Transform::transform`].
550 fn transform_pipeline(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
551 self.transform(x)
552 }
553}
554
555// ---------------------------------------------------------------------------
556// Tests
557// ---------------------------------------------------------------------------
558
559#[cfg(test)]
560mod tests {
561 use super::*;
562 use approx::assert_abs_diff_eq;
563 use ndarray::array;
564
565 #[test]
566 fn test_l2_norm_basic() {
567 let norm = Normalizer::<f64>::l2();
568 // Row [3, 4] has L2 norm 5.
569 let x = array![[3.0, 4.0]];
570 let out = norm.transform(&x).unwrap();
571 assert_abs_diff_eq!(out[[0, 0]], 0.6, epsilon = 1e-10);
572 assert_abs_diff_eq!(out[[0, 1]], 0.8, epsilon = 1e-10);
573 }
574
575 #[test]
576 fn test_l2_unit_norm_after_transform() {
577 let norm = Normalizer::<f64>::l2();
578 let x = array![[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
579 let out = norm.transform(&x).unwrap();
580 for row in out.rows() {
581 let row_norm: f64 = row.iter().map(|v| v * v).sum::<f64>().sqrt();
582 assert_abs_diff_eq!(row_norm, 1.0, epsilon = 1e-10);
583 }
584 }
585
586 #[test]
587 fn test_l1_norm_basic() {
588 let norm = Normalizer::<f64>::l1();
589 // Row [1, 2, 3] has L1 norm 6.
590 let x = array![[1.0, 2.0, 3.0]];
591 let out = norm.transform(&x).unwrap();
592 assert_abs_diff_eq!(out[[0, 0]], 1.0 / 6.0, epsilon = 1e-10);
593 assert_abs_diff_eq!(out[[0, 1]], 2.0 / 6.0, epsilon = 1e-10);
594 assert_abs_diff_eq!(out[[0, 2]], 3.0 / 6.0, epsilon = 1e-10);
595 }
596
597 #[test]
598 fn test_l1_unit_norm_after_transform() {
599 let norm = Normalizer::<f64>::l1();
600 let x = array![[1.0, 2.0, 3.0], [-4.0, 5.0, 6.0]];
601 let out = norm.transform(&x).unwrap();
602 for row in out.rows() {
603 let row_norm: f64 = row.iter().map(|v| v.abs()).sum();
604 assert_abs_diff_eq!(row_norm, 1.0, epsilon = 1e-10);
605 }
606 }
607
608 #[test]
609 fn test_max_norm_basic() {
610 let norm = Normalizer::<f64>::max();
611 // Row [-5, 3, 1] has max norm 5.
612 let x = array![[-5.0, 3.0, 1.0]];
613 let out = norm.transform(&x).unwrap();
614 assert_abs_diff_eq!(out[[0, 0]], -1.0, epsilon = 1e-10);
615 assert_abs_diff_eq!(out[[0, 1]], 0.6, epsilon = 1e-10);
616 assert_abs_diff_eq!(out[[0, 2]], 0.2, epsilon = 1e-10);
617 }
618
619 #[test]
620 fn test_zero_row_unchanged() {
621 let norm = Normalizer::<f64>::l2();
622 let x = array![[0.0, 0.0, 0.0], [1.0, 2.0, 3.0]];
623 let out = norm.transform(&x).unwrap();
624 // Zero row stays zero
625 assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-15);
626 assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-15);
627 assert_abs_diff_eq!(out[[0, 2]], 0.0, epsilon = 1e-15);
628 }
629
630 #[test]
631 fn test_negative_values_l2() {
632 let norm = Normalizer::<f64>::l2();
633 let x = array![[-3.0, -4.0]];
634 let out = norm.transform(&x).unwrap();
635 assert_abs_diff_eq!(out[[0, 0]], -0.6, epsilon = 1e-10);
636 assert_abs_diff_eq!(out[[0, 1]], -0.8, epsilon = 1e-10);
637 }
638
639 #[test]
640 fn test_default_is_l2() {
641 let norm = Normalizer::<f64>::default();
642 assert_eq!(norm.norm(), NormType::L2);
643 }
644
645 #[test]
646 fn test_multiple_rows_independent() {
647 let norm = Normalizer::<f64>::l2();
648 let x = array![[3.0, 4.0], [0.0, 5.0]];
649 let out = norm.transform(&x).unwrap();
650 // Row 0: L2 norm = 5
651 assert_abs_diff_eq!(out[[0, 0]], 0.6, epsilon = 1e-10);
652 assert_abs_diff_eq!(out[[0, 1]], 0.8, epsilon = 1e-10);
653 // Row 1: L2 norm = 5
654 assert_abs_diff_eq!(out[[1, 0]], 0.0, epsilon = 1e-10);
655 assert_abs_diff_eq!(out[[1, 1]], 1.0, epsilon = 1e-10);
656 }
657
658 #[test]
659 fn test_pipeline_integration() {
660 use ferrolearn_core::pipeline::PipelineTransformer;
661 let norm = Normalizer::<f64>::l2();
662 let x = array![[3.0, 4.0], [0.0, 2.0]];
663 let y = Array1::zeros(2);
664 let fitted = norm.fit_pipeline(&x, &y).unwrap();
665 let result = fitted.transform_pipeline(&x).unwrap();
666 assert_abs_diff_eq!(result[[0, 0]], 0.6, epsilon = 1e-10);
667 assert_abs_diff_eq!(result[[0, 1]], 0.8, epsilon = 1e-10);
668 }
669
670 #[test]
671 fn test_f32_normalizer() {
672 let norm = Normalizer::<f32>::l2();
673 let x: Array2<f32> = array![[3.0f32, 4.0]];
674 let out = norm.transform(&x).unwrap();
675 assert!((out[[0, 0]] - 0.6f32).abs() < 1e-6);
676 assert!((out[[0, 1]] - 0.8f32).abs() < 1e-6);
677 }
678
679 // -----------------------------------------------------------------------
680 // REQ-4 — standalone `normalize` / `normalize_with_norms` free functions.
681 // Oracle: live sklearn 1.5.2 (R-CHAR-3), X = [[1,2,2],[0,3,4]].
682 // normalize(X, l2, axis=1) -> [[.33333333,.66666667,.66666667],[0,.6,.8]]
683 // normalize(X, l1, axis=1) -> [[.2,.4,.4],[0,.42857143,.57142857]]
684 // normalize(X, max,axis=1) -> [[.5,1,1],[0,.75,1]]
685 // normalize(X, l2, axis=0) -> [[1,.5547002,.4472136],[0,.83205029,.89442719]]
686 // return_norm l2 axis=1 norms -> [3,5]; l1 axis=1 norms -> [5,7]
687 // -----------------------------------------------------------------------
688
689 #[test]
690 fn normalize_l2_axis1_matches_sklearn() -> Result<(), FerroError> {
691 let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
692 let out = normalize(&x, NormType::L2, 1)?;
693 let expected = array![[0.33333333, 0.66666667, 0.66666667], [0.0, 0.6, 0.8]];
694 for (a, b) in out.iter().zip(expected.iter()) {
695 assert_abs_diff_eq!(a, b, epsilon = 1e-7);
696 }
697 Ok(())
698 }
699
700 #[test]
701 fn normalize_l1_axis1_matches_sklearn() -> Result<(), FerroError> {
702 let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
703 let out = normalize(&x, NormType::L1, 1)?;
704 let expected = array![[0.2, 0.4, 0.4], [0.0, 0.42857143, 0.57142857]];
705 for (a, b) in out.iter().zip(expected.iter()) {
706 assert_abs_diff_eq!(a, b, epsilon = 1e-7);
707 }
708 Ok(())
709 }
710
711 #[test]
712 fn normalize_max_axis1_matches_sklearn() -> Result<(), FerroError> {
713 let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
714 let out = normalize(&x, NormType::Max, 1)?;
715 let expected = array![[0.5, 1.0, 1.0], [0.0, 0.75, 1.0]];
716 for (a, b) in out.iter().zip(expected.iter()) {
717 assert_abs_diff_eq!(a, b, epsilon = 1e-7);
718 }
719 Ok(())
720 }
721
722 #[test]
723 fn normalize_l2_axis0_matches_sklearn() -> Result<(), FerroError> {
724 let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
725 let out = normalize(&x, NormType::L2, 0)?;
726 let expected = array![[1.0, 0.5547002, 0.4472136], [0.0, 0.83205029, 0.89442719]];
727 for (a, b) in out.iter().zip(expected.iter()) {
728 assert_abs_diff_eq!(a, b, epsilon = 1e-7);
729 }
730 Ok(())
731 }
732
733 #[test]
734 fn normalize_return_norm_l2_and_l1() -> Result<(), FerroError> {
735 let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
736
737 let (_out_l2, norms_l2) = normalize_with_norms(&x, NormType::L2, 1)?;
738 assert_abs_diff_eq!(norms_l2[0], 3.0, epsilon = 1e-9);
739 assert_abs_diff_eq!(norms_l2[1], 5.0, epsilon = 1e-9);
740
741 let (_out_l1, norms_l1) = normalize_with_norms(&x, NormType::L1, 1)?;
742 assert_abs_diff_eq!(norms_l1[0], 5.0, epsilon = 1e-9);
743 assert_abs_diff_eq!(norms_l1[1], 7.0, epsilon = 1e-9);
744 Ok(())
745 }
746
747 #[test]
748 fn normalize_invalid_axis_errors() {
749 let x = array![[1.0, 2.0, 2.0], [0.0, 3.0, 4.0]];
750 let err = normalize(&x, NormType::L2, 2);
751 assert!(matches!(err, Err(FerroError::InvalidParameter { .. })));
752 }
753}