Module normalizer

Expand description

Normalizer: scale each sample (row) to unit norm.

Unlike column-wise scalers, the Normalizer operates row-wise: each sample is scaled independently so that its chosen norm equals 1.

Supported norms:

L1: divide by the sum of absolute values
L2: divide by the Euclidean norm (default)
Max: divide by the maximum absolute value

Samples that already have a zero norm are left unchanged.

This transformer is stateless — no fitting is required. Call Transform::transform directly. For scikit-learn API parity it ALSO supports the stateful Fit → FittedNormalizer path, which records n_features_in_ and (like sklearn) validates the input in fit; the fitted type’s transform reuses the very same row-norm logic as the stateless path, so both paths are bit-identical.

§`## REQ status`

Binary (R-DEFER-2), translating sklearn/preprocessing/_data.py (class Normalizer :1980, normalize :1866). Design doc: .design/preprocess/normalizer.md. Expected values from the live sklearn 1.5.2 oracle (R-CHAR-3). Consumers: the in-file PipelineTransformer/FittedPipelineTransformer impls (pipeline integration) + crate re-export (lib.rs:119, grandfathered S5). No PyO3 binding.

REQ	Status	Evidence
REQ-1 (row-wise L1/L2/Max transform)	SHIPPED	`Transform::transform` divides each row by its norm (L1=Σ\|v\|, L2=√Σv², Max=max\|v\|; zero-norm row unchanged), default L2; mirrors sklearn dense `normalize` (`_data.py:1962-1969`, `_handle_zeros_in_scale` `:1968`). Critic-verified bit-identical to live oracle: `guard_l1/l2/max/zero_row/f32_matches_oracle` in `tests/divergence_normalizer.rs`. Consumers: `FittedPipelineTransformer::transform_pipeline` + crate re-export `lib.rs:119`.
REQ-2 (transform input validation per check_array)	SHIPPED	FIXED #1140. `transform` guards (sklearn order) zero-samples → `InsufficientSamples` (`validation.py:1084`), zero-features → `InvalidParameter` (`:1093`), non-finite NaN/±inf → `InvalidParameter` (`:1063`) — matching `Normalizer.transform` → `normalize` → `check_array` (`_data.py:1933-1940`). Mirrors converged `binarizer.rs`. Critic two-round CLEAN: 6 rejection pins + finite-not-over-rejected guards (zero-NORM-row/1e308/subnormal/-0.0); pipeline consumer inherits validation.
REQ-3 (validating fit + parameter constraints)	SHIPPED	FIXED #1141. `impl Fit<Array2<F>, ()> for Normalizer` (`fit`): runs the SAME `validate_normalize_input` guard as `Transform::transform`/`normalize` (REQ-2: zero-samples → `InsufficientSamples`, zero-features/non-finite NaN±inf → `InvalidParameter`, sklearn `_validate_data` default `force_all_finite=True` REJECTS NaN/inf — confirmed `Normalizer().fit([[nan]])`/`[[inf]]` raise ValueError, `:2082`,`utils/validation.py:1063/1084/1093`), records `n_features_in_ = x.ncols()`, returns `FittedNormalizer { norm, copy, n_features_in_ }` (no fitted statistics — Normalizer is stateless, sklearn fit “Only validates”, `:2062-2083`). sklearn’s `_parameter_constraints {norm:[StrOptions{l1,l2,max}]}` (`:2053-2055`) has NO ferrolearn analog: `NormType` is a closed Rust enum, so an out-of-domain norm is UNREPRESENTABLE rather than runtime-rejected — the type system satisfies the param-domain check. Live-oracle tests: `fit_l1/l2/max_matches_oracle_and_stateless`, `fit_rejects_nan/pos_inf/neg_inf`, `fit_zero_row_unchanged`, `fitted_transform_shape_mismatch`, `fit_path_equals_stateless_path` in `tests/divergence_normalizer.rs`. Consumers: `FittedNormalizer::transform` (the fitted path) + crate re-export `lib.rs:140`.
REQ-4 (normalize free fn: axis / return_norm)	SHIPPED	FIXED #1142. `pub fn normalize` + `pub fn normalize_with_norms` (free fns) mirror sklearn `normalize(X, norm, *, axis=1, copy=True, return_norm=False)` (`_data.py:1866`). Shared `row_norm` helper computes L1=Σ\|v\|, L2=√Σv², Max=max\|v\| (`:1962-1967`); `_handle_zeros_in_scale` zero→1 (`:1968`); `X /= norms` (`:1969`). `axis=1` row-normalizes; `axis=0` column-normalizes (sklearn transpose `:1926-1942`,`:1971-1972`); `axis ∉ {0,1}` → `InvalidParameter`. `normalize_with_norms` returns `(normalized, raw_norms)` (return_norm `:1974-1975`; raw, NOT zero-handled). Same validation as `Transform::transform` (REQ-2). Oracle-grounded tests in `#[cfg(test)]`: `normalize_l2/l1/max_axis1_matches_sklearn`, `normalize_l2_axis0_matches_sklearn`, `normalize_return_norm_l2_and_l1`, `normalize_invalid_axis_errors`.
REQ-5 (copy parameter)	SHIPPED	FIXED #1143. `Normalizer<F>` gains a `copy: bool` field (default `true`) + `#[must_use] with_copy` builder + `copy()` getter, threaded onto `FittedNormalizer`, mirroring sklearn `__init__(norm='l2', *, copy=True)` (`_data.py:2058-2060`, `_parameter_constraints {copy:["boolean"]}` `:2055`). ACCEPT-AND-DOCUMENT no-op: ferrolearn’s `Transform` always returns a freshly allocated array (`to_owned()`), so `copy` has no observable effect — `copy=True`/`copy=False` produce identical output (sklearn’s `copy=False` does in-place row normalization, an optimization Rust’s ownership makes moot here). Live-oracle test `fit_copy_true_false_identical`. Consumers: `FittedNormalizer` carries the flag + crate re-export `lib.rs:140`.
REQ-6 (n_features_in_ / feature names)	PARTIAL	`n_features_in_` SHIPPED, `get_feature_names_out` NOT-STARTED. `FittedNormalizer<F>` records `n_features_in_ = x.ncols()` in `fit` and exposes `pub fn n_features_in(&self) -> usize`, mirroring sklearn’s `_validate_data` setting `n_features_in_` (`:2082`); `FittedNormalizer::transform` validates the input column count against it (`ShapeMismatch`, sklearn `_validate_data(reset=False)` `:2104`). The `OneToOneFeatureMixin.get_feature_names_out` / `feature_names_in_` string-name plumbing is OUT OF SCOPE for this build (no string feature-name infrastructure in ferrolearn yet) — open prereq blocker #1144 for the feature-name half. Live-oracle test `fit_n_features_in_matches_ncols`.
REQ-7 (sparse support)	NOT-STARTED	open prereq blocker #1145. Dense-only; no CSR `inplace_csr_row_normalize_l1/l2` / `min_max_axis` Max (`:1944-1960`).
REQ-8 (PyO3 binding)	SHIPPED	FIXED #1146. `ferrolearn-python` surfaces `Normalizer` as `ferrolearn.Normalizer`: the hand-written `_RsNormalizer` `#[pyclass]` (`ferrolearn-python/src/extras.rs`, registered `lib.rs`) maps sklearn’s `norm` STRING (‘l1’/‘l2’/‘max’) to the closed Rust `NormType` enum via `RsNormalizer::resolve_norm` — a bad string → `PyValueError` (sklearn `_parameter_constraints {norm: StrOptions({"l1","l2","max"})}`, `_data.py:2055`, `InvalidParameterError` ⊂ ValueError), builds `Normalizer::<f64>::new(normtype).with_copy(copy)`, runs the validating `Fit` (NaN/±inf → `PyValueError`, REQ-3) and delegates `transform` to `FittedNormalizer`. The non-test production consumer is `_extras.py::Normalizer(_TransformerWrapper)` with sklearn’s `__init__(self, norm="l2", , copy=True)` ABI (norm positional-or-keyword, copy keyword-only, `_data.py:2058`) + an overridden STATELESS `transform` (build-on-demand without fit, `_more_tags stateless=True` `_data.py:2110`, #2213) doing a FLOAT-ONLY dtype cast-back (float32→float32, float64→float64, int64→float64 UPCAST per `check_array(dtype=FLOAT_DTYPES)` `_data.py:2104`, #2214-analog — DIFFERS from Binarizer’s number-preserving cast); re-exported in `__init__.py`. Verified vs the live sklearn 1.5.2 oracle: `tests/divergence_normalizer.py` (l1/l2/max values, default-l2, positional-norm, stateless, dtype, NaN/±inf, zero-norm, bad-norm, clone/get_params/set_params, copy no-op, pipeline). Reduced-precision caveat (#2215, tracked):* sklearn `normalize` casts X to the INPUT float precision via `check_array(dtype=FLOAT_DTYPES)` (`_data.py:1933`) and computes the norm + division IN that precision (float16/float32), but the f64-only binding ABI (shared by EVERY `_Rs*` transformer) computes the norm in float64 then casts the result back — so float32 (~6e-8) and float16 (~5e-4) VALUES diverge slightly (dtype LABELS match; the float64 path is bit-exact, <1e-12). Same class as the generic-F precision caveats #2205/#2206; float16 is fundamentally unmatchable (the Rust core has no f16). Pinned `#[skip]` in `tests/divergence_normalizer_reduced_precision.py`.
REQ-9 (ferray substrate)	NOT-STARTED	open prereq blocker #1147. `ndarray::Array2` + `num_traits::Float`, not `ferray-core`/`ferray-ufunc` (R-SUBSTRATE-1/2).

Structs§

FittedNormalizer: A fitted Normalizer.
Normalizer: A stateless row-wise normalizer.

Enums§

NormType: The norm used by Normalizer when scaling each sample.

Functions§

normalize: Scale input vectors individually to unit norm — the standalone, estimator-less API mirroring scikit-learn’s normalize free function (sklearn/preprocessing/_data.py:1866).
normalize_with_norms: Like normalize but also returns the per-axis norm vector — the return_norm=True form of scikit-learn’s normalize (sklearn/preprocessing/_data.py:1971-1975).