ferrolearn_preprocess/
binarizer.rs

1//! Binarizer: threshold features to binary values.
2//!
3//! Values strictly greater than the threshold are set to `1.0`; all other
4//! values are set to `0.0`.
5//!
6//! This transformer is **stateless** — no fitting is required. Call
7//! [`Transform::transform`] directly. For scikit-learn API parity it ALSO
8//! supports the stateful [`Fit`](ferrolearn_core::traits::Fit) →
9//! [`FittedBinarizer`] path, which records `n_features_in_` and (like sklearn)
10//! validates the input in `fit`; the fitted type's `transform` reuses the very
11//! same strict-greater logic as the stateless path, so both paths are
12//! bit-identical.
13//!
14//! # `## REQ status`
15//!
16//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_data.py` (`class Binarizer`
17//! `:2177`, `binarize` `:2120`). Design doc: `.design/preprocess/binarizer.md`. Expected
18//! values from the live sklearn 1.5.2 oracle (R-CHAR-3). Consumers: the in-file
19//! `FittedBinarizer::transform` (the stateful fit→transform path) + crate re-export
20//! (`lib.rs:106`, grandfathered S5). The SHIPPED REQs are critic-verified vs the oracle;
21//! the remaining surface is NOT-STARTED with concrete blockers.
22//!
23//! | REQ | Status | Evidence |
24//! |---|---|---|
25//! | REQ-1 (dense strict-greater transform) | SHIPPED | `Transform::transform` = `x.mapv(\|v\| if v > threshold { 1 } else { 0 })`, strict `>`, shape-preserving; `Default` threshold 0.0. Mirrors sklearn `binarize` dense path (`_data.py:2170-2173`). Critic-verified bit-identical to live sklearn (`guard_binarizer_*` in `tests/divergence_binarizer.rs`: thr 0.5 → `[[0,1,0],[1,0,0]]`, default, negative, f32). Consumer: `pub use binarizer::Binarizer` (`lib.rs:106`). |
26//! | REQ-9 (transform input validation per check_array) | SHIPPED | FIXED #1123/#1124/#1125. `transform` rejects (in sklearn order) zero-samples → `InsufficientSamples` (`validation.py:1084`), zero-features → `InvalidParameter` (`:1093`), non-finite NaN/±inf → `InvalidParameter` (`:1063`, force_all_finite=True) — matching sklearn `Binarizer.transform` `_validate_data` (`_data.py:2301`). 13 live-oracle tests green; finite extremes (1e308/-0.0/subnormal) not over-rejected. Two-round critic-verified CLEAN. |
27//! | REQ-2 (copy param) | SHIPPED | FIXED #1126. `Binarizer<F>` gains a `copy: bool` field (default `true`) + `#[must_use] with_copy` builder + `copy()` getter, threaded onto `FittedBinarizer`, mirroring sklearn `__init__(*, threshold=0.0, copy=True)` (`_data.py:2253-2255`, `_parameter_constraints {copy:["boolean"]}` `:2250`). ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always returns a freshly allocated array (`binarize` → `mapv`), so `copy` has no observable effect — `copy=True`/`copy=False` produce identical output (sklearn's `copy=False` does in-place binarization, an optimization Rust's ownership makes moot here). Live-oracle test `fit_copy_true_false_identical_matches_sklearn`. Consumers: `FittedBinarizer` carries the flag + crate re-export `lib.rs:106`. |
28//! | REQ-3 (fit + parameter-constraints validation) | SHIPPED | FIXED #1127. `impl Fit<Array2<F>, ()> for Binarizer<F>` (`fit`): runs the SAME `validate_binarize_input` guard as `Transform::transform`/`binarize` (REQ-9: zero-samples → `InsufficientSamples`, zero-features/non-finite NaN±inf → `InvalidParameter`; sklearn `_validate_data` default `force_all_finite=True` REJECTS NaN/inf — confirmed `Binarizer().fit([[nan]])`/`[[inf]]` raise `ValueError`, `:2277`, `utils/validation.py:1063/1084/1093`), records `n_features_in_ = x.ncols()`, returns `FittedBinarizer { threshold, copy, n_features_in_ }` (no fitted statistics — Binarizer is stateless, sklearn fit "Only validates", `:2257-2278`). THRESHOLD domain (#2209, R-HONEST-4): `Binarizer.fit` does NOT validate the threshold against an interval — its `_parameter_constraints {threshold: [Real]}` (`_data.py:2249`) is a BARE `Real` type check that ACCEPTS `NaN`/`±inf`, and `fit` (`:2257-2278`) only runs `_validate_data` on the DATA. So `Binarizer(threshold=nan/inf).fit(X)` is ACCEPTED here; the non-finite threshold is only rejected later by `transform`/`binarize` (whose `@validate_params` uses the OPEN `Interval(Real, None, None, closed="neither")`, `:2114-2115`). The `fit_rejects_nan/pos_inf/neg_inf` tests reject NaN/inf in the DATA `X` (REQ-9), not the threshold. Live-oracle tests: `fit_transform_matches_sklearn_and_stateless`, `fit_n_features_in_matches_ncols`, `fit_rejects_nan/pos_inf/neg_inf` (data), `fit_strict_greater_boundary_preserved`, `fitted_transform_check_array_before_n_features`, `fitted_transform_shape_mismatch`, `divergence_binarizer_fit_accepts_nonfinite_threshold_like_sklearn` (#2209) in `tests/divergence_binarizer.rs`. Consumers: `FittedBinarizer::transform` (the fitted path) + crate re-export `lib.rs:106`. |
29//! | REQ-4 (binarize free function) | SHIPPED | FIXED #1128, #2208. Standalone [`binarize`] returns `Result<Array2<F>, FerroError>`: it FIRST rejects a non-finite `threshold` (`NaN`/`±inf` → `InvalidParameter`), mirroring sklearn's `@validate_params({"threshold": [Interval(Real, None, None, closed="neither")]})` (`_data.py:2112-2118`) — the OPEN interval `(-inf, inf)` EXCLUDES `NaN`/`±inf`, so `binarize(X, threshold=nan/inf)` raises `InvalidParameterError`; then `x.mapv(\|v\| if v > threshold { 1 } else { 0 })`, strict `>`, shape-preserving, mirroring the dense path (`_data.py:2120-2174`). Keyword default `threshold=0.0` documented. `Transform::transform` delegates to `binarize` (propagating the `Result`), so the two are byte-identical. FIXED #2208 (R-HONEST-4): the prior infallible signature + "[Real] accepts any float — ferrolearn does NOT over-reject" claim was WRONG; the non-finite threshold is now REJECTED. Critic-verified vs the live sklearn 1.5.2 oracle (`binarize_*_matches_sklearn`, `divergence_binarize_nan/inf_threshold_should_error_like_sklearn`). |
30//! | REQ-5 (n_features_in_ / feature names) | PARTIAL | `n_features_in_` SHIPPED (FIXED #1129), `get_feature_names_out` NOT-STARTED. `FittedBinarizer<F>` records `n_features_in_ = x.ncols()` in `fit` and exposes `pub fn n_features_in(&self) -> usize`, mirroring sklearn's `_validate_data` setting `n_features_in_` (`:2277`); `FittedBinarizer::transform` validates the input column count against it (`ShapeMismatch`, sklearn `_validate_data(reset=False)` `:2301`), AFTER the `check_array` finite/min checks (#2207 order). The `OneToOneFeatureMixin.get_feature_names_out` / `feature_names_in_` string-name plumbing is OUT OF SCOPE for this build (no string feature-name infrastructure in ferrolearn yet) — keep prereq blocker #1129 open for the feature-name half. Live-oracle tests `fit_n_features_in_matches_ncols`, `fitted_transform_shape_mismatch`. |
31//! | REQ-6 (sparse support) | NOT-STARTED | open prereq blocker #1130. Dense-only; no CSR/CSC path, no `threshold<0` guard, no `eliminate_zeros` (sklearn `:2161-2168`). |
32//! | REQ-7 (PyO3 binding) | SHIPPED | FIXED #1131. `ferrolearn-python` registers `_RsBinarizer` (`py_transformer!` macro, `ferrolearn-python/src/extras.rs`, ctor `threshold: f64 = 0.0` + `copy: bool = true` mirroring sklearn `Binarizer.__init__(*, threshold=0.0, copy=True)` `_data.py:2253`; builds `Binarizer::<f64>::new(threshold).with_copy(copy)`, `fit(x)`→`FittedBinarizer`, `transform(x)`→binarized `PyArray2<f64>`; `FerroError`→`PyValueError`), wired in `ferrolearn-python/src/lib.rs` (`m.add_class::<extras::RsBinarizer>()`). Non-test production consumer (R-DEFER-1): `ferrolearn-python/python/ferrolearn/_extras.py::class Binarizer(_TransformerWrapper)` (keyword-only `__init__(*, threshold=0.0, copy=True)`, `_make_rs → _RsBinarizer(threshold, copy)`, inherits `fit`/`transform`/`fit_transform`) re-exported as `ferrolearn.Binarizer` (`ferrolearn-python/python/ferrolearn/__init__.py`). The non-finite-threshold accept-at-fit (#2209) / reject-at-transform (#2208) and NaN/±inf-input rejection (REQ-9) surface naturally as Python `ValueError`. Verification (model B, R-CHAR-3): `ferrolearn-python/tests/divergence_binarizer.py` — `fit_transform`/`fit`-then-`transform` value parity vs the live sklearn 1.5.2 oracle for thresholds 0.0/0.5/-1.0 on a mixed-sign fixture, strict-greater boundary (value == threshold → 0), default threshold 0.0, NaN/±inf input → `ValueError`, non-finite threshold rejected at `transform` / accepted at `fit`, `get_params`/`set_params`/`clone` round-trip of `threshold`/`copy`, `copy=True`/`copy=False` identical output. |
33//! | REQ-8 (ferray substrate) | NOT-STARTED | open prereq blocker #1132. `ndarray`/`num_traits`, not `ferray-core`/`ferray-ufunc` (R-SUBSTRATE-1/2). |
34
35use ferrolearn_core::error::FerroError;
36use ferrolearn_core::traits::{Fit, Transform};
37use ndarray::Array2;
38use num_traits::Float;
39
40// ---------------------------------------------------------------------------
41// binarize (free function)
42// ---------------------------------------------------------------------------
43
44/// Boolean thresholding of a dense array, element by element.
45///
46/// Values **strictly greater** than `threshold` become `1.0`; all other values
47/// (less than *or equal to* the threshold) become `0.0`. The result is a new,
48/// shape-preserving array.
49///
50/// This is the estimator-less functional form of [`Binarizer`], mirroring
51/// scikit-learn's `binarize(X, *, threshold=0.0, copy=True)`
52/// (`sklearn/preprocessing/_data.py:2120-2174`), whose dense path is
53/// `cond = X > threshold; X[cond] = 1; X[not_cond] = 0` (`:2170-2173`) — the
54/// load-bearing strict greater-than. scikit-learn's keyword default is
55/// `threshold=0.0` (only positive values map to `1.0`); here the caller passes
56/// the threshold explicitly.
57///
58/// `binarize` is decorated `@validate_params({"threshold": [Interval(Real,
59/// None, None, closed="neither")]})` (`_data.py:2112-2118`), an OPEN interval
60/// `(-inf, inf)` that EXCLUDES `NaN` and `±inf`. A non-finite `threshold`
61/// therefore raises `InvalidParameterError` (a `ValueError`) BEFORE any element
62/// comparison; this function mirrors that by returning
63/// [`FerroError::InvalidParameter`] for a non-finite threshold.
64///
65/// [`Binarizer`]'s [`Transform::transform`] delegates its element mapping to
66/// this function, so the two share one implementation.
67///
68/// # Errors
69///
70/// Returns [`FerroError::InvalidParameter`] if `threshold` is `NaN` or `±inf`
71/// (sklearn `Interval(Real, None, None, closed="neither")`, `_data.py:2114`).
72///
73/// # Examples
74///
75/// ```
76/// use ferrolearn_preprocess::binarizer::binarize;
77/// use ndarray::array;
78///
79/// let x = array![[0.4, 0.6, 0.5], [0.6, 0.1, 0.2]];
80/// let out = binarize(&x, 0.5).unwrap();
81/// // out = [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
82/// ```
83pub fn binarize<F>(x: &Array2<F>, threshold: F) -> Result<Array2<F>, FerroError>
84where
85    F: Float,
86{
87    // sklearn `@validate_params` rejects a non-finite threshold at the
88    // `binarize` boundary (`Interval(Real, None, None, closed="neither")`,
89    // `_data.py:2114-2115`): the open `(-inf, inf)` interval excludes NaN/±inf.
90    if threshold.is_nan() || threshold.is_infinite() {
91        return Err(FerroError::InvalidParameter {
92            name: "threshold".into(),
93            reason: "must be a finite real number (got NaN or infinity)".into(),
94        });
95    }
96    Ok(x.mapv(|v| if v > threshold { F::one() } else { F::zero() }))
97}
98
99/// Run the shared `check_array` input validation (REQ-9) used by both
100/// [`Binarizer`]'s [`Transform::transform`] and [`Binarizer`]'s [`Fit::fit`], in
101/// sklearn's `check_array` order: zero-samples → zero-features → non-finite
102/// (`sklearn/utils/validation.py:1084`, `:1093`, `:1063`). Mirrors sklearn
103/// `Binarizer.fit`/`.transform` → `_validate_data` (`_data.py:2277`, `:2301`),
104/// whose default `force_all_finite=True` REJECTS NaN/±inf.
105///
106/// `context` names the calling site for diagnostics (e.g. `"Binarizer::transform"`
107/// vs `"Binarizer::fit"`).
108fn validate_binarize_input<F: Float>(x: &Array2<F>, context: &str) -> Result<(), FerroError> {
109    if x.nrows() == 0 {
110        return Err(FerroError::InsufficientSamples {
111            required: 1,
112            actual: 0,
113            context: context.into(),
114        });
115    }
116    if x.ncols() == 0 {
117        return Err(FerroError::InvalidParameter {
118            name: "X".to_string(),
119            reason: "Found array with 0 feature(s); a minimum of 1 is required \
120                     by Binarizer"
121                .to_string(),
122        });
123    }
124    if x.iter().any(|v| !v.is_finite()) {
125        return Err(FerroError::InvalidParameter {
126            name: "X".to_string(),
127            reason: "Input X contains non-finite values (NaN or infinity); \
128                     Binarizer requires all-finite input"
129                .to_string(),
130        });
131    }
132    Ok(())
133}
134
135// ---------------------------------------------------------------------------
136// Binarizer
137// ---------------------------------------------------------------------------
138
139/// A stateless feature binarizer.
140///
141/// Values strictly greater than `threshold` become `1.0`; all other values
142/// become `0.0`. The default threshold is `0.0`.
143///
144/// This transformer is stateless — no fitting is needed. Call
145/// [`Transform::transform`] directly.
146///
147/// # Examples
148///
149/// ```
150/// use ferrolearn_preprocess::binarizer::Binarizer;
151/// use ferrolearn_core::traits::Transform;
152/// use ndarray::array;
153///
154/// let binarizer = Binarizer::<f64>::new(0.5);
155/// let x = array![[0.0, 0.5, 1.0]];
156/// let out = binarizer.transform(&x).unwrap();
157/// // out = [[0.0, 0.0, 1.0]]
158/// ```
159#[derive(Debug, Clone)]
160pub struct Binarizer<F> {
161    /// The threshold value. Values strictly greater than this become 1.0.
162    pub(crate) threshold: F,
163    /// sklearn's `copy` constructor parameter (`__init__(*, threshold=0.0,
164    /// copy=True)`, `_data.py:2253-2255`; `_parameter_constraints
165    /// {copy:["boolean"]}` `:2248-2251`). ACCEPT-AND-DOCUMENT no-op: ferrolearn's
166    /// [`Transform`] always returns a freshly allocated array, so `copy` has no
167    /// observable effect on the output. Retained for API parity. Defaults to
168    /// `true`.
169    pub(crate) copy: bool,
170}
171
172impl<F: Float + Send + Sync + 'static> Binarizer<F> {
173    /// Create a new `Binarizer` with the given threshold (and the default
174    /// `copy = true`).
175    ///
176    /// sklearn constrains `threshold` to `Interval(Real, None, None,
177    /// closed="neither")` on `binarize` (`_data.py:2114-2115`) — an OPEN
178    /// interval `(-inf, inf)` that EXCLUDES `NaN`/`±inf`. A non-finite threshold
179    /// is NOT rejected by `new` (no validation at construction, matching
180    /// sklearn's `__init__`, which stores params unchecked); it is rejected
181    /// later by [`Fit::fit`] / [`Transform::transform`] / [`binarize`]
182    /// (`InvalidParameter`), matching sklearn's `_fit_context` /
183    /// `@validate_params` raising `InvalidParameterError` at `fit`/`binarize`.
184    #[must_use]
185    pub fn new(threshold: F) -> Self {
186        Self {
187            threshold,
188            copy: true,
189        }
190    }
191
192    /// Return the configured threshold.
193    #[must_use]
194    pub fn threshold(&self) -> F {
195        self.threshold
196    }
197
198    /// Set the `copy` parameter (sklearn `Binarizer(copy=...)`,
199    /// `_data.py:2253`, `_parameter_constraints {copy:["boolean"]}` `:2250`).
200    ///
201    /// This is an ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always
202    /// returns a freshly allocated array, so `copy` has no observable effect on
203    /// the output. It is retained for API parity with scikit-learn.
204    #[must_use]
205    pub fn with_copy(mut self, copy: bool) -> Self {
206        self.copy = copy;
207        self
208    }
209
210    /// Return the configured `copy` flag (sklearn `Binarizer.copy`).
211    #[must_use]
212    pub fn copy(&self) -> bool {
213        self.copy
214    }
215}
216
217impl<F: Float + Send + Sync + 'static> Default for Binarizer<F> {
218    fn default() -> Self {
219        Self::new(F::zero())
220    }
221}
222
223// ---------------------------------------------------------------------------
224// FittedBinarizer (sklearn stateful `fit` -> fitted estimator path)
225// ---------------------------------------------------------------------------
226
227/// A fitted [`Binarizer`].
228///
229/// `Binarizer` is stateless — its `fit` (sklearn `Binarizer.fit`,
230/// `_data.py:2257-2278`, "Only validates estimator's parameters") learns NO
231/// statistics; it merely validates the input and records `n_features_in_`. The
232/// fitted type therefore carries only the configured `threshold`, the `copy`
233/// flag, and the recorded feature count. Its [`Transform::transform`] reuses the
234/// very same strict-greater logic as the stateless [`Binarizer`]/[`binarize`]
235/// path, so the two paths are bit-identical.
236#[derive(Debug, Clone)]
237pub struct FittedBinarizer<F> {
238    /// The threshold value. Values strictly greater than this become 1.0.
239    pub(crate) threshold: F,
240    /// The `copy` flag carried from the unfitted [`Binarizer`] (no-op; see
241    /// [`Binarizer::with_copy`]).
242    pub(crate) copy: bool,
243    /// Number of features (columns) seen during [`Fit::fit`] — sklearn's
244    /// `n_features_in_` (`_data.py:2277`, set by `_validate_data`).
245    pub(crate) n_features_in_: usize,
246}
247
248impl<F: Float + Send + Sync + 'static> FittedBinarizer<F> {
249    /// Return the number of features (columns) seen during [`Fit::fit`].
250    ///
251    /// Mirrors scikit-learn's `Binarizer.n_features_in_` (`_data.py:2277`).
252    #[must_use]
253    pub fn n_features_in(&self) -> usize {
254        self.n_features_in_
255    }
256
257    /// Return the configured threshold.
258    #[must_use]
259    pub fn threshold(&self) -> F {
260        self.threshold
261    }
262
263    /// Return the configured `copy` flag (no-op; see [`Binarizer::with_copy`]).
264    #[must_use]
265    pub fn copy(&self) -> bool {
266        self.copy
267    }
268}
269
270impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for Binarizer<F> {
271    type Fitted = FittedBinarizer<F>;
272    type Error = FerroError;
273
274    /// Validate the input and record `n_features_in_`, returning a
275    /// [`FittedBinarizer`].
276    ///
277    /// `Binarizer` is stateless: like scikit-learn's `Binarizer.fit`
278    /// (`sklearn/preprocessing/_data.py:2257-2278`, "Only validates estimator's
279    /// parameters"), this learns NO statistics. It runs the SAME `check_array`
280    /// validation as [`Transform::transform`] / [`binarize`] (REQ-9, via the
281    /// shared [`validate_binarize_input`] helper) and records
282    /// `n_features_in_ = x.ncols()`. sklearn's `_validate_data` uses the default
283    /// `force_all_finite=True`, so NaN/±inf are REJECTED in `fit`
284    /// (`Binarizer().fit([[nan]])` / `[[inf]]` raise `ValueError`). sklearn's
285    /// `_fit_context` validates `_parameter_constraints` (`:2249`) BEFORE the
286    /// data, and `threshold` is constrained to `Interval(Real, None, None,
287    /// closed="neither")` on `binarize` (`_data.py:2114`) — an OPEN interval
288    /// `(-inf, inf)` that EXCLUDES `NaN`/`±inf`. A non-finite `threshold` is
289    /// therefore rejected here (param-check first, matching `_fit_context`).
290    ///
291    /// # Errors
292    ///
293    /// Returns [`FerroError::InvalidParameter`] if `threshold` is non-finite
294    /// (`NaN`/`±inf`, sklearn `Interval(Real, None, None, closed="neither")`,
295    /// `_data.py:2114`), [`FerroError::InsufficientSamples`] for zero rows, and
296    /// [`FerroError::InvalidParameter`] for zero features or any non-finite
297    /// value (NaN, +inf, -inf) — matching `check_array`
298    /// (`sklearn/utils/validation.py:1084`, `:1093`, `:1063`) as routed through
299    /// `Binarizer.fit` -> `_validate_data` (`_data.py:2277`).
300    fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedBinarizer<F>, FerroError> {
301        // sklearn `Binarizer._parameter_constraints = {"threshold": [Real], ...}`
302        // (`_data.py:2249`) is a bare `Real` TYPE check that ACCEPTS NaN/+-inf —
303        // UNLIKE the free `binarize`'s `Interval(Real, None, None,
304        // closed="neither")` (`:2115`). `Binarizer.fit` (`:2257-2278`) validates
305        // ONLY the data (`_validate_data`), never the threshold against an
306        // interval, so a non-finite threshold is accepted here and only rejected
307        // later by `transform` (which calls `binarize`). #2209.
308        validate_binarize_input(x, "Binarizer::fit")?;
309        Ok(FittedBinarizer {
310            threshold: self.threshold,
311            copy: self.copy,
312            n_features_in_: x.ncols(),
313        })
314    }
315}
316
317impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedBinarizer<F> {
318    type Output = Array2<F>;
319    type Error = FerroError;
320
321    /// Threshold each element of `x`, delegating to the SAME strict-greater
322    /// logic as the stateless [`Binarizer`] / [`binarize`] path.
323    ///
324    /// First applies the REQ-9 `check_array` guards (finite / min-samples /
325    /// min-features) and binarizes, THEN validates that `x` has the same number
326    /// of columns recorded during [`Fit::fit`]. This ORDER matches sklearn's
327    /// `_validate_data(reset=False)`, which runs `check_array` BEFORE
328    /// `_check_n_features` (`base.py:633` then `:654`, #2207): a NaN / ±inf /
329    /// zero-sample / zero-feature input raises its `check_array` error EVEN when
330    /// the column count is also wrong. Only after that does the `n_features`
331    /// comparison fire. The output is therefore byte-identical to
332    /// `Binarizer::transform` / `binarize(x, threshold)`.
333    ///
334    /// # Errors
335    ///
336    /// Returns [`FerroError::ShapeMismatch`] if the column count differs from
337    /// `n_features_in_`. Returns [`FerroError::InsufficientSamples`] for zero
338    /// rows and [`FerroError::InvalidParameter`] for zero features or any
339    /// non-finite value (REQ-9, via [`validate_binarize_input`]).
340    fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
341        // sklearn `_validate_data(reset=False)` runs `check_array` (finite /
342        // min-samples / min-features) BEFORE `_check_n_features` (#2207). So
343        // validate + binarize FIRST; a NaN / ±inf / zero-sample / zero-feature
344        // input must raise its check_array error EVEN when the column count is
345        // also wrong. Only then does the n_features comparison fire.
346        validate_binarize_input(x, "FittedBinarizer::transform")?;
347        let out = binarize(x, self.threshold)?;
348        if x.ncols() != self.n_features_in_ {
349            return Err(FerroError::ShapeMismatch {
350                expected: vec![x.nrows(), self.n_features_in_],
351                actual: vec![x.nrows(), x.ncols()],
352                context: "FittedBinarizer::transform".into(),
353            });
354        }
355        Ok(out)
356    }
357}
358
359// ---------------------------------------------------------------------------
360// Trait implementations
361// ---------------------------------------------------------------------------
362
363impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for Binarizer<F> {
364    type Output = Array2<F>;
365    type Error = FerroError;
366
367    /// Apply the threshold: values > threshold become `1.0`, others become `0.0`.
368    ///
369    /// # Errors
370    ///
371    /// Returns [`FerroError::InsufficientSamples`] if `x` has zero rows. This
372    /// mirrors scikit-learn's `Binarizer.transform`
373    /// (`sklearn/preprocessing/_data.py:2301`), whose `_validate_data` ->
374    /// `check_array` min-samples check raises `ValueError: Found array with 0
375    /// sample(s) ... while a minimum of 1 is required by Binarizer.`
376    ///
377    /// Returns [`FerroError::InvalidParameter`] if `x` has zero features
378    /// (columns). This mirrors scikit-learn's `Binarizer.transform`
379    /// (`sklearn/preprocessing/_data.py:2301`), whose `_validate_data` ->
380    /// `check_array` min-features check (`utils/validation.py:1093`,
381    /// `ensure_min_features=1`) raises `ValueError: Found array with 0
382    /// feature(s) (shape=(3, 0)) while a minimum of 1 is required by Binarizer.`
383    ///
384    /// Returns [`FerroError::InvalidParameter`] if `x` contains any non-finite
385    /// value (NaN, +inf, or -inf). This mirrors scikit-learn's
386    /// `Binarizer.transform` (`sklearn/preprocessing/_data.py:2301`), which
387    /// validates input via `check_array(force_all_finite=True)` and raises
388    /// `ValueError: Input X contains NaN.` / `Input X contains infinity ...`
389    /// before applying the threshold comparison.
390    fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
391        validate_binarize_input(x, "Binarizer::transform")?;
392        binarize(x, self.threshold)
393    }
394}
395
396// ---------------------------------------------------------------------------
397// Tests
398// ---------------------------------------------------------------------------
399
400#[cfg(test)]
401mod tests {
402    use super::*;
403    use approx::assert_abs_diff_eq;
404    use ndarray::array;
405
406    #[test]
407    fn test_binarizer_default_threshold() {
408        let b = Binarizer::<f64>::default();
409        assert_eq!(b.threshold(), 0.0);
410        let x = array![[-1.0, 0.0, 0.5, 1.0]];
411        let out = b.transform(&x).unwrap();
412        assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // -1 <= 0
413        assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-10); // 0 not > 0
414        assert_abs_diff_eq!(out[[0, 2]], 1.0, epsilon = 1e-10); // 0.5 > 0
415        assert_abs_diff_eq!(out[[0, 3]], 1.0, epsilon = 1e-10); // 1.0 > 0
416    }
417
418    #[test]
419    fn test_binarizer_custom_threshold() {
420        let b = Binarizer::<f64>::new(0.5);
421        let x = array![[0.0, 0.5, 1.0]];
422        let out = b.transform(&x).unwrap();
423        assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // 0.0 not > 0.5
424        assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-10); // 0.5 not > 0.5 (strict)
425        assert_abs_diff_eq!(out[[0, 2]], 1.0, epsilon = 1e-10); // 1.0 > 0.5
426    }
427
428    #[test]
429    fn test_binarizer_all_zeros() {
430        let b = Binarizer::<f64>::new(0.0);
431        let x = array![[0.0, 0.0, 0.0]];
432        let out = b.transform(&x).unwrap();
433        for v in &out {
434            assert_abs_diff_eq!(*v, 0.0, epsilon = 1e-10);
435        }
436    }
437
438    #[test]
439    fn test_binarizer_all_ones() {
440        let b = Binarizer::<f64>::new(0.0);
441        let x = array![[1.0, 2.0, 3.0]];
442        let out = b.transform(&x).unwrap();
443        for v in &out {
444            assert_abs_diff_eq!(*v, 1.0, epsilon = 1e-10);
445        }
446    }
447
448    #[test]
449    fn test_binarizer_negative_threshold() {
450        let b = Binarizer::<f64>::new(-1.0);
451        let x = array![[-2.0, -1.0, -0.5, 0.0]];
452        let out = b.transform(&x).unwrap();
453        assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // -2 <= -1
454        assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-10); // -1 not > -1
455        assert_abs_diff_eq!(out[[0, 2]], 1.0, epsilon = 1e-10); // -0.5 > -1
456        assert_abs_diff_eq!(out[[0, 3]], 1.0, epsilon = 1e-10); // 0.0 > -1
457    }
458
459    #[test]
460    fn test_binarizer_multiple_rows() {
461        let b = Binarizer::<f64>::new(2.0);
462        let x = array![[1.0, 3.0], [2.0, 4.0], [5.0, 0.0]];
463        let out = b.transform(&x).unwrap();
464        assert_eq!(out.shape(), &[3, 2]);
465        assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // 1 <= 2
466        assert_abs_diff_eq!(out[[0, 1]], 1.0, epsilon = 1e-10); // 3 > 2
467        assert_abs_diff_eq!(out[[1, 0]], 0.0, epsilon = 1e-10); // 2 not > 2
468        assert_abs_diff_eq!(out[[1, 1]], 1.0, epsilon = 1e-10); // 4 > 2
469        assert_abs_diff_eq!(out[[2, 0]], 1.0, epsilon = 1e-10); // 5 > 2
470        assert_abs_diff_eq!(out[[2, 1]], 0.0, epsilon = 1e-10); // 0 <= 2
471    }
472
473    #[test]
474    fn test_binarizer_preserves_shape() {
475        let b = Binarizer::<f64>::default();
476        let x = array![[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
477        let out = b.transform(&x).unwrap();
478        assert_eq!(out.shape(), x.shape());
479    }
480
481    #[test]
482    fn test_binarizer_f32() {
483        let b = Binarizer::<f32>::new(0.0f32);
484        let x: Array2<f32> = array![[1.0f32, -1.0, 0.0]];
485        let out = b.transform(&x).unwrap();
486        assert!((out[[0, 0]] - 1.0f32).abs() < 1e-6);
487        assert!((out[[0, 1]] - 0.0f32).abs() < 1e-6);
488        assert!((out[[0, 2]] - 0.0f32).abs() < 1e-6);
489    }
490
491    // -- binarize free function (REQ-4) -- oracle-grounded vs live sklearn 1.5.2 --
492    // X = [[1,-1,2],[2,0,0],[0,1,-1]]
493    // python3 -c "from sklearn.preprocessing import binarize; import numpy as np; \
494    //   print(binarize(np.array([[1.,-1,2],[2,0,0],[0,1,-1]])).tolist())"
495    //   -> [[1,0,1],[1,0,0],[0,1,0]]   (threshold 0.0, strict >)
496    // python3 -c "from sklearn.preprocessing import binarize; import numpy as np; \
497    //   print(binarize(np.array([[1.,-1,2],[2,0,0],[0,1,-1]]), threshold=-0.5).tolist())"
498    //   -> [[1,0,1],[1,1,1],[1,1,0]]
499
500    #[test]
501    fn binarize_default_threshold_matches_sklearn() {
502        let x = array![[1.0, -1.0, 2.0], [2.0, 0.0, 0.0], [0.0, 1.0, -1.0]];
503        let out = binarize(&x, 0.0).ok();
504        let expected = array![[1.0, 0.0, 1.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0]];
505        assert_eq!(out, Some(expected));
506    }
507
508    #[test]
509    fn binarize_negative_threshold_matches_sklearn() {
510        let x = array![[1.0, -1.0, 2.0], [2.0, 0.0, 0.0], [0.0, 1.0, -1.0]];
511        let out = binarize(&x, -0.5).ok();
512        let expected = array![[1.0, 0.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 0.0]];
513        assert_eq!(out, Some(expected));
514    }
515
516    #[test]
517    fn binarize_matches_estimator_transform() {
518        let x = array![[1.0, -1.0, 2.0], [2.0, 0.0, 0.0], [0.0, 1.0, -1.0]];
519        let free = binarize(&x, 0.5).ok();
520        let est = Binarizer::<f64>::new(0.5).transform(&x).ok();
521        assert_eq!(est, free);
522    }
523
524    #[test]
525    fn test_output_values_are_zero_or_one() {
526        let b = Binarizer::<f64>::new(0.0);
527        let x = array![[-5.0, -1.0, 0.0, 0.001, 1.0, 100.0]];
528        let out = b.transform(&x).unwrap();
529        for v in &out {
530            assert!(*v == 0.0 || *v == 1.0, "expected 0 or 1, got {v}");
531        }
532    }
533}
ferrolearn_preprocess/binarizer.rs

ferrolearn_preprocess/
binarizer.rs