ferrolearn_preprocess/binarizer.rs
1//! Binarizer: threshold features to binary values.
2//!
3//! Values strictly greater than the threshold are set to `1.0`; all other
4//! values are set to `0.0`.
5//!
6//! This transformer is **stateless** — no fitting is required. Call
7//! [`Transform::transform`] directly. For scikit-learn API parity it ALSO
8//! supports the stateful [`Fit`](ferrolearn_core::traits::Fit) →
9//! [`FittedBinarizer`] path, which records `n_features_in_` and (like sklearn)
10//! validates the input in `fit`; the fitted type's `transform` reuses the very
11//! same strict-greater logic as the stateless path, so both paths are
12//! bit-identical.
13//!
14//! # `## REQ status`
15//!
16//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_data.py` (`class Binarizer`
17//! `:2177`, `binarize` `:2120`). Design doc: `.design/preprocess/binarizer.md`. Expected
18//! values from the live sklearn 1.5.2 oracle (R-CHAR-3). Consumers: the in-file
19//! `FittedBinarizer::transform` (the stateful fit→transform path) + crate re-export
20//! (`lib.rs:106`, grandfathered S5). The SHIPPED REQs are critic-verified vs the oracle;
21//! the remaining surface is NOT-STARTED with concrete blockers.
22//!
23//! | REQ | Status | Evidence |
24//! |---|---|---|
25//! | REQ-1 (dense strict-greater transform) | SHIPPED | `Transform::transform` = `x.mapv(\|v\| if v > threshold { 1 } else { 0 })`, strict `>`, shape-preserving; `Default` threshold 0.0. Mirrors sklearn `binarize` dense path (`_data.py:2170-2173`). Critic-verified bit-identical to live sklearn (`guard_binarizer_*` in `tests/divergence_binarizer.rs`: thr 0.5 → `[[0,1,0],[1,0,0]]`, default, negative, f32). Consumer: `pub use binarizer::Binarizer` (`lib.rs:106`). |
26//! | REQ-9 (transform input validation per check_array) | SHIPPED | FIXED #1123/#1124/#1125. `transform` rejects (in sklearn order) zero-samples → `InsufficientSamples` (`validation.py:1084`), zero-features → `InvalidParameter` (`:1093`), non-finite NaN/±inf → `InvalidParameter` (`:1063`, force_all_finite=True) — matching sklearn `Binarizer.transform` `_validate_data` (`_data.py:2301`). 13 live-oracle tests green; finite extremes (1e308/-0.0/subnormal) not over-rejected. Two-round critic-verified CLEAN. |
27//! | REQ-2 (copy param) | SHIPPED | FIXED #1126. `Binarizer<F>` gains a `copy: bool` field (default `true`) + `#[must_use] with_copy` builder + `copy()` getter, threaded onto `FittedBinarizer`, mirroring sklearn `__init__(*, threshold=0.0, copy=True)` (`_data.py:2253-2255`, `_parameter_constraints {copy:["boolean"]}` `:2250`). ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always returns a freshly allocated array (`binarize` → `mapv`), so `copy` has no observable effect — `copy=True`/`copy=False` produce identical output (sklearn's `copy=False` does in-place binarization, an optimization Rust's ownership makes moot here). Live-oracle test `fit_copy_true_false_identical_matches_sklearn`. Consumers: `FittedBinarizer` carries the flag + crate re-export `lib.rs:106`. |
28//! | REQ-3 (fit + parameter-constraints validation) | SHIPPED | FIXED #1127. `impl Fit<Array2<F>, ()> for Binarizer<F>` (`fit`): runs the SAME `validate_binarize_input` guard as `Transform::transform`/`binarize` (REQ-9: zero-samples → `InsufficientSamples`, zero-features/non-finite NaN±inf → `InvalidParameter`; sklearn `_validate_data` default `force_all_finite=True` REJECTS NaN/inf — confirmed `Binarizer().fit([[nan]])`/`[[inf]]` raise `ValueError`, `:2277`, `utils/validation.py:1063/1084/1093`), records `n_features_in_ = x.ncols()`, returns `FittedBinarizer { threshold, copy, n_features_in_ }` (no fitted statistics — Binarizer is stateless, sklearn fit "Only validates", `:2257-2278`). THRESHOLD domain (#2209, R-HONEST-4): `Binarizer.fit` does NOT validate the threshold against an interval — its `_parameter_constraints {threshold: [Real]}` (`_data.py:2249`) is a BARE `Real` type check that ACCEPTS `NaN`/`±inf`, and `fit` (`:2257-2278`) only runs `_validate_data` on the DATA. So `Binarizer(threshold=nan/inf).fit(X)` is ACCEPTED here; the non-finite threshold is only rejected later by `transform`/`binarize` (whose `@validate_params` uses the OPEN `Interval(Real, None, None, closed="neither")`, `:2114-2115`). The `fit_rejects_nan/pos_inf/neg_inf` tests reject NaN/inf in the DATA `X` (REQ-9), not the threshold. Live-oracle tests: `fit_transform_matches_sklearn_and_stateless`, `fit_n_features_in_matches_ncols`, `fit_rejects_nan/pos_inf/neg_inf` (data), `fit_strict_greater_boundary_preserved`, `fitted_transform_check_array_before_n_features`, `fitted_transform_shape_mismatch`, `divergence_binarizer_fit_accepts_nonfinite_threshold_like_sklearn` (#2209) in `tests/divergence_binarizer.rs`. Consumers: `FittedBinarizer::transform` (the fitted path) + crate re-export `lib.rs:106`. |
29//! | REQ-4 (binarize free function) | SHIPPED | FIXED #1128, #2208. Standalone [`binarize`] returns `Result<Array2<F>, FerroError>`: it FIRST rejects a non-finite `threshold` (`NaN`/`±inf` → `InvalidParameter`), mirroring sklearn's `@validate_params({"threshold": [Interval(Real, None, None, closed="neither")]})` (`_data.py:2112-2118`) — the OPEN interval `(-inf, inf)` EXCLUDES `NaN`/`±inf`, so `binarize(X, threshold=nan/inf)` raises `InvalidParameterError`; then `x.mapv(\|v\| if v > threshold { 1 } else { 0 })`, strict `>`, shape-preserving, mirroring the dense path (`_data.py:2120-2174`). Keyword default `threshold=0.0` documented. `Transform::transform` delegates to `binarize` (propagating the `Result`), so the two are byte-identical. FIXED #2208 (R-HONEST-4): the prior infallible signature + "[Real] accepts any float — ferrolearn does NOT over-reject" claim was WRONG; the non-finite threshold is now REJECTED. Critic-verified vs the live sklearn 1.5.2 oracle (`binarize_*_matches_sklearn`, `divergence_binarize_nan/inf_threshold_should_error_like_sklearn`). |
30//! | REQ-5 (n_features_in_ / feature names) | PARTIAL | `n_features_in_` SHIPPED (FIXED #1129), `get_feature_names_out` NOT-STARTED. `FittedBinarizer<F>` records `n_features_in_ = x.ncols()` in `fit` and exposes `pub fn n_features_in(&self) -> usize`, mirroring sklearn's `_validate_data` setting `n_features_in_` (`:2277`); `FittedBinarizer::transform` validates the input column count against it (`ShapeMismatch`, sklearn `_validate_data(reset=False)` `:2301`), AFTER the `check_array` finite/min checks (#2207 order). The `OneToOneFeatureMixin.get_feature_names_out` / `feature_names_in_` string-name plumbing is OUT OF SCOPE for this build (no string feature-name infrastructure in ferrolearn yet) — keep prereq blocker #1129 open for the feature-name half. Live-oracle tests `fit_n_features_in_matches_ncols`, `fitted_transform_shape_mismatch`. |
31//! | REQ-6 (sparse support) | NOT-STARTED | open prereq blocker #1130. Dense-only; no CSR/CSC path, no `threshold<0` guard, no `eliminate_zeros` (sklearn `:2161-2168`). |
32//! | REQ-7 (PyO3 binding) | SHIPPED | FIXED #1131. `ferrolearn-python` registers `_RsBinarizer` (`py_transformer!` macro, `ferrolearn-python/src/extras.rs`, ctor `threshold: f64 = 0.0` + `copy: bool = true` mirroring sklearn `Binarizer.__init__(*, threshold=0.0, copy=True)` `_data.py:2253`; builds `Binarizer::<f64>::new(threshold).with_copy(copy)`, `fit(x)`→`FittedBinarizer`, `transform(x)`→binarized `PyArray2<f64>`; `FerroError`→`PyValueError`), wired in `ferrolearn-python/src/lib.rs` (`m.add_class::<extras::RsBinarizer>()`). Non-test production consumer (R-DEFER-1): `ferrolearn-python/python/ferrolearn/_extras.py::class Binarizer(_TransformerWrapper)` (keyword-only `__init__(*, threshold=0.0, copy=True)`, `_make_rs → _RsBinarizer(threshold, copy)`, inherits `fit`/`transform`/`fit_transform`) re-exported as `ferrolearn.Binarizer` (`ferrolearn-python/python/ferrolearn/__init__.py`). The non-finite-threshold accept-at-fit (#2209) / reject-at-transform (#2208) and NaN/±inf-input rejection (REQ-9) surface naturally as Python `ValueError`. Verification (model B, R-CHAR-3): `ferrolearn-python/tests/divergence_binarizer.py` — `fit_transform`/`fit`-then-`transform` value parity vs the live sklearn 1.5.2 oracle for thresholds 0.0/0.5/-1.0 on a mixed-sign fixture, strict-greater boundary (value == threshold → 0), default threshold 0.0, NaN/±inf input → `ValueError`, non-finite threshold rejected at `transform` / accepted at `fit`, `get_params`/`set_params`/`clone` round-trip of `threshold`/`copy`, `copy=True`/`copy=False` identical output. |
33//! | REQ-8 (ferray substrate) | NOT-STARTED | open prereq blocker #1132. `ndarray`/`num_traits`, not `ferray-core`/`ferray-ufunc` (R-SUBSTRATE-1/2). |
34
35use ferrolearn_core::error::FerroError;
36use ferrolearn_core::traits::{Fit, Transform};
37use ndarray::Array2;
38use num_traits::Float;
39
40// ---------------------------------------------------------------------------
41// binarize (free function)
42// ---------------------------------------------------------------------------
43
44/// Boolean thresholding of a dense array, element by element.
45///
46/// Values **strictly greater** than `threshold` become `1.0`; all other values
47/// (less than *or equal to* the threshold) become `0.0`. The result is a new,
48/// shape-preserving array.
49///
50/// This is the estimator-less functional form of [`Binarizer`], mirroring
51/// scikit-learn's `binarize(X, *, threshold=0.0, copy=True)`
52/// (`sklearn/preprocessing/_data.py:2120-2174`), whose dense path is
53/// `cond = X > threshold; X[cond] = 1; X[not_cond] = 0` (`:2170-2173`) — the
54/// load-bearing strict greater-than. scikit-learn's keyword default is
55/// `threshold=0.0` (only positive values map to `1.0`); here the caller passes
56/// the threshold explicitly.
57///
58/// `binarize` is decorated `@validate_params({"threshold": [Interval(Real,
59/// None, None, closed="neither")]})` (`_data.py:2112-2118`), an OPEN interval
60/// `(-inf, inf)` that EXCLUDES `NaN` and `±inf`. A non-finite `threshold`
61/// therefore raises `InvalidParameterError` (a `ValueError`) BEFORE any element
62/// comparison; this function mirrors that by returning
63/// [`FerroError::InvalidParameter`] for a non-finite threshold.
64///
65/// [`Binarizer`]'s [`Transform::transform`] delegates its element mapping to
66/// this function, so the two share one implementation.
67///
68/// # Errors
69///
70/// Returns [`FerroError::InvalidParameter`] if `threshold` is `NaN` or `±inf`
71/// (sklearn `Interval(Real, None, None, closed="neither")`, `_data.py:2114`).
72///
73/// # Examples
74///
75/// ```
76/// use ferrolearn_preprocess::binarizer::binarize;
77/// use ndarray::array;
78///
79/// let x = array![[0.4, 0.6, 0.5], [0.6, 0.1, 0.2]];
80/// let out = binarize(&x, 0.5).unwrap();
81/// // out = [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
82/// ```
83pub fn binarize<F>(x: &Array2<F>, threshold: F) -> Result<Array2<F>, FerroError>
84where
85 F: Float,
86{
87 // sklearn `@validate_params` rejects a non-finite threshold at the
88 // `binarize` boundary (`Interval(Real, None, None, closed="neither")`,
89 // `_data.py:2114-2115`): the open `(-inf, inf)` interval excludes NaN/±inf.
90 if threshold.is_nan() || threshold.is_infinite() {
91 return Err(FerroError::InvalidParameter {
92 name: "threshold".into(),
93 reason: "must be a finite real number (got NaN or infinity)".into(),
94 });
95 }
96 Ok(x.mapv(|v| if v > threshold { F::one() } else { F::zero() }))
97}
98
99/// Run the shared `check_array` input validation (REQ-9) used by both
100/// [`Binarizer`]'s [`Transform::transform`] and [`Binarizer`]'s [`Fit::fit`], in
101/// sklearn's `check_array` order: zero-samples → zero-features → non-finite
102/// (`sklearn/utils/validation.py:1084`, `:1093`, `:1063`). Mirrors sklearn
103/// `Binarizer.fit`/`.transform` → `_validate_data` (`_data.py:2277`, `:2301`),
104/// whose default `force_all_finite=True` REJECTS NaN/±inf.
105///
106/// `context` names the calling site for diagnostics (e.g. `"Binarizer::transform"`
107/// vs `"Binarizer::fit"`).
108fn validate_binarize_input<F: Float>(x: &Array2<F>, context: &str) -> Result<(), FerroError> {
109 if x.nrows() == 0 {
110 return Err(FerroError::InsufficientSamples {
111 required: 1,
112 actual: 0,
113 context: context.into(),
114 });
115 }
116 if x.ncols() == 0 {
117 return Err(FerroError::InvalidParameter {
118 name: "X".to_string(),
119 reason: "Found array with 0 feature(s); a minimum of 1 is required \
120 by Binarizer"
121 .to_string(),
122 });
123 }
124 if x.iter().any(|v| !v.is_finite()) {
125 return Err(FerroError::InvalidParameter {
126 name: "X".to_string(),
127 reason: "Input X contains non-finite values (NaN or infinity); \
128 Binarizer requires all-finite input"
129 .to_string(),
130 });
131 }
132 Ok(())
133}
134
135// ---------------------------------------------------------------------------
136// Binarizer
137// ---------------------------------------------------------------------------
138
139/// A stateless feature binarizer.
140///
141/// Values strictly greater than `threshold` become `1.0`; all other values
142/// become `0.0`. The default threshold is `0.0`.
143///
144/// This transformer is stateless — no fitting is needed. Call
145/// [`Transform::transform`] directly.
146///
147/// # Examples
148///
149/// ```
150/// use ferrolearn_preprocess::binarizer::Binarizer;
151/// use ferrolearn_core::traits::Transform;
152/// use ndarray::array;
153///
154/// let binarizer = Binarizer::<f64>::new(0.5);
155/// let x = array![[0.0, 0.5, 1.0]];
156/// let out = binarizer.transform(&x).unwrap();
157/// // out = [[0.0, 0.0, 1.0]]
158/// ```
159#[derive(Debug, Clone)]
160pub struct Binarizer<F> {
161 /// The threshold value. Values strictly greater than this become 1.0.
162 pub(crate) threshold: F,
163 /// sklearn's `copy` constructor parameter (`__init__(*, threshold=0.0,
164 /// copy=True)`, `_data.py:2253-2255`; `_parameter_constraints
165 /// {copy:["boolean"]}` `:2248-2251`). ACCEPT-AND-DOCUMENT no-op: ferrolearn's
166 /// [`Transform`] always returns a freshly allocated array, so `copy` has no
167 /// observable effect on the output. Retained for API parity. Defaults to
168 /// `true`.
169 pub(crate) copy: bool,
170}
171
172impl<F: Float + Send + Sync + 'static> Binarizer<F> {
173 /// Create a new `Binarizer` with the given threshold (and the default
174 /// `copy = true`).
175 ///
176 /// sklearn constrains `threshold` to `Interval(Real, None, None,
177 /// closed="neither")` on `binarize` (`_data.py:2114-2115`) — an OPEN
178 /// interval `(-inf, inf)` that EXCLUDES `NaN`/`±inf`. A non-finite threshold
179 /// is NOT rejected by `new` (no validation at construction, matching
180 /// sklearn's `__init__`, which stores params unchecked); it is rejected
181 /// later by [`Fit::fit`] / [`Transform::transform`] / [`binarize`]
182 /// (`InvalidParameter`), matching sklearn's `_fit_context` /
183 /// `@validate_params` raising `InvalidParameterError` at `fit`/`binarize`.
184 #[must_use]
185 pub fn new(threshold: F) -> Self {
186 Self {
187 threshold,
188 copy: true,
189 }
190 }
191
192 /// Return the configured threshold.
193 #[must_use]
194 pub fn threshold(&self) -> F {
195 self.threshold
196 }
197
198 /// Set the `copy` parameter (sklearn `Binarizer(copy=...)`,
199 /// `_data.py:2253`, `_parameter_constraints {copy:["boolean"]}` `:2250`).
200 ///
201 /// This is an ACCEPT-AND-DOCUMENT no-op: ferrolearn's [`Transform`] always
202 /// returns a freshly allocated array, so `copy` has no observable effect on
203 /// the output. It is retained for API parity with scikit-learn.
204 #[must_use]
205 pub fn with_copy(mut self, copy: bool) -> Self {
206 self.copy = copy;
207 self
208 }
209
210 /// Return the configured `copy` flag (sklearn `Binarizer.copy`).
211 #[must_use]
212 pub fn copy(&self) -> bool {
213 self.copy
214 }
215}
216
217impl<F: Float + Send + Sync + 'static> Default for Binarizer<F> {
218 fn default() -> Self {
219 Self::new(F::zero())
220 }
221}
222
223// ---------------------------------------------------------------------------
224// FittedBinarizer (sklearn stateful `fit` -> fitted estimator path)
225// ---------------------------------------------------------------------------
226
227/// A fitted [`Binarizer`].
228///
229/// `Binarizer` is stateless — its `fit` (sklearn `Binarizer.fit`,
230/// `_data.py:2257-2278`, "Only validates estimator's parameters") learns NO
231/// statistics; it merely validates the input and records `n_features_in_`. The
232/// fitted type therefore carries only the configured `threshold`, the `copy`
233/// flag, and the recorded feature count. Its [`Transform::transform`] reuses the
234/// very same strict-greater logic as the stateless [`Binarizer`]/[`binarize`]
235/// path, so the two paths are bit-identical.
236#[derive(Debug, Clone)]
237pub struct FittedBinarizer<F> {
238 /// The threshold value. Values strictly greater than this become 1.0.
239 pub(crate) threshold: F,
240 /// The `copy` flag carried from the unfitted [`Binarizer`] (no-op; see
241 /// [`Binarizer::with_copy`]).
242 pub(crate) copy: bool,
243 /// Number of features (columns) seen during [`Fit::fit`] — sklearn's
244 /// `n_features_in_` (`_data.py:2277`, set by `_validate_data`).
245 pub(crate) n_features_in_: usize,
246}
247
248impl<F: Float + Send + Sync + 'static> FittedBinarizer<F> {
249 /// Return the number of features (columns) seen during [`Fit::fit`].
250 ///
251 /// Mirrors scikit-learn's `Binarizer.n_features_in_` (`_data.py:2277`).
252 #[must_use]
253 pub fn n_features_in(&self) -> usize {
254 self.n_features_in_
255 }
256
257 /// Return the configured threshold.
258 #[must_use]
259 pub fn threshold(&self) -> F {
260 self.threshold
261 }
262
263 /// Return the configured `copy` flag (no-op; see [`Binarizer::with_copy`]).
264 #[must_use]
265 pub fn copy(&self) -> bool {
266 self.copy
267 }
268}
269
270impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for Binarizer<F> {
271 type Fitted = FittedBinarizer<F>;
272 type Error = FerroError;
273
274 /// Validate the input and record `n_features_in_`, returning a
275 /// [`FittedBinarizer`].
276 ///
277 /// `Binarizer` is stateless: like scikit-learn's `Binarizer.fit`
278 /// (`sklearn/preprocessing/_data.py:2257-2278`, "Only validates estimator's
279 /// parameters"), this learns NO statistics. It runs the SAME `check_array`
280 /// validation as [`Transform::transform`] / [`binarize`] (REQ-9, via the
281 /// shared [`validate_binarize_input`] helper) and records
282 /// `n_features_in_ = x.ncols()`. sklearn's `_validate_data` uses the default
283 /// `force_all_finite=True`, so NaN/±inf are REJECTED in `fit`
284 /// (`Binarizer().fit([[nan]])` / `[[inf]]` raise `ValueError`). sklearn's
285 /// `_fit_context` validates `_parameter_constraints` (`:2249`) BEFORE the
286 /// data, and `threshold` is constrained to `Interval(Real, None, None,
287 /// closed="neither")` on `binarize` (`_data.py:2114`) — an OPEN interval
288 /// `(-inf, inf)` that EXCLUDES `NaN`/`±inf`. A non-finite `threshold` is
289 /// therefore rejected here (param-check first, matching `_fit_context`).
290 ///
291 /// # Errors
292 ///
293 /// Returns [`FerroError::InvalidParameter`] if `threshold` is non-finite
294 /// (`NaN`/`±inf`, sklearn `Interval(Real, None, None, closed="neither")`,
295 /// `_data.py:2114`), [`FerroError::InsufficientSamples`] for zero rows, and
296 /// [`FerroError::InvalidParameter`] for zero features or any non-finite
297 /// value (NaN, +inf, -inf) — matching `check_array`
298 /// (`sklearn/utils/validation.py:1084`, `:1093`, `:1063`) as routed through
299 /// `Binarizer.fit` -> `_validate_data` (`_data.py:2277`).
300 fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedBinarizer<F>, FerroError> {
301 // sklearn `Binarizer._parameter_constraints = {"threshold": [Real], ...}`
302 // (`_data.py:2249`) is a bare `Real` TYPE check that ACCEPTS NaN/+-inf —
303 // UNLIKE the free `binarize`'s `Interval(Real, None, None,
304 // closed="neither")` (`:2115`). `Binarizer.fit` (`:2257-2278`) validates
305 // ONLY the data (`_validate_data`), never the threshold against an
306 // interval, so a non-finite threshold is accepted here and only rejected
307 // later by `transform` (which calls `binarize`). #2209.
308 validate_binarize_input(x, "Binarizer::fit")?;
309 Ok(FittedBinarizer {
310 threshold: self.threshold,
311 copy: self.copy,
312 n_features_in_: x.ncols(),
313 })
314 }
315}
316
317impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedBinarizer<F> {
318 type Output = Array2<F>;
319 type Error = FerroError;
320
321 /// Threshold each element of `x`, delegating to the SAME strict-greater
322 /// logic as the stateless [`Binarizer`] / [`binarize`] path.
323 ///
324 /// First applies the REQ-9 `check_array` guards (finite / min-samples /
325 /// min-features) and binarizes, THEN validates that `x` has the same number
326 /// of columns recorded during [`Fit::fit`]. This ORDER matches sklearn's
327 /// `_validate_data(reset=False)`, which runs `check_array` BEFORE
328 /// `_check_n_features` (`base.py:633` then `:654`, #2207): a NaN / ±inf /
329 /// zero-sample / zero-feature input raises its `check_array` error EVEN when
330 /// the column count is also wrong. Only after that does the `n_features`
331 /// comparison fire. The output is therefore byte-identical to
332 /// `Binarizer::transform` / `binarize(x, threshold)`.
333 ///
334 /// # Errors
335 ///
336 /// Returns [`FerroError::ShapeMismatch`] if the column count differs from
337 /// `n_features_in_`. Returns [`FerroError::InsufficientSamples`] for zero
338 /// rows and [`FerroError::InvalidParameter`] for zero features or any
339 /// non-finite value (REQ-9, via [`validate_binarize_input`]).
340 fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
341 // sklearn `_validate_data(reset=False)` runs `check_array` (finite /
342 // min-samples / min-features) BEFORE `_check_n_features` (#2207). So
343 // validate + binarize FIRST; a NaN / ±inf / zero-sample / zero-feature
344 // input must raise its check_array error EVEN when the column count is
345 // also wrong. Only then does the n_features comparison fire.
346 validate_binarize_input(x, "FittedBinarizer::transform")?;
347 let out = binarize(x, self.threshold)?;
348 if x.ncols() != self.n_features_in_ {
349 return Err(FerroError::ShapeMismatch {
350 expected: vec![x.nrows(), self.n_features_in_],
351 actual: vec![x.nrows(), x.ncols()],
352 context: "FittedBinarizer::transform".into(),
353 });
354 }
355 Ok(out)
356 }
357}
358
359// ---------------------------------------------------------------------------
360// Trait implementations
361// ---------------------------------------------------------------------------
362
363impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for Binarizer<F> {
364 type Output = Array2<F>;
365 type Error = FerroError;
366
367 /// Apply the threshold: values > threshold become `1.0`, others become `0.0`.
368 ///
369 /// # Errors
370 ///
371 /// Returns [`FerroError::InsufficientSamples`] if `x` has zero rows. This
372 /// mirrors scikit-learn's `Binarizer.transform`
373 /// (`sklearn/preprocessing/_data.py:2301`), whose `_validate_data` ->
374 /// `check_array` min-samples check raises `ValueError: Found array with 0
375 /// sample(s) ... while a minimum of 1 is required by Binarizer.`
376 ///
377 /// Returns [`FerroError::InvalidParameter`] if `x` has zero features
378 /// (columns). This mirrors scikit-learn's `Binarizer.transform`
379 /// (`sklearn/preprocessing/_data.py:2301`), whose `_validate_data` ->
380 /// `check_array` min-features check (`utils/validation.py:1093`,
381 /// `ensure_min_features=1`) raises `ValueError: Found array with 0
382 /// feature(s) (shape=(3, 0)) while a minimum of 1 is required by Binarizer.`
383 ///
384 /// Returns [`FerroError::InvalidParameter`] if `x` contains any non-finite
385 /// value (NaN, +inf, or -inf). This mirrors scikit-learn's
386 /// `Binarizer.transform` (`sklearn/preprocessing/_data.py:2301`), which
387 /// validates input via `check_array(force_all_finite=True)` and raises
388 /// `ValueError: Input X contains NaN.` / `Input X contains infinity ...`
389 /// before applying the threshold comparison.
390 fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
391 validate_binarize_input(x, "Binarizer::transform")?;
392 binarize(x, self.threshold)
393 }
394}
395
396// ---------------------------------------------------------------------------
397// Tests
398// ---------------------------------------------------------------------------
399
400#[cfg(test)]
401mod tests {
402 use super::*;
403 use approx::assert_abs_diff_eq;
404 use ndarray::array;
405
406 #[test]
407 fn test_binarizer_default_threshold() {
408 let b = Binarizer::<f64>::default();
409 assert_eq!(b.threshold(), 0.0);
410 let x = array![[-1.0, 0.0, 0.5, 1.0]];
411 let out = b.transform(&x).unwrap();
412 assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // -1 <= 0
413 assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-10); // 0 not > 0
414 assert_abs_diff_eq!(out[[0, 2]], 1.0, epsilon = 1e-10); // 0.5 > 0
415 assert_abs_diff_eq!(out[[0, 3]], 1.0, epsilon = 1e-10); // 1.0 > 0
416 }
417
418 #[test]
419 fn test_binarizer_custom_threshold() {
420 let b = Binarizer::<f64>::new(0.5);
421 let x = array![[0.0, 0.5, 1.0]];
422 let out = b.transform(&x).unwrap();
423 assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // 0.0 not > 0.5
424 assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-10); // 0.5 not > 0.5 (strict)
425 assert_abs_diff_eq!(out[[0, 2]], 1.0, epsilon = 1e-10); // 1.0 > 0.5
426 }
427
428 #[test]
429 fn test_binarizer_all_zeros() {
430 let b = Binarizer::<f64>::new(0.0);
431 let x = array![[0.0, 0.0, 0.0]];
432 let out = b.transform(&x).unwrap();
433 for v in &out {
434 assert_abs_diff_eq!(*v, 0.0, epsilon = 1e-10);
435 }
436 }
437
438 #[test]
439 fn test_binarizer_all_ones() {
440 let b = Binarizer::<f64>::new(0.0);
441 let x = array![[1.0, 2.0, 3.0]];
442 let out = b.transform(&x).unwrap();
443 for v in &out {
444 assert_abs_diff_eq!(*v, 1.0, epsilon = 1e-10);
445 }
446 }
447
448 #[test]
449 fn test_binarizer_negative_threshold() {
450 let b = Binarizer::<f64>::new(-1.0);
451 let x = array![[-2.0, -1.0, -0.5, 0.0]];
452 let out = b.transform(&x).unwrap();
453 assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // -2 <= -1
454 assert_abs_diff_eq!(out[[0, 1]], 0.0, epsilon = 1e-10); // -1 not > -1
455 assert_abs_diff_eq!(out[[0, 2]], 1.0, epsilon = 1e-10); // -0.5 > -1
456 assert_abs_diff_eq!(out[[0, 3]], 1.0, epsilon = 1e-10); // 0.0 > -1
457 }
458
459 #[test]
460 fn test_binarizer_multiple_rows() {
461 let b = Binarizer::<f64>::new(2.0);
462 let x = array![[1.0, 3.0], [2.0, 4.0], [5.0, 0.0]];
463 let out = b.transform(&x).unwrap();
464 assert_eq!(out.shape(), &[3, 2]);
465 assert_abs_diff_eq!(out[[0, 0]], 0.0, epsilon = 1e-10); // 1 <= 2
466 assert_abs_diff_eq!(out[[0, 1]], 1.0, epsilon = 1e-10); // 3 > 2
467 assert_abs_diff_eq!(out[[1, 0]], 0.0, epsilon = 1e-10); // 2 not > 2
468 assert_abs_diff_eq!(out[[1, 1]], 1.0, epsilon = 1e-10); // 4 > 2
469 assert_abs_diff_eq!(out[[2, 0]], 1.0, epsilon = 1e-10); // 5 > 2
470 assert_abs_diff_eq!(out[[2, 1]], 0.0, epsilon = 1e-10); // 0 <= 2
471 }
472
473 #[test]
474 fn test_binarizer_preserves_shape() {
475 let b = Binarizer::<f64>::default();
476 let x = array![[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
477 let out = b.transform(&x).unwrap();
478 assert_eq!(out.shape(), x.shape());
479 }
480
481 #[test]
482 fn test_binarizer_f32() {
483 let b = Binarizer::<f32>::new(0.0f32);
484 let x: Array2<f32> = array![[1.0f32, -1.0, 0.0]];
485 let out = b.transform(&x).unwrap();
486 assert!((out[[0, 0]] - 1.0f32).abs() < 1e-6);
487 assert!((out[[0, 1]] - 0.0f32).abs() < 1e-6);
488 assert!((out[[0, 2]] - 0.0f32).abs() < 1e-6);
489 }
490
491 // -- binarize free function (REQ-4) -- oracle-grounded vs live sklearn 1.5.2 --
492 // X = [[1,-1,2],[2,0,0],[0,1,-1]]
493 // python3 -c "from sklearn.preprocessing import binarize; import numpy as np; \
494 // print(binarize(np.array([[1.,-1,2],[2,0,0],[0,1,-1]])).tolist())"
495 // -> [[1,0,1],[1,0,0],[0,1,0]] (threshold 0.0, strict >)
496 // python3 -c "from sklearn.preprocessing import binarize; import numpy as np; \
497 // print(binarize(np.array([[1.,-1,2],[2,0,0],[0,1,-1]]), threshold=-0.5).tolist())"
498 // -> [[1,0,1],[1,1,1],[1,1,0]]
499
500 #[test]
501 fn binarize_default_threshold_matches_sklearn() {
502 let x = array![[1.0, -1.0, 2.0], [2.0, 0.0, 0.0], [0.0, 1.0, -1.0]];
503 let out = binarize(&x, 0.0).ok();
504 let expected = array![[1.0, 0.0, 1.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0]];
505 assert_eq!(out, Some(expected));
506 }
507
508 #[test]
509 fn binarize_negative_threshold_matches_sklearn() {
510 let x = array![[1.0, -1.0, 2.0], [2.0, 0.0, 0.0], [0.0, 1.0, -1.0]];
511 let out = binarize(&x, -0.5).ok();
512 let expected = array![[1.0, 0.0, 1.0], [1.0, 1.0, 1.0], [1.0, 1.0, 0.0]];
513 assert_eq!(out, Some(expected));
514 }
515
516 #[test]
517 fn binarize_matches_estimator_transform() {
518 let x = array![[1.0, -1.0, 2.0], [2.0, 0.0, 0.0], [0.0, 1.0, -1.0]];
519 let free = binarize(&x, 0.5).ok();
520 let est = Binarizer::<f64>::new(0.5).transform(&x).ok();
521 assert_eq!(est, free);
522 }
523
524 #[test]
525 fn test_output_values_are_zero_or_one() {
526 let b = Binarizer::<f64>::new(0.0);
527 let x = array![[-5.0, -1.0, 0.0, 0.001, 1.0, 100.0]];
528 let out = b.transform(&x).unwrap();
529 for v in &out {
530 assert!(*v == 0.0 || *v == 1.0, "expected 0 or 1, got {v}");
531 }
532 }
533}