ferrolearn_preprocess/
one_hot_encoder.rs

1//! One-hot encoder for categorical numeric features.
2//!
3//! `fit` learns, for each input column, `categories_[j]` = the **sorted unique
4//! set** of values in that column (matching scikit-learn's
5//! `OneHotEncoder.categories_`, `_BaseEncoder._fit:99`, `categories_ =
6//! _unique(Xi)`). `transform` emits a dense binary matrix where each learned
7//! category gets its own output column; the per-feature blocks are concatenated
8//! left-to-right (column 0's categories first, then column 1's, …), and a value
9//! is one-hot by **category membership** (the value's index within
10//! `categories_[j]`), NOT by an assumed contiguous `0..max` integer layout.
11//!
12//! # Example
13//!
14//! ```text
15//! Input column with the (non-contiguous) categories {2, 5, 9}:
16//!   [2, 5, 9]  →  [[1,0,0],[0,1,0],[0,0,1]]   (3 columns, one per unique value)
17//! ```
18//!
19//! # `## REQ status`
20//!
21//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_encoders.py` (`class OneHotEncoder`
22//! `:458`). Design doc: `.design/preprocess/one_hot_encoder.md`. Expected values from the live
23//! sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (`lib.rs`, grandfathered S5).
24//! HONEST (R-HONEST-3): ferrolearn ships a numeric (`F`-input) DENSE encoder whose `categories_`
25//! and column layout now match sklearn's `sparse_output=False` output for ANY finite numeric
26//! columns (contiguous or not); `drop` ({None,'first','if_binary'}, REQ-5a) IS shipped, as are
27//! `handle_unknown='ignore'` and `inverse_transform`/`get_feature_names_out`. Sparse-by-default
28//! output, string/object categories, infrequent grouping (`min_frequency`/`max_categories`,
29//! REQ-5b), the full ctor surface and the ferray substrate stay NOT-STARTED. The PyO3 binding ships the
30//! DENSE numeric path (`ferrolearn.OneHotEncoder`, REQ-8) with the unsupported surface surfaced
31//! as `NotImplementedError`/`ValueError` rather than silently mismatched (R-HONEST-3).
32//!
33//! | REQ | Status | Evidence |
34//! |---|---|---|
35//! | REQ-1 (dense one-hot via per-feature category blocks) | SHIPPED | `Transform::transform for FittedOneHotEncoder` zero-fills an `Array2<F>` of width `n_output()` then, for each value, sets `out[[i, offsets[j]+idx]]=1` where `idx` is the value's index in `categories_[j]` (membership), mirroring `_BaseEncoder._transform` (`_encoders.py:206-240`) + the one-hot block expansion. Consumer: crate re-export `lib.rs`. |
36//! | REQ-2 (sparse-by-default output) | NOT-STARTED | open prereq blocker #1149. Dense `Array2<F>` only; sklearn defaults `sparse_output=True` → scipy CSR (`:531`,`:748`). |
37//! | REQ-3 (categories_ = sorted unique set) | SHIPPED | `Fit::fit` computes `categories_[j]` = per-column values sorted via `partial_cmp` then exact-equality deduped to the sorted-unique set (`_BaseEncoder._fit:99` `categories_=_unique(Xi)`); precomputes `offsets` (prefix sums of `categories_[j].len()`) + `n_output`; rejects 0 rows (`InsufficientSamples`). `categories()` accessor exposes the learned sets. Transform is membership-based (value's index in `categories_[j]`), so non-contiguous integers (`[2,5,9]` → 3 columns, NOT 10) and arbitrary finite floats encode correctly — bit-exact to live sklearn 1.5.2 `sparse_output=False`: `categories_`/`transform`/non-contiguous-headline/offsets guards in `tests/divergence_one_hot_encoder.rs`. Consumer: crate re-export `lib.rs`. SCOPE: numeric `F` input; exact float equality for membership (np.unique semantics — documented); NaN-as-a-category is HANDLED (#2223): NaN sorts LAST + collapses to one category (sklearn `_encode.py:70-74`), a NaN row one-hots its column; +/-inf is REJECTED at `fit`/`transform` (#2225, `force_all_finite="allow-nan"` allows NaN but not inf); string/object input is REQ-3-string (NOT-STARTED, no String path). |
38//! | REQ-4 (handle_unknown {'error','ignore'}) | SHIPPED | `OneHotHandleUnknown` enum `{ Error (#[default]), Ignore }` (mirrors sklearn's `handle_unknown` `_parameter_constraints` `StrOptions({"error","ignore","infrequent_if_exist"})` default `"error"`, `_encoders.py:732,750`) + `OneHotEncoder::with_handle_unknown`/`handle_unknown()` builder+getter, threaded into `FittedOneHotEncoder` (`handle_unknown` field + getter) by `Fit::fit` (handle_unknown affects ONLY transform; `categories_` learned identically). `Transform::transform` unknown branch (`cats.iter().position(...) == None`): `Error` → `InvalidParameter` "Found unknown categories … during transform" (the SHIPPED REQ-2 default `ValueError`, `_encoders.py:209-214`, UNCHANGED); `Ignore` → `continue` leaving that feature's one-hot block ALL-ZERO (`_encoders.py:215-240`: unknown row masked out, no encoded column set), every KNOWN feature still one-hots. The +/-inf rejection (#2225), ncols + 0-row guards UNCHANGED (inf is invalid input, not an "unknown category" — still errors in `Ignore`; NaN with NO nan-category is "unknown" → all-zero block in `Ignore`, with a nan-category one-hots it). Never panics (R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`): `ignore_multifeature_all_zero_block` (`[[100,0],[5,99]]→[[0,0,0,1,0],[0,1,0,0,0]]`), `ignore_fully_unknown_row_all_zero`, `ignore_known_row_normal_one_hot`, `error_default_unknown_rejected`, `with_handle_unknown_ignore_known_value_normal`, `ignore_inf_still_rejected`, `ignore_nan_no_category_all_zero`, `ignore_nan_with_category_one_hots`, `handle_unknown_default_and_builder_abi` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs` (`OneHotHandleUnknown`). R-DEV-2. STILL NOT-STARTED: `'infrequent_if_exist'` (REQ-5). |
39//! | REQ-5a (`drop` {None,'first','if_binary'}) | SHIPPED | #1152: `OneHotDrop` enum `{ None_ (#[default]), First, IfBinary }` (mirrors sklearn `drop` `_parameter_constraints` `StrOptions({"first","if_binary"})` / `None`, `_encoders.py:730`,`:498-516`) + `OneHotEncoder::with_drop`/`drop()` builder+getter, threaded into `Fit::fit` which computes `drop_idx_: Vec<Option<usize>>` (sklearn `_compute_drop_idx`, `_encoders.py:812-831`: `None_`→all `None`; `First`→all `Some(0)` (empty feature `None`); `IfBinary`→`Some(0)` iff `len==2` else `None`) and recomputes `offsets`/`n_output` from the per-feature BLOCK WIDTH `len - (drop_idx is Some)`. `FittedOneHotEncoder::drop_idx_()` accessor exposes it. `Transform::transform` (`_encoders.py:1033-1046`): the dropped category emits an ALL-ZERO block; a kept category at membership index `idx` maps to output col `offset + (idx if idx<d else idx-1)` (the `X_int > to_drop` decrement). `inverse_transform` (`_encoders.py:1124-1172`): an all-zero block with `drop_idx_[j]==Some(d)` inverts to the DROPPED category `categories_[j][d]` in BOTH handle_unknown modes (sklearn checks `_drop_idx_after_grouping[i] is not None` FIRST, bypassing the all-zeros error / None paths); a 0-width fully-dropped feature fills the dropped category (`:1132-1135`); a kept block position `pos>=d` maps to category `pos+1`. `get_feature_names_out` OMITS the dropped category (`_compute_transformed_categories` `remove_dropped=True`, `:909`,`:1209-1212`). DROP+IGNORE interaction (verified LIVE, sklearn 1.5.2): `drop` + `handle_unknown='ignore'` is ALLOWED (does NOT raise at fit; warns on unknown at transform, encoding the unknown as an all-zero block == the dropped category) — ferrolearn matches (fit imposes no constraint). NEVER panics: every drop-shift index uses `get`/bounds-checked arithmetic (R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`, `drop=...`): `drop_first_*`, `drop_if_binary_*`, `drop_inverse_roundtrip_*`, `drop_single_category_fully_dropped_*`, `drop_shift_3cat_*`, `drop_plus_ignore_allowed_*`, `drop_idx_abi_*`, `drop_none_unchanged_*` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs` (`OneHotDrop`). R-DEV-2. |
40//! | REQ-5b (infrequent grouping `min_frequency`/`max_categories`) | SHIPPED | #1152: `OneHotEncoder::with_min_frequency`/`with_max_categories` (+`min_frequency()`/`max_categories()` getters) add the integer-count infrequent thresholds (`_encoders.py:566-587`,`:733-738`). `Fit::fit` computes per-category training counts (the run-length over the sorted column) and, when `infrequent_enabled`, calls `identify_infrequent` (mirrors `_BaseEncoder._identify_infrequent`, `_encoders.py:275-318`: min_frequency `count < min_freq` FIRST, then max_categories on the survivors via a STABLE argsort over the full count array keeping the top `max_categories-1` — ties favor the LARGER index; `max_categories==1` → all infrequent) + `build_infrequent_map` (mirrors `_default_to_infrequent_mappings`, `:373-400`: frequent → its remapped slot `0..n_frequent`, infrequent → the trailing slot). `FittedOneHotEncoder` carries `infrequent_indices_` + the per-feature `infrequent_map`; `block_width` becomes `n_frequent + 1` (sklearn `_compute_n_features_outs`, `:948-953`); `offsets`/`n_output` recomputed from it. `infrequent_categories()` exposes the infrequent VALUES per feature (`infrequent_categories_`, `:254-262`,`:625-633`). `Transform::transform` routes a found category through `infrequent_map[j][idx]` (frequent → own col, infrequent → trailing col; `_map_infrequent_categories`, `:442-452`). `inverse_transform` maps the trailing infrequent column to `F::nan()` (DOCUMENTED SCOPE, R-HONEST-3: `Array2<F>` cannot hold sklearn's `'infrequent_sklearn'` string, `:1675-1677`, like the ignore-None NaN proxy #2227), frequent cols → their category. `get_feature_names_out` emits the frequent names + a trailing `"x{j}_infrequent_sklearn"` (`_compute_transformed_categories`, `:913-921`). Infrequent grouping REQUIRES `drop==None_` — combining it errors `InvalidParameter` (REQ-5a×5b interaction DEFERRED; sklearn allows it). Never panics (every remap bounds-checked, R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`): `infrequent_min_frequency_*`, `infrequent_max_categories_*`, `infrequent_max_categories_tiebreak`, `infrequent_both_*`, `infrequent_inverse_*`, `infrequent_feature_names_*`, `infrequent_multifeature_offsets`, `infrequent_no_infrequent_*`, `infrequent_drop_rejected`, `infrequent_disabled_unchanged` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs`. STILL NOT-STARTED: the FLOAT-fraction `min_frequency` (`:573-575`,`:297-299`), `drop`+infrequent (`:518-520`,`:818-902`), and `'infrequent_if_exist'` (`:550-560`) stay unimplemented. |
41//! | REQ-6 (inverse_transform + get_feature_names_out) | SHIPPED | `FittedOneHotEncoder::inverse_transform` reduces each per-feature block `x[:, offsets[j]..offsets[j]+len(categories_[j])]` via **argmax** (numpy first-max-on-ties) to `categories_[j][argmax]`, then handles an ALL-ZERO block (`block_sum == 0`) per `handle_unknown` (sklearn `_encoders.py:1141`,`:1159-1168`): `Error` -> `InvalidParameter` ("Samples can not be inverted ... all zeros"); `Ignore` -> the unknown-category sentinel inverts to `None` in sklearn (`:1183`), represented here as `NaN` (Array2<F> cannot hold None, #2227) with the KNOWN feature blocks still recovered; 0-row → `InsufficientSamples`, `ncols != n_output` → `ShapeMismatch` (`:1100-1104`). Never panics (block slices bounds-checked, R-CODE-2). `FittedOneHotEncoder::get_feature_names_out` emits `format!("x{j}_{cat}")` over `categories_` with default `input_features=["x0",..]` + the `"concat"` combiner (`feature+"_"+str(category)`, `:1217,1224`) → `["x0_2.0","x0_5.0","x0_9.0","x1_0.0","x1_1.0"]`; the float label via `category_label` appends `.0` to whole-valued floats (Python `str(np.float64)`: `2.0`/`-3.0`/`2.5`), `NaN→"nan"`. Live-oracle parity (roundtrip incl. non-contiguous `{2,5,9}`, held-out `[[0,1,0,1,0]]→[[5,0]]`, all-zero/ncols/0-row errors, feature names whole+fractional+negative) in `tests/divergence_one_hot_encoder.rs`. Consumer: crate re-export (`lib.rs:141`). DOCUMENTED DIVERGENCE (R-HONEST-3): the float label uses Rust `Display` for non-whole values, so it diverges from Python's scientific notation at `|v|>=1e16` / `0<|v|<1e-4` (`1e+20`/`1e-07` vs full decimal) — not a plausible category. STILL NOT-STARTED within REQ-6: the `input_features=`/`feature_name_combiner=` params (`:1192,1222`) and the `drop`-aware inverse (REQ-5). The `handle_unknown='ignore'` inverse IS handled (#2227, all-zero -> NaN sentinel). |
42//! | REQ-7 (ctor + dtype + _parameter_constraints) | SHIPPED | The supported ctor params are type-safe Rust enums — `OneHotHandleUnknown {Error,Ignore}` (REQ-4) and `OneHotDrop {None_,First,IfBinary}` (REQ-5a) — so sklearn's `handle_unknown`/`drop` `StrOptions` `_parameter_constraints` (`_encoders.py:733-738`) are provided BY THE TYPE SYSTEM (an out-of-domain value is unrepresentable). The numeric thresholds carry runtime constraints matching sklearn's `Interval(Integral, 1, None)`: `Fit::fit` rejects `min_frequency==Some(0)` and `max_categories==Some(0)` with `InvalidParameter` ("must be an int in the range [1, inf)", verified live: `OneHotEncoder(min_frequency=0).fit` -> InvalidParameterError, #1154). `dtype` is f64 (the category container; REQ-3-analog). The FULL 8-key keyword-only sklearn ctor surface (categories/drop/sparse_output/dtype/handle_unknown/min_frequency/max_categories/feature_name_combiner) is exposed + validated at the PyO3 binding (REQ-8, `_extras.py::OneHotEncoder` get_params parity + `_check_unsupported`). Live-oracle test `req7_min_frequency_max_categories_must_be_at_least_one`. Consumer: crate re-export `lib.rs`. |
43//! | REQ-8 (PyO3 binding) | SHIPPED | #1155: `ferrolearn-python` exposes `ferrolearn.OneHotEncoder` over `{OneHotEncoder, FittedOneHotEncoder, OneHotHandleUnknown}`. The Rust shim `_RsOneHotEncoder` (hand `#[pyclass]`, `ferrolearn-python/src/extras.rs`) ctor takes `handle_unknown: String = "error"` mapped via `resolve_handle_unknown` ("error"→`Error`, "ignore"→`Ignore`, "infrequent_if_exist"→`PyNotImplementedError` REQ-5, bad→`PyValueError` per `_encoders.py:732` `StrOptions({"error","ignore","infrequent_if_exist"})`); `fit` builds `OneHotEncoder::<f64>::new().with_handle_unknown(..)` + runs `Fit`; `transform`/`inverse_transform`→`PyArray2<f64>` (FerroError→`PyValueError`; the `Ignore`-mode all-zero inverse flows through as NaN, #2227); `#[getter]`s `categories_` (a Python LIST of 1-D f64 numpy arrays via `PyList`), `feature_names_out` (`get_feature_names_out()`→`Vec<String>`), `n_features_in_` (`n_features()`). Registered in `lib.rs` (`m.add_class::<extras::RsOneHotEncoder>()`). The Python wrapper `_extras.py::OneHotEncoder(_TransformerWrapper)` mirrors sklearn's KEYWORD-ONLY 8-key ctor `(*, categories="auto", drop=None, sparse_output=True, dtype=np.float64, handle_unknown="error", min_frequency=None, max_categories=None, feature_name_combiner="concat")` (`_encoders.py:743-762`) for `get_params`/`clone` parity; `_make_rs` threads `handle_unknown`; `fit` calls `_check_unsupported` which HONESTLY (R-HONEST-3) rejects the core's gaps — `sparse_output=True` (the sklearn DEFAULT; dense-only REQ-2 #1149)/`categories!='auto'`/`drop`/`min_frequency`/`max_categories`/`feature_name_combiner!='concat'` (REQ-5/REQ-7 #1152/#1154) → `NotImplementedError`; `transform`/`inverse_transform`/`categories_`/`n_features_in_`/`get_feature_names_out(input_features=None)` guarded by `check_is_fitted`→`NotFittedError` pre-fit (`input_features!=None`→`NotImplementedError` REQ-6). Boundary consumer (R-DEFER-1): the `_extras.py::OneHotEncoder` wrapper + `lib.rs` `add_class` + `__init__.py` re-export. Live-oracle parity (model B, sklearn 1.5.2 `sparse_output=False`): `tests/divergence_one_hot_encoder_py.py` (17 pass) — multi-feature non-contiguous `transform`/`fit_transform`/`categories_`, `handle_unknown='ignore'` all-zero block, `inverse_transform` roundtrip + ignore-NaN-vs-None known-feature recovery, `get_feature_names_out` (`['x0_2.0',...]`), pre-fit `NotFittedError`, bad-handle_unknown `ValueError`, `infrequent_if_exist`/unsupported-param `NotImplementedError`, dense-only `sparse_output=True` error, `get_params` 8-key parity, `clone`. R-DEFER-1 satisfied. |
44//! | REQ-9 (ferray substrate) | NOT-STARTED | open prereq blocker #1156. `ndarray::Array2`, not `ferray-core` (R-SUBSTRATE-1/2). |
45
46use ferrolearn_core::error::FerroError;
47use ferrolearn_core::traits::{Fit, FitTransform, Transform};
48use ndarray::Array2;
49use num_traits::Float;
50use std::cmp::Ordering;
51
52// ---------------------------------------------------------------------------
53// OneHotHandleUnknown
54// ---------------------------------------------------------------------------
55
56/// How [`FittedOneHotEncoder`] treats a category at `transform` time that was not
57/// seen during `fit` (an **unknown category**).
58///
59/// Mirrors scikit-learn's `OneHotEncoder(handle_unknown=...)` parameter
60/// (`sklearn/preprocessing/_encoders.py:732,750`), whose
61/// `_parameter_constraints` accepts `{'error', 'ignore', 'infrequent_if_exist'}`
62/// and whose default is `'error'`. ferrolearn ships `Error` (REQ-2) and `Ignore`
63/// (REQ-4); `'infrequent_if_exist'` is NOT-STARTED (REQ-5).
64///
65/// This is a distinct type from
66/// [`ordinal_encoder::HandleUnknown`](crate::ordinal_encoder::HandleUnknown):
67/// the one-hot encoder's modes are `{error, ignore}` while the ordinal encoder's
68/// are `{error, use_encoded_value}` (sklearn's two `handle_unknown` enums differ
69/// the same way).
70#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
71pub enum OneHotHandleUnknown {
72    /// Raise an error on any unknown category at `transform` time (scikit-learn's
73    /// default `handle_unknown='error'`, the default here too). The unfitted
74    /// encoder's [`Transform::transform`] returns
75    /// [`FerroError::InvalidParameter`] ("Found unknown categories … during
76    /// transform", `_encoders.py:209-214`).
77    #[default]
78    Error,
79    /// Encode an unknown category as an **all-zero** one-hot block for that
80    /// feature, leaving every known feature untouched (scikit-learn's
81    /// `handle_unknown='ignore'`, `_encoders.py:215-240`: the unknown row is
82    /// masked out and no column in that feature's block is set).
83    Ignore,
84}
85
86// ---------------------------------------------------------------------------
87// OneHotDrop
88// ---------------------------------------------------------------------------
89
90/// Which category (if any) to drop from each feature's one-hot block at
91/// `transform` time (`OneHotEncoder(drop=...)`).
92///
93/// Mirrors scikit-learn's `OneHotEncoder(drop=...)` parameter, whose
94/// `_parameter_constraints` accepts `{'first', 'if_binary'}`, an array-like, or
95/// `None` (`sklearn/preprocessing/_encoders.py:730`, default `None`). Dropping a
96/// category removes one output column per feature, which is useful to break the
97/// collinearity an unregularized linear model would otherwise see
98/// (`_encoders.py:498-516`).
99///
100/// ferrolearn ships the `None`/`'first'`/`'if_binary'` modes (REQ-5). The
101/// array-of-explicit-categories form (`drop[i]` = the category to drop in
102/// feature `i`, `_encoders.py:515-516`) is NOT-STARTED.
103///
104/// The variant is named `None_` (not `None`) to avoid colliding with
105/// [`Option::None`].
106#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
107pub enum OneHotDrop {
108    /// Retain all categories — no column is dropped (scikit-learn's default
109    /// `drop=None`, `_encoders.py:509`,`:812-813`: `drop_idx_ = None`). The
110    /// default here too.
111    #[default]
112    None_,
113    /// Drop the **first** category of every feature (scikit-learn's
114    /// `drop='first'`, `_encoders.py:510-511`,`:815-816`: `drop_idx_[j] = 0` for
115    /// every feature). A feature with only one category is dropped entirely (its
116    /// block width becomes 0).
117    First,
118    /// Drop the first category of every feature that has **exactly two**
119    /// categories, leaving 1-category and 3+-category features intact
120    /// (scikit-learn's `drop='if_binary'`, `_encoders.py:512-514`,`:817-831`:
121    /// `drop_idx_[j] = 0` iff `len(categories_[j]) == 2`, else `None`).
122    IfBinary,
123}
124
125// ---------------------------------------------------------------------------
126// OneHotEncoder (unfitted)
127// ---------------------------------------------------------------------------
128
129/// An unfitted one-hot encoder for multi-column numeric categorical data.
130///
131/// Input: `Array2<F>` where each column contains the (finite) numeric category
132/// values. Calling [`Fit::fit`] learns, per column, the **sorted unique set** of
133/// values (`categories_`) and returns a [`FittedOneHotEncoder`]. The output of
134/// [`Transform::transform`] is a dense binary matrix with one column per learned
135/// category, the per-feature blocks concatenated left-to-right.
136///
137/// # Examples
138///
139/// ```
140/// use ferrolearn_preprocess::OneHotEncoder;
141/// use ferrolearn_core::traits::{Fit, Transform};
142/// use ndarray::array;
143///
144/// let enc = OneHotEncoder::<f64>::new();
145/// // Non-contiguous categories {2, 5, 9} in column 0, {0, 1} in column 1.
146/// let x = array![[2.0_f64, 0.0], [5.0, 1.0], [9.0, 0.0], [5.0, 1.0]];
147/// let fitted = enc.fit(&x, &()).unwrap();
148/// assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0], vec![0.0, 1.0]]);
149/// let encoded = fitted.transform(&x).unwrap();
150/// assert_eq!(encoded.ncols(), 5); // 3 + 2 category columns
151/// ```
152///
153/// Unknown categories at `transform` time are, by default, rejected
154/// ([`OneHotHandleUnknown::Error`], scikit-learn's `handle_unknown='error'`).
155/// Configuring [`with_handle_unknown`](OneHotEncoder::with_handle_unknown) with
156/// [`OneHotHandleUnknown::Ignore`] instead encodes an unknown category as an
157/// all-zero one-hot block, matching `OneHotEncoder(handle_unknown='ignore')`.
158#[derive(Debug, Clone)]
159pub struct OneHotEncoder<F> {
160    /// Strategy for unknown categories at `transform` time
161    /// (`handle_unknown`). Defaults to [`OneHotHandleUnknown::Error`].
162    handle_unknown: OneHotHandleUnknown,
163    /// Which category (if any) to drop per feature (`drop`). Defaults to
164    /// [`OneHotDrop::None_`] (retain all categories).
165    drop: OneHotDrop,
166    /// Minimum frequency (count) below which a category is grouped into the
167    /// trailing "infrequent" output column (`min_frequency`). `None` (the
168    /// default) disables the min-frequency threshold. Mirrors scikit-learn's
169    /// `OneHotEncoder(min_frequency=...)` (`_encoders.py:566-577`,`:734-738`).
170    /// SCOPE: only the integer-count form is supported — sklearn also accepts a
171    /// FLOAT fraction `min_frequency * n_samples` (`:573-575`,`_encoders.py:297`),
172    /// which is NOT-STARTED here.
173    min_frequency: Option<usize>,
174    /// Upper limit on the number of output columns per feature when grouping
175    /// infrequent categories (`max_categories`); the infrequent column itself
176    /// counts toward this limit. `None` (the default) imposes no limit. Mirrors
177    /// scikit-learn's `OneHotEncoder(max_categories=...)`
178    /// (`_encoders.py:579-587`,`:733`).
179    max_categories: Option<usize>,
180    _marker: std::marker::PhantomData<F>,
181}
182
183impl<F: Float + Send + Sync + 'static> OneHotEncoder<F> {
184    /// Create a new `OneHotEncoder` with scikit-learn's default
185    /// `handle_unknown='error'` ([`OneHotHandleUnknown::Error`]).
186    #[must_use]
187    pub fn new() -> Self {
188        Self {
189            handle_unknown: OneHotHandleUnknown::Error,
190            drop: OneHotDrop::None_,
191            min_frequency: None,
192            max_categories: None,
193            _marker: std::marker::PhantomData,
194        }
195    }
196
197    /// Set the unknown-category strategy (`handle_unknown`).
198    ///
199    /// With [`OneHotHandleUnknown::Ignore`] an unknown category at `transform`
200    /// time becomes an all-zero one-hot block for that feature instead of an
201    /// error, matching scikit-learn's `OneHotEncoder(handle_unknown='ignore')`
202    /// (`_encoders.py:215-240`).
203    #[must_use]
204    pub fn with_handle_unknown(mut self, handle_unknown: OneHotHandleUnknown) -> Self {
205        self.handle_unknown = handle_unknown;
206        self
207    }
208
209    /// Return the configured unknown-category strategy (`handle_unknown`).
210    #[must_use]
211    pub fn handle_unknown(&self) -> OneHotHandleUnknown {
212        self.handle_unknown
213    }
214
215    /// Set the drop strategy (`drop`).
216    ///
217    /// With [`OneHotDrop::First`] the first category of every feature is dropped
218    /// from the output; with [`OneHotDrop::IfBinary`] only binary (2-category)
219    /// features lose their first category. The dropped category produces an
220    /// all-zero one-hot block, matching scikit-learn's `OneHotEncoder(drop=...)`
221    /// (`_encoders.py:498-516`).
222    #[must_use]
223    pub fn with_drop(mut self, drop: OneHotDrop) -> Self {
224        self.drop = drop;
225        self
226    }
227
228    /// Return the configured drop strategy (`drop`).
229    #[must_use]
230    pub fn drop(&self) -> OneHotDrop {
231        self.drop
232    }
233
234    /// Set the minimum-frequency threshold for infrequent grouping
235    /// (`min_frequency`, integer count).
236    ///
237    /// At `fit` time a category whose count in the training data is **strictly
238    /// less than** `min_frequency` is grouped into a single trailing
239    /// "infrequent" output column for that feature, matching scikit-learn's
240    /// `OneHotEncoder(min_frequency=...)` integer form
241    /// (`_encoders.py:566-577`, `_identify_infrequent` `:295-296`
242    /// `category_count < self.min_frequency`).
243    ///
244    /// Enabling infrequent grouping (`min_frequency` and/or `max_categories`)
245    /// requires `drop == OneHotDrop::None_`; combining it with `drop` is a
246    /// deferred interaction (REQ-5a×5b) and [`Fit::fit`] returns an error.
247    ///
248    /// SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn
249    /// also accepts a FLOAT `min_frequency` interpreted as the fraction
250    /// `min_frequency * n_samples` (`_encoders.py:573-575`,`:297-299`); the
251    /// float-fraction form is NOT-STARTED here.
252    #[must_use]
253    pub fn with_min_frequency(mut self, min_frequency: usize) -> Self {
254        self.min_frequency = Some(min_frequency);
255        self
256    }
257
258    /// Set the maximum number of output columns per feature for infrequent
259    /// grouping (`max_categories`).
260    ///
261    /// At `fit` time, if a feature would otherwise produce more than
262    /// `max_categories` output columns, the least-frequent categories are
263    /// grouped into a single trailing "infrequent" column so the block width is
264    /// at most `max_categories` (the infrequent column itself counts toward the
265    /// limit). Mirrors scikit-learn's `OneHotEncoder(max_categories=...)`
266    /// (`_encoders.py:579-587`, `_identify_infrequent` `:303-315`).
267    ///
268    /// Enabling infrequent grouping requires `drop == OneHotDrop::None_` (see
269    /// [`Self::with_min_frequency`]).
270    #[must_use]
271    pub fn with_max_categories(mut self, max_categories: usize) -> Self {
272        self.max_categories = Some(max_categories);
273        self
274    }
275
276    /// Return the configured minimum-frequency threshold (`min_frequency`), or
277    /// `None` if infrequent grouping by frequency is disabled.
278    #[must_use]
279    pub fn min_frequency(&self) -> Option<usize> {
280        self.min_frequency
281    }
282
283    /// Return the configured maximum output-column limit (`max_categories`), or
284    /// `None` if no limit is imposed.
285    #[must_use]
286    pub fn max_categories(&self) -> Option<usize> {
287        self.max_categories
288    }
289
290    /// Whether infrequent grouping is enabled (either `min_frequency` or
291    /// `max_categories` is set). Mirrors scikit-learn's `_infrequent_enabled`
292    /// (`_encoders.py:264-273`: `(max_categories is not None and
293    /// max_categories >= 1) or min_frequency is not None`).
294    fn infrequent_enabled(&self) -> bool {
295        self.min_frequency.is_some() || self.max_categories.is_some_and(|m| m >= 1)
296    }
297}
298
299impl<F: Float + Send + Sync + 'static> Default for OneHotEncoder<F> {
300    fn default() -> Self {
301        Self::new()
302    }
303}
304
305// ---------------------------------------------------------------------------
306// FittedOneHotEncoder
307// ---------------------------------------------------------------------------
308
309/// A fitted one-hot encoder holding the sorted-unique category set per input
310/// column, plus the precomputed output-column layout.
311///
312/// Created by calling [`Fit::fit`] on a [`OneHotEncoder`]. Mirrors
313/// scikit-learn's `OneHotEncoder.categories_` (a list of arrays of the actual
314/// sorted-unique values, `_BaseEncoder._fit:99`).
315#[derive(Debug, Clone)]
316pub struct FittedOneHotEncoder<F> {
317    /// Per-column sorted-unique category values (`categories_`). `categories_[j]`
318    /// is the sorted set of distinct values seen in input column `j`; its length
319    /// is the number of output columns devoted to that feature's block.
320    pub(crate) categories_: Vec<Vec<F>>,
321    /// Per-column output-block start offsets (prefix sums of the per-feature
322    /// **block width**). The block width of feature `j` is
323    /// `categories_[j].len() - (1 if drop_idx_[j].is_some() else 0)`. Output
324    /// column `offsets[j] + pos` is the one-hot bit for the `pos`-th *kept*
325    /// category of feature `j`. Has length `categories_.len()`.
326    pub(crate) offsets: Vec<usize>,
327    /// Total number of output columns (`Σ block_width(j)`), accounting for any
328    /// dropped categories (`drop`).
329    pub(crate) n_output: usize,
330    /// Strategy for unknown categories at `transform` time, threaded from the
331    /// unfitted [`OneHotEncoder`]. [`OneHotHandleUnknown::Error`] rejects an
332    /// unknown category; [`OneHotHandleUnknown::Ignore`] emits an all-zero block.
333    pub(crate) handle_unknown: OneHotHandleUnknown,
334    /// Per-feature index into `categories_[j]` of the category to drop, or `None`
335    /// for "no drop" on that feature (`drop_idx_`). Has length
336    /// `categories_.len()`. Mirrors scikit-learn's public `drop_idx_`
337    /// (`_encoders.py:608-615`,`:885-902`): `drop='first'` → every entry
338    /// `Some(0)`; `drop='if_binary'` → `Some(0)` iff the feature has exactly two
339    /// categories else `None`; `drop=None` → every entry `None`.
340    pub(crate) drop_idx_: Vec<Option<usize>>,
341    /// Per-feature indices into `categories_[j]` of the categories grouped as
342    /// **infrequent** (`min_frequency`/`max_categories`), sorted ascending.
343    /// Mirrors scikit-learn's private `_infrequent_indices[j]`
344    /// (`_encoders.py:336-340`,`:367-370`): the indices `idx` such that
345    /// `categories_[j][idx]` is an infrequent category. Empty when feature `j`
346    /// has no infrequent categories (sklearn's `None`). With infrequent grouping
347    /// disabled every entry is empty. Length `categories_.len()`.
348    pub(crate) infrequent_indices_: Vec<Vec<usize>>,
349    /// Per-feature mapping from a `categories_[j]` index to its OUTPUT column
350    /// offset WITHIN feature `j`'s block (before adding `offsets[j]`). Mirrors
351    /// scikit-learn's `_default_to_infrequent_mappings[j]`
352    /// (`_encoders.py:373-400`): a frequent category maps to its remapped slot
353    /// `0..n_frequent`, every infrequent category maps to the single trailing
354    /// slot `n_frequent`. When feature `j` has no infrequent categories the
355    /// mapping is the identity `0..len` (sklearn stores `None`; the identity is
356    /// the representable equivalent). Length `categories_.len()`, with
357    /// `infrequent_map[j].len() == categories_[j].len()`. Used by `transform`,
358    /// `inverse_transform`, and `get_feature_names_out` to place each category in
359    /// the right output column without recomputing the grouping.
360    pub(crate) infrequent_map: Vec<Vec<usize>>,
361}
362
363impl<F: Float + Send + Sync + 'static> FittedOneHotEncoder<F> {
364    /// Return the learned sorted-unique category set for each input column
365    /// (`categories_`).
366    ///
367    /// `categories()[j][idx]` is the value encoded by output column
368    /// `offsets[j] + idx`. Mirrors scikit-learn's `OneHotEncoder.categories_`.
369    #[must_use]
370    pub fn categories(&self) -> &[Vec<F>] {
371        &self.categories_
372    }
373
374    /// Return the number of distinct categories for each input feature column,
375    /// i.e. the width of each per-feature one-hot block.
376    #[must_use]
377    pub fn n_categories(&self) -> Vec<usize> {
378        self.categories_.iter().map(Vec::len).collect()
379    }
380
381    /// Return the number of input feature columns.
382    #[must_use]
383    pub fn n_features(&self) -> usize {
384        self.categories_.len()
385    }
386
387    /// Return the total number of output columns (`Σ categories_[j].len()`).
388    #[must_use]
389    pub fn n_output_features(&self) -> usize {
390        self.n_output
391    }
392
393    /// Return the configured unknown-category strategy (`handle_unknown`),
394    /// threaded from the unfitted [`OneHotEncoder`].
395    #[must_use]
396    pub fn handle_unknown(&self) -> OneHotHandleUnknown {
397        self.handle_unknown
398    }
399
400    /// Return the per-feature drop index (`drop_idx_`).
401    ///
402    /// `drop_idx_()[j]` is `Some(d)` if category `categories_[j][d]` is dropped
403    /// from feature `j`'s one-hot block (its block width is one less than
404    /// `categories_[j].len()`, and that category encodes to an all-zero block),
405    /// or `None` if no category is dropped from that feature. Mirrors
406    /// scikit-learn's public `drop_idx_` attribute (`_encoders.py:608-615`). With
407    /// `drop=None` (the default) every entry is `None`.
408    #[must_use]
409    pub fn drop_idx_(&self) -> &[Option<usize>] {
410        &self.drop_idx_
411    }
412
413    /// Return the infrequent category **values** for each feature
414    /// (`infrequent_categories_`).
415    ///
416    /// `infrequent_categories()[j]` is the sorted list of category values from
417    /// `categories_[j]` that were grouped into the single trailing "infrequent"
418    /// output column (because their training count fell below `min_frequency`
419    /// and/or beyond the `max_categories` limit). An EMPTY inner `Vec` means
420    /// feature `j` had no infrequent categories (scikit-learn returns `None`
421    /// there; an empty list is the representable equivalent). With infrequent
422    /// grouping disabled every entry is empty. Mirrors scikit-learn's
423    /// `OneHotEncoder.infrequent_categories_`
424    /// (`_encoders.py:254-262`,`:625-633`): `category[indices]` over
425    /// `_infrequent_indices`.
426    #[must_use]
427    pub fn infrequent_categories(&self) -> Vec<Vec<F>> {
428        self.infrequent_indices_
429            .iter()
430            .enumerate()
431            .map(|(j, idxs)| {
432                idxs.iter()
433                    .filter_map(|&idx| self.categories_.get(j).and_then(|c| c.get(idx)).copied())
434                    .collect()
435            })
436            .collect()
437    }
438
439    /// Whether feature `j` has any infrequent categories (a trailing infrequent
440    /// output column). Bounds-safe: a `j` past the end yields `false`.
441    fn has_infrequent(&self, j: usize) -> bool {
442        self.infrequent_indices_
443            .get(j)
444            .is_some_and(|v| !v.is_empty())
445    }
446
447    /// Return the width of feature `j`'s one-hot block: `categories_[j].len()`
448    /// minus one if that feature has a dropped category. Bounds-safe: a `j` past
449    /// the end yields 0 (R-CODE-2).
450    fn block_width(&self, j: usize) -> usize {
451        let len = self.categories_.get(j).map_or(0, Vec::len);
452        // Infrequent grouping (REQ-5b) and `drop` (REQ-5a) are mutually
453        // exclusive — `fit` rejects their combination — so at most one branch
454        // applies. With infrequent categories the block is `n_frequent + 1`
455        // trailing infrequent column (sklearn `_compute_n_features_outs`
456        // `_encoders.py:948-953`: `output[i] -= infreq.size - 1`, i.e.
457        // `len - n_infreq + 1`).
458        let n_infreq = self.infrequent_indices_.get(j).map_or(0, Vec::len);
459        if n_infreq > 0 {
460            return len - n_infreq + 1;
461        }
462        let dropped = matches!(self.drop_idx_.get(j), Some(Some(_)));
463        len - usize::from(dropped && len > 0)
464    }
465
466    /// Invert a one-hot encoded matrix back to the original category values.
467    ///
468    /// For each input feature `j` the per-feature block
469    /// `x[:, offsets[j] .. offsets[j] + categories_[j].len()]` is reduced to a
470    /// single category via **argmax** (the index of the maximum value in the
471    /// block, first-max on ties — numpy `argmax` semantics), and the original
472    /// value `categories_[j][argmax]` is written to `out[[i, j]]`. This mirrors
473    /// scikit-learn's `OneHotEncoder.inverse_transform`
474    /// (`sklearn/preprocessing/_encoders.py:1136-1139`):
475    /// `labels = sub.argmax(axis=1); X_tr[:, i] = cats[labels]`.
476    ///
477    /// After the argmax, an **all-zero block** (a row whose per-feature block
478    /// sums to zero) cannot be inverted. With no `drop` and the default
479    /// `handle_unknown='error'` (the only mode ferrolearn ships — REQ-4/5), this
480    /// is an error, matching sklearn's
481    /// `ValueError("Samples [...] can not be inverted when drop=None and
482    /// handle_unknown='error' because they contain all zeros")`
483    /// (`_encoders.py:1160-1168`). A proper one-hot row from
484    /// [`Transform::transform`] has exactly one `1` per block, so argmax always
485    /// finds it and the block sum is never zero.
486    ///
487    /// # Errors
488    ///
489    /// - [`FerroError::InsufficientSamples`] if `x` has zero rows (sklearn
490    ///   `check_array` requires a minimum of 1 sample).
491    /// - [`FerroError::ShapeMismatch`] if `x.ncols() != n_output` (sklearn's
492    ///   "Shape of the passed X data is not correct" `ValueError`,
493    ///   `_encoders.py:1100-1104`).
494    /// - [`FerroError::InvalidParameter`] if any per-feature block is all-zero
495    ///   (the sklearn all-zeros `ValueError`, `_encoders.py:1164-1168`).
496    ///
497    /// Never panics: every block slice is bounds-checked (R-CODE-2).
498    pub fn inverse_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
499        let n_samples = x.nrows();
500        if n_samples == 0 {
501            return Err(FerroError::InsufficientSamples {
502                required: 1,
503                actual: 0,
504                context: "FittedOneHotEncoder::inverse_transform".into(),
505            });
506        }
507        // sklearn `inverse_transform` -> `check_array(X, accept_sparse="csr")`
508        // (`_encoders.py:1092`) with the DEFAULT `force_all_finite=True`, so a
509        // NaN or +/-inf cell in the one-hot matrix raises BEFORE the argmax
510        // (#2224). A valid one-hot row is all 0/1 (finite); a non-finite cell is
511        // invalid input.
512        if x.iter().any(|v| !v.is_finite()) {
513            return Err(FerroError::InvalidParameter {
514                name: "X".into(),
515                reason: "Input X contains NaN or infinity.".into(),
516            });
517        }
518        if x.ncols() != self.n_output {
519            return Err(FerroError::ShapeMismatch {
520                expected: vec![n_samples, self.n_output],
521                actual: vec![n_samples, x.ncols()],
522                context: "FittedOneHotEncoder::inverse_transform".into(),
523            });
524        }
525
526        let n_features = self.categories_.len();
527        let mut out = Array2::zeros((n_samples, n_features));
528
529        for j in 0..n_features {
530            let cats = &self.categories_[j];
531            let drop_d = self.drop_idx_.get(j).copied().flatten();
532            // The per-feature block WIDTH after drop (the number of output columns
533            // for this feature). With a dropped category the block is one narrower
534            // than `categories_[j]` (`_encoders.py:1124-1127` `cats_wo_dropped`).
535            let block_width = self.block_width(j);
536            let offset = self.offsets[j];
537
538            // A feature whose entire (single) category was dropped has a
539            // zero-width block (`drop='first'` on a 1-category feature). Every row
540            // inverts to that dropped category, with no columns consumed (sklearn
541            // `n_categories == 0` branch, `_encoders.py:1132-1135`).
542            if block_width == 0 {
543                if let Some(&cat) = drop_d.and_then(|d| cats.get(d)) {
544                    for i in 0..n_samples {
545                        out[[i, j]] = cat;
546                    }
547                }
548                continue;
549            }
550
551            for i in 0..n_samples {
552                // Argmax over the per-feature block (numpy `argmax`: index of the
553                // maximum, FIRST on ties). Track the block sum to detect the
554                // all-zero case separately, mirroring sklearn's two-step
555                // argmax-then-all-zero-check (`_encoders.py:1136-1172`). `argmax`
556                // is a BLOCK position in `0..block_width`.
557                let mut argmax: usize = 0;
558                let mut max_val = x[[i, offset]];
559                let mut block_sum = max_val;
560                for k in 1..block_width {
561                    let v = x[[i, offset + k]];
562                    block_sum = block_sum + v;
563                    if v > max_val {
564                        max_val = v;
565                        argmax = k;
566                    }
567                }
568                if block_sum == F::zero() {
569                    // All-zero block. With a dropped category this is the
570                    // LEGITIMATE encoding of the dropped value, so it inverts to
571                    // that category in BOTH handle_unknown modes — sklearn checks
572                    // `_drop_idx_after_grouping[i] is not None` FIRST and maps the
573                    // all-zero row to the dropped category (`_encoders.py:1150-1158`
574                    // for ignore, `:1169-1172` for error), bypassing the
575                    // "can not be inverted" / None paths.
576                    if drop_d.is_some() {
577                        if let Some(&cat) = drop_d.and_then(|d| cats.get(d)) {
578                            out[[i, j]] = cat;
579                        }
580                    } else {
581                        // No drop on this feature: the existing handle_unknown
582                        // semantics (`_encoders.py:1141`,`:1159-1168`).
583                        match self.handle_unknown {
584                            OneHotHandleUnknown::Error => {
585                                return Err(FerroError::InvalidParameter {
586                                    name: "X".into(),
587                                    reason: "Samples can not be inverted when drop=None and \
588                                         handle_unknown='error' because they contain all zeros"
589                                        .into(),
590                                });
591                            }
592                            // `handle_unknown='ignore'` all-zero block -> None in
593                            // sklearn (`:1183`); `Array2<F>` cannot hold None so we
594                            // use NaN as the representable sentinel (#2227).
595                            OneHotHandleUnknown::Ignore => {
596                                out[[i, j]] = F::nan();
597                            }
598                        }
599                    }
600                } else if self.has_infrequent(j) {
601                    // Infrequent grouping (REQ-5b). The block POSITION `argmax`
602                    // is a slot in `infrequent_map[j]`. The TRAILING slot
603                    // (`n_frequent`) is the infrequent column: sklearn inverts it
604                    // to the string `'infrequent_sklearn'` (`_encoders.py:1675-1677`,
605                    // `_compute_transformed_categories:917`), which an `Array2<F>`
606                    // cannot hold — NaN is the representable proxy (DOCUMENTED
607                    // SCOPE, R-HONEST-3, like the ignore-None case #2227). A
608                    // frequent slot inverts to the unique `categories_[j]` index
609                    // that maps to it (`labels = cats_wo_dropped[argmax]`,
610                    // `:1138-1139`). Bounds-safe via `get` (R-CODE-2).
611                    let map = self.infrequent_map.get(j);
612                    let n_frequent = block_width - 1; // the trailing slot index
613                    if argmax >= n_frequent {
614                        out[[i, j]] = F::nan();
615                    } else if let Some(&cat) = map
616                        .and_then(|m| m.iter().position(|&s| s == argmax))
617                        .and_then(|orig| cats.get(orig))
618                    {
619                        out[[i, j]] = cat;
620                    }
621                } else {
622                    // Map the block POSITION back to a `categories_[j]` index: with
623                    // a dropped category `d`, positions `>= d` correspond to the
624                    // category one higher (the dropped category was removed),
625                    // matching sklearn's `cats_wo_dropped` indexing
626                    // (`_encoders.py:1124-1139`). Bounds-safe via `get` (R-CODE-2).
627                    let cat_idx = match drop_d {
628                        Some(d) if argmax >= d => argmax + 1,
629                        _ => argmax,
630                    };
631                    if let Some(&cat) = cats.get(cat_idx) {
632                        out[[i, j]] = cat;
633                    }
634                }
635            }
636        }
637
638        Ok(out)
639    }
640
641    /// Return the output feature names, one per output column.
642    ///
643    /// For each input feature `j`, for each category `c` in `categories_[j]`,
644    /// emits `format!("x{j}_{c}")` where `c` is rendered to match Python's
645    /// `str(np.float64(c))`. This mirrors scikit-learn's
646    /// `OneHotEncoder.get_feature_names_out` with the default `input_features`
647    /// (`["x0", "x1", ...]`) and the `"concat"` name combiner
648    /// (`feature + "_" + str(category)`, `_encoders.py:1217,1224`). For the
649    /// whole-number fixture `[[2,0],[5,1],[9,0],[5,1]]` this yields
650    /// `["x0_2.0", "x0_5.0", "x0_9.0", "x1_0.0", "x1_1.0"]`.
651    ///
652    /// # Float-rendering divergence (HONEST, R-HONEST-3)
653    ///
654    /// The category is rendered via [`Self::category_label`], which appends `.0`
655    /// to integer-valued floats (`2.0 → "2.0"`, `-3.0 → "-3.0"`, matching
656    /// Python) and uses Rust's shortest round-trip `Display` otherwise
657    /// (`2.5 → "2.5"`). For category values in the usual categorical range
658    /// (small whole or fractional numbers) this is byte-identical to Python.
659    /// It DIVERGES for extreme magnitudes: Python's `repr`/`str` switches to
660    /// scientific notation at `|v| >= 1e16` and `0 < |v| < 1e-4`
661    /// (`1e+20`, `1e-07`), while Rust's `Display` prints the full decimal
662    /// (`100000000000000000000`, `0.0000001`). Such values are not plausible
663    /// one-hot categories; the divergence is documented rather than papered over.
664    /// `NaN` renders as `"nan"` (matching Python's `str(nan)`).
665    #[must_use]
666    pub fn get_feature_names_out(&self) -> Vec<String> {
667        let mut names = Vec::with_capacity(self.n_output);
668        for (j, cats) in self.categories_.iter().enumerate() {
669            // The dropped category's name is OMITTED (sklearn
670            // `_compute_transformed_categories` with `remove_dropped=True`,
671            // `_encoders.py:1209-1212`,`:909`).
672            let drop_d = self.drop_idx_.get(j).copied().flatten();
673            // Infrequent grouping (REQ-5b): emit only the FREQUENT category names
674            // then a single trailing `"x{j}_infrequent_sklearn"` column — the
675            // infrequent categories collapse into that one column (sklearn
676            // `_compute_transformed_categories`, `_encoders.py:913-921`:
677            // `cats[frequent_mask] + ['infrequent_sklearn']`). Infrequent and
678            // `drop` are mutually exclusive, so `drop_d` is `None` here.
679            if self.has_infrequent(j) {
680                let map = self.infrequent_map.get(j);
681                let n_frequent = self.block_width(j).saturating_sub(1);
682                for slot in 0..n_frequent {
683                    // The unique frequent category whose remapped slot is `slot`.
684                    if let Some(&c) = map
685                        .and_then(|m| m.iter().position(|&s| s == slot))
686                        .and_then(|orig| cats.get(orig))
687                    {
688                        names.push(format!("x{j}_{}", Self::category_label(c)));
689                    }
690                }
691                names.push(format!("x{j}_infrequent_sklearn"));
692                continue;
693            }
694            for (idx, &c) in cats.iter().enumerate() {
695                if drop_d == Some(idx) {
696                    continue;
697                }
698                names.push(format!("x{j}_{}", Self::category_label(c)));
699            }
700        }
701        names
702    }
703
704    /// Render a category value to a string matching Python's `str(np.float64(v))`
705    /// for the categorical-value range (see [`Self::get_feature_names_out`] for
706    /// the documented extreme-magnitude divergence).
707    ///
708    /// Python's `str(float)` always shows a decimal point for whole floats
709    /// (`2.0`, not `2`), so an integer-valued finite float gets a `.0` suffix;
710    /// otherwise Rust's shortest round-trip `Display` is used. `NaN → "nan"`.
711    fn category_label(v: F) -> String {
712        let Some(f) = v.to_f64() else {
713            return "nan".to_string();
714        };
715        if f.is_nan() {
716            return "nan".to_string();
717        }
718        if f.is_finite() && f == f.trunc() {
719            // Whole-valued finite float: Python prints e.g. "2.0", "-3.0".
720            format!("{f:.1}")
721        } else {
722            // Fractional or non-finite: shortest round-trip Display ("2.5").
723            format!("{f}")
724        }
725    }
726}
727
728// ---------------------------------------------------------------------------
729// Trait implementations
730// ---------------------------------------------------------------------------
731
732impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for OneHotEncoder<F> {
733    type Fitted = FittedOneHotEncoder<F>;
734    type Error = FerroError;
735
736    /// Fit the encoder by learning the **sorted-unique category set** per column.
737    ///
738    /// For each input column `j`, `categories_[j]` is the distinct values of that
739    /// column, sorted ascending via `partial_cmp` and deduped by **exact
740    /// equality** — mirroring scikit-learn's `categories_ = _unique(Xi)`
741    /// (`sklearn/preprocessing/_encoders.py:99`, `np.unique` per column).
742    /// The output-column layout (`offsets`, `n_output`) is precomputed as the
743    /// prefix sums / total of the per-column category counts.
744    ///
745    /// Exact float equality is what `np.unique` does, so two values that differ
746    /// by an ULP are distinct categories here, exactly as in sklearn.
747    ///
748    /// # NaN handling (#2223)
749    ///
750    /// `NaN` is treated as a valid category, matching sklearn's `_unique_np`
751    /// (`_encode.py:70-74`): it sorts LAST and a run of duplicate NaNs collapses
752    /// to a SINGLE sorted-last category (the sort orders `NaN` after every finite
753    /// value and `dedup_by` collapses consecutive NaNs, since `NaN != NaN`). A
754    /// NaN cell at `transform` then one-hots that trailing category. `fit` never
755    /// panics (R-CODE-2).
756    ///
757    /// # Errors
758    ///
759    /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows
760    /// (matching sklearn's `check_array` minimum-of-1-sample requirement).
761    fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedOneHotEncoder<F>, FerroError> {
762        // sklearn `_parameter_constraints` (`@_fit_context`, `_encoders.py:733-738`)
763        // validates the params BEFORE the data: `min_frequency` is
764        // `Interval(Integral, 1, None)` and `max_categories` is
765        // `Interval(Integral, 1, None)` — a value of 0 raises
766        // `InvalidParameterError` ("must be ... in the range [1, inf)"). #1154/REQ-7.
767        // (handle_unknown/drop are type-safe Rust enums, so their StrOptions
768        // constraints are provided by the type system — no runtime check needed.)
769        if self.min_frequency == Some(0) {
770            return Err(FerroError::InvalidParameter {
771                name: "min_frequency".into(),
772                reason: "must be an int in the range [1, inf)".into(),
773            });
774        }
775        if self.max_categories == Some(0) {
776            return Err(FerroError::InvalidParameter {
777                name: "max_categories".into(),
778                reason: "must be an int in the range [1, inf)".into(),
779            });
780        }
781        let n_samples = x.nrows();
782        if n_samples == 0 {
783            return Err(FerroError::InsufficientSamples {
784                required: 1,
785                actual: 0,
786                context: "OneHotEncoder::fit".into(),
787            });
788        }
789        // sklearn `OneHotEncoder.fit` -> `check_array(force_all_finite="allow-nan")`:
790        // NaN is a valid CATEGORY (#2223), but +/-inf is REJECTED (verified live:
791        // fit([[inf]]) -> ValueError "Input contains infinity"). #2225.
792        if x.iter().any(|v| v.is_infinite()) {
793            return Err(FerroError::InvalidParameter {
794                name: "X".into(),
795                reason: "Input X contains infinity or a value too large for dtype.".into(),
796            });
797        }
798
799        let infrequent_enabled = self.infrequent_enabled();
800
801        let n_features = x.ncols();
802        let mut categories_: Vec<Vec<F>> = Vec::with_capacity(n_features);
803        // Per-feature, per-category training counts ALIGNED with `categories_[j]`
804        // (`category_counts[j][idx]` is the count of `categories_[j][idx]`).
805        // Only needed when infrequent grouping is enabled — sklearn computes
806        // counts via `_unique(Xi, return_counts=True)` (`_encoders.py:99-102`).
807        let mut category_counts: Vec<Vec<usize>> = Vec::with_capacity(n_features);
808
809        for j in 0..n_features {
810            // Collect this column's values, sort ascending (sklearn `np.unique`
811            // sorts), then dedup by EXACT equality to the sorted-unique set.
812            let mut col: Vec<F> = x.column(j).iter().copied().collect();
813            // Sort ascending with NaN LAST (sklearn `_unique_np` keeps any NaN at
814            // the end, `_encode.py:70-74`); `partial_cmp` alone returns None for
815            // NaN and would leave it unmoved (#2223).
816            col.sort_by(|a, b| match (a.is_nan(), b.is_nan()) {
817                (true, true) => Ordering::Equal,
818                (true, false) => Ordering::Greater,
819                (false, true) => Ordering::Less,
820                (false, false) => a.partial_cmp(b).unwrap_or(Ordering::Equal),
821            });
822            // Build the sorted-unique set AND, when infrequent grouping is
823            // enabled, the per-category run-length count (the sorted column has
824            // each category's occurrences contiguous, so a run length is the
825            // count). Consecutive EXACT-equal values collapse (an ULP-apart pair
826            // stays distinct, like `np.unique`), AND consecutive NaNs collapse to
827            // ONE (`dedup` alone keeps every NaN since `NaN != NaN`; sklearn
828            // collapses the trailing NaN run to a single sorted-last category,
829            // #2223).
830            let mut cats: Vec<F> = Vec::with_capacity(col.len());
831            let mut counts: Vec<usize> = Vec::with_capacity(col.len());
832            for v in col {
833                match cats.last() {
834                    Some(&last) if last == v || (last.is_nan() && v.is_nan()) => {
835                        if let Some(c) = counts.last_mut() {
836                            *c += 1;
837                        }
838                    }
839                    _ => {
840                        cats.push(v);
841                        counts.push(1);
842                    }
843                }
844            }
845            categories_.push(cats);
846            category_counts.push(counts);
847        }
848
849        // Infrequent grouping (REQ-5b). When enabled, identify each feature's
850        // infrequent category indices and build the per-feature index→output
851        // column mapping; otherwise every feature has no infrequent categories
852        // and the mapping is the identity.
853        let mut infrequent_indices_: Vec<Vec<usize>> = Vec::with_capacity(n_features);
854        let mut infrequent_map: Vec<Vec<usize>> = Vec::with_capacity(n_features);
855        if infrequent_enabled {
856            // REQ-5a × REQ-5b interaction is DEFERRED: combining infrequent
857            // grouping with `drop` is rejected at fit (sklearn ALLOWS it, but the
858            // remapping is intricate — documented scope, R-HONEST-3). Require
859            // `drop == None_`.
860            if self.drop != OneHotDrop::None_ {
861                return Err(FerroError::InvalidParameter {
862                    name: "drop".into(),
863                    reason: "infrequent grouping (min_frequency/max_categories) with drop is not \
864                             yet supported"
865                        .into(),
866                });
867            }
868            for counts in &category_counts {
869                let infreq = identify_infrequent(counts, self.min_frequency, self.max_categories);
870                let map = build_infrequent_map(counts.len(), &infreq);
871                infrequent_indices_.push(infreq);
872                infrequent_map.push(map);
873            }
874        } else {
875            for cats in &categories_ {
876                infrequent_indices_.push(Vec::new());
877                infrequent_map.push((0..cats.len()).collect());
878            }
879        }
880
881        // Compute `drop_idx_` from `drop` + the learned `categories_`
882        // (sklearn `_compute_drop_idx`, `_encoders.py:812-831`). `drop=None` →
883        // every feature `None`; `drop='first'` → every feature `Some(0)`;
884        // `drop='if_binary'` → `Some(0)` iff the feature has exactly two
885        // categories, else `None`. (With infrequent grouping active `drop` is
886        // forced to `None_` above, so every entry is `None`.)
887        let drop_idx_: Vec<Option<usize>> = match self.drop {
888            OneHotDrop::None_ => vec![None; n_features],
889            OneHotDrop::First => categories_
890                .iter()
891                .map(|cats| if cats.is_empty() { None } else { Some(0) })
892                .collect(),
893            OneHotDrop::IfBinary => categories_
894                .iter()
895                .map(|cats| if cats.len() == 2 { Some(0) } else { None })
896                .collect(),
897        };
898
899        let mut fitted = FittedOneHotEncoder {
900            categories_,
901            // Placeholder; recomputed below from per-feature block widths.
902            offsets: Vec::new(),
903            n_output: 0,
904            // `handle_unknown` only affects `transform` (sklearn learns the same
905            // `categories_` regardless); thread the configured mode through. Note
906            // (verified live, sklearn 1.5.2): `drop` + `handle_unknown='ignore'`
907            // is ALLOWED — sklearn does NOT raise at fit; it warns on unknown at
908            // transform and encodes the unknown as an all-zero block (the same as
909            // the dropped category). So fit imposes no drop+ignore constraint.
910            handle_unknown: self.handle_unknown,
911            drop_idx_,
912            infrequent_indices_,
913            infrequent_map,
914        };
915
916        // Recompute the output-column layout from each feature's block width:
917        // `block_width(j)` is `n_frequent + 1` with infrequent grouping (the
918        // trailing infrequent column), else `len - (1 if dropped)`. `offsets` is
919        // the prefix sum of those widths; `n_output` the total (sklearn
920        // `_compute_n_features_outs`, `_encoders.py:936-955`; `feature_indices`,
921        // `:1049`).
922        let mut offsets: Vec<usize> = Vec::with_capacity(n_features);
923        let mut n_output: usize = 0;
924        for j in 0..n_features {
925            offsets.push(n_output);
926            n_output += fitted.block_width(j);
927        }
928        fitted.offsets = offsets;
929        fitted.n_output = n_output;
930
931        Ok(fitted)
932    }
933}
934
935/// Identify the indices of infrequent categories for one feature, given the
936/// per-category training `counts` (aligned with `categories_[j]`) and the
937/// `min_frequency`/`max_categories` thresholds.
938///
939/// Mirrors scikit-learn's `_BaseEncoder._identify_infrequent`
940/// (`_encoders.py:275-318`):
941/// 1. min_frequency: a category with `count < min_frequency` is infrequent
942///    (`:295-296`, integer form only — the float-fraction form is out of scope).
943/// 2. max_categories: if (after step 1) the feature would still produce more
944///    than `max_categories` output columns — counted as `n_remaining_frequent +
945///    1` for the infrequent group (`:303`) — the least-frequent categories are
946///    additionally marked infrequent until only `max_categories - 1` frequent
947///    categories remain (`:304-315`). Ties broken by a STABLE sort over the
948///    FULL count array, so among equal counts the SMALLER category index is
949///    marked infrequent first (sklearn `np.argsort(kind="mergesort")[:-k]`).
950///    `max_categories == 1` (frequent_category_count 0) makes every category
951///    infrequent (`:307-309`).
952///
953/// Returns the sorted-ascending infrequent indices (empty if none — sklearn's
954/// `None`). Never panics (R-CODE-2).
955fn identify_infrequent(
956    counts: &[usize],
957    min_frequency: Option<usize>,
958    max_categories: Option<usize>,
959) -> Vec<usize> {
960    let n = counts.len();
961    let mut infrequent_mask = vec![false; n];
962
963    // Step 1: min_frequency (integer count). `count < min_frequency`.
964    if let Some(min_freq) = min_frequency {
965        for (idx, &c) in counts.iter().enumerate() {
966            if c < min_freq {
967                infrequent_mask[idx] = true;
968            }
969        }
970    }
971
972    // Step 2: max_categories on the survivors. `n_current_features` counts the
973    // remaining frequent categories PLUS 1 for the infrequent group
974    // (`_encoders.py:303`).
975    if let Some(max_cat) = max_categories {
976        let n_infreq = infrequent_mask.iter().filter(|&&m| m).count();
977        let n_current_features = n - n_infreq + 1;
978        if max_cat < n_current_features {
979            // `max_categories` includes the one infrequent category.
980            let frequent_category_count = max_cat - 1;
981            if frequent_category_count == 0 {
982                // All categories are infrequent (`:307-309`).
983                infrequent_mask.iter_mut().for_each(|m| *m = true);
984            } else {
985                // Stable argsort over the FULL count array (ascending by count,
986                // ties by ascending index), then mark the smallest
987                // `n - frequent_category_count` levels infrequent — i.e. keep the
988                // top `frequent_category_count` by count, with ties resolved in
989                // favor of the LARGER index (`np.argsort(kind="mergesort")[:-k]`,
990                // `:312-315`).
991                let mut order: Vec<usize> = (0..n).collect();
992                order.sort_by(|&a, &b| counts[a].cmp(&counts[b]).then(a.cmp(&b)));
993                let keep = frequent_category_count.min(n);
994                let cut = n - keep;
995                for &idx in &order[..cut] {
996                    infrequent_mask[idx] = true;
997                }
998            }
999        }
1000    }
1001
1002    infrequent_mask
1003        .iter()
1004        .enumerate()
1005        .filter_map(|(idx, &m)| if m { Some(idx) } else { None })
1006        .collect()
1007}
1008
1009/// Build the per-feature mapping from a `categories_[j]` index to its output
1010/// column slot WITHIN the feature's block (before adding `offsets[j]`).
1011///
1012/// Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
1013/// (`_encoders.py:373-400`): frequent categories take slots `0..n_frequent` in
1014/// their original (ascending-index) order; every infrequent category maps to the
1015/// single trailing slot `n_frequent`. With no infrequent categories the mapping
1016/// is the identity `0..n`. `infrequent` must be sorted ascending. Never panics
1017/// (R-CODE-2): every index is bounds-checked.
1018fn build_infrequent_map(n: usize, infrequent: &[usize]) -> Vec<usize> {
1019    if infrequent.is_empty() {
1020        return (0..n).collect();
1021    }
1022    let n_frequent = n - infrequent.len();
1023    let mut map = vec![n_frequent; n];
1024    let mut next_frequent = 0usize;
1025    for (idx, slot) in map.iter_mut().enumerate() {
1026        if infrequent.binary_search(&idx).is_ok() {
1027            // Infrequent → the trailing slot (already set to `n_frequent`).
1028        } else {
1029            *slot = next_frequent;
1030            next_frequent += 1;
1031        }
1032    }
1033    map
1034}
1035
1036impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedOneHotEncoder<F> {
1037    type Output = Array2<F>;
1038    type Error = FerroError;
1039
1040    /// Transform numeric categorical data into a dense one-hot encoded matrix.
1041    ///
1042    /// Each value is one-hot by **category membership**: for input column `j` the
1043    /// value `x[[i, j]]` is matched (by exact equality) against `categories_[j]`,
1044    /// and the bit at output column `offsets[j] + idx` is set, where `idx` is the
1045    /// value's position in the sorted-unique set. The per-feature one-hot blocks
1046    /// are concatenated left-to-right, matching scikit-learn's
1047    /// `OneHotEncoder(sparse_output=False)` output column layout
1048    /// (`_BaseEncoder._transform`, `_encoders.py:206-240`).
1049    ///
1050    /// A value not present in `categories_[j]` is an **unknown category**. Its
1051    /// handling depends on the configured `handle_unknown`
1052    /// ([`OneHotEncoder::with_handle_unknown`]):
1053    /// - [`OneHotHandleUnknown::Error`] (the default): returns an error, matching
1054    ///   sklearn's `handle_unknown='error'`
1055    ///   (`ValueError("Found unknown categories … during transform")`,
1056    ///   `_encoders.py:209-214`).
1057    /// - [`OneHotHandleUnknown::Ignore`]: leaves that feature's one-hot block
1058    ///   **all-zero** for this row (no column is set), matching sklearn's
1059    ///   `handle_unknown='ignore'` (`_encoders.py:215-240`: the unknown row is
1060    ///   masked out so no encoded column is set). Every KNOWN feature still emits
1061    ///   its normal one-hot bit.
1062    ///
1063    /// The +/-inf rejection (#2225), the ncols guard, and the 0-row handling are
1064    /// unaffected by `handle_unknown`: a non-finite +/-inf value is invalid input
1065    /// (not an unknown category) and still errors even in `Ignore` mode.
1066    ///
1067    /// # Errors
1068    ///
1069    /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
1070    /// match the number of features seen during fitting.
1071    ///
1072    /// Returns [`FerroError::InvalidParameter`] if any value is an unknown
1073    /// category (not in the learned `categories_[j]` set) AND `handle_unknown`
1074    /// is [`OneHotHandleUnknown::Error`] (the default); under
1075    /// [`OneHotHandleUnknown::Ignore`] an unknown category never errors. Also
1076    /// returned if any value is +/-infinite (invalid input, #2225).
1077    fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
1078        let n_features = self.categories_.len();
1079        // sklearn `transform` -> `check_array(force_all_finite="allow-nan")`
1080        // (`_encoders.py`): +/-inf is rejected with "Input contains infinity"
1081        // BEFORE the per-feature membership lookup (so an inf value reports the
1082        // finite-check error, NOT "unknown category"); NaN passes (it can be a
1083        // known category). #2225.
1084        if x.iter().any(|v| v.is_infinite()) {
1085            return Err(FerroError::InvalidParameter {
1086                name: "X".into(),
1087                reason: "Input X contains infinity or a value too large for dtype.".into(),
1088            });
1089        }
1090        if x.ncols() != n_features {
1091            return Err(FerroError::ShapeMismatch {
1092                expected: vec![x.nrows(), n_features],
1093                actual: vec![x.nrows(), x.ncols()],
1094                context: "FittedOneHotEncoder::transform".into(),
1095            });
1096        }
1097
1098        let n_samples = x.nrows();
1099        let mut out = Array2::zeros((n_samples, self.n_output));
1100
1101        for j in 0..n_features {
1102            let cats = &self.categories_[j];
1103            let offset = self.offsets[j];
1104            // The per-feature dropped category index, if any (`drop_idx_[j]`).
1105            // Used to shift kept categories down by one and to emit an all-zero
1106            // block for the dropped category (sklearn `transform`,
1107            // `_encoders.py:1033-1046`: `X_int > to_drop` decrements, the dropped
1108            // cell is masked out).
1109            let drop_d = self.drop_idx_.get(j).copied().flatten();
1110            // The per-feature infrequent remapping (REQ-5b). When feature `j` has
1111            // infrequent categories, a found category index maps to its block
1112            // slot via `infrequent_map[j][idx]` (a frequent category → its
1113            // remapped slot, an infrequent category → the trailing slot). When
1114            // there are none the map is the identity and `infreq` is `false`, so
1115            // the existing `drop` path is unchanged (the two are mutually
1116            // exclusive — `fit` rejects their combination).
1117            let infreq = self.has_infrequent(j);
1118            let infreq_map = self.infrequent_map.get(j);
1119            for i in 0..n_samples {
1120                let value = x[[i, j]];
1121                // Membership lookup: find the value's index in the sorted-unique
1122                // `categories_[j]` by EXACT equality (np.unique / `_encode`
1123                // semantics). A small linear scan over the per-feature category
1124                // set — bounds-safe (no unchecked indexing; R-CODE-2).
1125                match cats
1126                    .iter()
1127                    .position(|&c| c == value || (c.is_nan() && value.is_nan()))
1128                {
1129                    // Infrequent grouping active: place the value in its remapped
1130                    // block slot (`_BaseEncoder._map_infrequent_categories`,
1131                    // `_encoders.py:442-452`: `X_int = np.take(mapping, X_int)`).
1132                    Some(idx) if infreq => {
1133                        if let Some(&slot) = infreq_map.and_then(|m| m.get(idx)) {
1134                            out[[i, offset + slot]] = F::one();
1135                        }
1136                    }
1137                    Some(idx) => match drop_d {
1138                        // The dropped category encodes to an ALL-ZERO block: set
1139                        // nothing (sklearn masks the dropped cell out of `X_mask`,
1140                        // `_encoders.py:1037,1046`). `out` is already zero-filled.
1141                        Some(d) if idx == d => {}
1142                        // A KEPT category after a drop shifts down by one when its
1143                        // index is past the dropped one (sklearn `X_int > to_drop`
1144                        // decrements, `_encoders.py:1045`): the output column is
1145                        // `idx` if `idx < d`, else `idx - 1`.
1146                        Some(d) if idx > d => out[[i, offset + idx - 1]] = F::one(),
1147                        // No drop on this feature, or a kept category before the
1148                        // dropped one (`idx < d`): the column is `offset + idx`.
1149                        _ => out[[i, offset + idx]] = F::one(),
1150                    },
1151                    None => match self.handle_unknown {
1152                        // handle_unknown='ignore' (`_encoders.py:215-240`): the
1153                        // unknown row is masked out and NO column in this
1154                        // feature's block is set, so the per-feature one-hot block
1155                        // stays ALL-ZERO. `out` is already zero-filled, so we just
1156                        // skip — every KNOWN feature still sets its own bit.
1157                        OneHotHandleUnknown::Ignore => continue,
1158                        // handle_unknown='error' (the sklearn default, SHIPPED
1159                        // REQ-2, UNCHANGED): ValueError "Found unknown categories
1160                        // … during transform" (`_encoders.py:209-214`). `F: Float`
1161                        // is not `Display`, so report the value via `to_f64`.
1162                        OneHotHandleUnknown::Error => {
1163                            let v = value.to_f64();
1164                            let shown = match v {
1165                                Some(f) => format!("[{f}]"),
1166                                None => "[<non-finite>]".to_string(),
1167                            };
1168                            return Err(FerroError::InvalidParameter {
1169                                name: format!("x[{i},{j}]"),
1170                                reason: format!(
1171                                    "Found unknown categories {shown} in column {j} during transform"
1172                                ),
1173                            });
1174                        }
1175                    },
1176                }
1177            }
1178        }
1179
1180        Ok(out)
1181    }
1182}
1183
1184/// Implement `Transform` on the unfitted encoder to satisfy the `FitTransform: Transform`
1185/// supertrait bound. Calling `transform` on an unfitted encoder always returns an error.
1186impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for OneHotEncoder<F> {
1187    type Output = Array2<F>;
1188    type Error = FerroError;
1189
1190    /// Always returns an error — the encoder must be fitted first.
1191    ///
1192    /// Use [`Fit::fit`] to produce a [`FittedOneHotEncoder`], then call
1193    /// [`Transform::transform`] on that.
1194    fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError> {
1195        Err(FerroError::InvalidParameter {
1196            name: "OneHotEncoder".into(),
1197            reason: "encoder must be fitted before calling transform; use fit() first".into(),
1198        })
1199    }
1200}
1201
1202impl<F: Float + Send + Sync + 'static> FitTransform<Array2<F>> for OneHotEncoder<F> {
1203    type FitError = FerroError;
1204
1205    /// Fit the encoder on `x` and return the one-hot encoded output in one step.
1206    ///
1207    /// # Errors
1208    ///
1209    /// Returns an error if fitting or transformation fails.
1210    fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
1211        let fitted = self.fit(x, &())?;
1212        fitted.transform(x)
1213    }
1214}
1215
1216/// Convenience: encode a 1-D array of numeric categories.
1217///
1218/// This wraps the input in a single-column `Array2<F>` and returns the encoded
1219/// result with one-hot columns for that single feature, matching the membership
1220/// encoding of [`Transform::transform`].
1221impl<F: Float + Send + Sync + 'static> FittedOneHotEncoder<F> {
1222    /// Transform a 1-D slice of numeric category values.
1223    ///
1224    /// # Errors
1225    ///
1226    /// Returns an error if the encoder was fitted on more than one column, or if
1227    /// any value is an unknown category (not in the learned `categories_[0]`).
1228    pub fn transform_1d(&self, x: &[F]) -> Result<Array2<F>, FerroError> {
1229        if self.categories_.len() != 1 {
1230            return Err(FerroError::InvalidParameter {
1231                name: "transform_1d".into(),
1232                reason: "encoder was fitted on more than one column; use transform instead".into(),
1233            });
1234        }
1235        let col = Array2::from_shape_vec((x.len(), 1), x.to_vec()).map_err(|e| {
1236            FerroError::InvalidParameter {
1237                name: "x".into(),
1238                reason: e.to_string(),
1239            }
1240        })?;
1241        self.transform(&col)
1242    }
1243}
1244
1245// ---------------------------------------------------------------------------
1246// Tests
1247// ---------------------------------------------------------------------------
1248
1249#[cfg(test)]
1250mod tests {
1251    use super::*;
1252    use ndarray::array;
1253
1254    #[test]
1255    fn test_one_hot_single_column() {
1256        let enc = OneHotEncoder::<f64>::new();
1257        let x = array![[0.0_f64], [1.0], [2.0]];
1258        let fitted = enc.fit(&x, &()).unwrap();
1259        assert_eq!(fitted.categories(), &[vec![0.0, 1.0, 2.0]]);
1260        assert_eq!(fitted.n_categories(), vec![3]);
1261        assert_eq!(fitted.n_output_features(), 3);
1262
1263        let out = fitted.transform(&x).unwrap();
1264        assert_eq!(out.shape(), &[3, 3]);
1265        // Row 0: category 0 → [1, 0, 0]
1266        assert_eq!(out[[0, 0]], 1.0);
1267        assert_eq!(out[[0, 1]], 0.0);
1268        assert_eq!(out[[0, 2]], 0.0);
1269        // Row 1: category 1 → [0, 1, 0]
1270        assert_eq!(out[[1, 0]], 0.0);
1271        assert_eq!(out[[1, 1]], 1.0);
1272        assert_eq!(out[[1, 2]], 0.0);
1273        // Row 2: category 2 → [0, 0, 1]
1274        assert_eq!(out[[2, 0]], 0.0);
1275        assert_eq!(out[[2, 1]], 0.0);
1276        assert_eq!(out[[2, 2]], 1.0);
1277    }
1278
1279    #[test]
1280    fn test_one_hot_multi_column() {
1281        let enc = OneHotEncoder::<f64>::new();
1282        // Two columns: col0 has 3 categories, col1 has 2 categories
1283        let x = array![[0.0_f64, 0.0], [1.0, 1.0], [2.0, 0.0]];
1284        let fitted = enc.fit(&x, &()).unwrap();
1285        assert_eq!(fitted.categories(), &[vec![0.0, 1.0, 2.0], vec![0.0, 1.0]]);
1286        assert_eq!(fitted.n_categories(), vec![3, 2]);
1287        assert_eq!(fitted.n_output_features(), 5);
1288
1289        let out = fitted.transform(&x).unwrap();
1290        assert_eq!(out.shape(), &[3, 5]);
1291        // Row 0: (0, 0) → [1,0,0, 1,0]
1292        assert_eq!(out.row(0).to_vec(), vec![1.0, 0.0, 0.0, 1.0, 0.0]);
1293        // Row 1: (1, 1) → [0,1,0, 0,1]
1294        assert_eq!(out.row(1).to_vec(), vec![0.0, 1.0, 0.0, 0.0, 1.0]);
1295        // Row 2: (2, 0) → [0,0,1, 1,0]
1296        assert_eq!(out.row(2).to_vec(), vec![0.0, 0.0, 1.0, 1.0, 0.0]);
1297    }
1298
1299    #[test]
1300    fn test_non_contiguous_single_column() {
1301        // The REQ-3 headline: non-contiguous integers {2,5,9} must yield 3
1302        // category columns (one per unique value), NOT max+1 == 10.
1303        let enc = OneHotEncoder::<f64>::new();
1304        let x = array![[2.0_f64], [5.0], [9.0]];
1305        let fitted = enc.fit(&x, &()).unwrap();
1306        assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0]]);
1307        assert_eq!(fitted.n_output_features(), 3);
1308        let out = fitted.transform(&x).unwrap();
1309        assert_eq!(out.shape(), &[3, 3]);
1310        assert_eq!(out.row(0).to_vec(), vec![1.0, 0.0, 0.0]);
1311        assert_eq!(out.row(1).to_vec(), vec![0.0, 1.0, 0.0]);
1312        assert_eq!(out.row(2).to_vec(), vec![0.0, 0.0, 1.0]);
1313    }
1314
1315    #[test]
1316    fn test_unknown_category_error() {
1317        let enc = OneHotEncoder::<f64>::new();
1318        let x_train = array![[0.0_f64], [1.0]];
1319        let fitted = enc.fit(&x_train, &()).unwrap();
1320        // Value 2.0 was not seen during fitting → unknown category.
1321        let x_bad = array![[2.0_f64]];
1322        assert!(fitted.transform(&x_bad).is_err());
1323    }
1324
1325    #[test]
1326    fn test_fit_transform_equivalence() {
1327        let enc = OneHotEncoder::<f64>::new();
1328        let x = array![[0.0_f64, 1.0], [1.0, 0.0], [2.0, 1.0]];
1329        let via_fit_transform: Array2<f64> = enc.fit_transform(&x).unwrap();
1330        let fitted = enc.fit(&x, &()).unwrap();
1331        let via_separate = fitted.transform(&x).unwrap();
1332        for (a, b) in via_fit_transform.iter().zip(via_separate.iter()) {
1333            assert!((a - b).abs() < 1e-15);
1334        }
1335    }
1336
1337    #[test]
1338    fn test_shape_mismatch_error() {
1339        let enc = OneHotEncoder::<f64>::new();
1340        let x_train = array![[0.0_f64, 1.0], [1.0, 0.0]];
1341        let fitted = enc.fit(&x_train, &()).unwrap();
1342        let x_bad = array![[0.0_f64]];
1343        assert!(fitted.transform(&x_bad).is_err());
1344    }
1345}
ferrolearn_preprocess/one_hot_encoder.rs

ferrolearn_preprocess/
one_hot_encoder.rs