ferrolearn_preprocess/one_hot_encoder.rs
1//! One-hot encoder for categorical numeric features.
2//!
3//! `fit` learns, for each input column, `categories_[j]` = the **sorted unique
4//! set** of values in that column (matching scikit-learn's
5//! `OneHotEncoder.categories_`, `_BaseEncoder._fit:99`, `categories_ =
6//! _unique(Xi)`). `transform` emits a dense binary matrix where each learned
7//! category gets its own output column; the per-feature blocks are concatenated
8//! left-to-right (column 0's categories first, then column 1's, …), and a value
9//! is one-hot by **category membership** (the value's index within
10//! `categories_[j]`), NOT by an assumed contiguous `0..max` integer layout.
11//!
12//! # Example
13//!
14//! ```text
15//! Input column with the (non-contiguous) categories {2, 5, 9}:
16//! [2, 5, 9] → [[1,0,0],[0,1,0],[0,0,1]] (3 columns, one per unique value)
17//! ```
18//!
19//! # `## REQ status`
20//!
21//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_encoders.py` (`class OneHotEncoder`
22//! `:458`). Design doc: `.design/preprocess/one_hot_encoder.md`. Expected values from the live
23//! sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (`lib.rs`, grandfathered S5).
24//! HONEST (R-HONEST-3): ferrolearn ships a numeric (`F`-input) DENSE encoder whose `categories_`
25//! and column layout now match sklearn's `sparse_output=False` output for ANY finite numeric
26//! columns (contiguous or not); `drop` ({None,'first','if_binary'}, REQ-5a) IS shipped, as are
27//! `handle_unknown='ignore'` and `inverse_transform`/`get_feature_names_out`. Sparse-by-default
28//! output, string/object categories, infrequent grouping (`min_frequency`/`max_categories`,
29//! REQ-5b), the full ctor surface and the ferray substrate stay NOT-STARTED. The PyO3 binding ships the
30//! DENSE numeric path (`ferrolearn.OneHotEncoder`, REQ-8) with the unsupported surface surfaced
31//! as `NotImplementedError`/`ValueError` rather than silently mismatched (R-HONEST-3).
32//!
33//! | REQ | Status | Evidence |
34//! |---|---|---|
35//! | REQ-1 (dense one-hot via per-feature category blocks) | SHIPPED | `Transform::transform for FittedOneHotEncoder` zero-fills an `Array2<F>` of width `n_output()` then, for each value, sets `out[[i, offsets[j]+idx]]=1` where `idx` is the value's index in `categories_[j]` (membership), mirroring `_BaseEncoder._transform` (`_encoders.py:206-240`) + the one-hot block expansion. Consumer: crate re-export `lib.rs`. |
36//! | REQ-2 (sparse-by-default output) | NOT-STARTED | open prereq blocker #1149. Dense `Array2<F>` only; sklearn defaults `sparse_output=True` → scipy CSR (`:531`,`:748`). |
37//! | REQ-3 (categories_ = sorted unique set) | SHIPPED | `Fit::fit` computes `categories_[j]` = per-column values sorted via `partial_cmp` then exact-equality deduped to the sorted-unique set (`_BaseEncoder._fit:99` `categories_=_unique(Xi)`); precomputes `offsets` (prefix sums of `categories_[j].len()`) + `n_output`; rejects 0 rows (`InsufficientSamples`). `categories()` accessor exposes the learned sets. Transform is membership-based (value's index in `categories_[j]`), so non-contiguous integers (`[2,5,9]` → 3 columns, NOT 10) and arbitrary finite floats encode correctly — bit-exact to live sklearn 1.5.2 `sparse_output=False`: `categories_`/`transform`/non-contiguous-headline/offsets guards in `tests/divergence_one_hot_encoder.rs`. Consumer: crate re-export `lib.rs`. SCOPE: numeric `F` input; exact float equality for membership (np.unique semantics — documented); NaN-as-a-category is HANDLED (#2223): NaN sorts LAST + collapses to one category (sklearn `_encode.py:70-74`), a NaN row one-hots its column; +/-inf is REJECTED at `fit`/`transform` (#2225, `force_all_finite="allow-nan"` allows NaN but not inf); string/object input is REQ-3-string (NOT-STARTED, no String path). |
38//! | REQ-4 (handle_unknown {'error','ignore'}) | SHIPPED | `OneHotHandleUnknown` enum `{ Error (#[default]), Ignore }` (mirrors sklearn's `handle_unknown` `_parameter_constraints` `StrOptions({"error","ignore","infrequent_if_exist"})` default `"error"`, `_encoders.py:732,750`) + `OneHotEncoder::with_handle_unknown`/`handle_unknown()` builder+getter, threaded into `FittedOneHotEncoder` (`handle_unknown` field + getter) by `Fit::fit` (handle_unknown affects ONLY transform; `categories_` learned identically). `Transform::transform` unknown branch (`cats.iter().position(...) == None`): `Error` → `InvalidParameter` "Found unknown categories … during transform" (the SHIPPED REQ-2 default `ValueError`, `_encoders.py:209-214`, UNCHANGED); `Ignore` → `continue` leaving that feature's one-hot block ALL-ZERO (`_encoders.py:215-240`: unknown row masked out, no encoded column set), every KNOWN feature still one-hots. The +/-inf rejection (#2225), ncols + 0-row guards UNCHANGED (inf is invalid input, not an "unknown category" — still errors in `Ignore`; NaN with NO nan-category is "unknown" → all-zero block in `Ignore`, with a nan-category one-hots it). Never panics (R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`): `ignore_multifeature_all_zero_block` (`[[100,0],[5,99]]→[[0,0,0,1,0],[0,1,0,0,0]]`), `ignore_fully_unknown_row_all_zero`, `ignore_known_row_normal_one_hot`, `error_default_unknown_rejected`, `with_handle_unknown_ignore_known_value_normal`, `ignore_inf_still_rejected`, `ignore_nan_no_category_all_zero`, `ignore_nan_with_category_one_hots`, `handle_unknown_default_and_builder_abi` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs` (`OneHotHandleUnknown`). R-DEV-2. STILL NOT-STARTED: `'infrequent_if_exist'` (REQ-5). |
39//! | REQ-5a (`drop` {None,'first','if_binary'}) | SHIPPED | #1152: `OneHotDrop` enum `{ None_ (#[default]), First, IfBinary }` (mirrors sklearn `drop` `_parameter_constraints` `StrOptions({"first","if_binary"})` / `None`, `_encoders.py:730`,`:498-516`) + `OneHotEncoder::with_drop`/`drop()` builder+getter, threaded into `Fit::fit` which computes `drop_idx_: Vec<Option<usize>>` (sklearn `_compute_drop_idx`, `_encoders.py:812-831`: `None_`→all `None`; `First`→all `Some(0)` (empty feature `None`); `IfBinary`→`Some(0)` iff `len==2` else `None`) and recomputes `offsets`/`n_output` from the per-feature BLOCK WIDTH `len - (drop_idx is Some)`. `FittedOneHotEncoder::drop_idx_()` accessor exposes it. `Transform::transform` (`_encoders.py:1033-1046`): the dropped category emits an ALL-ZERO block; a kept category at membership index `idx` maps to output col `offset + (idx if idx<d else idx-1)` (the `X_int > to_drop` decrement). `inverse_transform` (`_encoders.py:1124-1172`): an all-zero block with `drop_idx_[j]==Some(d)` inverts to the DROPPED category `categories_[j][d]` in BOTH handle_unknown modes (sklearn checks `_drop_idx_after_grouping[i] is not None` FIRST, bypassing the all-zeros error / None paths); a 0-width fully-dropped feature fills the dropped category (`:1132-1135`); a kept block position `pos>=d` maps to category `pos+1`. `get_feature_names_out` OMITS the dropped category (`_compute_transformed_categories` `remove_dropped=True`, `:909`,`:1209-1212`). DROP+IGNORE interaction (verified LIVE, sklearn 1.5.2): `drop` + `handle_unknown='ignore'` is ALLOWED (does NOT raise at fit; warns on unknown at transform, encoding the unknown as an all-zero block == the dropped category) — ferrolearn matches (fit imposes no constraint). NEVER panics: every drop-shift index uses `get`/bounds-checked arithmetic (R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`, `drop=...`): `drop_first_*`, `drop_if_binary_*`, `drop_inverse_roundtrip_*`, `drop_single_category_fully_dropped_*`, `drop_shift_3cat_*`, `drop_plus_ignore_allowed_*`, `drop_idx_abi_*`, `drop_none_unchanged_*` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs` (`OneHotDrop`). R-DEV-2. |
40//! | REQ-5b (infrequent grouping `min_frequency`/`max_categories`) | SHIPPED | #1152: `OneHotEncoder::with_min_frequency`/`with_max_categories` (+`min_frequency()`/`max_categories()` getters) add the integer-count infrequent thresholds (`_encoders.py:566-587`,`:733-738`). `Fit::fit` computes per-category training counts (the run-length over the sorted column) and, when `infrequent_enabled`, calls `identify_infrequent` (mirrors `_BaseEncoder._identify_infrequent`, `_encoders.py:275-318`: min_frequency `count < min_freq` FIRST, then max_categories on the survivors via a STABLE argsort over the full count array keeping the top `max_categories-1` — ties favor the LARGER index; `max_categories==1` → all infrequent) + `build_infrequent_map` (mirrors `_default_to_infrequent_mappings`, `:373-400`: frequent → its remapped slot `0..n_frequent`, infrequent → the trailing slot). `FittedOneHotEncoder` carries `infrequent_indices_` + the per-feature `infrequent_map`; `block_width` becomes `n_frequent + 1` (sklearn `_compute_n_features_outs`, `:948-953`); `offsets`/`n_output` recomputed from it. `infrequent_categories()` exposes the infrequent VALUES per feature (`infrequent_categories_`, `:254-262`,`:625-633`). `Transform::transform` routes a found category through `infrequent_map[j][idx]` (frequent → own col, infrequent → trailing col; `_map_infrequent_categories`, `:442-452`). `inverse_transform` maps the trailing infrequent column to `F::nan()` (DOCUMENTED SCOPE, R-HONEST-3: `Array2<F>` cannot hold sklearn's `'infrequent_sklearn'` string, `:1675-1677`, like the ignore-None NaN proxy #2227), frequent cols → their category. `get_feature_names_out` emits the frequent names + a trailing `"x{j}_infrequent_sklearn"` (`_compute_transformed_categories`, `:913-921`). Infrequent grouping REQUIRES `drop==None_` — combining it errors `InvalidParameter` (REQ-5a×5b interaction DEFERRED; sklearn allows it). Never panics (every remap bounds-checked, R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`): `infrequent_min_frequency_*`, `infrequent_max_categories_*`, `infrequent_max_categories_tiebreak`, `infrequent_both_*`, `infrequent_inverse_*`, `infrequent_feature_names_*`, `infrequent_multifeature_offsets`, `infrequent_no_infrequent_*`, `infrequent_drop_rejected`, `infrequent_disabled_unchanged` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs`. STILL NOT-STARTED: the FLOAT-fraction `min_frequency` (`:573-575`,`:297-299`), `drop`+infrequent (`:518-520`,`:818-902`), and `'infrequent_if_exist'` (`:550-560`) stay unimplemented. |
41//! | REQ-6 (inverse_transform + get_feature_names_out) | SHIPPED | `FittedOneHotEncoder::inverse_transform` reduces each per-feature block `x[:, offsets[j]..offsets[j]+len(categories_[j])]` via **argmax** (numpy first-max-on-ties) to `categories_[j][argmax]`, then handles an ALL-ZERO block (`block_sum == 0`) per `handle_unknown` (sklearn `_encoders.py:1141`,`:1159-1168`): `Error` -> `InvalidParameter` ("Samples can not be inverted ... all zeros"); `Ignore` -> the unknown-category sentinel inverts to `None` in sklearn (`:1183`), represented here as `NaN` (Array2<F> cannot hold None, #2227) with the KNOWN feature blocks still recovered; 0-row → `InsufficientSamples`, `ncols != n_output` → `ShapeMismatch` (`:1100-1104`). Never panics (block slices bounds-checked, R-CODE-2). `FittedOneHotEncoder::get_feature_names_out` emits `format!("x{j}_{cat}")` over `categories_` with default `input_features=["x0",..]` + the `"concat"` combiner (`feature+"_"+str(category)`, `:1217,1224`) → `["x0_2.0","x0_5.0","x0_9.0","x1_0.0","x1_1.0"]`; the float label via `category_label` appends `.0` to whole-valued floats (Python `str(np.float64)`: `2.0`/`-3.0`/`2.5`), `NaN→"nan"`. Live-oracle parity (roundtrip incl. non-contiguous `{2,5,9}`, held-out `[[0,1,0,1,0]]→[[5,0]]`, all-zero/ncols/0-row errors, feature names whole+fractional+negative) in `tests/divergence_one_hot_encoder.rs`. Consumer: crate re-export (`lib.rs:141`). DOCUMENTED DIVERGENCE (R-HONEST-3): the float label uses Rust `Display` for non-whole values, so it diverges from Python's scientific notation at `|v|>=1e16` / `0<|v|<1e-4` (`1e+20`/`1e-07` vs full decimal) — not a plausible category. STILL NOT-STARTED within REQ-6: the `input_features=`/`feature_name_combiner=` params (`:1192,1222`) and the `drop`-aware inverse (REQ-5). The `handle_unknown='ignore'` inverse IS handled (#2227, all-zero -> NaN sentinel). |
42//! | REQ-7 (ctor + dtype + _parameter_constraints) | SHIPPED | The supported ctor params are type-safe Rust enums — `OneHotHandleUnknown {Error,Ignore}` (REQ-4) and `OneHotDrop {None_,First,IfBinary}` (REQ-5a) — so sklearn's `handle_unknown`/`drop` `StrOptions` `_parameter_constraints` (`_encoders.py:733-738`) are provided BY THE TYPE SYSTEM (an out-of-domain value is unrepresentable). The numeric thresholds carry runtime constraints matching sklearn's `Interval(Integral, 1, None)`: `Fit::fit` rejects `min_frequency==Some(0)` and `max_categories==Some(0)` with `InvalidParameter` ("must be an int in the range [1, inf)", verified live: `OneHotEncoder(min_frequency=0).fit` -> InvalidParameterError, #1154). `dtype` is f64 (the category container; REQ-3-analog). The FULL 8-key keyword-only sklearn ctor surface (categories/drop/sparse_output/dtype/handle_unknown/min_frequency/max_categories/feature_name_combiner) is exposed + validated at the PyO3 binding (REQ-8, `_extras.py::OneHotEncoder` get_params parity + `_check_unsupported`). Live-oracle test `req7_min_frequency_max_categories_must_be_at_least_one`. Consumer: crate re-export `lib.rs`. |
43//! | REQ-8 (PyO3 binding) | SHIPPED | #1155: `ferrolearn-python` exposes `ferrolearn.OneHotEncoder` over `{OneHotEncoder, FittedOneHotEncoder, OneHotHandleUnknown}`. The Rust shim `_RsOneHotEncoder` (hand `#[pyclass]`, `ferrolearn-python/src/extras.rs`) ctor takes `handle_unknown: String = "error"` mapped via `resolve_handle_unknown` ("error"→`Error`, "ignore"→`Ignore`, "infrequent_if_exist"→`PyNotImplementedError` REQ-5, bad→`PyValueError` per `_encoders.py:732` `StrOptions({"error","ignore","infrequent_if_exist"})`); `fit` builds `OneHotEncoder::<f64>::new().with_handle_unknown(..)` + runs `Fit`; `transform`/`inverse_transform`→`PyArray2<f64>` (FerroError→`PyValueError`; the `Ignore`-mode all-zero inverse flows through as NaN, #2227); `#[getter]`s `categories_` (a Python LIST of 1-D f64 numpy arrays via `PyList`), `feature_names_out` (`get_feature_names_out()`→`Vec<String>`), `n_features_in_` (`n_features()`). Registered in `lib.rs` (`m.add_class::<extras::RsOneHotEncoder>()`). The Python wrapper `_extras.py::OneHotEncoder(_TransformerWrapper)` mirrors sklearn's KEYWORD-ONLY 8-key ctor `(*, categories="auto", drop=None, sparse_output=True, dtype=np.float64, handle_unknown="error", min_frequency=None, max_categories=None, feature_name_combiner="concat")` (`_encoders.py:743-762`) for `get_params`/`clone` parity; `_make_rs` threads `handle_unknown`; `fit` calls `_check_unsupported` which HONESTLY (R-HONEST-3) rejects the core's gaps — `sparse_output=True` (the sklearn DEFAULT; dense-only REQ-2 #1149)/`categories!='auto'`/`drop`/`min_frequency`/`max_categories`/`feature_name_combiner!='concat'` (REQ-5/REQ-7 #1152/#1154) → `NotImplementedError`; `transform`/`inverse_transform`/`categories_`/`n_features_in_`/`get_feature_names_out(input_features=None)` guarded by `check_is_fitted`→`NotFittedError` pre-fit (`input_features!=None`→`NotImplementedError` REQ-6). Boundary consumer (R-DEFER-1): the `_extras.py::OneHotEncoder` wrapper + `lib.rs` `add_class` + `__init__.py` re-export. Live-oracle parity (model B, sklearn 1.5.2 `sparse_output=False`): `tests/divergence_one_hot_encoder_py.py` (17 pass) — multi-feature non-contiguous `transform`/`fit_transform`/`categories_`, `handle_unknown='ignore'` all-zero block, `inverse_transform` roundtrip + ignore-NaN-vs-None known-feature recovery, `get_feature_names_out` (`['x0_2.0',...]`), pre-fit `NotFittedError`, bad-handle_unknown `ValueError`, `infrequent_if_exist`/unsupported-param `NotImplementedError`, dense-only `sparse_output=True` error, `get_params` 8-key parity, `clone`. R-DEFER-1 satisfied. |
44//! | REQ-9 (ferray substrate) | NOT-STARTED | open prereq blocker #1156. `ndarray::Array2`, not `ferray-core` (R-SUBSTRATE-1/2). |
45
46use ferrolearn_core::error::FerroError;
47use ferrolearn_core::traits::{Fit, FitTransform, Transform};
48use ndarray::Array2;
49use num_traits::Float;
50use std::cmp::Ordering;
51
52// ---------------------------------------------------------------------------
53// OneHotHandleUnknown
54// ---------------------------------------------------------------------------
55
56/// How [`FittedOneHotEncoder`] treats a category at `transform` time that was not
57/// seen during `fit` (an **unknown category**).
58///
59/// Mirrors scikit-learn's `OneHotEncoder(handle_unknown=...)` parameter
60/// (`sklearn/preprocessing/_encoders.py:732,750`), whose
61/// `_parameter_constraints` accepts `{'error', 'ignore', 'infrequent_if_exist'}`
62/// and whose default is `'error'`. ferrolearn ships `Error` (REQ-2) and `Ignore`
63/// (REQ-4); `'infrequent_if_exist'` is NOT-STARTED (REQ-5).
64///
65/// This is a distinct type from
66/// [`ordinal_encoder::HandleUnknown`](crate::ordinal_encoder::HandleUnknown):
67/// the one-hot encoder's modes are `{error, ignore}` while the ordinal encoder's
68/// are `{error, use_encoded_value}` (sklearn's two `handle_unknown` enums differ
69/// the same way).
70#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
71pub enum OneHotHandleUnknown {
72 /// Raise an error on any unknown category at `transform` time (scikit-learn's
73 /// default `handle_unknown='error'`, the default here too). The unfitted
74 /// encoder's [`Transform::transform`] returns
75 /// [`FerroError::InvalidParameter`] ("Found unknown categories … during
76 /// transform", `_encoders.py:209-214`).
77 #[default]
78 Error,
79 /// Encode an unknown category as an **all-zero** one-hot block for that
80 /// feature, leaving every known feature untouched (scikit-learn's
81 /// `handle_unknown='ignore'`, `_encoders.py:215-240`: the unknown row is
82 /// masked out and no column in that feature's block is set).
83 Ignore,
84}
85
86// ---------------------------------------------------------------------------
87// OneHotDrop
88// ---------------------------------------------------------------------------
89
90/// Which category (if any) to drop from each feature's one-hot block at
91/// `transform` time (`OneHotEncoder(drop=...)`).
92///
93/// Mirrors scikit-learn's `OneHotEncoder(drop=...)` parameter, whose
94/// `_parameter_constraints` accepts `{'first', 'if_binary'}`, an array-like, or
95/// `None` (`sklearn/preprocessing/_encoders.py:730`, default `None`). Dropping a
96/// category removes one output column per feature, which is useful to break the
97/// collinearity an unregularized linear model would otherwise see
98/// (`_encoders.py:498-516`).
99///
100/// ferrolearn ships the `None`/`'first'`/`'if_binary'` modes (REQ-5). The
101/// array-of-explicit-categories form (`drop[i]` = the category to drop in
102/// feature `i`, `_encoders.py:515-516`) is NOT-STARTED.
103///
104/// The variant is named `None_` (not `None`) to avoid colliding with
105/// [`Option::None`].
106#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
107pub enum OneHotDrop {
108 /// Retain all categories — no column is dropped (scikit-learn's default
109 /// `drop=None`, `_encoders.py:509`,`:812-813`: `drop_idx_ = None`). The
110 /// default here too.
111 #[default]
112 None_,
113 /// Drop the **first** category of every feature (scikit-learn's
114 /// `drop='first'`, `_encoders.py:510-511`,`:815-816`: `drop_idx_[j] = 0` for
115 /// every feature). A feature with only one category is dropped entirely (its
116 /// block width becomes 0).
117 First,
118 /// Drop the first category of every feature that has **exactly two**
119 /// categories, leaving 1-category and 3+-category features intact
120 /// (scikit-learn's `drop='if_binary'`, `_encoders.py:512-514`,`:817-831`:
121 /// `drop_idx_[j] = 0` iff `len(categories_[j]) == 2`, else `None`).
122 IfBinary,
123}
124
125// ---------------------------------------------------------------------------
126// OneHotEncoder (unfitted)
127// ---------------------------------------------------------------------------
128
129/// An unfitted one-hot encoder for multi-column numeric categorical data.
130///
131/// Input: `Array2<F>` where each column contains the (finite) numeric category
132/// values. Calling [`Fit::fit`] learns, per column, the **sorted unique set** of
133/// values (`categories_`) and returns a [`FittedOneHotEncoder`]. The output of
134/// [`Transform::transform`] is a dense binary matrix with one column per learned
135/// category, the per-feature blocks concatenated left-to-right.
136///
137/// # Examples
138///
139/// ```
140/// use ferrolearn_preprocess::OneHotEncoder;
141/// use ferrolearn_core::traits::{Fit, Transform};
142/// use ndarray::array;
143///
144/// let enc = OneHotEncoder::<f64>::new();
145/// // Non-contiguous categories {2, 5, 9} in column 0, {0, 1} in column 1.
146/// let x = array![[2.0_f64, 0.0], [5.0, 1.0], [9.0, 0.0], [5.0, 1.0]];
147/// let fitted = enc.fit(&x, &()).unwrap();
148/// assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0], vec![0.0, 1.0]]);
149/// let encoded = fitted.transform(&x).unwrap();
150/// assert_eq!(encoded.ncols(), 5); // 3 + 2 category columns
151/// ```
152///
153/// Unknown categories at `transform` time are, by default, rejected
154/// ([`OneHotHandleUnknown::Error`], scikit-learn's `handle_unknown='error'`).
155/// Configuring [`with_handle_unknown`](OneHotEncoder::with_handle_unknown) with
156/// [`OneHotHandleUnknown::Ignore`] instead encodes an unknown category as an
157/// all-zero one-hot block, matching `OneHotEncoder(handle_unknown='ignore')`.
158#[derive(Debug, Clone)]
159pub struct OneHotEncoder<F> {
160 /// Strategy for unknown categories at `transform` time
161 /// (`handle_unknown`). Defaults to [`OneHotHandleUnknown::Error`].
162 handle_unknown: OneHotHandleUnknown,
163 /// Which category (if any) to drop per feature (`drop`). Defaults to
164 /// [`OneHotDrop::None_`] (retain all categories).
165 drop: OneHotDrop,
166 /// Minimum frequency (count) below which a category is grouped into the
167 /// trailing "infrequent" output column (`min_frequency`). `None` (the
168 /// default) disables the min-frequency threshold. Mirrors scikit-learn's
169 /// `OneHotEncoder(min_frequency=...)` (`_encoders.py:566-577`,`:734-738`).
170 /// SCOPE: only the integer-count form is supported — sklearn also accepts a
171 /// FLOAT fraction `min_frequency * n_samples` (`:573-575`,`_encoders.py:297`),
172 /// which is NOT-STARTED here.
173 min_frequency: Option<usize>,
174 /// Upper limit on the number of output columns per feature when grouping
175 /// infrequent categories (`max_categories`); the infrequent column itself
176 /// counts toward this limit. `None` (the default) imposes no limit. Mirrors
177 /// scikit-learn's `OneHotEncoder(max_categories=...)`
178 /// (`_encoders.py:579-587`,`:733`).
179 max_categories: Option<usize>,
180 _marker: std::marker::PhantomData<F>,
181}
182
183impl<F: Float + Send + Sync + 'static> OneHotEncoder<F> {
184 /// Create a new `OneHotEncoder` with scikit-learn's default
185 /// `handle_unknown='error'` ([`OneHotHandleUnknown::Error`]).
186 #[must_use]
187 pub fn new() -> Self {
188 Self {
189 handle_unknown: OneHotHandleUnknown::Error,
190 drop: OneHotDrop::None_,
191 min_frequency: None,
192 max_categories: None,
193 _marker: std::marker::PhantomData,
194 }
195 }
196
197 /// Set the unknown-category strategy (`handle_unknown`).
198 ///
199 /// With [`OneHotHandleUnknown::Ignore`] an unknown category at `transform`
200 /// time becomes an all-zero one-hot block for that feature instead of an
201 /// error, matching scikit-learn's `OneHotEncoder(handle_unknown='ignore')`
202 /// (`_encoders.py:215-240`).
203 #[must_use]
204 pub fn with_handle_unknown(mut self, handle_unknown: OneHotHandleUnknown) -> Self {
205 self.handle_unknown = handle_unknown;
206 self
207 }
208
209 /// Return the configured unknown-category strategy (`handle_unknown`).
210 #[must_use]
211 pub fn handle_unknown(&self) -> OneHotHandleUnknown {
212 self.handle_unknown
213 }
214
215 /// Set the drop strategy (`drop`).
216 ///
217 /// With [`OneHotDrop::First`] the first category of every feature is dropped
218 /// from the output; with [`OneHotDrop::IfBinary`] only binary (2-category)
219 /// features lose their first category. The dropped category produces an
220 /// all-zero one-hot block, matching scikit-learn's `OneHotEncoder(drop=...)`
221 /// (`_encoders.py:498-516`).
222 #[must_use]
223 pub fn with_drop(mut self, drop: OneHotDrop) -> Self {
224 self.drop = drop;
225 self
226 }
227
228 /// Return the configured drop strategy (`drop`).
229 #[must_use]
230 pub fn drop(&self) -> OneHotDrop {
231 self.drop
232 }
233
234 /// Set the minimum-frequency threshold for infrequent grouping
235 /// (`min_frequency`, integer count).
236 ///
237 /// At `fit` time a category whose count in the training data is **strictly
238 /// less than** `min_frequency` is grouped into a single trailing
239 /// "infrequent" output column for that feature, matching scikit-learn's
240 /// `OneHotEncoder(min_frequency=...)` integer form
241 /// (`_encoders.py:566-577`, `_identify_infrequent` `:295-296`
242 /// `category_count < self.min_frequency`).
243 ///
244 /// Enabling infrequent grouping (`min_frequency` and/or `max_categories`)
245 /// requires `drop == OneHotDrop::None_`; combining it with `drop` is a
246 /// deferred interaction (REQ-5a×5b) and [`Fit::fit`] returns an error.
247 ///
248 /// SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn
249 /// also accepts a FLOAT `min_frequency` interpreted as the fraction
250 /// `min_frequency * n_samples` (`_encoders.py:573-575`,`:297-299`); the
251 /// float-fraction form is NOT-STARTED here.
252 #[must_use]
253 pub fn with_min_frequency(mut self, min_frequency: usize) -> Self {
254 self.min_frequency = Some(min_frequency);
255 self
256 }
257
258 /// Set the maximum number of output columns per feature for infrequent
259 /// grouping (`max_categories`).
260 ///
261 /// At `fit` time, if a feature would otherwise produce more than
262 /// `max_categories` output columns, the least-frequent categories are
263 /// grouped into a single trailing "infrequent" column so the block width is
264 /// at most `max_categories` (the infrequent column itself counts toward the
265 /// limit). Mirrors scikit-learn's `OneHotEncoder(max_categories=...)`
266 /// (`_encoders.py:579-587`, `_identify_infrequent` `:303-315`).
267 ///
268 /// Enabling infrequent grouping requires `drop == OneHotDrop::None_` (see
269 /// [`Self::with_min_frequency`]).
270 #[must_use]
271 pub fn with_max_categories(mut self, max_categories: usize) -> Self {
272 self.max_categories = Some(max_categories);
273 self
274 }
275
276 /// Return the configured minimum-frequency threshold (`min_frequency`), or
277 /// `None` if infrequent grouping by frequency is disabled.
278 #[must_use]
279 pub fn min_frequency(&self) -> Option<usize> {
280 self.min_frequency
281 }
282
283 /// Return the configured maximum output-column limit (`max_categories`), or
284 /// `None` if no limit is imposed.
285 #[must_use]
286 pub fn max_categories(&self) -> Option<usize> {
287 self.max_categories
288 }
289
290 /// Whether infrequent grouping is enabled (either `min_frequency` or
291 /// `max_categories` is set). Mirrors scikit-learn's `_infrequent_enabled`
292 /// (`_encoders.py:264-273`: `(max_categories is not None and
293 /// max_categories >= 1) or min_frequency is not None`).
294 fn infrequent_enabled(&self) -> bool {
295 self.min_frequency.is_some() || self.max_categories.is_some_and(|m| m >= 1)
296 }
297}
298
299impl<F: Float + Send + Sync + 'static> Default for OneHotEncoder<F> {
300 fn default() -> Self {
301 Self::new()
302 }
303}
304
305// ---------------------------------------------------------------------------
306// FittedOneHotEncoder
307// ---------------------------------------------------------------------------
308
309/// A fitted one-hot encoder holding the sorted-unique category set per input
310/// column, plus the precomputed output-column layout.
311///
312/// Created by calling [`Fit::fit`] on a [`OneHotEncoder`]. Mirrors
313/// scikit-learn's `OneHotEncoder.categories_` (a list of arrays of the actual
314/// sorted-unique values, `_BaseEncoder._fit:99`).
315#[derive(Debug, Clone)]
316pub struct FittedOneHotEncoder<F> {
317 /// Per-column sorted-unique category values (`categories_`). `categories_[j]`
318 /// is the sorted set of distinct values seen in input column `j`; its length
319 /// is the number of output columns devoted to that feature's block.
320 pub(crate) categories_: Vec<Vec<F>>,
321 /// Per-column output-block start offsets (prefix sums of the per-feature
322 /// **block width**). The block width of feature `j` is
323 /// `categories_[j].len() - (1 if drop_idx_[j].is_some() else 0)`. Output
324 /// column `offsets[j] + pos` is the one-hot bit for the `pos`-th *kept*
325 /// category of feature `j`. Has length `categories_.len()`.
326 pub(crate) offsets: Vec<usize>,
327 /// Total number of output columns (`Σ block_width(j)`), accounting for any
328 /// dropped categories (`drop`).
329 pub(crate) n_output: usize,
330 /// Strategy for unknown categories at `transform` time, threaded from the
331 /// unfitted [`OneHotEncoder`]. [`OneHotHandleUnknown::Error`] rejects an
332 /// unknown category; [`OneHotHandleUnknown::Ignore`] emits an all-zero block.
333 pub(crate) handle_unknown: OneHotHandleUnknown,
334 /// Per-feature index into `categories_[j]` of the category to drop, or `None`
335 /// for "no drop" on that feature (`drop_idx_`). Has length
336 /// `categories_.len()`. Mirrors scikit-learn's public `drop_idx_`
337 /// (`_encoders.py:608-615`,`:885-902`): `drop='first'` → every entry
338 /// `Some(0)`; `drop='if_binary'` → `Some(0)` iff the feature has exactly two
339 /// categories else `None`; `drop=None` → every entry `None`.
340 pub(crate) drop_idx_: Vec<Option<usize>>,
341 /// Per-feature indices into `categories_[j]` of the categories grouped as
342 /// **infrequent** (`min_frequency`/`max_categories`), sorted ascending.
343 /// Mirrors scikit-learn's private `_infrequent_indices[j]`
344 /// (`_encoders.py:336-340`,`:367-370`): the indices `idx` such that
345 /// `categories_[j][idx]` is an infrequent category. Empty when feature `j`
346 /// has no infrequent categories (sklearn's `None`). With infrequent grouping
347 /// disabled every entry is empty. Length `categories_.len()`.
348 pub(crate) infrequent_indices_: Vec<Vec<usize>>,
349 /// Per-feature mapping from a `categories_[j]` index to its OUTPUT column
350 /// offset WITHIN feature `j`'s block (before adding `offsets[j]`). Mirrors
351 /// scikit-learn's `_default_to_infrequent_mappings[j]`
352 /// (`_encoders.py:373-400`): a frequent category maps to its remapped slot
353 /// `0..n_frequent`, every infrequent category maps to the single trailing
354 /// slot `n_frequent`. When feature `j` has no infrequent categories the
355 /// mapping is the identity `0..len` (sklearn stores `None`; the identity is
356 /// the representable equivalent). Length `categories_.len()`, with
357 /// `infrequent_map[j].len() == categories_[j].len()`. Used by `transform`,
358 /// `inverse_transform`, and `get_feature_names_out` to place each category in
359 /// the right output column without recomputing the grouping.
360 pub(crate) infrequent_map: Vec<Vec<usize>>,
361}
362
363impl<F: Float + Send + Sync + 'static> FittedOneHotEncoder<F> {
364 /// Return the learned sorted-unique category set for each input column
365 /// (`categories_`).
366 ///
367 /// `categories()[j][idx]` is the value encoded by output column
368 /// `offsets[j] + idx`. Mirrors scikit-learn's `OneHotEncoder.categories_`.
369 #[must_use]
370 pub fn categories(&self) -> &[Vec<F>] {
371 &self.categories_
372 }
373
374 /// Return the number of distinct categories for each input feature column,
375 /// i.e. the width of each per-feature one-hot block.
376 #[must_use]
377 pub fn n_categories(&self) -> Vec<usize> {
378 self.categories_.iter().map(Vec::len).collect()
379 }
380
381 /// Return the number of input feature columns.
382 #[must_use]
383 pub fn n_features(&self) -> usize {
384 self.categories_.len()
385 }
386
387 /// Return the total number of output columns (`Σ categories_[j].len()`).
388 #[must_use]
389 pub fn n_output_features(&self) -> usize {
390 self.n_output
391 }
392
393 /// Return the configured unknown-category strategy (`handle_unknown`),
394 /// threaded from the unfitted [`OneHotEncoder`].
395 #[must_use]
396 pub fn handle_unknown(&self) -> OneHotHandleUnknown {
397 self.handle_unknown
398 }
399
400 /// Return the per-feature drop index (`drop_idx_`).
401 ///
402 /// `drop_idx_()[j]` is `Some(d)` if category `categories_[j][d]` is dropped
403 /// from feature `j`'s one-hot block (its block width is one less than
404 /// `categories_[j].len()`, and that category encodes to an all-zero block),
405 /// or `None` if no category is dropped from that feature. Mirrors
406 /// scikit-learn's public `drop_idx_` attribute (`_encoders.py:608-615`). With
407 /// `drop=None` (the default) every entry is `None`.
408 #[must_use]
409 pub fn drop_idx_(&self) -> &[Option<usize>] {
410 &self.drop_idx_
411 }
412
413 /// Return the infrequent category **values** for each feature
414 /// (`infrequent_categories_`).
415 ///
416 /// `infrequent_categories()[j]` is the sorted list of category values from
417 /// `categories_[j]` that were grouped into the single trailing "infrequent"
418 /// output column (because their training count fell below `min_frequency`
419 /// and/or beyond the `max_categories` limit). An EMPTY inner `Vec` means
420 /// feature `j` had no infrequent categories (scikit-learn returns `None`
421 /// there; an empty list is the representable equivalent). With infrequent
422 /// grouping disabled every entry is empty. Mirrors scikit-learn's
423 /// `OneHotEncoder.infrequent_categories_`
424 /// (`_encoders.py:254-262`,`:625-633`): `category[indices]` over
425 /// `_infrequent_indices`.
426 #[must_use]
427 pub fn infrequent_categories(&self) -> Vec<Vec<F>> {
428 self.infrequent_indices_
429 .iter()
430 .enumerate()
431 .map(|(j, idxs)| {
432 idxs.iter()
433 .filter_map(|&idx| self.categories_.get(j).and_then(|c| c.get(idx)).copied())
434 .collect()
435 })
436 .collect()
437 }
438
439 /// Whether feature `j` has any infrequent categories (a trailing infrequent
440 /// output column). Bounds-safe: a `j` past the end yields `false`.
441 fn has_infrequent(&self, j: usize) -> bool {
442 self.infrequent_indices_
443 .get(j)
444 .is_some_and(|v| !v.is_empty())
445 }
446
447 /// Return the width of feature `j`'s one-hot block: `categories_[j].len()`
448 /// minus one if that feature has a dropped category. Bounds-safe: a `j` past
449 /// the end yields 0 (R-CODE-2).
450 fn block_width(&self, j: usize) -> usize {
451 let len = self.categories_.get(j).map_or(0, Vec::len);
452 // Infrequent grouping (REQ-5b) and `drop` (REQ-5a) are mutually
453 // exclusive — `fit` rejects their combination — so at most one branch
454 // applies. With infrequent categories the block is `n_frequent + 1`
455 // trailing infrequent column (sklearn `_compute_n_features_outs`
456 // `_encoders.py:948-953`: `output[i] -= infreq.size - 1`, i.e.
457 // `len - n_infreq + 1`).
458 let n_infreq = self.infrequent_indices_.get(j).map_or(0, Vec::len);
459 if n_infreq > 0 {
460 return len - n_infreq + 1;
461 }
462 let dropped = matches!(self.drop_idx_.get(j), Some(Some(_)));
463 len - usize::from(dropped && len > 0)
464 }
465
466 /// Invert a one-hot encoded matrix back to the original category values.
467 ///
468 /// For each input feature `j` the per-feature block
469 /// `x[:, offsets[j] .. offsets[j] + categories_[j].len()]` is reduced to a
470 /// single category via **argmax** (the index of the maximum value in the
471 /// block, first-max on ties — numpy `argmax` semantics), and the original
472 /// value `categories_[j][argmax]` is written to `out[[i, j]]`. This mirrors
473 /// scikit-learn's `OneHotEncoder.inverse_transform`
474 /// (`sklearn/preprocessing/_encoders.py:1136-1139`):
475 /// `labels = sub.argmax(axis=1); X_tr[:, i] = cats[labels]`.
476 ///
477 /// After the argmax, an **all-zero block** (a row whose per-feature block
478 /// sums to zero) cannot be inverted. With no `drop` and the default
479 /// `handle_unknown='error'` (the only mode ferrolearn ships — REQ-4/5), this
480 /// is an error, matching sklearn's
481 /// `ValueError("Samples [...] can not be inverted when drop=None and
482 /// handle_unknown='error' because they contain all zeros")`
483 /// (`_encoders.py:1160-1168`). A proper one-hot row from
484 /// [`Transform::transform`] has exactly one `1` per block, so argmax always
485 /// finds it and the block sum is never zero.
486 ///
487 /// # Errors
488 ///
489 /// - [`FerroError::InsufficientSamples`] if `x` has zero rows (sklearn
490 /// `check_array` requires a minimum of 1 sample).
491 /// - [`FerroError::ShapeMismatch`] if `x.ncols() != n_output` (sklearn's
492 /// "Shape of the passed X data is not correct" `ValueError`,
493 /// `_encoders.py:1100-1104`).
494 /// - [`FerroError::InvalidParameter`] if any per-feature block is all-zero
495 /// (the sklearn all-zeros `ValueError`, `_encoders.py:1164-1168`).
496 ///
497 /// Never panics: every block slice is bounds-checked (R-CODE-2).
498 pub fn inverse_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
499 let n_samples = x.nrows();
500 if n_samples == 0 {
501 return Err(FerroError::InsufficientSamples {
502 required: 1,
503 actual: 0,
504 context: "FittedOneHotEncoder::inverse_transform".into(),
505 });
506 }
507 // sklearn `inverse_transform` -> `check_array(X, accept_sparse="csr")`
508 // (`_encoders.py:1092`) with the DEFAULT `force_all_finite=True`, so a
509 // NaN or +/-inf cell in the one-hot matrix raises BEFORE the argmax
510 // (#2224). A valid one-hot row is all 0/1 (finite); a non-finite cell is
511 // invalid input.
512 if x.iter().any(|v| !v.is_finite()) {
513 return Err(FerroError::InvalidParameter {
514 name: "X".into(),
515 reason: "Input X contains NaN or infinity.".into(),
516 });
517 }
518 if x.ncols() != self.n_output {
519 return Err(FerroError::ShapeMismatch {
520 expected: vec![n_samples, self.n_output],
521 actual: vec![n_samples, x.ncols()],
522 context: "FittedOneHotEncoder::inverse_transform".into(),
523 });
524 }
525
526 let n_features = self.categories_.len();
527 let mut out = Array2::zeros((n_samples, n_features));
528
529 for j in 0..n_features {
530 let cats = &self.categories_[j];
531 let drop_d = self.drop_idx_.get(j).copied().flatten();
532 // The per-feature block WIDTH after drop (the number of output columns
533 // for this feature). With a dropped category the block is one narrower
534 // than `categories_[j]` (`_encoders.py:1124-1127` `cats_wo_dropped`).
535 let block_width = self.block_width(j);
536 let offset = self.offsets[j];
537
538 // A feature whose entire (single) category was dropped has a
539 // zero-width block (`drop='first'` on a 1-category feature). Every row
540 // inverts to that dropped category, with no columns consumed (sklearn
541 // `n_categories == 0` branch, `_encoders.py:1132-1135`).
542 if block_width == 0 {
543 if let Some(&cat) = drop_d.and_then(|d| cats.get(d)) {
544 for i in 0..n_samples {
545 out[[i, j]] = cat;
546 }
547 }
548 continue;
549 }
550
551 for i in 0..n_samples {
552 // Argmax over the per-feature block (numpy `argmax`: index of the
553 // maximum, FIRST on ties). Track the block sum to detect the
554 // all-zero case separately, mirroring sklearn's two-step
555 // argmax-then-all-zero-check (`_encoders.py:1136-1172`). `argmax`
556 // is a BLOCK position in `0..block_width`.
557 let mut argmax: usize = 0;
558 let mut max_val = x[[i, offset]];
559 let mut block_sum = max_val;
560 for k in 1..block_width {
561 let v = x[[i, offset + k]];
562 block_sum = block_sum + v;
563 if v > max_val {
564 max_val = v;
565 argmax = k;
566 }
567 }
568 if block_sum == F::zero() {
569 // All-zero block. With a dropped category this is the
570 // LEGITIMATE encoding of the dropped value, so it inverts to
571 // that category in BOTH handle_unknown modes — sklearn checks
572 // `_drop_idx_after_grouping[i] is not None` FIRST and maps the
573 // all-zero row to the dropped category (`_encoders.py:1150-1158`
574 // for ignore, `:1169-1172` for error), bypassing the
575 // "can not be inverted" / None paths.
576 if drop_d.is_some() {
577 if let Some(&cat) = drop_d.and_then(|d| cats.get(d)) {
578 out[[i, j]] = cat;
579 }
580 } else {
581 // No drop on this feature: the existing handle_unknown
582 // semantics (`_encoders.py:1141`,`:1159-1168`).
583 match self.handle_unknown {
584 OneHotHandleUnknown::Error => {
585 return Err(FerroError::InvalidParameter {
586 name: "X".into(),
587 reason: "Samples can not be inverted when drop=None and \
588 handle_unknown='error' because they contain all zeros"
589 .into(),
590 });
591 }
592 // `handle_unknown='ignore'` all-zero block -> None in
593 // sklearn (`:1183`); `Array2<F>` cannot hold None so we
594 // use NaN as the representable sentinel (#2227).
595 OneHotHandleUnknown::Ignore => {
596 out[[i, j]] = F::nan();
597 }
598 }
599 }
600 } else if self.has_infrequent(j) {
601 // Infrequent grouping (REQ-5b). The block POSITION `argmax`
602 // is a slot in `infrequent_map[j]`. The TRAILING slot
603 // (`n_frequent`) is the infrequent column: sklearn inverts it
604 // to the string `'infrequent_sklearn'` (`_encoders.py:1675-1677`,
605 // `_compute_transformed_categories:917`), which an `Array2<F>`
606 // cannot hold — NaN is the representable proxy (DOCUMENTED
607 // SCOPE, R-HONEST-3, like the ignore-None case #2227). A
608 // frequent slot inverts to the unique `categories_[j]` index
609 // that maps to it (`labels = cats_wo_dropped[argmax]`,
610 // `:1138-1139`). Bounds-safe via `get` (R-CODE-2).
611 let map = self.infrequent_map.get(j);
612 let n_frequent = block_width - 1; // the trailing slot index
613 if argmax >= n_frequent {
614 out[[i, j]] = F::nan();
615 } else if let Some(&cat) = map
616 .and_then(|m| m.iter().position(|&s| s == argmax))
617 .and_then(|orig| cats.get(orig))
618 {
619 out[[i, j]] = cat;
620 }
621 } else {
622 // Map the block POSITION back to a `categories_[j]` index: with
623 // a dropped category `d`, positions `>= d` correspond to the
624 // category one higher (the dropped category was removed),
625 // matching sklearn's `cats_wo_dropped` indexing
626 // (`_encoders.py:1124-1139`). Bounds-safe via `get` (R-CODE-2).
627 let cat_idx = match drop_d {
628 Some(d) if argmax >= d => argmax + 1,
629 _ => argmax,
630 };
631 if let Some(&cat) = cats.get(cat_idx) {
632 out[[i, j]] = cat;
633 }
634 }
635 }
636 }
637
638 Ok(out)
639 }
640
641 /// Return the output feature names, one per output column.
642 ///
643 /// For each input feature `j`, for each category `c` in `categories_[j]`,
644 /// emits `format!("x{j}_{c}")` where `c` is rendered to match Python's
645 /// `str(np.float64(c))`. This mirrors scikit-learn's
646 /// `OneHotEncoder.get_feature_names_out` with the default `input_features`
647 /// (`["x0", "x1", ...]`) and the `"concat"` name combiner
648 /// (`feature + "_" + str(category)`, `_encoders.py:1217,1224`). For the
649 /// whole-number fixture `[[2,0],[5,1],[9,0],[5,1]]` this yields
650 /// `["x0_2.0", "x0_5.0", "x0_9.0", "x1_0.0", "x1_1.0"]`.
651 ///
652 /// # Float-rendering divergence (HONEST, R-HONEST-3)
653 ///
654 /// The category is rendered via [`Self::category_label`], which appends `.0`
655 /// to integer-valued floats (`2.0 → "2.0"`, `-3.0 → "-3.0"`, matching
656 /// Python) and uses Rust's shortest round-trip `Display` otherwise
657 /// (`2.5 → "2.5"`). For category values in the usual categorical range
658 /// (small whole or fractional numbers) this is byte-identical to Python.
659 /// It DIVERGES for extreme magnitudes: Python's `repr`/`str` switches to
660 /// scientific notation at `|v| >= 1e16` and `0 < |v| < 1e-4`
661 /// (`1e+20`, `1e-07`), while Rust's `Display` prints the full decimal
662 /// (`100000000000000000000`, `0.0000001`). Such values are not plausible
663 /// one-hot categories; the divergence is documented rather than papered over.
664 /// `NaN` renders as `"nan"` (matching Python's `str(nan)`).
665 #[must_use]
666 pub fn get_feature_names_out(&self) -> Vec<String> {
667 let mut names = Vec::with_capacity(self.n_output);
668 for (j, cats) in self.categories_.iter().enumerate() {
669 // The dropped category's name is OMITTED (sklearn
670 // `_compute_transformed_categories` with `remove_dropped=True`,
671 // `_encoders.py:1209-1212`,`:909`).
672 let drop_d = self.drop_idx_.get(j).copied().flatten();
673 // Infrequent grouping (REQ-5b): emit only the FREQUENT category names
674 // then a single trailing `"x{j}_infrequent_sklearn"` column — the
675 // infrequent categories collapse into that one column (sklearn
676 // `_compute_transformed_categories`, `_encoders.py:913-921`:
677 // `cats[frequent_mask] + ['infrequent_sklearn']`). Infrequent and
678 // `drop` are mutually exclusive, so `drop_d` is `None` here.
679 if self.has_infrequent(j) {
680 let map = self.infrequent_map.get(j);
681 let n_frequent = self.block_width(j).saturating_sub(1);
682 for slot in 0..n_frequent {
683 // The unique frequent category whose remapped slot is `slot`.
684 if let Some(&c) = map
685 .and_then(|m| m.iter().position(|&s| s == slot))
686 .and_then(|orig| cats.get(orig))
687 {
688 names.push(format!("x{j}_{}", Self::category_label(c)));
689 }
690 }
691 names.push(format!("x{j}_infrequent_sklearn"));
692 continue;
693 }
694 for (idx, &c) in cats.iter().enumerate() {
695 if drop_d == Some(idx) {
696 continue;
697 }
698 names.push(format!("x{j}_{}", Self::category_label(c)));
699 }
700 }
701 names
702 }
703
704 /// Render a category value to a string matching Python's `str(np.float64(v))`
705 /// for the categorical-value range (see [`Self::get_feature_names_out`] for
706 /// the documented extreme-magnitude divergence).
707 ///
708 /// Python's `str(float)` always shows a decimal point for whole floats
709 /// (`2.0`, not `2`), so an integer-valued finite float gets a `.0` suffix;
710 /// otherwise Rust's shortest round-trip `Display` is used. `NaN → "nan"`.
711 fn category_label(v: F) -> String {
712 let Some(f) = v.to_f64() else {
713 return "nan".to_string();
714 };
715 if f.is_nan() {
716 return "nan".to_string();
717 }
718 if f.is_finite() && f == f.trunc() {
719 // Whole-valued finite float: Python prints e.g. "2.0", "-3.0".
720 format!("{f:.1}")
721 } else {
722 // Fractional or non-finite: shortest round-trip Display ("2.5").
723 format!("{f}")
724 }
725 }
726}
727
728// ---------------------------------------------------------------------------
729// Trait implementations
730// ---------------------------------------------------------------------------
731
732impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for OneHotEncoder<F> {
733 type Fitted = FittedOneHotEncoder<F>;
734 type Error = FerroError;
735
736 /// Fit the encoder by learning the **sorted-unique category set** per column.
737 ///
738 /// For each input column `j`, `categories_[j]` is the distinct values of that
739 /// column, sorted ascending via `partial_cmp` and deduped by **exact
740 /// equality** — mirroring scikit-learn's `categories_ = _unique(Xi)`
741 /// (`sklearn/preprocessing/_encoders.py:99`, `np.unique` per column).
742 /// The output-column layout (`offsets`, `n_output`) is precomputed as the
743 /// prefix sums / total of the per-column category counts.
744 ///
745 /// Exact float equality is what `np.unique` does, so two values that differ
746 /// by an ULP are distinct categories here, exactly as in sklearn.
747 ///
748 /// # NaN handling (#2223)
749 ///
750 /// `NaN` is treated as a valid category, matching sklearn's `_unique_np`
751 /// (`_encode.py:70-74`): it sorts LAST and a run of duplicate NaNs collapses
752 /// to a SINGLE sorted-last category (the sort orders `NaN` after every finite
753 /// value and `dedup_by` collapses consecutive NaNs, since `NaN != NaN`). A
754 /// NaN cell at `transform` then one-hots that trailing category. `fit` never
755 /// panics (R-CODE-2).
756 ///
757 /// # Errors
758 ///
759 /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows
760 /// (matching sklearn's `check_array` minimum-of-1-sample requirement).
761 fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedOneHotEncoder<F>, FerroError> {
762 // sklearn `_parameter_constraints` (`@_fit_context`, `_encoders.py:733-738`)
763 // validates the params BEFORE the data: `min_frequency` is
764 // `Interval(Integral, 1, None)` and `max_categories` is
765 // `Interval(Integral, 1, None)` — a value of 0 raises
766 // `InvalidParameterError` ("must be ... in the range [1, inf)"). #1154/REQ-7.
767 // (handle_unknown/drop are type-safe Rust enums, so their StrOptions
768 // constraints are provided by the type system — no runtime check needed.)
769 if self.min_frequency == Some(0) {
770 return Err(FerroError::InvalidParameter {
771 name: "min_frequency".into(),
772 reason: "must be an int in the range [1, inf)".into(),
773 });
774 }
775 if self.max_categories == Some(0) {
776 return Err(FerroError::InvalidParameter {
777 name: "max_categories".into(),
778 reason: "must be an int in the range [1, inf)".into(),
779 });
780 }
781 let n_samples = x.nrows();
782 if n_samples == 0 {
783 return Err(FerroError::InsufficientSamples {
784 required: 1,
785 actual: 0,
786 context: "OneHotEncoder::fit".into(),
787 });
788 }
789 // sklearn `OneHotEncoder.fit` -> `check_array(force_all_finite="allow-nan")`:
790 // NaN is a valid CATEGORY (#2223), but +/-inf is REJECTED (verified live:
791 // fit([[inf]]) -> ValueError "Input contains infinity"). #2225.
792 if x.iter().any(|v| v.is_infinite()) {
793 return Err(FerroError::InvalidParameter {
794 name: "X".into(),
795 reason: "Input X contains infinity or a value too large for dtype.".into(),
796 });
797 }
798
799 let infrequent_enabled = self.infrequent_enabled();
800
801 let n_features = x.ncols();
802 let mut categories_: Vec<Vec<F>> = Vec::with_capacity(n_features);
803 // Per-feature, per-category training counts ALIGNED with `categories_[j]`
804 // (`category_counts[j][idx]` is the count of `categories_[j][idx]`).
805 // Only needed when infrequent grouping is enabled — sklearn computes
806 // counts via `_unique(Xi, return_counts=True)` (`_encoders.py:99-102`).
807 let mut category_counts: Vec<Vec<usize>> = Vec::with_capacity(n_features);
808
809 for j in 0..n_features {
810 // Collect this column's values, sort ascending (sklearn `np.unique`
811 // sorts), then dedup by EXACT equality to the sorted-unique set.
812 let mut col: Vec<F> = x.column(j).iter().copied().collect();
813 // Sort ascending with NaN LAST (sklearn `_unique_np` keeps any NaN at
814 // the end, `_encode.py:70-74`); `partial_cmp` alone returns None for
815 // NaN and would leave it unmoved (#2223).
816 col.sort_by(|a, b| match (a.is_nan(), b.is_nan()) {
817 (true, true) => Ordering::Equal,
818 (true, false) => Ordering::Greater,
819 (false, true) => Ordering::Less,
820 (false, false) => a.partial_cmp(b).unwrap_or(Ordering::Equal),
821 });
822 // Build the sorted-unique set AND, when infrequent grouping is
823 // enabled, the per-category run-length count (the sorted column has
824 // each category's occurrences contiguous, so a run length is the
825 // count). Consecutive EXACT-equal values collapse (an ULP-apart pair
826 // stays distinct, like `np.unique`), AND consecutive NaNs collapse to
827 // ONE (`dedup` alone keeps every NaN since `NaN != NaN`; sklearn
828 // collapses the trailing NaN run to a single sorted-last category,
829 // #2223).
830 let mut cats: Vec<F> = Vec::with_capacity(col.len());
831 let mut counts: Vec<usize> = Vec::with_capacity(col.len());
832 for v in col {
833 match cats.last() {
834 Some(&last) if last == v || (last.is_nan() && v.is_nan()) => {
835 if let Some(c) = counts.last_mut() {
836 *c += 1;
837 }
838 }
839 _ => {
840 cats.push(v);
841 counts.push(1);
842 }
843 }
844 }
845 categories_.push(cats);
846 category_counts.push(counts);
847 }
848
849 // Infrequent grouping (REQ-5b). When enabled, identify each feature's
850 // infrequent category indices and build the per-feature index→output
851 // column mapping; otherwise every feature has no infrequent categories
852 // and the mapping is the identity.
853 let mut infrequent_indices_: Vec<Vec<usize>> = Vec::with_capacity(n_features);
854 let mut infrequent_map: Vec<Vec<usize>> = Vec::with_capacity(n_features);
855 if infrequent_enabled {
856 // REQ-5a × REQ-5b interaction is DEFERRED: combining infrequent
857 // grouping with `drop` is rejected at fit (sklearn ALLOWS it, but the
858 // remapping is intricate — documented scope, R-HONEST-3). Require
859 // `drop == None_`.
860 if self.drop != OneHotDrop::None_ {
861 return Err(FerroError::InvalidParameter {
862 name: "drop".into(),
863 reason: "infrequent grouping (min_frequency/max_categories) with drop is not \
864 yet supported"
865 .into(),
866 });
867 }
868 for counts in &category_counts {
869 let infreq = identify_infrequent(counts, self.min_frequency, self.max_categories);
870 let map = build_infrequent_map(counts.len(), &infreq);
871 infrequent_indices_.push(infreq);
872 infrequent_map.push(map);
873 }
874 } else {
875 for cats in &categories_ {
876 infrequent_indices_.push(Vec::new());
877 infrequent_map.push((0..cats.len()).collect());
878 }
879 }
880
881 // Compute `drop_idx_` from `drop` + the learned `categories_`
882 // (sklearn `_compute_drop_idx`, `_encoders.py:812-831`). `drop=None` →
883 // every feature `None`; `drop='first'` → every feature `Some(0)`;
884 // `drop='if_binary'` → `Some(0)` iff the feature has exactly two
885 // categories, else `None`. (With infrequent grouping active `drop` is
886 // forced to `None_` above, so every entry is `None`.)
887 let drop_idx_: Vec<Option<usize>> = match self.drop {
888 OneHotDrop::None_ => vec![None; n_features],
889 OneHotDrop::First => categories_
890 .iter()
891 .map(|cats| if cats.is_empty() { None } else { Some(0) })
892 .collect(),
893 OneHotDrop::IfBinary => categories_
894 .iter()
895 .map(|cats| if cats.len() == 2 { Some(0) } else { None })
896 .collect(),
897 };
898
899 let mut fitted = FittedOneHotEncoder {
900 categories_,
901 // Placeholder; recomputed below from per-feature block widths.
902 offsets: Vec::new(),
903 n_output: 0,
904 // `handle_unknown` only affects `transform` (sklearn learns the same
905 // `categories_` regardless); thread the configured mode through. Note
906 // (verified live, sklearn 1.5.2): `drop` + `handle_unknown='ignore'`
907 // is ALLOWED — sklearn does NOT raise at fit; it warns on unknown at
908 // transform and encodes the unknown as an all-zero block (the same as
909 // the dropped category). So fit imposes no drop+ignore constraint.
910 handle_unknown: self.handle_unknown,
911 drop_idx_,
912 infrequent_indices_,
913 infrequent_map,
914 };
915
916 // Recompute the output-column layout from each feature's block width:
917 // `block_width(j)` is `n_frequent + 1` with infrequent grouping (the
918 // trailing infrequent column), else `len - (1 if dropped)`. `offsets` is
919 // the prefix sum of those widths; `n_output` the total (sklearn
920 // `_compute_n_features_outs`, `_encoders.py:936-955`; `feature_indices`,
921 // `:1049`).
922 let mut offsets: Vec<usize> = Vec::with_capacity(n_features);
923 let mut n_output: usize = 0;
924 for j in 0..n_features {
925 offsets.push(n_output);
926 n_output += fitted.block_width(j);
927 }
928 fitted.offsets = offsets;
929 fitted.n_output = n_output;
930
931 Ok(fitted)
932 }
933}
934
935/// Identify the indices of infrequent categories for one feature, given the
936/// per-category training `counts` (aligned with `categories_[j]`) and the
937/// `min_frequency`/`max_categories` thresholds.
938///
939/// Mirrors scikit-learn's `_BaseEncoder._identify_infrequent`
940/// (`_encoders.py:275-318`):
941/// 1. min_frequency: a category with `count < min_frequency` is infrequent
942/// (`:295-296`, integer form only — the float-fraction form is out of scope).
943/// 2. max_categories: if (after step 1) the feature would still produce more
944/// than `max_categories` output columns — counted as `n_remaining_frequent +
945/// 1` for the infrequent group (`:303`) — the least-frequent categories are
946/// additionally marked infrequent until only `max_categories - 1` frequent
947/// categories remain (`:304-315`). Ties broken by a STABLE sort over the
948/// FULL count array, so among equal counts the SMALLER category index is
949/// marked infrequent first (sklearn `np.argsort(kind="mergesort")[:-k]`).
950/// `max_categories == 1` (frequent_category_count 0) makes every category
951/// infrequent (`:307-309`).
952///
953/// Returns the sorted-ascending infrequent indices (empty if none — sklearn's
954/// `None`). Never panics (R-CODE-2).
955fn identify_infrequent(
956 counts: &[usize],
957 min_frequency: Option<usize>,
958 max_categories: Option<usize>,
959) -> Vec<usize> {
960 let n = counts.len();
961 let mut infrequent_mask = vec![false; n];
962
963 // Step 1: min_frequency (integer count). `count < min_frequency`.
964 if let Some(min_freq) = min_frequency {
965 for (idx, &c) in counts.iter().enumerate() {
966 if c < min_freq {
967 infrequent_mask[idx] = true;
968 }
969 }
970 }
971
972 // Step 2: max_categories on the survivors. `n_current_features` counts the
973 // remaining frequent categories PLUS 1 for the infrequent group
974 // (`_encoders.py:303`).
975 if let Some(max_cat) = max_categories {
976 let n_infreq = infrequent_mask.iter().filter(|&&m| m).count();
977 let n_current_features = n - n_infreq + 1;
978 if max_cat < n_current_features {
979 // `max_categories` includes the one infrequent category.
980 let frequent_category_count = max_cat - 1;
981 if frequent_category_count == 0 {
982 // All categories are infrequent (`:307-309`).
983 infrequent_mask.iter_mut().for_each(|m| *m = true);
984 } else {
985 // Stable argsort over the FULL count array (ascending by count,
986 // ties by ascending index), then mark the smallest
987 // `n - frequent_category_count` levels infrequent — i.e. keep the
988 // top `frequent_category_count` by count, with ties resolved in
989 // favor of the LARGER index (`np.argsort(kind="mergesort")[:-k]`,
990 // `:312-315`).
991 let mut order: Vec<usize> = (0..n).collect();
992 order.sort_by(|&a, &b| counts[a].cmp(&counts[b]).then(a.cmp(&b)));
993 let keep = frequent_category_count.min(n);
994 let cut = n - keep;
995 for &idx in &order[..cut] {
996 infrequent_mask[idx] = true;
997 }
998 }
999 }
1000 }
1001
1002 infrequent_mask
1003 .iter()
1004 .enumerate()
1005 .filter_map(|(idx, &m)| if m { Some(idx) } else { None })
1006 .collect()
1007}
1008
1009/// Build the per-feature mapping from a `categories_[j]` index to its output
1010/// column slot WITHIN the feature's block (before adding `offsets[j]`).
1011///
1012/// Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
1013/// (`_encoders.py:373-400`): frequent categories take slots `0..n_frequent` in
1014/// their original (ascending-index) order; every infrequent category maps to the
1015/// single trailing slot `n_frequent`. With no infrequent categories the mapping
1016/// is the identity `0..n`. `infrequent` must be sorted ascending. Never panics
1017/// (R-CODE-2): every index is bounds-checked.
1018fn build_infrequent_map(n: usize, infrequent: &[usize]) -> Vec<usize> {
1019 if infrequent.is_empty() {
1020 return (0..n).collect();
1021 }
1022 let n_frequent = n - infrequent.len();
1023 let mut map = vec![n_frequent; n];
1024 let mut next_frequent = 0usize;
1025 for (idx, slot) in map.iter_mut().enumerate() {
1026 if infrequent.binary_search(&idx).is_ok() {
1027 // Infrequent → the trailing slot (already set to `n_frequent`).
1028 } else {
1029 *slot = next_frequent;
1030 next_frequent += 1;
1031 }
1032 }
1033 map
1034}
1035
1036impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedOneHotEncoder<F> {
1037 type Output = Array2<F>;
1038 type Error = FerroError;
1039
1040 /// Transform numeric categorical data into a dense one-hot encoded matrix.
1041 ///
1042 /// Each value is one-hot by **category membership**: for input column `j` the
1043 /// value `x[[i, j]]` is matched (by exact equality) against `categories_[j]`,
1044 /// and the bit at output column `offsets[j] + idx` is set, where `idx` is the
1045 /// value's position in the sorted-unique set. The per-feature one-hot blocks
1046 /// are concatenated left-to-right, matching scikit-learn's
1047 /// `OneHotEncoder(sparse_output=False)` output column layout
1048 /// (`_BaseEncoder._transform`, `_encoders.py:206-240`).
1049 ///
1050 /// A value not present in `categories_[j]` is an **unknown category**. Its
1051 /// handling depends on the configured `handle_unknown`
1052 /// ([`OneHotEncoder::with_handle_unknown`]):
1053 /// - [`OneHotHandleUnknown::Error`] (the default): returns an error, matching
1054 /// sklearn's `handle_unknown='error'`
1055 /// (`ValueError("Found unknown categories … during transform")`,
1056 /// `_encoders.py:209-214`).
1057 /// - [`OneHotHandleUnknown::Ignore`]: leaves that feature's one-hot block
1058 /// **all-zero** for this row (no column is set), matching sklearn's
1059 /// `handle_unknown='ignore'` (`_encoders.py:215-240`: the unknown row is
1060 /// masked out so no encoded column is set). Every KNOWN feature still emits
1061 /// its normal one-hot bit.
1062 ///
1063 /// The +/-inf rejection (#2225), the ncols guard, and the 0-row handling are
1064 /// unaffected by `handle_unknown`: a non-finite +/-inf value is invalid input
1065 /// (not an unknown category) and still errors even in `Ignore` mode.
1066 ///
1067 /// # Errors
1068 ///
1069 /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
1070 /// match the number of features seen during fitting.
1071 ///
1072 /// Returns [`FerroError::InvalidParameter`] if any value is an unknown
1073 /// category (not in the learned `categories_[j]` set) AND `handle_unknown`
1074 /// is [`OneHotHandleUnknown::Error`] (the default); under
1075 /// [`OneHotHandleUnknown::Ignore`] an unknown category never errors. Also
1076 /// returned if any value is +/-infinite (invalid input, #2225).
1077 fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
1078 let n_features = self.categories_.len();
1079 // sklearn `transform` -> `check_array(force_all_finite="allow-nan")`
1080 // (`_encoders.py`): +/-inf is rejected with "Input contains infinity"
1081 // BEFORE the per-feature membership lookup (so an inf value reports the
1082 // finite-check error, NOT "unknown category"); NaN passes (it can be a
1083 // known category). #2225.
1084 if x.iter().any(|v| v.is_infinite()) {
1085 return Err(FerroError::InvalidParameter {
1086 name: "X".into(),
1087 reason: "Input X contains infinity or a value too large for dtype.".into(),
1088 });
1089 }
1090 if x.ncols() != n_features {
1091 return Err(FerroError::ShapeMismatch {
1092 expected: vec![x.nrows(), n_features],
1093 actual: vec![x.nrows(), x.ncols()],
1094 context: "FittedOneHotEncoder::transform".into(),
1095 });
1096 }
1097
1098 let n_samples = x.nrows();
1099 let mut out = Array2::zeros((n_samples, self.n_output));
1100
1101 for j in 0..n_features {
1102 let cats = &self.categories_[j];
1103 let offset = self.offsets[j];
1104 // The per-feature dropped category index, if any (`drop_idx_[j]`).
1105 // Used to shift kept categories down by one and to emit an all-zero
1106 // block for the dropped category (sklearn `transform`,
1107 // `_encoders.py:1033-1046`: `X_int > to_drop` decrements, the dropped
1108 // cell is masked out).
1109 let drop_d = self.drop_idx_.get(j).copied().flatten();
1110 // The per-feature infrequent remapping (REQ-5b). When feature `j` has
1111 // infrequent categories, a found category index maps to its block
1112 // slot via `infrequent_map[j][idx]` (a frequent category → its
1113 // remapped slot, an infrequent category → the trailing slot). When
1114 // there are none the map is the identity and `infreq` is `false`, so
1115 // the existing `drop` path is unchanged (the two are mutually
1116 // exclusive — `fit` rejects their combination).
1117 let infreq = self.has_infrequent(j);
1118 let infreq_map = self.infrequent_map.get(j);
1119 for i in 0..n_samples {
1120 let value = x[[i, j]];
1121 // Membership lookup: find the value's index in the sorted-unique
1122 // `categories_[j]` by EXACT equality (np.unique / `_encode`
1123 // semantics). A small linear scan over the per-feature category
1124 // set — bounds-safe (no unchecked indexing; R-CODE-2).
1125 match cats
1126 .iter()
1127 .position(|&c| c == value || (c.is_nan() && value.is_nan()))
1128 {
1129 // Infrequent grouping active: place the value in its remapped
1130 // block slot (`_BaseEncoder._map_infrequent_categories`,
1131 // `_encoders.py:442-452`: `X_int = np.take(mapping, X_int)`).
1132 Some(idx) if infreq => {
1133 if let Some(&slot) = infreq_map.and_then(|m| m.get(idx)) {
1134 out[[i, offset + slot]] = F::one();
1135 }
1136 }
1137 Some(idx) => match drop_d {
1138 // The dropped category encodes to an ALL-ZERO block: set
1139 // nothing (sklearn masks the dropped cell out of `X_mask`,
1140 // `_encoders.py:1037,1046`). `out` is already zero-filled.
1141 Some(d) if idx == d => {}
1142 // A KEPT category after a drop shifts down by one when its
1143 // index is past the dropped one (sklearn `X_int > to_drop`
1144 // decrements, `_encoders.py:1045`): the output column is
1145 // `idx` if `idx < d`, else `idx - 1`.
1146 Some(d) if idx > d => out[[i, offset + idx - 1]] = F::one(),
1147 // No drop on this feature, or a kept category before the
1148 // dropped one (`idx < d`): the column is `offset + idx`.
1149 _ => out[[i, offset + idx]] = F::one(),
1150 },
1151 None => match self.handle_unknown {
1152 // handle_unknown='ignore' (`_encoders.py:215-240`): the
1153 // unknown row is masked out and NO column in this
1154 // feature's block is set, so the per-feature one-hot block
1155 // stays ALL-ZERO. `out` is already zero-filled, so we just
1156 // skip — every KNOWN feature still sets its own bit.
1157 OneHotHandleUnknown::Ignore => continue,
1158 // handle_unknown='error' (the sklearn default, SHIPPED
1159 // REQ-2, UNCHANGED): ValueError "Found unknown categories
1160 // … during transform" (`_encoders.py:209-214`). `F: Float`
1161 // is not `Display`, so report the value via `to_f64`.
1162 OneHotHandleUnknown::Error => {
1163 let v = value.to_f64();
1164 let shown = match v {
1165 Some(f) => format!("[{f}]"),
1166 None => "[<non-finite>]".to_string(),
1167 };
1168 return Err(FerroError::InvalidParameter {
1169 name: format!("x[{i},{j}]"),
1170 reason: format!(
1171 "Found unknown categories {shown} in column {j} during transform"
1172 ),
1173 });
1174 }
1175 },
1176 }
1177 }
1178 }
1179
1180 Ok(out)
1181 }
1182}
1183
1184/// Implement `Transform` on the unfitted encoder to satisfy the `FitTransform: Transform`
1185/// supertrait bound. Calling `transform` on an unfitted encoder always returns an error.
1186impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for OneHotEncoder<F> {
1187 type Output = Array2<F>;
1188 type Error = FerroError;
1189
1190 /// Always returns an error — the encoder must be fitted first.
1191 ///
1192 /// Use [`Fit::fit`] to produce a [`FittedOneHotEncoder`], then call
1193 /// [`Transform::transform`] on that.
1194 fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError> {
1195 Err(FerroError::InvalidParameter {
1196 name: "OneHotEncoder".into(),
1197 reason: "encoder must be fitted before calling transform; use fit() first".into(),
1198 })
1199 }
1200}
1201
1202impl<F: Float + Send + Sync + 'static> FitTransform<Array2<F>> for OneHotEncoder<F> {
1203 type FitError = FerroError;
1204
1205 /// Fit the encoder on `x` and return the one-hot encoded output in one step.
1206 ///
1207 /// # Errors
1208 ///
1209 /// Returns an error if fitting or transformation fails.
1210 fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
1211 let fitted = self.fit(x, &())?;
1212 fitted.transform(x)
1213 }
1214}
1215
1216/// Convenience: encode a 1-D array of numeric categories.
1217///
1218/// This wraps the input in a single-column `Array2<F>` and returns the encoded
1219/// result with one-hot columns for that single feature, matching the membership
1220/// encoding of [`Transform::transform`].
1221impl<F: Float + Send + Sync + 'static> FittedOneHotEncoder<F> {
1222 /// Transform a 1-D slice of numeric category values.
1223 ///
1224 /// # Errors
1225 ///
1226 /// Returns an error if the encoder was fitted on more than one column, or if
1227 /// any value is an unknown category (not in the learned `categories_[0]`).
1228 pub fn transform_1d(&self, x: &[F]) -> Result<Array2<F>, FerroError> {
1229 if self.categories_.len() != 1 {
1230 return Err(FerroError::InvalidParameter {
1231 name: "transform_1d".into(),
1232 reason: "encoder was fitted on more than one column; use transform instead".into(),
1233 });
1234 }
1235 let col = Array2::from_shape_vec((x.len(), 1), x.to_vec()).map_err(|e| {
1236 FerroError::InvalidParameter {
1237 name: "x".into(),
1238 reason: e.to_string(),
1239 }
1240 })?;
1241 self.transform(&col)
1242 }
1243}
1244
1245// ---------------------------------------------------------------------------
1246// Tests
1247// ---------------------------------------------------------------------------
1248
1249#[cfg(test)]
1250mod tests {
1251 use super::*;
1252 use ndarray::array;
1253
1254 #[test]
1255 fn test_one_hot_single_column() {
1256 let enc = OneHotEncoder::<f64>::new();
1257 let x = array![[0.0_f64], [1.0], [2.0]];
1258 let fitted = enc.fit(&x, &()).unwrap();
1259 assert_eq!(fitted.categories(), &[vec![0.0, 1.0, 2.0]]);
1260 assert_eq!(fitted.n_categories(), vec![3]);
1261 assert_eq!(fitted.n_output_features(), 3);
1262
1263 let out = fitted.transform(&x).unwrap();
1264 assert_eq!(out.shape(), &[3, 3]);
1265 // Row 0: category 0 → [1, 0, 0]
1266 assert_eq!(out[[0, 0]], 1.0);
1267 assert_eq!(out[[0, 1]], 0.0);
1268 assert_eq!(out[[0, 2]], 0.0);
1269 // Row 1: category 1 → [0, 1, 0]
1270 assert_eq!(out[[1, 0]], 0.0);
1271 assert_eq!(out[[1, 1]], 1.0);
1272 assert_eq!(out[[1, 2]], 0.0);
1273 // Row 2: category 2 → [0, 0, 1]
1274 assert_eq!(out[[2, 0]], 0.0);
1275 assert_eq!(out[[2, 1]], 0.0);
1276 assert_eq!(out[[2, 2]], 1.0);
1277 }
1278
1279 #[test]
1280 fn test_one_hot_multi_column() {
1281 let enc = OneHotEncoder::<f64>::new();
1282 // Two columns: col0 has 3 categories, col1 has 2 categories
1283 let x = array![[0.0_f64, 0.0], [1.0, 1.0], [2.0, 0.0]];
1284 let fitted = enc.fit(&x, &()).unwrap();
1285 assert_eq!(fitted.categories(), &[vec![0.0, 1.0, 2.0], vec![0.0, 1.0]]);
1286 assert_eq!(fitted.n_categories(), vec![3, 2]);
1287 assert_eq!(fitted.n_output_features(), 5);
1288
1289 let out = fitted.transform(&x).unwrap();
1290 assert_eq!(out.shape(), &[3, 5]);
1291 // Row 0: (0, 0) → [1,0,0, 1,0]
1292 assert_eq!(out.row(0).to_vec(), vec![1.0, 0.0, 0.0, 1.0, 0.0]);
1293 // Row 1: (1, 1) → [0,1,0, 0,1]
1294 assert_eq!(out.row(1).to_vec(), vec![0.0, 1.0, 0.0, 0.0, 1.0]);
1295 // Row 2: (2, 0) → [0,0,1, 1,0]
1296 assert_eq!(out.row(2).to_vec(), vec![0.0, 0.0, 1.0, 1.0, 0.0]);
1297 }
1298
1299 #[test]
1300 fn test_non_contiguous_single_column() {
1301 // The REQ-3 headline: non-contiguous integers {2,5,9} must yield 3
1302 // category columns (one per unique value), NOT max+1 == 10.
1303 let enc = OneHotEncoder::<f64>::new();
1304 let x = array![[2.0_f64], [5.0], [9.0]];
1305 let fitted = enc.fit(&x, &()).unwrap();
1306 assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0]]);
1307 assert_eq!(fitted.n_output_features(), 3);
1308 let out = fitted.transform(&x).unwrap();
1309 assert_eq!(out.shape(), &[3, 3]);
1310 assert_eq!(out.row(0).to_vec(), vec![1.0, 0.0, 0.0]);
1311 assert_eq!(out.row(1).to_vec(), vec![0.0, 1.0, 0.0]);
1312 assert_eq!(out.row(2).to_vec(), vec![0.0, 0.0, 1.0]);
1313 }
1314
1315 #[test]
1316 fn test_unknown_category_error() {
1317 let enc = OneHotEncoder::<f64>::new();
1318 let x_train = array![[0.0_f64], [1.0]];
1319 let fitted = enc.fit(&x_train, &()).unwrap();
1320 // Value 2.0 was not seen during fitting → unknown category.
1321 let x_bad = array![[2.0_f64]];
1322 assert!(fitted.transform(&x_bad).is_err());
1323 }
1324
1325 #[test]
1326 fn test_fit_transform_equivalence() {
1327 let enc = OneHotEncoder::<f64>::new();
1328 let x = array![[0.0_f64, 1.0], [1.0, 0.0], [2.0, 1.0]];
1329 let via_fit_transform: Array2<f64> = enc.fit_transform(&x).unwrap();
1330 let fitted = enc.fit(&x, &()).unwrap();
1331 let via_separate = fitted.transform(&x).unwrap();
1332 for (a, b) in via_fit_transform.iter().zip(via_separate.iter()) {
1333 assert!((a - b).abs() < 1e-15);
1334 }
1335 }
1336
1337 #[test]
1338 fn test_shape_mismatch_error() {
1339 let enc = OneHotEncoder::<f64>::new();
1340 let x_train = array![[0.0_f64, 1.0], [1.0, 0.0]];
1341 let fitted = enc.fit(&x_train, &()).unwrap();
1342 let x_bad = array![[0.0_f64]];
1343 assert!(fitted.transform(&x_bad).is_err());
1344 }
1345}