//! One-hot encoder for categorical numeric features.
//!
//! `fit` learns, for each input column, `categories_[j]` = the **sorted unique
//! set** of values in that column (matching scikit-learn's
//! `OneHotEncoder.categories_`, `_BaseEncoder._fit:99`, `categories_ =
//! _unique(Xi)`). `transform` emits a dense binary matrix where each learned
//! category gets its own output column; the per-feature blocks are concatenated
//! left-to-right (column 0's categories first, then column 1's, …), and a value
//! is one-hot by **category membership** (the value's index within
//! `categories_[j]`), NOT by an assumed contiguous `0..max` integer layout.
//!
//! # Example
//!
//! ```text
//! Input column with the (non-contiguous) categories {2, 5, 9}:
//! [2, 5, 9] → [[1,0,0],[0,1,0],[0,0,1]] (3 columns, one per unique value)
//! ```
//!
//! # `## REQ status`
//!
//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_encoders.py` (`class OneHotEncoder`
//! `:458`). Design doc: `.design/preprocess/one_hot_encoder.md`. Expected values from the live
//! sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (`lib.rs`, grandfathered S5).
//! HONEST (R-HONEST-3): ferrolearn ships a numeric (`F`-input) DENSE encoder whose `categories_`
//! and column layout now match sklearn's `sparse_output=False` output for ANY finite numeric
//! columns (contiguous or not); `drop` ({None,'first','if_binary'}, REQ-5a) IS shipped, as are
//! `handle_unknown='ignore'` and `inverse_transform`/`get_feature_names_out`. Sparse-by-default
//! output, string/object categories, infrequent grouping (`min_frequency`/`max_categories`,
//! REQ-5b), the full ctor surface and the ferray substrate stay NOT-STARTED. The PyO3 binding ships the
//! DENSE numeric path (`ferrolearn.OneHotEncoder`, REQ-8) with the unsupported surface surfaced
//! as `NotImplementedError`/`ValueError` rather than silently mismatched (R-HONEST-3).
//!
//! | REQ | Status | Evidence |
//! |---|---|---|
//! | REQ-1 (dense one-hot via per-feature category blocks) | SHIPPED | `Transform::transform for FittedOneHotEncoder` zero-fills an `Array2<F>` of width `n_output()` then, for each value, sets `out[[i, offsets[j]+idx]]=1` where `idx` is the value's index in `categories_[j]` (membership), mirroring `_BaseEncoder._transform` (`_encoders.py:206-240`) + the one-hot block expansion. Consumer: crate re-export `lib.rs`. |
//! | REQ-2 (sparse-by-default output) | NOT-STARTED | open prereq blocker #1149. Dense `Array2<F>` only; sklearn defaults `sparse_output=True` → scipy CSR (`:531`,`:748`). |
//! | REQ-3 (categories_ = sorted unique set) | SHIPPED | `Fit::fit` computes `categories_[j]` = per-column values sorted via `partial_cmp` then exact-equality deduped to the sorted-unique set (`_BaseEncoder._fit:99` `categories_=_unique(Xi)`); precomputes `offsets` (prefix sums of `categories_[j].len()`) + `n_output`; rejects 0 rows (`InsufficientSamples`). `categories()` accessor exposes the learned sets. Transform is membership-based (value's index in `categories_[j]`), so non-contiguous integers (`[2,5,9]` → 3 columns, NOT 10) and arbitrary finite floats encode correctly — bit-exact to live sklearn 1.5.2 `sparse_output=False`: `categories_`/`transform`/non-contiguous-headline/offsets guards in `tests/divergence_one_hot_encoder.rs`. Consumer: crate re-export `lib.rs`. SCOPE: numeric `F` input; exact float equality for membership (np.unique semantics — documented); NaN-as-a-category is HANDLED (#2223): NaN sorts LAST + collapses to one category (sklearn `_encode.py:70-74`), a NaN row one-hots its column; +/-inf is REJECTED at `fit`/`transform` (#2225, `force_all_finite="allow-nan"` allows NaN but not inf); string/object input is REQ-3-string (NOT-STARTED, no String path). |
//! | REQ-4 (handle_unknown {'error','ignore'}) | SHIPPED | `OneHotHandleUnknown` enum `{ Error (#[default]), Ignore }` (mirrors sklearn's `handle_unknown` `_parameter_constraints` `StrOptions({"error","ignore","infrequent_if_exist"})` default `"error"`, `_encoders.py:732,750`) + `OneHotEncoder::with_handle_unknown`/`handle_unknown()` builder+getter, threaded into `FittedOneHotEncoder` (`handle_unknown` field + getter) by `Fit::fit` (handle_unknown affects ONLY transform; `categories_` learned identically). `Transform::transform` unknown branch (`cats.iter().position(...) == None`): `Error` → `InvalidParameter` "Found unknown categories … during transform" (the SHIPPED REQ-2 default `ValueError`, `_encoders.py:209-214`, UNCHANGED); `Ignore` → `continue` leaving that feature's one-hot block ALL-ZERO (`_encoders.py:215-240`: unknown row masked out, no encoded column set), every KNOWN feature still one-hots. The +/-inf rejection (#2225), ncols + 0-row guards UNCHANGED (inf is invalid input, not an "unknown category" — still errors in `Ignore`; NaN with NO nan-category is "unknown" → all-zero block in `Ignore`, with a nan-category one-hots it). Never panics (R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`): `ignore_multifeature_all_zero_block` (`[[100,0],[5,99]]→[[0,0,0,1,0],[0,1,0,0,0]]`), `ignore_fully_unknown_row_all_zero`, `ignore_known_row_normal_one_hot`, `error_default_unknown_rejected`, `with_handle_unknown_ignore_known_value_normal`, `ignore_inf_still_rejected`, `ignore_nan_no_category_all_zero`, `ignore_nan_with_category_one_hots`, `handle_unknown_default_and_builder_abi` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs` (`OneHotHandleUnknown`). R-DEV-2. STILL NOT-STARTED: `'infrequent_if_exist'` (REQ-5). |
//! | REQ-5a (`drop` {None,'first','if_binary'}) | SHIPPED | #1152: `OneHotDrop` enum `{ None_ (#[default]), First, IfBinary }` (mirrors sklearn `drop` `_parameter_constraints` `StrOptions({"first","if_binary"})` / `None`, `_encoders.py:730`,`:498-516`) + `OneHotEncoder::with_drop`/`drop()` builder+getter, threaded into `Fit::fit` which computes `drop_idx_: Vec<Option<usize>>` (sklearn `_compute_drop_idx`, `_encoders.py:812-831`: `None_`→all `None`; `First`→all `Some(0)` (empty feature `None`); `IfBinary`→`Some(0)` iff `len==2` else `None`) and recomputes `offsets`/`n_output` from the per-feature BLOCK WIDTH `len - (drop_idx is Some)`. `FittedOneHotEncoder::drop_idx_()` accessor exposes it. `Transform::transform` (`_encoders.py:1033-1046`): the dropped category emits an ALL-ZERO block; a kept category at membership index `idx` maps to output col `offset + (idx if idx<d else idx-1)` (the `X_int > to_drop` decrement). `inverse_transform` (`_encoders.py:1124-1172`): an all-zero block with `drop_idx_[j]==Some(d)` inverts to the DROPPED category `categories_[j][d]` in BOTH handle_unknown modes (sklearn checks `_drop_idx_after_grouping[i] is not None` FIRST, bypassing the all-zeros error / None paths); a 0-width fully-dropped feature fills the dropped category (`:1132-1135`); a kept block position `pos>=d` maps to category `pos+1`. `get_feature_names_out` OMITS the dropped category (`_compute_transformed_categories` `remove_dropped=True`, `:909`,`:1209-1212`). DROP+IGNORE interaction (verified LIVE, sklearn 1.5.2): `drop` + `handle_unknown='ignore'` is ALLOWED (does NOT raise at fit; warns on unknown at transform, encoding the unknown as an all-zero block == the dropped category) — ferrolearn matches (fit imposes no constraint). NEVER panics: every drop-shift index uses `get`/bounds-checked arithmetic (R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`, `drop=...`): `drop_first_*`, `drop_if_binary_*`, `drop_inverse_roundtrip_*`, `drop_single_category_fully_dropped_*`, `drop_shift_3cat_*`, `drop_plus_ignore_allowed_*`, `drop_idx_abi_*`, `drop_none_unchanged_*` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs` (`OneHotDrop`). R-DEV-2. |
//! | REQ-5b (infrequent grouping `min_frequency`/`max_categories`) | SHIPPED | #1152: `OneHotEncoder::with_min_frequency`/`with_max_categories` (+`min_frequency()`/`max_categories()` getters) add the integer-count infrequent thresholds (`_encoders.py:566-587`,`:733-738`). `Fit::fit` computes per-category training counts (the run-length over the sorted column) and, when `infrequent_enabled`, calls `identify_infrequent` (mirrors `_BaseEncoder._identify_infrequent`, `_encoders.py:275-318`: min_frequency `count < min_freq` FIRST, then max_categories on the survivors via a STABLE argsort over the full count array keeping the top `max_categories-1` — ties favor the LARGER index; `max_categories==1` → all infrequent) + `build_infrequent_map` (mirrors `_default_to_infrequent_mappings`, `:373-400`: frequent → its remapped slot `0..n_frequent`, infrequent → the trailing slot). `FittedOneHotEncoder` carries `infrequent_indices_` + the per-feature `infrequent_map`; `block_width` becomes `n_frequent + 1` (sklearn `_compute_n_features_outs`, `:948-953`); `offsets`/`n_output` recomputed from it. `infrequent_categories()` exposes the infrequent VALUES per feature (`infrequent_categories_`, `:254-262`,`:625-633`). `Transform::transform` routes a found category through `infrequent_map[j][idx]` (frequent → own col, infrequent → trailing col; `_map_infrequent_categories`, `:442-452`). `inverse_transform` maps the trailing infrequent column to `F::nan()` (DOCUMENTED SCOPE, R-HONEST-3: `Array2<F>` cannot hold sklearn's `'infrequent_sklearn'` string, `:1675-1677`, like the ignore-None NaN proxy #2227), frequent cols → their category. `get_feature_names_out` emits the frequent names + a trailing `"x{j}_infrequent_sklearn"` (`_compute_transformed_categories`, `:913-921`). Infrequent grouping REQUIRES `drop==None_` — combining it errors `InvalidParameter` (REQ-5a×5b interaction DEFERRED; sklearn allows it). Never panics (every remap bounds-checked, R-CODE-2). Live-oracle parity (sklearn 1.5.2 `sparse_output=False`): `infrequent_min_frequency_*`, `infrequent_max_categories_*`, `infrequent_max_categories_tiebreak`, `infrequent_both_*`, `infrequent_inverse_*`, `infrequent_feature_names_*`, `infrequent_multifeature_offsets`, `infrequent_no_infrequent_*`, `infrequent_drop_rejected`, `infrequent_disabled_unchanged` (`tests/divergence_one_hot_encoder.rs`). Consumer: crate re-export `lib.rs`. STILL NOT-STARTED: the FLOAT-fraction `min_frequency` (`:573-575`,`:297-299`), `drop`+infrequent (`:518-520`,`:818-902`), and `'infrequent_if_exist'` (`:550-560`) stay unimplemented. |
//! | REQ-6 (inverse_transform + get_feature_names_out) | SHIPPED | `FittedOneHotEncoder::inverse_transform` reduces each per-feature block `x[:, offsets[j]..offsets[j]+len(categories_[j])]` via **argmax** (numpy first-max-on-ties) to `categories_[j][argmax]`, then handles an ALL-ZERO block (`block_sum == 0`) per `handle_unknown` (sklearn `_encoders.py:1141`,`:1159-1168`): `Error` -> `InvalidParameter` ("Samples can not be inverted ... all zeros"); `Ignore` -> the unknown-category sentinel inverts to `None` in sklearn (`:1183`), represented here as `NaN` (Array2<F> cannot hold None, #2227) with the KNOWN feature blocks still recovered; 0-row → `InsufficientSamples`, `ncols != n_output` → `ShapeMismatch` (`:1100-1104`). Never panics (block slices bounds-checked, R-CODE-2). `FittedOneHotEncoder::get_feature_names_out` emits `format!("x{j}_{cat}")` over `categories_` with default `input_features=["x0",..]` + the `"concat"` combiner (`feature+"_"+str(category)`, `:1217,1224`) → `["x0_2.0","x0_5.0","x0_9.0","x1_0.0","x1_1.0"]`; the float label via `category_label` appends `.0` to whole-valued floats (Python `str(np.float64)`: `2.0`/`-3.0`/`2.5`), `NaN→"nan"`. Live-oracle parity (roundtrip incl. non-contiguous `{2,5,9}`, held-out `[[0,1,0,1,0]]→[[5,0]]`, all-zero/ncols/0-row errors, feature names whole+fractional+negative) in `tests/divergence_one_hot_encoder.rs`. Consumer: crate re-export (`lib.rs:141`). DOCUMENTED DIVERGENCE (R-HONEST-3): the float label uses Rust `Display` for non-whole values, so it diverges from Python's scientific notation at `|v|>=1e16` / `0<|v|<1e-4` (`1e+20`/`1e-07` vs full decimal) — not a plausible category. STILL NOT-STARTED within REQ-6: the `input_features=`/`feature_name_combiner=` params (`:1192,1222`) and the `drop`-aware inverse (REQ-5). The `handle_unknown='ignore'` inverse IS handled (#2227, all-zero -> NaN sentinel). |
//! | REQ-7 (ctor + dtype + _parameter_constraints) | SHIPPED | The supported ctor params are type-safe Rust enums — `OneHotHandleUnknown {Error,Ignore}` (REQ-4) and `OneHotDrop {None_,First,IfBinary}` (REQ-5a) — so sklearn's `handle_unknown`/`drop` `StrOptions` `_parameter_constraints` (`_encoders.py:733-738`) are provided BY THE TYPE SYSTEM (an out-of-domain value is unrepresentable). The numeric thresholds carry runtime constraints matching sklearn's `Interval(Integral, 1, None)`: `Fit::fit` rejects `min_frequency==Some(0)` and `max_categories==Some(0)` with `InvalidParameter` ("must be an int in the range [1, inf)", verified live: `OneHotEncoder(min_frequency=0).fit` -> InvalidParameterError, #1154). `dtype` is f64 (the category container; REQ-3-analog). The FULL 8-key keyword-only sklearn ctor surface (categories/drop/sparse_output/dtype/handle_unknown/min_frequency/max_categories/feature_name_combiner) is exposed + validated at the PyO3 binding (REQ-8, `_extras.py::OneHotEncoder` get_params parity + `_check_unsupported`). Live-oracle test `req7_min_frequency_max_categories_must_be_at_least_one`. Consumer: crate re-export `lib.rs`. |
//! | REQ-8 (PyO3 binding) | SHIPPED | #1155: `ferrolearn-python` exposes `ferrolearn.OneHotEncoder` over `{OneHotEncoder, FittedOneHotEncoder, OneHotHandleUnknown}`. The Rust shim `_RsOneHotEncoder` (hand `#[pyclass]`, `ferrolearn-python/src/extras.rs`) ctor takes `handle_unknown: String = "error"` mapped via `resolve_handle_unknown` ("error"→`Error`, "ignore"→`Ignore`, "infrequent_if_exist"→`PyNotImplementedError` REQ-5, bad→`PyValueError` per `_encoders.py:732` `StrOptions({"error","ignore","infrequent_if_exist"})`); `fit` builds `OneHotEncoder::<f64>::new().with_handle_unknown(..)` + runs `Fit`; `transform`/`inverse_transform`→`PyArray2<f64>` (FerroError→`PyValueError`; the `Ignore`-mode all-zero inverse flows through as NaN, #2227); `#[getter]`s `categories_` (a Python LIST of 1-D f64 numpy arrays via `PyList`), `feature_names_out` (`get_feature_names_out()`→`Vec<String>`), `n_features_in_` (`n_features()`). Registered in `lib.rs` (`m.add_class::<extras::RsOneHotEncoder>()`). The Python wrapper `_extras.py::OneHotEncoder(_TransformerWrapper)` mirrors sklearn's KEYWORD-ONLY 8-key ctor `(*, categories="auto", drop=None, sparse_output=True, dtype=np.float64, handle_unknown="error", min_frequency=None, max_categories=None, feature_name_combiner="concat")` (`_encoders.py:743-762`) for `get_params`/`clone` parity; `_make_rs` threads `handle_unknown`; `fit` calls `_check_unsupported` which HONESTLY (R-HONEST-3) rejects the core's gaps — `sparse_output=True` (the sklearn DEFAULT; dense-only REQ-2 #1149)/`categories!='auto'`/`drop`/`min_frequency`/`max_categories`/`feature_name_combiner!='concat'` (REQ-5/REQ-7 #1152/#1154) → `NotImplementedError`; `transform`/`inverse_transform`/`categories_`/`n_features_in_`/`get_feature_names_out(input_features=None)` guarded by `check_is_fitted`→`NotFittedError` pre-fit (`input_features!=None`→`NotImplementedError` REQ-6). Boundary consumer (R-DEFER-1): the `_extras.py::OneHotEncoder` wrapper + `lib.rs` `add_class` + `__init__.py` re-export. Live-oracle parity (model B, sklearn 1.5.2 `sparse_output=False`): `tests/divergence_one_hot_encoder_py.py` (17 pass) — multi-feature non-contiguous `transform`/`fit_transform`/`categories_`, `handle_unknown='ignore'` all-zero block, `inverse_transform` roundtrip + ignore-NaN-vs-None known-feature recovery, `get_feature_names_out` (`['x0_2.0',...]`), pre-fit `NotFittedError`, bad-handle_unknown `ValueError`, `infrequent_if_exist`/unsupported-param `NotImplementedError`, dense-only `sparse_output=True` error, `get_params` 8-key parity, `clone`. R-DEFER-1 satisfied. |
//! | REQ-9 (ferray substrate) | NOT-STARTED | open prereq blocker #1156. `ndarray::Array2`, not `ferray-core` (R-SUBSTRATE-1/2). |
use ferrolearn_core::error::FerroError;
use ferrolearn_core::traits::{Fit, FitTransform, Transform};
use ndarray::Array2;
use num_traits::Float;
use std::cmp::Ordering;
// ---------------------------------------------------------------------------
// OneHotHandleUnknown
// ---------------------------------------------------------------------------
/// How [`FittedOneHotEncoder`] treats a category at `transform` time that was not
/// seen during `fit` (an **unknown category**).
///
/// Mirrors scikit-learn's `OneHotEncoder(handle_unknown=...)` parameter
/// (`sklearn/preprocessing/_encoders.py:732,750`), whose
/// `_parameter_constraints` accepts `{'error', 'ignore', 'infrequent_if_exist'}`
/// and whose default is `'error'`. ferrolearn ships `Error` (REQ-2) and `Ignore`
/// (REQ-4); `'infrequent_if_exist'` is NOT-STARTED (REQ-5).
///
/// This is a distinct type from
/// [`ordinal_encoder::HandleUnknown`](crate::ordinal_encoder::HandleUnknown):
/// the one-hot encoder's modes are `{error, ignore}` while the ordinal encoder's
/// are `{error, use_encoded_value}` (sklearn's two `handle_unknown` enums differ
/// the same way).
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum OneHotHandleUnknown {
/// Raise an error on any unknown category at `transform` time (scikit-learn's
/// default `handle_unknown='error'`, the default here too). The unfitted
/// encoder's [`Transform::transform`] returns
/// [`FerroError::InvalidParameter`] ("Found unknown categories … during
/// transform", `_encoders.py:209-214`).
#[default]
Error,
/// Encode an unknown category as an **all-zero** one-hot block for that
/// feature, leaving every known feature untouched (scikit-learn's
/// `handle_unknown='ignore'`, `_encoders.py:215-240`: the unknown row is
/// masked out and no column in that feature's block is set).
Ignore,
}
// ---------------------------------------------------------------------------
// OneHotDrop
// ---------------------------------------------------------------------------
/// Which category (if any) to drop from each feature's one-hot block at
/// `transform` time (`OneHotEncoder(drop=...)`).
///
/// Mirrors scikit-learn's `OneHotEncoder(drop=...)` parameter, whose
/// `_parameter_constraints` accepts `{'first', 'if_binary'}`, an array-like, or
/// `None` (`sklearn/preprocessing/_encoders.py:730`, default `None`). Dropping a
/// category removes one output column per feature, which is useful to break the
/// collinearity an unregularized linear model would otherwise see
/// (`_encoders.py:498-516`).
///
/// ferrolearn ships the `None`/`'first'`/`'if_binary'` modes (REQ-5). The
/// array-of-explicit-categories form (`drop[i]` = the category to drop in
/// feature `i`, `_encoders.py:515-516`) is NOT-STARTED.
///
/// The variant is named `None_` (not `None`) to avoid colliding with
/// [`Option::None`].
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum OneHotDrop {
/// Retain all categories — no column is dropped (scikit-learn's default
/// `drop=None`, `_encoders.py:509`,`:812-813`: `drop_idx_ = None`). The
/// default here too.
#[default]
None_,
/// Drop the **first** category of every feature (scikit-learn's
/// `drop='first'`, `_encoders.py:510-511`,`:815-816`: `drop_idx_[j] = 0` for
/// every feature). A feature with only one category is dropped entirely (its
/// block width becomes 0).
First,
/// Drop the first category of every feature that has **exactly two**
/// categories, leaving 1-category and 3+-category features intact
/// (scikit-learn's `drop='if_binary'`, `_encoders.py:512-514`,`:817-831`:
/// `drop_idx_[j] = 0` iff `len(categories_[j]) == 2`, else `None`).
IfBinary,
}
// ---------------------------------------------------------------------------
// OneHotEncoder (unfitted)
// ---------------------------------------------------------------------------
/// An unfitted one-hot encoder for multi-column numeric categorical data.
///
/// Input: `Array2<F>` where each column contains the (finite) numeric category
/// values. Calling [`Fit::fit`] learns, per column, the **sorted unique set** of
/// values (`categories_`) and returns a [`FittedOneHotEncoder`]. The output of
/// [`Transform::transform`] is a dense binary matrix with one column per learned
/// category, the per-feature blocks concatenated left-to-right.
///
/// # Examples
///
/// ```
/// use ferrolearn_preprocess::OneHotEncoder;
/// use ferrolearn_core::traits::{Fit, Transform};
/// use ndarray::array;
///
/// let enc = OneHotEncoder::<f64>::new();
/// // Non-contiguous categories {2, 5, 9} in column 0, {0, 1} in column 1.
/// let x = array![[2.0_f64, 0.0], [5.0, 1.0], [9.0, 0.0], [5.0, 1.0]];
/// let fitted = enc.fit(&x, &()).unwrap();
/// assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0], vec![0.0, 1.0]]);
/// let encoded = fitted.transform(&x).unwrap();
/// assert_eq!(encoded.ncols(), 5); // 3 + 2 category columns
/// ```
///
/// Unknown categories at `transform` time are, by default, rejected
/// ([`OneHotHandleUnknown::Error`], scikit-learn's `handle_unknown='error'`).
/// Configuring [`with_handle_unknown`](OneHotEncoder::with_handle_unknown) with
/// [`OneHotHandleUnknown::Ignore`] instead encodes an unknown category as an
/// all-zero one-hot block, matching `OneHotEncoder(handle_unknown='ignore')`.
#[derive(Debug, Clone)]
pub struct OneHotEncoder<F> {
/// Strategy for unknown categories at `transform` time
/// (`handle_unknown`). Defaults to [`OneHotHandleUnknown::Error`].
handle_unknown: OneHotHandleUnknown,
/// Which category (if any) to drop per feature (`drop`). Defaults to
/// [`OneHotDrop::None_`] (retain all categories).
drop: OneHotDrop,
/// Minimum frequency (count) below which a category is grouped into the
/// trailing "infrequent" output column (`min_frequency`). `None` (the
/// default) disables the min-frequency threshold. Mirrors scikit-learn's
/// `OneHotEncoder(min_frequency=...)` (`_encoders.py:566-577`,`:734-738`).
/// SCOPE: only the integer-count form is supported — sklearn also accepts a
/// FLOAT fraction `min_frequency * n_samples` (`:573-575`,`_encoders.py:297`),
/// which is NOT-STARTED here.
min_frequency: Option<usize>,
/// Upper limit on the number of output columns per feature when grouping
/// infrequent categories (`max_categories`); the infrequent column itself
/// counts toward this limit. `None` (the default) imposes no limit. Mirrors
/// scikit-learn's `OneHotEncoder(max_categories=...)`
/// (`_encoders.py:579-587`,`:733`).
max_categories: Option<usize>,
_marker: std::marker::PhantomData<F>,
}
impl<F: Float + Send + Sync + 'static> OneHotEncoder<F> {
/// Create a new `OneHotEncoder` with scikit-learn's default
/// `handle_unknown='error'` ([`OneHotHandleUnknown::Error`]).
#[must_use]
pub fn new() -> Self {
Self {
handle_unknown: OneHotHandleUnknown::Error,
drop: OneHotDrop::None_,
min_frequency: None,
max_categories: None,
_marker: std::marker::PhantomData,
}
}
/// Set the unknown-category strategy (`handle_unknown`).
///
/// With [`OneHotHandleUnknown::Ignore`] an unknown category at `transform`
/// time becomes an all-zero one-hot block for that feature instead of an
/// error, matching scikit-learn's `OneHotEncoder(handle_unknown='ignore')`
/// (`_encoders.py:215-240`).
#[must_use]
pub fn with_handle_unknown(mut self, handle_unknown: OneHotHandleUnknown) -> Self {
self.handle_unknown = handle_unknown;
self
}
/// Return the configured unknown-category strategy (`handle_unknown`).
#[must_use]
pub fn handle_unknown(&self) -> OneHotHandleUnknown {
self.handle_unknown
}
/// Set the drop strategy (`drop`).
///
/// With [`OneHotDrop::First`] the first category of every feature is dropped
/// from the output; with [`OneHotDrop::IfBinary`] only binary (2-category)
/// features lose their first category. The dropped category produces an
/// all-zero one-hot block, matching scikit-learn's `OneHotEncoder(drop=...)`
/// (`_encoders.py:498-516`).
#[must_use]
pub fn with_drop(mut self, drop: OneHotDrop) -> Self {
self.drop = drop;
self
}
/// Return the configured drop strategy (`drop`).
#[must_use]
pub fn drop(&self) -> OneHotDrop {
self.drop
}
/// Set the minimum-frequency threshold for infrequent grouping
/// (`min_frequency`, integer count).
///
/// At `fit` time a category whose count in the training data is **strictly
/// less than** `min_frequency` is grouped into a single trailing
/// "infrequent" output column for that feature, matching scikit-learn's
/// `OneHotEncoder(min_frequency=...)` integer form
/// (`_encoders.py:566-577`, `_identify_infrequent` `:295-296`
/// `category_count < self.min_frequency`).
///
/// Enabling infrequent grouping (`min_frequency` and/or `max_categories`)
/// requires `drop == OneHotDrop::None_`; combining it with `drop` is a
/// deferred interaction (REQ-5a×5b) and [`Fit::fit`] returns an error.
///
/// SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn
/// also accepts a FLOAT `min_frequency` interpreted as the fraction
/// `min_frequency * n_samples` (`_encoders.py:573-575`,`:297-299`); the
/// float-fraction form is NOT-STARTED here.
#[must_use]
pub fn with_min_frequency(mut self, min_frequency: usize) -> Self {
self.min_frequency = Some(min_frequency);
self
}
/// Set the maximum number of output columns per feature for infrequent
/// grouping (`max_categories`).
///
/// At `fit` time, if a feature would otherwise produce more than
/// `max_categories` output columns, the least-frequent categories are
/// grouped into a single trailing "infrequent" column so the block width is
/// at most `max_categories` (the infrequent column itself counts toward the
/// limit). Mirrors scikit-learn's `OneHotEncoder(max_categories=...)`
/// (`_encoders.py:579-587`, `_identify_infrequent` `:303-315`).
///
/// Enabling infrequent grouping requires `drop == OneHotDrop::None_` (see
/// [`Self::with_min_frequency`]).
#[must_use]
pub fn with_max_categories(mut self, max_categories: usize) -> Self {
self.max_categories = Some(max_categories);
self
}
/// Return the configured minimum-frequency threshold (`min_frequency`), or
/// `None` if infrequent grouping by frequency is disabled.
#[must_use]
pub fn min_frequency(&self) -> Option<usize> {
self.min_frequency
}
/// Return the configured maximum output-column limit (`max_categories`), or
/// `None` if no limit is imposed.
#[must_use]
pub fn max_categories(&self) -> Option<usize> {
self.max_categories
}
/// Whether infrequent grouping is enabled (either `min_frequency` or
/// `max_categories` is set). Mirrors scikit-learn's `_infrequent_enabled`
/// (`_encoders.py:264-273`: `(max_categories is not None and
/// max_categories >= 1) or min_frequency is not None`).
fn infrequent_enabled(&self) -> bool {
self.min_frequency.is_some() || self.max_categories.is_some_and(|m| m >= 1)
}
}
impl<F: Float + Send + Sync + 'static> Default for OneHotEncoder<F> {
fn default() -> Self {
Self::new()
}
}
// ---------------------------------------------------------------------------
// FittedOneHotEncoder
// ---------------------------------------------------------------------------
/// A fitted one-hot encoder holding the sorted-unique category set per input
/// column, plus the precomputed output-column layout.
///
/// Created by calling [`Fit::fit`] on a [`OneHotEncoder`]. Mirrors
/// scikit-learn's `OneHotEncoder.categories_` (a list of arrays of the actual
/// sorted-unique values, `_BaseEncoder._fit:99`).
#[derive(Debug, Clone)]
pub struct FittedOneHotEncoder<F> {
/// Per-column sorted-unique category values (`categories_`). `categories_[j]`
/// is the sorted set of distinct values seen in input column `j`; its length
/// is the number of output columns devoted to that feature's block.
pub(crate) categories_: Vec<Vec<F>>,
/// Per-column output-block start offsets (prefix sums of the per-feature
/// **block width**). The block width of feature `j` is
/// `categories_[j].len() - (1 if drop_idx_[j].is_some() else 0)`. Output
/// column `offsets[j] + pos` is the one-hot bit for the `pos`-th *kept*
/// category of feature `j`. Has length `categories_.len()`.
pub(crate) offsets: Vec<usize>,
/// Total number of output columns (`Σ block_width(j)`), accounting for any
/// dropped categories (`drop`).
pub(crate) n_output: usize,
/// Strategy for unknown categories at `transform` time, threaded from the
/// unfitted [`OneHotEncoder`]. [`OneHotHandleUnknown::Error`] rejects an
/// unknown category; [`OneHotHandleUnknown::Ignore`] emits an all-zero block.
pub(crate) handle_unknown: OneHotHandleUnknown,
/// Per-feature index into `categories_[j]` of the category to drop, or `None`
/// for "no drop" on that feature (`drop_idx_`). Has length
/// `categories_.len()`. Mirrors scikit-learn's public `drop_idx_`
/// (`_encoders.py:608-615`,`:885-902`): `drop='first'` → every entry
/// `Some(0)`; `drop='if_binary'` → `Some(0)` iff the feature has exactly two
/// categories else `None`; `drop=None` → every entry `None`.
pub(crate) drop_idx_: Vec<Option<usize>>,
/// Per-feature indices into `categories_[j]` of the categories grouped as
/// **infrequent** (`min_frequency`/`max_categories`), sorted ascending.
/// Mirrors scikit-learn's private `_infrequent_indices[j]`
/// (`_encoders.py:336-340`,`:367-370`): the indices `idx` such that
/// `categories_[j][idx]` is an infrequent category. Empty when feature `j`
/// has no infrequent categories (sklearn's `None`). With infrequent grouping
/// disabled every entry is empty. Length `categories_.len()`.
pub(crate) infrequent_indices_: Vec<Vec<usize>>,
/// Per-feature mapping from a `categories_[j]` index to its OUTPUT column
/// offset WITHIN feature `j`'s block (before adding `offsets[j]`). Mirrors
/// scikit-learn's `_default_to_infrequent_mappings[j]`
/// (`_encoders.py:373-400`): a frequent category maps to its remapped slot
/// `0..n_frequent`, every infrequent category maps to the single trailing
/// slot `n_frequent`. When feature `j` has no infrequent categories the
/// mapping is the identity `0..len` (sklearn stores `None`; the identity is
/// the representable equivalent). Length `categories_.len()`, with
/// `infrequent_map[j].len() == categories_[j].len()`. Used by `transform`,
/// `inverse_transform`, and `get_feature_names_out` to place each category in
/// the right output column without recomputing the grouping.
pub(crate) infrequent_map: Vec<Vec<usize>>,
}
impl<F: Float + Send + Sync + 'static> FittedOneHotEncoder<F> {
/// Return the learned sorted-unique category set for each input column
/// (`categories_`).
///
/// `categories()[j][idx]` is the value encoded by output column
/// `offsets[j] + idx`. Mirrors scikit-learn's `OneHotEncoder.categories_`.
#[must_use]
pub fn categories(&self) -> &[Vec<F>] {
&self.categories_
}
/// Return the number of distinct categories for each input feature column,
/// i.e. the width of each per-feature one-hot block.
#[must_use]
pub fn n_categories(&self) -> Vec<usize> {
self.categories_.iter().map(Vec::len).collect()
}
/// Return the number of input feature columns.
#[must_use]
pub fn n_features(&self) -> usize {
self.categories_.len()
}
/// Return the total number of output columns (`Σ categories_[j].len()`).
#[must_use]
pub fn n_output_features(&self) -> usize {
self.n_output
}
/// Return the configured unknown-category strategy (`handle_unknown`),
/// threaded from the unfitted [`OneHotEncoder`].
#[must_use]
pub fn handle_unknown(&self) -> OneHotHandleUnknown {
self.handle_unknown
}
/// Return the per-feature drop index (`drop_idx_`).
///
/// `drop_idx_()[j]` is `Some(d)` if category `categories_[j][d]` is dropped
/// from feature `j`'s one-hot block (its block width is one less than
/// `categories_[j].len()`, and that category encodes to an all-zero block),
/// or `None` if no category is dropped from that feature. Mirrors
/// scikit-learn's public `drop_idx_` attribute (`_encoders.py:608-615`). With
/// `drop=None` (the default) every entry is `None`.
#[must_use]
pub fn drop_idx_(&self) -> &[Option<usize>] {
&self.drop_idx_
}
/// Return the infrequent category **values** for each feature
/// (`infrequent_categories_`).
///
/// `infrequent_categories()[j]` is the sorted list of category values from
/// `categories_[j]` that were grouped into the single trailing "infrequent"
/// output column (because their training count fell below `min_frequency`
/// and/or beyond the `max_categories` limit). An EMPTY inner `Vec` means
/// feature `j` had no infrequent categories (scikit-learn returns `None`
/// there; an empty list is the representable equivalent). With infrequent
/// grouping disabled every entry is empty. Mirrors scikit-learn's
/// `OneHotEncoder.infrequent_categories_`
/// (`_encoders.py:254-262`,`:625-633`): `category[indices]` over
/// `_infrequent_indices`.
#[must_use]
pub fn infrequent_categories(&self) -> Vec<Vec<F>> {
self.infrequent_indices_
.iter()
.enumerate()
.map(|(j, idxs)| {
idxs.iter()
.filter_map(|&idx| self.categories_.get(j).and_then(|c| c.get(idx)).copied())
.collect()
})
.collect()
}
/// Whether feature `j` has any infrequent categories (a trailing infrequent
/// output column). Bounds-safe: a `j` past the end yields `false`.
fn has_infrequent(&self, j: usize) -> bool {
self.infrequent_indices_
.get(j)
.is_some_and(|v| !v.is_empty())
}
/// Return the width of feature `j`'s one-hot block: `categories_[j].len()`
/// minus one if that feature has a dropped category. Bounds-safe: a `j` past
/// the end yields 0 (R-CODE-2).
fn block_width(&self, j: usize) -> usize {
let len = self.categories_.get(j).map_or(0, Vec::len);
// Infrequent grouping (REQ-5b) and `drop` (REQ-5a) are mutually
// exclusive — `fit` rejects their combination — so at most one branch
// applies. With infrequent categories the block is `n_frequent + 1`
// trailing infrequent column (sklearn `_compute_n_features_outs`
// `_encoders.py:948-953`: `output[i] -= infreq.size - 1`, i.e.
// `len - n_infreq + 1`).
let n_infreq = self.infrequent_indices_.get(j).map_or(0, Vec::len);
if n_infreq > 0 {
return len - n_infreq + 1;
}
let dropped = matches!(self.drop_idx_.get(j), Some(Some(_)));
len - usize::from(dropped && len > 0)
}
/// Invert a one-hot encoded matrix back to the original category values.
///
/// For each input feature `j` the per-feature block
/// `x[:, offsets[j] .. offsets[j] + categories_[j].len()]` is reduced to a
/// single category via **argmax** (the index of the maximum value in the
/// block, first-max on ties — numpy `argmax` semantics), and the original
/// value `categories_[j][argmax]` is written to `out[[i, j]]`. This mirrors
/// scikit-learn's `OneHotEncoder.inverse_transform`
/// (`sklearn/preprocessing/_encoders.py:1136-1139`):
/// `labels = sub.argmax(axis=1); X_tr[:, i] = cats[labels]`.
///
/// After the argmax, an **all-zero block** (a row whose per-feature block
/// sums to zero) cannot be inverted. With no `drop` and the default
/// `handle_unknown='error'` (the only mode ferrolearn ships — REQ-4/5), this
/// is an error, matching sklearn's
/// `ValueError("Samples [...] can not be inverted when drop=None and
/// handle_unknown='error' because they contain all zeros")`
/// (`_encoders.py:1160-1168`). A proper one-hot row from
/// [`Transform::transform`] has exactly one `1` per block, so argmax always
/// finds it and the block sum is never zero.
///
/// # Errors
///
/// - [`FerroError::InsufficientSamples`] if `x` has zero rows (sklearn
/// `check_array` requires a minimum of 1 sample).
/// - [`FerroError::ShapeMismatch`] if `x.ncols() != n_output` (sklearn's
/// "Shape of the passed X data is not correct" `ValueError`,
/// `_encoders.py:1100-1104`).
/// - [`FerroError::InvalidParameter`] if any per-feature block is all-zero
/// (the sklearn all-zeros `ValueError`, `_encoders.py:1164-1168`).
///
/// Never panics: every block slice is bounds-checked (R-CODE-2).
pub fn inverse_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let n_samples = x.nrows();
if n_samples == 0 {
return Err(FerroError::InsufficientSamples {
required: 1,
actual: 0,
context: "FittedOneHotEncoder::inverse_transform".into(),
});
}
// sklearn `inverse_transform` -> `check_array(X, accept_sparse="csr")`
// (`_encoders.py:1092`) with the DEFAULT `force_all_finite=True`, so a
// NaN or +/-inf cell in the one-hot matrix raises BEFORE the argmax
// (#2224). A valid one-hot row is all 0/1 (finite); a non-finite cell is
// invalid input.
if x.iter().any(|v| !v.is_finite()) {
return Err(FerroError::InvalidParameter {
name: "X".into(),
reason: "Input X contains NaN or infinity.".into(),
});
}
if x.ncols() != self.n_output {
return Err(FerroError::ShapeMismatch {
expected: vec![n_samples, self.n_output],
actual: vec![n_samples, x.ncols()],
context: "FittedOneHotEncoder::inverse_transform".into(),
});
}
let n_features = self.categories_.len();
let mut out = Array2::zeros((n_samples, n_features));
for j in 0..n_features {
let cats = &self.categories_[j];
let drop_d = self.drop_idx_.get(j).copied().flatten();
// The per-feature block WIDTH after drop (the number of output columns
// for this feature). With a dropped category the block is one narrower
// than `categories_[j]` (`_encoders.py:1124-1127` `cats_wo_dropped`).
let block_width = self.block_width(j);
let offset = self.offsets[j];
// A feature whose entire (single) category was dropped has a
// zero-width block (`drop='first'` on a 1-category feature). Every row
// inverts to that dropped category, with no columns consumed (sklearn
// `n_categories == 0` branch, `_encoders.py:1132-1135`).
if block_width == 0 {
if let Some(&cat) = drop_d.and_then(|d| cats.get(d)) {
for i in 0..n_samples {
out[[i, j]] = cat;
}
}
continue;
}
for i in 0..n_samples {
// Argmax over the per-feature block (numpy `argmax`: index of the
// maximum, FIRST on ties). Track the block sum to detect the
// all-zero case separately, mirroring sklearn's two-step
// argmax-then-all-zero-check (`_encoders.py:1136-1172`). `argmax`
// is a BLOCK position in `0..block_width`.
let mut argmax: usize = 0;
let mut max_val = x[[i, offset]];
let mut block_sum = max_val;
for k in 1..block_width {
let v = x[[i, offset + k]];
block_sum = block_sum + v;
if v > max_val {
max_val = v;
argmax = k;
}
}
if block_sum == F::zero() {
// All-zero block. With a dropped category this is the
// LEGITIMATE encoding of the dropped value, so it inverts to
// that category in BOTH handle_unknown modes — sklearn checks
// `_drop_idx_after_grouping[i] is not None` FIRST and maps the
// all-zero row to the dropped category (`_encoders.py:1150-1158`
// for ignore, `:1169-1172` for error), bypassing the
// "can not be inverted" / None paths.
if drop_d.is_some() {
if let Some(&cat) = drop_d.and_then(|d| cats.get(d)) {
out[[i, j]] = cat;
}
} else {
// No drop on this feature: the existing handle_unknown
// semantics (`_encoders.py:1141`,`:1159-1168`).
match self.handle_unknown {
OneHotHandleUnknown::Error => {
return Err(FerroError::InvalidParameter {
name: "X".into(),
reason: "Samples can not be inverted when drop=None and \
handle_unknown='error' because they contain all zeros"
.into(),
});
}
// `handle_unknown='ignore'` all-zero block -> None in
// sklearn (`:1183`); `Array2<F>` cannot hold None so we
// use NaN as the representable sentinel (#2227).
OneHotHandleUnknown::Ignore => {
out[[i, j]] = F::nan();
}
}
}
} else if self.has_infrequent(j) {
// Infrequent grouping (REQ-5b). The block POSITION `argmax`
// is a slot in `infrequent_map[j]`. The TRAILING slot
// (`n_frequent`) is the infrequent column: sklearn inverts it
// to the string `'infrequent_sklearn'` (`_encoders.py:1675-1677`,
// `_compute_transformed_categories:917`), which an `Array2<F>`
// cannot hold — NaN is the representable proxy (DOCUMENTED
// SCOPE, R-HONEST-3, like the ignore-None case #2227). A
// frequent slot inverts to the unique `categories_[j]` index
// that maps to it (`labels = cats_wo_dropped[argmax]`,
// `:1138-1139`). Bounds-safe via `get` (R-CODE-2).
let map = self.infrequent_map.get(j);
let n_frequent = block_width - 1; // the trailing slot index
if argmax >= n_frequent {
out[[i, j]] = F::nan();
} else if let Some(&cat) = map
.and_then(|m| m.iter().position(|&s| s == argmax))
.and_then(|orig| cats.get(orig))
{
out[[i, j]] = cat;
}
} else {
// Map the block POSITION back to a `categories_[j]` index: with
// a dropped category `d`, positions `>= d` correspond to the
// category one higher (the dropped category was removed),
// matching sklearn's `cats_wo_dropped` indexing
// (`_encoders.py:1124-1139`). Bounds-safe via `get` (R-CODE-2).
let cat_idx = match drop_d {
Some(d) if argmax >= d => argmax + 1,
_ => argmax,
};
if let Some(&cat) = cats.get(cat_idx) {
out[[i, j]] = cat;
}
}
}
}
Ok(out)
}
/// Return the output feature names, one per output column.
///
/// For each input feature `j`, for each category `c` in `categories_[j]`,
/// emits `format!("x{j}_{c}")` where `c` is rendered to match Python's
/// `str(np.float64(c))`. This mirrors scikit-learn's
/// `OneHotEncoder.get_feature_names_out` with the default `input_features`
/// (`["x0", "x1", ...]`) and the `"concat"` name combiner
/// (`feature + "_" + str(category)`, `_encoders.py:1217,1224`). For the
/// whole-number fixture `[[2,0],[5,1],[9,0],[5,1]]` this yields
/// `["x0_2.0", "x0_5.0", "x0_9.0", "x1_0.0", "x1_1.0"]`.
///
/// # Float-rendering divergence (HONEST, R-HONEST-3)
///
/// The category is rendered via [`Self::category_label`], which appends `.0`
/// to integer-valued floats (`2.0 → "2.0"`, `-3.0 → "-3.0"`, matching
/// Python) and uses Rust's shortest round-trip `Display` otherwise
/// (`2.5 → "2.5"`). For category values in the usual categorical range
/// (small whole or fractional numbers) this is byte-identical to Python.
/// It DIVERGES for extreme magnitudes: Python's `repr`/`str` switches to
/// scientific notation at `|v| >= 1e16` and `0 < |v| < 1e-4`
/// (`1e+20`, `1e-07`), while Rust's `Display` prints the full decimal
/// (`100000000000000000000`, `0.0000001`). Such values are not plausible
/// one-hot categories; the divergence is documented rather than papered over.
/// `NaN` renders as `"nan"` (matching Python's `str(nan)`).
#[must_use]
pub fn get_feature_names_out(&self) -> Vec<String> {
let mut names = Vec::with_capacity(self.n_output);
for (j, cats) in self.categories_.iter().enumerate() {
// The dropped category's name is OMITTED (sklearn
// `_compute_transformed_categories` with `remove_dropped=True`,
// `_encoders.py:1209-1212`,`:909`).
let drop_d = self.drop_idx_.get(j).copied().flatten();
// Infrequent grouping (REQ-5b): emit only the FREQUENT category names
// then a single trailing `"x{j}_infrequent_sklearn"` column — the
// infrequent categories collapse into that one column (sklearn
// `_compute_transformed_categories`, `_encoders.py:913-921`:
// `cats[frequent_mask] + ['infrequent_sklearn']`). Infrequent and
// `drop` are mutually exclusive, so `drop_d` is `None` here.
if self.has_infrequent(j) {
let map = self.infrequent_map.get(j);
let n_frequent = self.block_width(j).saturating_sub(1);
for slot in 0..n_frequent {
// The unique frequent category whose remapped slot is `slot`.
if let Some(&c) = map
.and_then(|m| m.iter().position(|&s| s == slot))
.and_then(|orig| cats.get(orig))
{
names.push(format!("x{j}_{}", Self::category_label(c)));
}
}
names.push(format!("x{j}_infrequent_sklearn"));
continue;
}
for (idx, &c) in cats.iter().enumerate() {
if drop_d == Some(idx) {
continue;
}
names.push(format!("x{j}_{}", Self::category_label(c)));
}
}
names
}
/// Render a category value to a string matching Python's `str(np.float64(v))`
/// for the categorical-value range (see [`Self::get_feature_names_out`] for
/// the documented extreme-magnitude divergence).
///
/// Python's `str(float)` always shows a decimal point for whole floats
/// (`2.0`, not `2`), so an integer-valued finite float gets a `.0` suffix;
/// otherwise Rust's shortest round-trip `Display` is used. `NaN → "nan"`.
fn category_label(v: F) -> String {
let Some(f) = v.to_f64() else {
return "nan".to_string();
};
if f.is_nan() {
return "nan".to_string();
}
if f.is_finite() && f == f.trunc() {
// Whole-valued finite float: Python prints e.g. "2.0", "-3.0".
format!("{f:.1}")
} else {
// Fractional or non-finite: shortest round-trip Display ("2.5").
format!("{f}")
}
}
}
// ---------------------------------------------------------------------------
// Trait implementations
// ---------------------------------------------------------------------------
impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for OneHotEncoder<F> {
type Fitted = FittedOneHotEncoder<F>;
type Error = FerroError;
/// Fit the encoder by learning the **sorted-unique category set** per column.
///
/// For each input column `j`, `categories_[j]` is the distinct values of that
/// column, sorted ascending via `partial_cmp` and deduped by **exact
/// equality** — mirroring scikit-learn's `categories_ = _unique(Xi)`
/// (`sklearn/preprocessing/_encoders.py:99`, `np.unique` per column).
/// The output-column layout (`offsets`, `n_output`) is precomputed as the
/// prefix sums / total of the per-column category counts.
///
/// Exact float equality is what `np.unique` does, so two values that differ
/// by an ULP are distinct categories here, exactly as in sklearn.
///
/// # NaN handling (#2223)
///
/// `NaN` is treated as a valid category, matching sklearn's `_unique_np`
/// (`_encode.py:70-74`): it sorts LAST and a run of duplicate NaNs collapses
/// to a SINGLE sorted-last category (the sort orders `NaN` after every finite
/// value and `dedup_by` collapses consecutive NaNs, since `NaN != NaN`). A
/// NaN cell at `transform` then one-hots that trailing category. `fit` never
/// panics (R-CODE-2).
///
/// # Errors
///
/// Returns [`FerroError::InsufficientSamples`] if the input has zero rows
/// (matching sklearn's `check_array` minimum-of-1-sample requirement).
fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedOneHotEncoder<F>, FerroError> {
// sklearn `_parameter_constraints` (`@_fit_context`, `_encoders.py:733-738`)
// validates the params BEFORE the data: `min_frequency` is
// `Interval(Integral, 1, None)` and `max_categories` is
// `Interval(Integral, 1, None)` — a value of 0 raises
// `InvalidParameterError` ("must be ... in the range [1, inf)"). #1154/REQ-7.
// (handle_unknown/drop are type-safe Rust enums, so their StrOptions
// constraints are provided by the type system — no runtime check needed.)
if self.min_frequency == Some(0) {
return Err(FerroError::InvalidParameter {
name: "min_frequency".into(),
reason: "must be an int in the range [1, inf)".into(),
});
}
if self.max_categories == Some(0) {
return Err(FerroError::InvalidParameter {
name: "max_categories".into(),
reason: "must be an int in the range [1, inf)".into(),
});
}
let n_samples = x.nrows();
if n_samples == 0 {
return Err(FerroError::InsufficientSamples {
required: 1,
actual: 0,
context: "OneHotEncoder::fit".into(),
});
}
// sklearn `OneHotEncoder.fit` -> `check_array(force_all_finite="allow-nan")`:
// NaN is a valid CATEGORY (#2223), but +/-inf is REJECTED (verified live:
// fit([[inf]]) -> ValueError "Input contains infinity"). #2225.
if x.iter().any(|v| v.is_infinite()) {
return Err(FerroError::InvalidParameter {
name: "X".into(),
reason: "Input X contains infinity or a value too large for dtype.".into(),
});
}
let infrequent_enabled = self.infrequent_enabled();
let n_features = x.ncols();
let mut categories_: Vec<Vec<F>> = Vec::with_capacity(n_features);
// Per-feature, per-category training counts ALIGNED with `categories_[j]`
// (`category_counts[j][idx]` is the count of `categories_[j][idx]`).
// Only needed when infrequent grouping is enabled — sklearn computes
// counts via `_unique(Xi, return_counts=True)` (`_encoders.py:99-102`).
let mut category_counts: Vec<Vec<usize>> = Vec::with_capacity(n_features);
for j in 0..n_features {
// Collect this column's values, sort ascending (sklearn `np.unique`
// sorts), then dedup by EXACT equality to the sorted-unique set.
let mut col: Vec<F> = x.column(j).iter().copied().collect();
// Sort ascending with NaN LAST (sklearn `_unique_np` keeps any NaN at
// the end, `_encode.py:70-74`); `partial_cmp` alone returns None for
// NaN and would leave it unmoved (#2223).
col.sort_by(|a, b| match (a.is_nan(), b.is_nan()) {
(true, true) => Ordering::Equal,
(true, false) => Ordering::Greater,
(false, true) => Ordering::Less,
(false, false) => a.partial_cmp(b).unwrap_or(Ordering::Equal),
});
// Build the sorted-unique set AND, when infrequent grouping is
// enabled, the per-category run-length count (the sorted column has
// each category's occurrences contiguous, so a run length is the
// count). Consecutive EXACT-equal values collapse (an ULP-apart pair
// stays distinct, like `np.unique`), AND consecutive NaNs collapse to
// ONE (`dedup` alone keeps every NaN since `NaN != NaN`; sklearn
// collapses the trailing NaN run to a single sorted-last category,
// #2223).
let mut cats: Vec<F> = Vec::with_capacity(col.len());
let mut counts: Vec<usize> = Vec::with_capacity(col.len());
for v in col {
match cats.last() {
Some(&last) if last == v || (last.is_nan() && v.is_nan()) => {
if let Some(c) = counts.last_mut() {
*c += 1;
}
}
_ => {
cats.push(v);
counts.push(1);
}
}
}
categories_.push(cats);
category_counts.push(counts);
}
// Infrequent grouping (REQ-5b). When enabled, identify each feature's
// infrequent category indices and build the per-feature index→output
// column mapping; otherwise every feature has no infrequent categories
// and the mapping is the identity.
let mut infrequent_indices_: Vec<Vec<usize>> = Vec::with_capacity(n_features);
let mut infrequent_map: Vec<Vec<usize>> = Vec::with_capacity(n_features);
if infrequent_enabled {
// REQ-5a × REQ-5b interaction is DEFERRED: combining infrequent
// grouping with `drop` is rejected at fit (sklearn ALLOWS it, but the
// remapping is intricate — documented scope, R-HONEST-3). Require
// `drop == None_`.
if self.drop != OneHotDrop::None_ {
return Err(FerroError::InvalidParameter {
name: "drop".into(),
reason: "infrequent grouping (min_frequency/max_categories) with drop is not \
yet supported"
.into(),
});
}
for counts in &category_counts {
let infreq = identify_infrequent(counts, self.min_frequency, self.max_categories);
let map = build_infrequent_map(counts.len(), &infreq);
infrequent_indices_.push(infreq);
infrequent_map.push(map);
}
} else {
for cats in &categories_ {
infrequent_indices_.push(Vec::new());
infrequent_map.push((0..cats.len()).collect());
}
}
// Compute `drop_idx_` from `drop` + the learned `categories_`
// (sklearn `_compute_drop_idx`, `_encoders.py:812-831`). `drop=None` →
// every feature `None`; `drop='first'` → every feature `Some(0)`;
// `drop='if_binary'` → `Some(0)` iff the feature has exactly two
// categories, else `None`. (With infrequent grouping active `drop` is
// forced to `None_` above, so every entry is `None`.)
let drop_idx_: Vec<Option<usize>> = match self.drop {
OneHotDrop::None_ => vec![None; n_features],
OneHotDrop::First => categories_
.iter()
.map(|cats| if cats.is_empty() { None } else { Some(0) })
.collect(),
OneHotDrop::IfBinary => categories_
.iter()
.map(|cats| if cats.len() == 2 { Some(0) } else { None })
.collect(),
};
let mut fitted = FittedOneHotEncoder {
categories_,
// Placeholder; recomputed below from per-feature block widths.
offsets: Vec::new(),
n_output: 0,
// `handle_unknown` only affects `transform` (sklearn learns the same
// `categories_` regardless); thread the configured mode through. Note
// (verified live, sklearn 1.5.2): `drop` + `handle_unknown='ignore'`
// is ALLOWED — sklearn does NOT raise at fit; it warns on unknown at
// transform and encodes the unknown as an all-zero block (the same as
// the dropped category). So fit imposes no drop+ignore constraint.
handle_unknown: self.handle_unknown,
drop_idx_,
infrequent_indices_,
infrequent_map,
};
// Recompute the output-column layout from each feature's block width:
// `block_width(j)` is `n_frequent + 1` with infrequent grouping (the
// trailing infrequent column), else `len - (1 if dropped)`. `offsets` is
// the prefix sum of those widths; `n_output` the total (sklearn
// `_compute_n_features_outs`, `_encoders.py:936-955`; `feature_indices`,
// `:1049`).
let mut offsets: Vec<usize> = Vec::with_capacity(n_features);
let mut n_output: usize = 0;
for j in 0..n_features {
offsets.push(n_output);
n_output += fitted.block_width(j);
}
fitted.offsets = offsets;
fitted.n_output = n_output;
Ok(fitted)
}
}
/// Identify the indices of infrequent categories for one feature, given the
/// per-category training `counts` (aligned with `categories_[j]`) and the
/// `min_frequency`/`max_categories` thresholds.
///
/// Mirrors scikit-learn's `_BaseEncoder._identify_infrequent`
/// (`_encoders.py:275-318`):
/// 1. min_frequency: a category with `count < min_frequency` is infrequent
/// (`:295-296`, integer form only — the float-fraction form is out of scope).
/// 2. max_categories: if (after step 1) the feature would still produce more
/// than `max_categories` output columns — counted as `n_remaining_frequent +
/// 1` for the infrequent group (`:303`) — the least-frequent categories are
/// additionally marked infrequent until only `max_categories - 1` frequent
/// categories remain (`:304-315`). Ties broken by a STABLE sort over the
/// FULL count array, so among equal counts the SMALLER category index is
/// marked infrequent first (sklearn `np.argsort(kind="mergesort")[:-k]`).
/// `max_categories == 1` (frequent_category_count 0) makes every category
/// infrequent (`:307-309`).
///
/// Returns the sorted-ascending infrequent indices (empty if none — sklearn's
/// `None`). Never panics (R-CODE-2).
fn identify_infrequent(
counts: &[usize],
min_frequency: Option<usize>,
max_categories: Option<usize>,
) -> Vec<usize> {
let n = counts.len();
let mut infrequent_mask = vec![false; n];
// Step 1: min_frequency (integer count). `count < min_frequency`.
if let Some(min_freq) = min_frequency {
for (idx, &c) in counts.iter().enumerate() {
if c < min_freq {
infrequent_mask[idx] = true;
}
}
}
// Step 2: max_categories on the survivors. `n_current_features` counts the
// remaining frequent categories PLUS 1 for the infrequent group
// (`_encoders.py:303`).
if let Some(max_cat) = max_categories {
let n_infreq = infrequent_mask.iter().filter(|&&m| m).count();
let n_current_features = n - n_infreq + 1;
if max_cat < n_current_features {
// `max_categories` includes the one infrequent category.
let frequent_category_count = max_cat - 1;
if frequent_category_count == 0 {
// All categories are infrequent (`:307-309`).
infrequent_mask.iter_mut().for_each(|m| *m = true);
} else {
// Stable argsort over the FULL count array (ascending by count,
// ties by ascending index), then mark the smallest
// `n - frequent_category_count` levels infrequent — i.e. keep the
// top `frequent_category_count` by count, with ties resolved in
// favor of the LARGER index (`np.argsort(kind="mergesort")[:-k]`,
// `:312-315`).
let mut order: Vec<usize> = (0..n).collect();
order.sort_by(|&a, &b| counts[a].cmp(&counts[b]).then(a.cmp(&b)));
let keep = frequent_category_count.min(n);
let cut = n - keep;
for &idx in &order[..cut] {
infrequent_mask[idx] = true;
}
}
}
}
infrequent_mask
.iter()
.enumerate()
.filter_map(|(idx, &m)| if m { Some(idx) } else { None })
.collect()
}
/// Build the per-feature mapping from a `categories_[j]` index to its output
/// column slot WITHIN the feature's block (before adding `offsets[j]`).
///
/// Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
/// (`_encoders.py:373-400`): frequent categories take slots `0..n_frequent` in
/// their original (ascending-index) order; every infrequent category maps to the
/// single trailing slot `n_frequent`. With no infrequent categories the mapping
/// is the identity `0..n`. `infrequent` must be sorted ascending. Never panics
/// (R-CODE-2): every index is bounds-checked.
fn build_infrequent_map(n: usize, infrequent: &[usize]) -> Vec<usize> {
if infrequent.is_empty() {
return (0..n).collect();
}
let n_frequent = n - infrequent.len();
let mut map = vec![n_frequent; n];
let mut next_frequent = 0usize;
for (idx, slot) in map.iter_mut().enumerate() {
if infrequent.binary_search(&idx).is_ok() {
// Infrequent → the trailing slot (already set to `n_frequent`).
} else {
*slot = next_frequent;
next_frequent += 1;
}
}
map
}
impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedOneHotEncoder<F> {
type Output = Array2<F>;
type Error = FerroError;
/// Transform numeric categorical data into a dense one-hot encoded matrix.
///
/// Each value is one-hot by **category membership**: for input column `j` the
/// value `x[[i, j]]` is matched (by exact equality) against `categories_[j]`,
/// and the bit at output column `offsets[j] + idx` is set, where `idx` is the
/// value's position in the sorted-unique set. The per-feature one-hot blocks
/// are concatenated left-to-right, matching scikit-learn's
/// `OneHotEncoder(sparse_output=False)` output column layout
/// (`_BaseEncoder._transform`, `_encoders.py:206-240`).
///
/// A value not present in `categories_[j]` is an **unknown category**. Its
/// handling depends on the configured `handle_unknown`
/// ([`OneHotEncoder::with_handle_unknown`]):
/// - [`OneHotHandleUnknown::Error`] (the default): returns an error, matching
/// sklearn's `handle_unknown='error'`
/// (`ValueError("Found unknown categories … during transform")`,
/// `_encoders.py:209-214`).
/// - [`OneHotHandleUnknown::Ignore`]: leaves that feature's one-hot block
/// **all-zero** for this row (no column is set), matching sklearn's
/// `handle_unknown='ignore'` (`_encoders.py:215-240`: the unknown row is
/// masked out so no encoded column is set). Every KNOWN feature still emits
/// its normal one-hot bit.
///
/// The +/-inf rejection (#2225), the ncols guard, and the 0-row handling are
/// unaffected by `handle_unknown`: a non-finite +/-inf value is invalid input
/// (not an unknown category) and still errors even in `Ignore` mode.
///
/// # Errors
///
/// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
/// match the number of features seen during fitting.
///
/// Returns [`FerroError::InvalidParameter`] if any value is an unknown
/// category (not in the learned `categories_[j]` set) AND `handle_unknown`
/// is [`OneHotHandleUnknown::Error`] (the default); under
/// [`OneHotHandleUnknown::Ignore`] an unknown category never errors. Also
/// returned if any value is +/-infinite (invalid input, #2225).
fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let n_features = self.categories_.len();
// sklearn `transform` -> `check_array(force_all_finite="allow-nan")`
// (`_encoders.py`): +/-inf is rejected with "Input contains infinity"
// BEFORE the per-feature membership lookup (so an inf value reports the
// finite-check error, NOT "unknown category"); NaN passes (it can be a
// known category). #2225.
if x.iter().any(|v| v.is_infinite()) {
return Err(FerroError::InvalidParameter {
name: "X".into(),
reason: "Input X contains infinity or a value too large for dtype.".into(),
});
}
if x.ncols() != n_features {
return Err(FerroError::ShapeMismatch {
expected: vec![x.nrows(), n_features],
actual: vec![x.nrows(), x.ncols()],
context: "FittedOneHotEncoder::transform".into(),
});
}
let n_samples = x.nrows();
let mut out = Array2::zeros((n_samples, self.n_output));
for j in 0..n_features {
let cats = &self.categories_[j];
let offset = self.offsets[j];
// The per-feature dropped category index, if any (`drop_idx_[j]`).
// Used to shift kept categories down by one and to emit an all-zero
// block for the dropped category (sklearn `transform`,
// `_encoders.py:1033-1046`: `X_int > to_drop` decrements, the dropped
// cell is masked out).
let drop_d = self.drop_idx_.get(j).copied().flatten();
// The per-feature infrequent remapping (REQ-5b). When feature `j` has
// infrequent categories, a found category index maps to its block
// slot via `infrequent_map[j][idx]` (a frequent category → its
// remapped slot, an infrequent category → the trailing slot). When
// there are none the map is the identity and `infreq` is `false`, so
// the existing `drop` path is unchanged (the two are mutually
// exclusive — `fit` rejects their combination).
let infreq = self.has_infrequent(j);
let infreq_map = self.infrequent_map.get(j);
for i in 0..n_samples {
let value = x[[i, j]];
// Membership lookup: find the value's index in the sorted-unique
// `categories_[j]` by EXACT equality (np.unique / `_encode`
// semantics). A small linear scan over the per-feature category
// set — bounds-safe (no unchecked indexing; R-CODE-2).
match cats
.iter()
.position(|&c| c == value || (c.is_nan() && value.is_nan()))
{
// Infrequent grouping active: place the value in its remapped
// block slot (`_BaseEncoder._map_infrequent_categories`,
// `_encoders.py:442-452`: `X_int = np.take(mapping, X_int)`).
Some(idx) if infreq => {
if let Some(&slot) = infreq_map.and_then(|m| m.get(idx)) {
out[[i, offset + slot]] = F::one();
}
}
Some(idx) => match drop_d {
// The dropped category encodes to an ALL-ZERO block: set
// nothing (sklearn masks the dropped cell out of `X_mask`,
// `_encoders.py:1037,1046`). `out` is already zero-filled.
Some(d) if idx == d => {}
// A KEPT category after a drop shifts down by one when its
// index is past the dropped one (sklearn `X_int > to_drop`
// decrements, `_encoders.py:1045`): the output column is
// `idx` if `idx < d`, else `idx - 1`.
Some(d) if idx > d => out[[i, offset + idx - 1]] = F::one(),
// No drop on this feature, or a kept category before the
// dropped one (`idx < d`): the column is `offset + idx`.
_ => out[[i, offset + idx]] = F::one(),
},
None => match self.handle_unknown {
// handle_unknown='ignore' (`_encoders.py:215-240`): the
// unknown row is masked out and NO column in this
// feature's block is set, so the per-feature one-hot block
// stays ALL-ZERO. `out` is already zero-filled, so we just
// skip — every KNOWN feature still sets its own bit.
OneHotHandleUnknown::Ignore => continue,
// handle_unknown='error' (the sklearn default, SHIPPED
// REQ-2, UNCHANGED): ValueError "Found unknown categories
// … during transform" (`_encoders.py:209-214`). `F: Float`
// is not `Display`, so report the value via `to_f64`.
OneHotHandleUnknown::Error => {
let v = value.to_f64();
let shown = match v {
Some(f) => format!("[{f}]"),
None => "[<non-finite>]".to_string(),
};
return Err(FerroError::InvalidParameter {
name: format!("x[{i},{j}]"),
reason: format!(
"Found unknown categories {shown} in column {j} during transform"
),
});
}
},
}
}
}
Ok(out)
}
}
/// Implement `Transform` on the unfitted encoder to satisfy the `FitTransform: Transform`
/// supertrait bound. Calling `transform` on an unfitted encoder always returns an error.
impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for OneHotEncoder<F> {
type Output = Array2<F>;
type Error = FerroError;
/// Always returns an error — the encoder must be fitted first.
///
/// Use [`Fit::fit`] to produce a [`FittedOneHotEncoder`], then call
/// [`Transform::transform`] on that.
fn transform(&self, _x: &Array2<F>) -> Result<Array2<F>, FerroError> {
Err(FerroError::InvalidParameter {
name: "OneHotEncoder".into(),
reason: "encoder must be fitted before calling transform; use fit() first".into(),
})
}
}
impl<F: Float + Send + Sync + 'static> FitTransform<Array2<F>> for OneHotEncoder<F> {
type FitError = FerroError;
/// Fit the encoder on `x` and return the one-hot encoded output in one step.
///
/// # Errors
///
/// Returns an error if fitting or transformation fails.
fn fit_transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let fitted = self.fit(x, &())?;
fitted.transform(x)
}
}
/// Convenience: encode a 1-D array of numeric categories.
///
/// This wraps the input in a single-column `Array2<F>` and returns the encoded
/// result with one-hot columns for that single feature, matching the membership
/// encoding of [`Transform::transform`].
impl<F: Float + Send + Sync + 'static> FittedOneHotEncoder<F> {
/// Transform a 1-D slice of numeric category values.
///
/// # Errors
///
/// Returns an error if the encoder was fitted on more than one column, or if
/// any value is an unknown category (not in the learned `categories_[0]`).
pub fn transform_1d(&self, x: &[F]) -> Result<Array2<F>, FerroError> {
if self.categories_.len() != 1 {
return Err(FerroError::InvalidParameter {
name: "transform_1d".into(),
reason: "encoder was fitted on more than one column; use transform instead".into(),
});
}
let col = Array2::from_shape_vec((x.len(), 1), x.to_vec()).map_err(|e| {
FerroError::InvalidParameter {
name: "x".into(),
reason: e.to_string(),
}
})?;
self.transform(&col)
}
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
use ndarray::array;
#[test]
fn test_one_hot_single_column() {
let enc = OneHotEncoder::<f64>::new();
let x = array![[0.0_f64], [1.0], [2.0]];
let fitted = enc.fit(&x, &()).unwrap();
assert_eq!(fitted.categories(), &[vec![0.0, 1.0, 2.0]]);
assert_eq!(fitted.n_categories(), vec![3]);
assert_eq!(fitted.n_output_features(), 3);
let out = fitted.transform(&x).unwrap();
assert_eq!(out.shape(), &[3, 3]);
// Row 0: category 0 → [1, 0, 0]
assert_eq!(out[[0, 0]], 1.0);
assert_eq!(out[[0, 1]], 0.0);
assert_eq!(out[[0, 2]], 0.0);
// Row 1: category 1 → [0, 1, 0]
assert_eq!(out[[1, 0]], 0.0);
assert_eq!(out[[1, 1]], 1.0);
assert_eq!(out[[1, 2]], 0.0);
// Row 2: category 2 → [0, 0, 1]
assert_eq!(out[[2, 0]], 0.0);
assert_eq!(out[[2, 1]], 0.0);
assert_eq!(out[[2, 2]], 1.0);
}
#[test]
fn test_one_hot_multi_column() {
let enc = OneHotEncoder::<f64>::new();
// Two columns: col0 has 3 categories, col1 has 2 categories
let x = array![[0.0_f64, 0.0], [1.0, 1.0], [2.0, 0.0]];
let fitted = enc.fit(&x, &()).unwrap();
assert_eq!(fitted.categories(), &[vec![0.0, 1.0, 2.0], vec![0.0, 1.0]]);
assert_eq!(fitted.n_categories(), vec![3, 2]);
assert_eq!(fitted.n_output_features(), 5);
let out = fitted.transform(&x).unwrap();
assert_eq!(out.shape(), &[3, 5]);
// Row 0: (0, 0) → [1,0,0, 1,0]
assert_eq!(out.row(0).to_vec(), vec![1.0, 0.0, 0.0, 1.0, 0.0]);
// Row 1: (1, 1) → [0,1,0, 0,1]
assert_eq!(out.row(1).to_vec(), vec![0.0, 1.0, 0.0, 0.0, 1.0]);
// Row 2: (2, 0) → [0,0,1, 1,0]
assert_eq!(out.row(2).to_vec(), vec![0.0, 0.0, 1.0, 1.0, 0.0]);
}
#[test]
fn test_non_contiguous_single_column() {
// The REQ-3 headline: non-contiguous integers {2,5,9} must yield 3
// category columns (one per unique value), NOT max+1 == 10.
let enc = OneHotEncoder::<f64>::new();
let x = array![[2.0_f64], [5.0], [9.0]];
let fitted = enc.fit(&x, &()).unwrap();
assert_eq!(fitted.categories(), &[vec![2.0, 5.0, 9.0]]);
assert_eq!(fitted.n_output_features(), 3);
let out = fitted.transform(&x).unwrap();
assert_eq!(out.shape(), &[3, 3]);
assert_eq!(out.row(0).to_vec(), vec![1.0, 0.0, 0.0]);
assert_eq!(out.row(1).to_vec(), vec![0.0, 1.0, 0.0]);
assert_eq!(out.row(2).to_vec(), vec![0.0, 0.0, 1.0]);
}
#[test]
fn test_unknown_category_error() {
let enc = OneHotEncoder::<f64>::new();
let x_train = array![[0.0_f64], [1.0]];
let fitted = enc.fit(&x_train, &()).unwrap();
// Value 2.0 was not seen during fitting → unknown category.
let x_bad = array![[2.0_f64]];
assert!(fitted.transform(&x_bad).is_err());
}
#[test]
fn test_fit_transform_equivalence() {
let enc = OneHotEncoder::<f64>::new();
let x = array![[0.0_f64, 1.0], [1.0, 0.0], [2.0, 1.0]];
let via_fit_transform: Array2<f64> = enc.fit_transform(&x).unwrap();
let fitted = enc.fit(&x, &()).unwrap();
let via_separate = fitted.transform(&x).unwrap();
for (a, b) in via_fit_transform.iter().zip(via_separate.iter()) {
assert!((a - b).abs() < 1e-15);
}
}
#[test]
fn test_shape_mismatch_error() {
let enc = OneHotEncoder::<f64>::new();
let x_train = array![[0.0_f64, 1.0], [1.0, 0.0]];
let fitted = enc.fit(&x_train, &()).unwrap();
let x_bad = array![[0.0_f64]];
assert!(fitted.transform(&x_bad).is_err());
}
}