ferrolearn_preprocess/
ordinal_encoder.rs

1//! Ordinal encoder: map string categories to integer indices.
2//!
3//! Each column's categories are mapped to integers `0, 1, 2, ...` in
4//! **lexicographic order** (matching scikit-learn's `OrdinalEncoder`).
5//! Unknown categories seen during `transform` produce an error by default
6//! (`handle_unknown='error'`); with `handle_unknown='use_encoded_value'` they
7//! are instead encoded as a configurable `unknown_value` sentinel
8//! (matching scikit-learn's `OrdinalEncoder`).
9//!
10//! # `## REQ status`
11//!
12//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_encoders.py` (`class OrdinalEncoder`
13//! `:1235`). Design doc: `.design/preprocess/ordinal_encoder.md`. Expected values from the live
14//! sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (`lib.rs:121`, grandfathered S5).
15//! HONEST (R-HONEST-3): a FAITHFUL String-only ordinal encoder — `categories_`=sorted-unique and
16//! the ordinal VALUES match sklearn bit-for-bit on the string path; the output container is now
17//! `Array2<f64>` (sklearn's `dtype=np.float64` default, `:1262`); remaining divergences are
18//! String-only input, the absent configurable `dtype` param, and the rest of the param/feature
19//! surface.
20//!
21//! | REQ | Status | Evidence |
22//! |---|---|---|
23//! | REQ-1 (string fit → sorted-unique categories_) | SHIPPED | `Fit::fit` per column → `categories_`=sorted-unique (`Vec<String>::sort`, lexicographic) + index map; rejects 0 rows (`InsufficientSamples`, matches sklearn `check_array`). Mirrors `_BaseEncoder._fit` `categories_=_unique(Xi)` (`_encoders.py:99`). Critic-verified vs live oracle: `green_value_match_and_categories` (`[['bird','cat','dog'],['large','medium','small']]`), `green_lexicographic_sort_matches_np_unique` + `green_non_ascii_codepoint_order` (== `np.unique`), `green_empty_fit_rejected_matches_sklearn`. Consumer: re-export `lib.rs:121`. |
24//! | REQ-2 (transform + fit_transform, ordinal values + unknown rejection) | SHIPPED | `Transform::transform` maps category→ordinal index (now cast to `f64` via `ordinal_index_to_f64`), unknown → `InvalidParameter` (matches `handle_unknown='error'` default `ValueError`), ncols-mismatch → `ShapeMismatch`. The unknown/ncols-mismatch LOGIC is byte-for-byte UNCHANGED by the dtype fix. Critic-verified: ordinal VALUES `[[1.,2.],[2.,0.],[1.,1.],[0.,2.]]` == live oracle, `green_unknown_category_rejected`, `green_fit_transform_equals_oracle`. Consumer: re-export `lib.rs:142`. |
25//! | REQ-3 (output dtype float64) | SHIPPED | `Transform::Output = Array2<f64>` on BOTH `Transform` impls (`FittedOrdinalEncoder` + the unfitted `OrdinalEncoder` shim) and `FitTransform::fit_transform`; each cell is the ordinal index cast via `ordinal_index_to_f64` (`idx as f64`, lossless < 2^53), matching sklearn's default `dtype=np.float64` output container (`_encoders.py:1262`, `transform` casts `X_int.astype(self.dtype)`). The REQ-1/REQ-2 fit + unknown-rejection LOGIC is unchanged. Critic-verified vs live oracle: `green_fit_transform_f64_oracle` (multi-feature f64 matrix), `green_exact_integer_index_to_f64` (index 10 → `10.0`), plus the value guards over `Array2<f64>`. A CONFIGURABLE non-float64 output dtype (`int32` etc.) is a FOLLOW-ON (blocker #1158 remains open for the `dtype` ctor param); ferrolearn's output is fixed to sklearn's `float64` DEFAULT. This unblocks REQ-5's float `unknown_value` sentinel. Consumer: crate re-export `lib.rs:142`. |
26//! | REQ-4 (numeric/mixed-dtype input) | NOT-STARTED | open prereq blocker #1159. `Array2<String>`-only; sklearn accepts int/str/object (`np.unique` numeric sort). |
27//! | REQ-5 (handle_unknown='use_encoded_value' + unknown_value) | SHIPPED | `HandleUnknown` enum `{ Error, UseEncodedValue }` (default `Error`) + `unknown_value: Option<f64>` on `OrdinalEncoder`, threaded into `FittedOrdinalEncoder` via `with_handle_unknown`/`with_unknown_value` builders. `Fit::fit` runs the 3 sklearn validations (`_encoders.py:1473-1526`) AFTER the unchanged `categories_` compute, mapping sklearn's `TypeError`/`ValueError` → `FerroError::InvalidParameter`: (a) `UseEncodedValue` && `unknown_value is None` (sklearn `:1481` `not isinstance(.,Integral)` TypeError); (b) `Error` && `unknown_value is Some` (sklearn `:1488` TypeError); (c) `UseEncodedValue` && non-nan integer `v` with `0 <= v < max_cardinality` (sklearn `:1518-1526` ValueError collision). `Transform::transform` branches unknown categories: `UseEncodedValue` → write `unknown_value` (incl. nan) (sklearn `:1591` `X_trans[~X_mask] = self.unknown_value`); `Error` → `InvalidParameter` (the SHIPPED REQ-2 default, UNCHANGED). Seen categories still map to `idx as f64` (UNCHANGED). NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle: `green_use_encoded_value_minus_one`, `green_use_encoded_value_nan`, `green_use_encoded_value_multifeature`, `red_uev_requires_unknown_value`, `red_error_mode_forbids_unknown_value`, `red_unknown_value_collision_in_range`, `green_unknown_value_negative_or_oob_or_nan_ok`, `green_error_mode_unknown_still_rejected` (`tests/divergence_ordinal_encoder.rs`). Configurable `dtype`/`encoded_missing_value` interplay stays OUT OF SCOPE (REQ-3/REQ-6). Consumer: crate re-export `lib.rs:142`. |
28//! | REQ-6 (encoded_missing_value / NaN) | NOT-STARTED | open prereq blocker #1161. No missing-value concept (`:1283`). |
29//! | REQ-7 (explicit categories param) | SHIPPED | `Categories` enum `{ Auto, Explicit(Vec<Vec<String>>) }` (default `Auto`) + `#[must_use] OrdinalEncoder::with_categories(Vec<Vec<String>>)` builder + `categories_param()` getter (named to avoid colliding with `FittedOrdinalEncoder::categories`). `Fit::fit` branches on the param AFTER the 0-row guard: `Auto` → the SHIPPED REQ-1 sorted-unique compute (UNCHANGED); `Explicit(lists)` → use each `lists[j]` AS-GIVEN for `categories_[j]` (GIVEN order, NOT re-sorted) + the index map in that order, mirroring sklearn `_encoders.py:114` `cats = np.array(self.categories[i])`. Validations match `_BaseEncoder._fit`: list-count ≠ n_features → `ShapeMismatch` ("Shape mismatch: if categories is an array, it has to be of shape (n_features,)." `:85-89`); an EMPTY list → `InvalidParameter` (sklearn indexes `cats[0]` -> IndexError in both modes, `:114-117`, #2229); a list with duplicate elements → `InvalidParameter` ("In column {j}, the predefined categories contain duplicate elements." `:136-141`); under [`HandleUnknown::Error`] (default) a data value not in its column's list → `InvalidParameter` ("Found unknown categories [{v}] in column {j} during fit" `:153-160`), while under [`HandleUnknown::UseEncodedValue`] this fit-time subset check is SKIPPED (out-of-set data is encoded to `unknown_value` at transform). The REQ-5 unknown_value validations still apply (the `max_cardinality` collision check now keys off the explicit list lengths). `Transform`/`inverse_transform`/`categories()`/`get_feature_names_out` are UNCHANGED — they already read `categories_`/`category_to_index`, which now reflect the explicit given-order set. NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle (`tests/divergence_ordinal_encoder.rs`): `green_explicit_given_order_not_sorted`, `green_explicit_unsorted_accepted`, `red_explicit_error_mode_data_not_in_cats_fits_err`, `green_explicit_use_encoded_value_out_of_set_ok`, `red_explicit_n_features_mismatch`, `green_explicit_multifeature_each_own_order`, `red_explicit_duplicate_categories`, `green_explicit_inverse_roundtrip_given_order`, `green_explicit_auto_still_default`. Consumer: crate re-export (`lib.rs:142`, `Categories` re-exported). Configurable numeric/`bytes` categories + the nan-last rule stay OUT OF SCOPE (String-only path, REQ-4/REQ-6). |
30//! | REQ-8 (min_frequency/max_categories infrequent) | SHIPPED | #1163: `OrdinalEncoder::with_min_frequency`/`with_max_categories` (+`min_frequency()`/`max_categories()` getters) add the integer-count infrequent thresholds (`_encoders.py:1289-1315`). The OrdinalEncoder ANALOG of the SHIPPED OneHotEncoder REQ-5b (`one_hot_encoder.rs`): the SAME `_identify_infrequent` algorithm (reused as `identify_infrequent` + `build_infrequent_map`, mirroring `_BaseEncoder._identify_infrequent` `:275-318` + `_default_to_infrequent_mappings` `:373-400`: min_frequency `count < min_freq` FIRST, then max_categories keeps top `max_categories-1` by count via a STABLE argsort over the full count array — ties favor the LARGER index; `max_categories==1` → all infrequent), but the infrequent categories collapse to a single shared ORDINAL CODE `n_frequent` (NOT a one-hot column): frequent categories keep codes `0..n_frequent` in their original sorted order, every infrequent category emits `n_frequent`. `Fit::fit` runs the `_parameter_constraints` check FIRST (`min_frequency`/`max_categories` `Some(0)` → `InvalidParameter` "must be an int in the range [1, inf)", BEFORE the data, `Interval(Integral,1,None)`), then (after the SHIPPED `categories_` compute, UNCHANGED — `categories_` keeps ALL categories) builds per-feature `infrequent_indices_`/`infrequent_map`/`n_frequent` from the fit-data category counts. `FittedOrdinalEncoder::infrequent_categories()` exposes the infrequent VALUES per feature (`infrequent_categories_`, `:255-262`). `Transform` routes a found category index through `infrequent_map[j]` then casts to f64 (frequent → own code, infrequent → shared trailing code; `_map_infrequent_categories`, `:402-452`); with grouping DISABLED the map is the identity so REQ-2 is UNCHANGED. `inverse_transform`: a code `< n_frequent` → the frequent category at that remapped slot (via the frequent-only category list); a code `== n_frequent` (exact float equality on the raw label) → the REAL String `"infrequent_sklearn"` (`:1644`,`:1675-1677`) — representable, UNLIKE OneHotEncoder's NaN proxy; the truncate+wrap numpy index logic applies over the frequent-only list (SHIPPED REQ-9 path UNCHANGED when disabled). The `unknown_value` collision check now keys off the EFFECTIVE code count `n_frequent + 1` (verified live: `min_frequency=2` over 4 cats → 3 codes → `unknown_value=3` accepted, `=2` collides). `get_feature_names_out` is UNCHANGED (OrdinalEncoder is one-to-one — infrequent does NOT add columns). NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle (`tests/divergence_ordinal_encoder.rs`): `req8_min_frequency_two_categories_transform_inverse`, `req8_max_categories_keeps_top_k_minus_one`, `req8_max_categories_tiebreak_favors_larger_index`, `req8_both_set_multifeature_some_without_infrequent`, `req8_zero_thresholds_rejected`, `req8_infrequent_plus_use_encoded_value_distinct_codes`, `req8_unknown_value_collision_uses_effective_code_count`, `req8_disabled_default_unchanged`, `req8_inverse_infrequent_non_roundtrip`. Consumer: crate re-export `lib.rs:142`. STILL NOT-STARTED (R-HONEST-3): the FLOAT-fraction `min_frequency` (`:1296-1297`,`:297-299`) and the explicit-`categories`+infrequent interaction stay unimplemented. |
31//! | REQ-9 (inverse_transform) | SHIPPED | `FittedOrdinalEncoder::inverse_transform(&Array2<f64>) -> Array2<String>` reuses the SHIPPED `categories_` (REQ-1): each cell is an ordinal index into `categories[j]`, mirroring sklearn `X_tr[:, i] = self.categories_[i][labels]` (`_encoders.py:1595-1679`). Validates the index BEFORE lookup (no panic, R-CODE-2): an exact non-negative integer in `[0, len)` → `categories[j][index].clone()`; 0-row → `InsufficientSamples` (symmetry with the #2220 transform guard); ncols-mismatch → `ShapeMismatch` (sklearn `:1619`). FAITHFUL to numpy: mirrors `labels.astype("int64")` (truncate toward zero, Rust `as i64`) + numpy fancy indexing (negative WRAP, `-1.0` → last category, `-2.0` → `len-2`), raising only once the wrapped index leaves `[0, len)` (`_encoders.py:1664`,`:1679`). Non-finite (NaN/±inf) → `InvalidParameter` (sklearn IndexError/ValueError; guarded because Rust `f64 as i64` saturates NaN→0). Critic-verified vs live sklearn 1.5.2 oracle: `green_inverse_roundtrip_multifeature`, `green_inverse_held_out_valid_ordinals`, `green_inverse_negative_wraps_like_numpy` (`-1.0`→'dog', `-2.0`→'cat', `-3.0`→Err), `green_inverse_non_integer_truncates_like_numpy` (`1.5`→'dog', `0.7`→'cat'), `red_inverse_out_of_range_positive` (`9.0`→Err), `red_inverse_ncols_mismatch`, `red_inverse_zero_row`, `red_inverse_use_encoded_value_unknown_cell` (`tests/divergence_ordinal_encoder.rs`). SCOPE LIMITATION (R-HONEST-3): the `unknown_value`-cell → `None` inverse (sklearn `:1673`) is unrepresentable in `Array2<String>` (would need `Array2<Option<String>>`), so a `use_encoded_value` cell equal to `unknown_value` ERRORS (checked BEFORE the index logic so the sentinel is not silently wrapped) instead of yielding `None`; the default `Error`-mode encoder has only valid ordinals so its inverse is COMPLETE and bit-exact. Consumer: crate re-export `lib.rs:142`. |
32//! | REQ-10 (get_feature_names_out + n_features_in_) | SHIPPED | `FittedOrdinalEncoder::n_features_in()` (= `n_features()`, sklearn `n_features_in_`) + `get_feature_names_out(input_features)` — `OneToOneFeatureMixin` (one output col per input col) returns the INPUT names unchanged: `None` -> `["x0","x1",..]` (`_check_feature_names_in`), `Some(names)` -> verbatim, a wrong-length `input_features` -> `ShapeMismatch` (sklearn ValueError). Live-oracle test `req10_feature_names_out_and_n_features_in` (`['x0','x1']`, `['a','b']`, wrong-length Err). feature_names_in_ (string input-name capture) stays NOT-STARTED (ferrolearn fit takes positional columns, no input names). Consumer: crate re-export `lib.rs:142`. |
33//! | REQ-11 (full ctor + _parameter_constraints) | NOT-STARTED | open prereq blocker #1166. `new()` takes no params (`:1320-1386`). |
34//! | REQ-12 (PyO3 binding) | SHIPPED | `_RsOrdinalEncoder` (hand `#[pyclass]`, `ferrolearn-python/src/extras.rs`) over `OrdinalEncoder`/`FittedOrdinalEncoder`/`HandleUnknown`/`Categories` — the FIRST STRING-INPUT binding: `fit(rows)`/`transform(rows)` take a Python `list[list[str]]` (PyO3 `Vec<Vec<String>>` extraction, NOT a numpy f64 array), validate rectangular rows (ragged → `PyValueError`), build `Array2<String>` via `Array2::from_shape_vec`, and `transform` returns `PyArray2<f64>`; `inverse_transform(PyReadonlyArray2<f64>)` returns the `Array2<String>` rows as `Vec<Vec<String>>` (the `use_encoded_value`→None inverse ERRORS, REQ-9 scope → `PyValueError`). Ctor knobs `handle_unknown="error"` (`resolve_handle_unknown`: "error"→`Error`, "use_encoded_value"→`UseEncodedValue`, bad→`PyValueError` per `_encoders.py:1425`), `unknown_value: Option<f64>=None`, `categories: Option<Vec<Vec<String>>>=None` (None→`Auto`, Some→`Explicit`); the REQ-5/REQ-7 fit validations (`OrdinalEncoder::fit`) surface as `FerroError`→`PyValueError`. `#[getter]`s `categories_` (PyList of str lists), `n_features_in_`, `feature_names_out` (`get_feature_names_out(None)`). Registered `lib.rs` `m.add_class::<extras::RsOrdinalEncoder>()`. Non-test production consumer (R-DEFER-1): `_extras.py::OrdinalEncoder(BaseEstimator)` — a CUSTOM class (NOT `_TransformerWrapper`, input is str), full 7-key keyword-only ctor (`categories`/`dtype`/`handle_unknown`/`unknown_value`/`encoded_missing_value`/`min_frequency`/`max_categories`, `_encoders.py:1435-1452`) for `get_params`/`clone`, `_to_rows` (numpy str/object array OR list-of-lists → `list[list[str]]` via `np.asarray(X).astype(str).tolist()`), `_check_unsupported` (non-NaN `encoded_missing_value` REQ-6 / `min_frequency`/`max_categories` REQ-8 / non-f64 `dtype` REQ-3 → `NotImplementedError`), `fit`/`transform`/`fit_transform`/`inverse_transform` (→ numpy object array)/`get_feature_names_out`, `@property` `categories_`/`n_features_in_`, pre-fit access → `NotFittedError` (`check_is_fitted(self, "_rs")`); re-exported in `ferrolearn/__init__.py` as `ferrolearn.OrdinalEncoder`. Live-oracle parity (R-CHAR-3, sklearn 1.5.2, `tests/divergence_ordinal_encoder_py.py`, 19 pass): `fit_transform([['cat'],['dog'],['cat']])==[[0.],[1.],[0.]]`==sklearn, `categories_`==sklearn sorted-unique, multi-feature, inverse_transform roundtrip==original, `use_encoded_value`/`unknown_value=-1`→-1.0, explicit `categories=[['dog','cat','bird']]`→given-order index, `n_features_in_`, `get_feature_names_out`→`['x0','x1']` (+ input_features pass-through), pre-fit `NotFittedError`, bad `handle_unknown`→`ValueError`, unsupported (`encoded_missing_value`/`min_frequency`/`max_categories`/`dtype`)→`NotImplementedError`, 7-key get_params==sklearn, `clone`, numpy object/str-array input. STRING-only input (REQ-4 #1159), the `use_encoded_value`→None inverse (REQ-9), and the rest of the param surface stay OUT OF SCOPE (R-HONEST-3). |
35//! | REQ-13 (ferray substrate) | NOT-STARTED | open prereq blocker #1168. `ndarray`+`HashMap`, not `ferray-core` (R-SUBSTRATE-1/2). |
36
37use ferrolearn_core::error::FerroError;
38use ferrolearn_core::traits::{Fit, FitTransform, Transform};
39use ndarray::Array2;
40use std::collections::HashMap;
41
42// ---------------------------------------------------------------------------
43// HandleUnknown
44// ---------------------------------------------------------------------------
45
46/// How [`OrdinalEncoder`] treats categories at `transform` time that were not
47/// seen during `fit`.
48///
49/// Mirrors scikit-learn's `OrdinalEncoder(handle_unknown=...)` parameter
50/// (`sklearn/preprocessing/_encoders.py:1262`), which accepts `'error'` and
51/// `'use_encoded_value'`.
52#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
53pub enum HandleUnknown {
54    /// Raise an error on any unknown category (scikit-learn's default
55    /// `handle_unknown='error'`). This is also the default here.
56    #[default]
57    Error,
58    /// Encode unknown categories with the configured `unknown_value` sentinel
59    /// (scikit-learn's `handle_unknown='use_encoded_value'`). Requires
60    /// `unknown_value` to be set.
61    UseEncodedValue,
62}
63
64// ---------------------------------------------------------------------------
65// Categories
66// ---------------------------------------------------------------------------
67
68/// How [`OrdinalEncoder`] determines, per column, the ordered category set used
69/// to assign ordinal indices.
70///
71/// Mirrors scikit-learn's `OrdinalEncoder(categories=...)` parameter
72/// (`sklearn/preprocessing/_encoders.py:1252`), which accepts `'auto'` or a
73/// list of per-feature category lists.
74#[derive(Debug, Clone, PartialEq, Eq, Default)]
75pub enum Categories {
76    /// Determine the categories automatically from the training data as the
77    /// sorted-unique values per column (scikit-learn's default `categories='auto'`).
78    #[default]
79    Auto,
80    /// Use the explicit, user-provided category lists. `Explicit(lists)[j]` is
81    /// the ordered category set for column `j`, used **as given** (the order is
82    /// preserved, NOT re-sorted), mirroring scikit-learn's
83    /// `categories=[list, ...]` (`_encoders.py:114`, the categories are used
84    /// `np.array(self.categories[i])` as-is).
85    Explicit(Vec<Vec<String>>),
86}
87
88// ---------------------------------------------------------------------------
89// OrdinalEncoder (unfitted)
90// ---------------------------------------------------------------------------
91
92/// An unfitted ordinal encoder.
93///
94/// Calling [`Fit::fit`] on an `Array2<String>` learns, for each column, a
95/// mapping from the unique string categories (sorted lexicographically)
96/// to consecutive integers `0, 1, 2, ...`, and returns a
97/// [`FittedOrdinalEncoder`].
98///
99/// Unknown categories at `transform` time are, by default, rejected
100/// ([`HandleUnknown::Error`]). Configuring
101/// [`with_handle_unknown`](OrdinalEncoder::with_handle_unknown) with
102/// [`HandleUnknown::UseEncodedValue`] plus
103/// [`with_unknown_value`](OrdinalEncoder::with_unknown_value) instead encodes
104/// unknown categories as the supplied sentinel (which may be `f64::NAN`),
105/// matching scikit-learn's `OrdinalEncoder(handle_unknown='use_encoded_value')`.
106///
107/// # Examples
108///
109/// ```
110/// use ferrolearn_preprocess::ordinal_encoder::OrdinalEncoder;
111/// use ferrolearn_core::traits::{Fit, Transform};
112/// use ndarray::Array2;
113///
114/// let enc = OrdinalEncoder::new();
115/// let data = Array2::from_shape_vec(
116///     (3, 2),
117///     vec![
118///         "cat".to_string(), "small".to_string(),
119///         "dog".to_string(), "large".to_string(),
120///         "cat".to_string(), "small".to_string(),
121///     ],
122/// ).unwrap();
123/// let fitted = enc.fit(&data, &()).unwrap();
124/// let encoded = fitted.transform(&data).unwrap();
125/// // Output is `Array2<f64>`, matching sklearn's `dtype=np.float64` default.
126/// assert_eq!(encoded[[0, 0]], 0.0); // "cat" is index 0 in col 0
127/// assert_eq!(encoded[[1, 0]], 1.0); // "dog" is index 1 in col 0
128/// ```
129#[derive(Debug, Clone, Default)]
130pub struct OrdinalEncoder {
131    /// How the per-column category sets are determined ([`Categories::Auto`] =
132    /// sorted-unique from the data, the default; [`Categories::Explicit`] =
133    /// user-provided lists used in the given order).
134    categories: Categories,
135    /// Strategy for unknown categories at `transform` time.
136    handle_unknown: HandleUnknown,
137    /// Sentinel written for unknown categories when `handle_unknown` is
138    /// [`HandleUnknown::UseEncodedValue`]. May be `f64::NAN`.
139    unknown_value: Option<f64>,
140    /// Minimum frequency (count) below which a category is grouped into the
141    /// single trailing "infrequent" ordinal index for that feature
142    /// (`min_frequency`). `None` (the default) disables the min-frequency
143    /// threshold. Mirrors scikit-learn's `OrdinalEncoder(min_frequency=...)`
144    /// (`sklearn/preprocessing/_encoders.py:1289-1297`). SCOPE (R-HONEST-3):
145    /// only the integer-count form is supported — sklearn also accepts a FLOAT
146    /// fraction `min_frequency * n_samples` (`:1296-1297`,`:297-299`), which is
147    /// NOT-STARTED here.
148    min_frequency: Option<usize>,
149    /// Upper limit on the number of output ordinal codes per feature when
150    /// grouping infrequent categories (`max_categories`); the infrequent group
151    /// itself counts toward this limit. `None` (the default) imposes no limit.
152    /// Mirrors scikit-learn's `OrdinalEncoder(max_categories=...)`
153    /// (`sklearn/preprocessing/_encoders.py:1301-1315`).
154    max_categories: Option<usize>,
155}
156
157impl OrdinalEncoder {
158    /// Create a new `OrdinalEncoder` with scikit-learn's defaults
159    /// (`handle_unknown='error'`, no `unknown_value`).
160    #[must_use]
161    pub fn new() -> Self {
162        Self {
163            categories: Categories::Auto,
164            handle_unknown: HandleUnknown::Error,
165            unknown_value: None,
166            min_frequency: None,
167            max_categories: None,
168        }
169    }
170
171    /// Set the explicit per-column category lists (`categories=[list, ...]`).
172    ///
173    /// Each `lists[j]` is the ordered category set for column `j`, used **as
174    /// given** at `fit` time — the order is preserved (NOT re-sorted), so the
175    /// assigned ordinal indices follow the supplied order, matching
176    /// scikit-learn's `OrdinalEncoder(categories=...)`
177    /// (`sklearn/preprocessing/_encoders.py:114`).
178    ///
179    /// At `fit` time the number of lists must equal the number of input columns,
180    /// no list may contain duplicates, and (under the default
181    /// `handle_unknown='error'`) every value seen in the data must appear in its
182    /// column's list; otherwise [`Fit::fit`] returns an error. See [`Fit::fit`]
183    /// for the exact validation contract.
184    #[must_use]
185    pub fn with_categories(mut self, categories: Vec<Vec<String>>) -> Self {
186        self.categories = Categories::Explicit(categories);
187        self
188    }
189
190    /// Return the configured `categories` strategy ([`Categories::Auto`] or
191    /// [`Categories::Explicit`]).
192    ///
193    /// Named `categories_param` to avoid colliding with
194    /// [`FittedOrdinalEncoder::categories`], which returns the *learned*
195    /// per-column category lists after fitting.
196    #[must_use]
197    pub fn categories_param(&self) -> &Categories {
198        &self.categories
199    }
200
201    /// Set the unknown-category strategy (`handle_unknown`).
202    ///
203    /// With [`HandleUnknown::UseEncodedValue`] an `unknown_value` must also be
204    /// supplied via [`with_unknown_value`](OrdinalEncoder::with_unknown_value);
205    /// otherwise [`Fit::fit`] returns an error (matching scikit-learn's
206    /// validation).
207    #[must_use]
208    pub fn with_handle_unknown(mut self, handle_unknown: HandleUnknown) -> Self {
209        self.handle_unknown = handle_unknown;
210        self
211    }
212
213    /// Set the sentinel written for unknown categories under
214    /// [`HandleUnknown::UseEncodedValue`]. May be `f64::NAN`.
215    ///
216    /// Setting this while `handle_unknown` is [`HandleUnknown::Error`] causes
217    /// [`Fit::fit`] to return an error (matching scikit-learn's validation).
218    #[must_use]
219    pub fn with_unknown_value(mut self, unknown_value: f64) -> Self {
220        self.unknown_value = Some(unknown_value);
221        self
222    }
223
224    /// Return the configured unknown-category strategy.
225    #[must_use]
226    pub fn handle_unknown(&self) -> HandleUnknown {
227        self.handle_unknown
228    }
229
230    /// Return the configured unknown-category sentinel, if any.
231    #[must_use]
232    pub fn unknown_value(&self) -> Option<f64> {
233        self.unknown_value
234    }
235
236    /// Set the minimum-frequency threshold for infrequent grouping
237    /// (`min_frequency`, integer count).
238    ///
239    /// At `fit` time a category whose count in the training data is **strictly
240    /// less than** `min_frequency` is grouped with the other infrequent
241    /// categories into a single trailing ordinal index `n_frequent` for that
242    /// feature (the frequent categories keep ordinal indices `0..n_frequent` in
243    /// their original sorted order), matching scikit-learn's
244    /// `OrdinalEncoder(min_frequency=...)` integer form
245    /// (`sklearn/preprocessing/_encoders.py:1289-1297`, `_identify_infrequent`
246    /// `:295-296` `category_count < self.min_frequency`).
247    ///
248    /// Unlike [`crate::OneHotEncoder`], the infrequent group collapses to ONE
249    /// **ordinal index** (not a one-hot column), so `categories_` is unchanged
250    /// (all categories retained) — only the emitted ordinal code is shared.
251    ///
252    /// SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn
253    /// also accepts a FLOAT `min_frequency` interpreted as the fraction
254    /// `min_frequency * n_samples` (`_encoders.py:1296-1297`,`:297-299`); the
255    /// float-fraction form is NOT-STARTED here.
256    #[must_use]
257    pub fn with_min_frequency(mut self, min_frequency: usize) -> Self {
258        self.min_frequency = Some(min_frequency);
259        self
260    }
261
262    /// Set the maximum number of output ordinal codes per feature for infrequent
263    /// grouping (`max_categories`).
264    ///
265    /// At `fit` time, if a feature would otherwise produce more than
266    /// `max_categories` distinct ordinal codes, the least-frequent categories
267    /// are grouped into the single trailing infrequent index so the number of
268    /// codes is at most `max_categories` (the infrequent group itself counts
269    /// toward the limit). Mirrors scikit-learn's
270    /// `OrdinalEncoder(max_categories=...)`
271    /// (`sklearn/preprocessing/_encoders.py:1301-1315`, `_identify_infrequent`
272    /// `:303-315`).
273    #[must_use]
274    pub fn with_max_categories(mut self, max_categories: usize) -> Self {
275        self.max_categories = Some(max_categories);
276        self
277    }
278
279    /// Return the configured minimum-frequency threshold (`min_frequency`), or
280    /// `None` if infrequent grouping by frequency is disabled.
281    #[must_use]
282    pub fn min_frequency(&self) -> Option<usize> {
283        self.min_frequency
284    }
285
286    /// Return the configured maximum ordinal-code limit (`max_categories`), or
287    /// `None` if no limit is imposed.
288    #[must_use]
289    pub fn max_categories(&self) -> Option<usize> {
290        self.max_categories
291    }
292
293    /// Whether infrequent grouping is enabled (either `min_frequency` or
294    /// `max_categories` is set). Mirrors scikit-learn's `_infrequent_enabled`
295    /// (`_encoders.py:271-273`: `(max_categories is not None and
296    /// max_categories >= 1) or min_frequency is not None`).
297    fn infrequent_enabled(&self) -> bool {
298        self.min_frequency.is_some() || self.max_categories.is_some_and(|m| m >= 1)
299    }
300}
301
302// ---------------------------------------------------------------------------
303// FittedOrdinalEncoder
304// ---------------------------------------------------------------------------
305
306/// A fitted ordinal encoder holding per-column category-to-index mappings.
307///
308/// Created by calling [`Fit::fit`] on an [`OrdinalEncoder`].
309#[derive(Debug, Clone)]
310pub struct FittedOrdinalEncoder {
311    /// Per-column ordered category lists (index = integer value).
312    pub(crate) categories: Vec<Vec<String>>,
313    /// Per-column category-to-index maps.
314    pub(crate) category_to_index: Vec<HashMap<String, usize>>,
315    /// Strategy for unknown categories at `transform` time (threaded from the
316    /// unfitted [`OrdinalEncoder`]).
317    pub(crate) handle_unknown: HandleUnknown,
318    /// Sentinel for unknown categories under
319    /// [`HandleUnknown::UseEncodedValue`] (threaded from the unfitted encoder;
320    /// validated to be present in that mode during `fit`).
321    pub(crate) unknown_value: Option<f64>,
322    /// Per-feature indices into `categories[j]` of the categories grouped as
323    /// **infrequent** (`min_frequency`/`max_categories`), sorted ascending.
324    /// Mirrors scikit-learn's private `_infrequent_indices[j]`
325    /// (`_encoders.py:336-340`,`:367-370`). Empty when feature `j` has no
326    /// infrequent categories (sklearn's `None`); with infrequent grouping
327    /// disabled every entry is empty. Length `categories.len()`. The categories
328    /// themselves are NOT removed from `categories[j]` (unlike one-hot column
329    /// dropping) — only their emitted ordinal code is folded.
330    pub(crate) infrequent_indices_: Vec<Vec<usize>>,
331    /// Per-feature mapping from a `categories[j]` index to its emitted ORDINAL
332    /// code. Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
333    /// (`_encoders.py:373-400`): a frequent category maps to its remapped slot
334    /// `0..n_frequent` (frequent categories keep their original sorted order),
335    /// every infrequent category maps to the single trailing index
336    /// `n_frequent`. When feature `j` has no infrequent categories the mapping
337    /// is the identity `0..len` (sklearn stores `None`; the identity is the
338    /// representable equivalent). Length `categories.len()`, with
339    /// `infrequent_map[j].len() == categories[j].len()`. Used by `transform`
340    /// and `inverse_transform`.
341    pub(crate) infrequent_map: Vec<Vec<usize>>,
342    /// Per-feature number of frequent categories (`n_frequent`): the trailing
343    /// infrequent ordinal index when feature `j` has infrequent categories.
344    /// Equals `categories[j].len() - infrequent_indices_[j].len()`. When feature
345    /// `j` has no infrequent categories this equals `categories[j].len()` (the
346    /// identity map's range). Length `categories.len()`. Used by
347    /// `inverse_transform` to recognise the shared infrequent code.
348    pub(crate) n_frequent: Vec<usize>,
349}
350
351impl FittedOrdinalEncoder {
352    /// Return the ordered category list for each column.
353    ///
354    /// `categories()[j][i]` is the category that maps to integer `i` in column `j`.
355    #[must_use]
356    pub fn categories(&self) -> &[Vec<String>] {
357        &self.categories
358    }
359
360    /// Return the infrequent category **values** for each feature
361    /// (`infrequent_categories_`).
362    ///
363    /// `infrequent_categories()[j]` is the sorted list of category values from
364    /// `categories[j]` that were grouped into the single trailing "infrequent"
365    /// ordinal code (because their training count fell below `min_frequency`
366    /// and/or beyond the `max_categories` limit). An EMPTY inner `Vec` means
367    /// feature `j` had no infrequent categories (scikit-learn returns `None`
368    /// there; an empty list is the representable equivalent). With infrequent
369    /// grouping disabled every entry is empty. Mirrors scikit-learn's
370    /// `OrdinalEncoder.infrequent_categories_` (`_encoders.py:255-262`):
371    /// `category[indices]` over `_infrequent_indices`.
372    #[must_use]
373    pub fn infrequent_categories(&self) -> Vec<Vec<String>> {
374        self.infrequent_indices_
375            .iter()
376            .enumerate()
377            .map(|(j, idxs)| {
378                idxs.iter()
379                    .filter_map(|&idx| self.categories.get(j).and_then(|c| c.get(idx)).cloned())
380                    .collect()
381            })
382            .collect()
383    }
384
385    /// Return the number of input columns (features).
386    #[must_use]
387    pub fn n_features(&self) -> usize {
388        self.categories.len()
389    }
390
391    /// Return the number of features seen during `fit`.
392    ///
393    /// Mirrors scikit-learn's `n_features_in_` attribute (set by `_validate_data`
394    /// at fit, `sklearn/base.py`). Equal to [`n_features`](Self::n_features); the
395    /// distinct name matches sklearn's fitted-attribute surface (REQ-10).
396    #[must_use]
397    pub fn n_features_in(&self) -> usize {
398        self.categories.len()
399    }
400
401    /// Return the output feature names, one per input feature.
402    ///
403    /// `OrdinalEncoder` is a `OneToOneFeatureMixin` (one output column per input
404    /// column), so `get_feature_names_out` returns the INPUT feature names
405    /// unchanged (`sklearn/utils/_set_output` / `OneToOneFeatureMixin.
406    /// get_feature_names_out`): with `input_features = None` the default names
407    /// `["x0", "x1", ...]` (`_check_feature_names_in`), otherwise the supplied
408    /// names verbatim.
409    ///
410    /// # Errors
411    ///
412    /// Returns [`FerroError::ShapeMismatch`] if `input_features` is `Some` but its
413    /// length differs from [`n_features_in`](Self::n_features_in) (sklearn raises
414    /// `ValueError("input_features should have length equal to number of features
415    /// ...")`).
416    pub fn get_feature_names_out(
417        &self,
418        input_features: Option<&[String]>,
419    ) -> Result<Vec<String>, FerroError> {
420        let n = self.categories.len();
421        match input_features {
422            None => Ok((0..n).map(|j| format!("x{j}")).collect()),
423            Some(names) => {
424                if names.len() != n {
425                    return Err(FerroError::ShapeMismatch {
426                        expected: vec![n],
427                        actual: vec![names.len()],
428                        context: "FittedOrdinalEncoder::get_feature_names_out (input_features \
429                                  length must equal n_features_in_)"
430                            .into(),
431                    });
432                }
433                Ok(names.to_vec())
434            }
435        }
436    }
437
438    /// Return the configured unknown-category strategy.
439    #[must_use]
440    pub fn handle_unknown(&self) -> HandleUnknown {
441        self.handle_unknown
442    }
443
444    /// Return the configured unknown-category sentinel, if any.
445    #[must_use]
446    pub fn unknown_value(&self) -> Option<f64> {
447        self.unknown_value
448    }
449
450    /// Convert ordinal indices back to the original category strings.
451    ///
452    /// This is the inverse of [`Transform::transform`]: each `f64` cell is read
453    /// as an ordinal index into the per-column `categories_` learned at `fit`
454    /// time, and the corresponding category string is returned. Reusing the
455    /// SHIPPED `categories_` (REQ-1), `inverse_transform(transform(X)) == X` for
456    /// any `X` whose every category was seen during `fit` (a bit-exact roundtrip
457    /// on the default `Error`-mode encoder). Mirrors scikit-learn's
458    /// `OrdinalEncoder.inverse_transform` (`sklearn/preprocessing/_encoders.py:1595`),
459    /// `X_tr[:, i] = self.categories_[i][labels]`.
460    ///
461    /// # Index contract (faithful to sklearn / numpy)
462    ///
463    /// Mirrors sklearn's `labels.astype("int64")` (`_encoders.py:1664`) followed
464    /// by numpy fancy indexing `categories_[j][labels]` (`:1679`):
465    /// - **truncates non-integers toward zero** (`1.5` → index `1` → that
466    ///   category; `0.7` → `0`) — Rust `f64 as i64` matches the C-style cast.
467    /// - **wraps small negatives** via numpy negative indexing (`-1.0` →
468    ///   `categories_[j][len-1]`, the LAST category; `-2.0` → `len-2`), raising
469    ///   only once the wrapped index still leaves `[0, len)` (`-3.0` with 2
470    ///   categories → `IndexError`).
471    /// - **errors** on an out-of-range positive ordinal (`9.0` with 2 categories
472    ///   → sklearn `IndexError`) and on a non-finite cell (NaN/±inf overflow the
473    ///   `astype("int64")` cast → sklearn `IndexError`/`ValueError`; guarded
474    ///   explicitly because Rust's `f64 as i64` saturates NaN→0, which would
475    ///   diverge).
476    ///
477    /// The roundtrip, held-out valid-ordinal, truncation, and negative-wrap paths
478    /// all match sklearn; out-of-range / non-finite both error.
479    ///
480    /// # `use_encoded_value` → `None` (SCOPE LIMITATION, R-HONEST-3)
481    ///
482    /// With [`HandleUnknown::UseEncodedValue`], sklearn maps a cell equal to
483    /// `unknown_value` back to `None` (`_encoders.py:1673`,
484    /// `X_tr[mask, idx] = None`). ferrolearn's `Array2<String>` output container
485    /// **cannot represent `None`** (it would require `Array2<Option<String>>`).
486    /// The configured `unknown_value` is itself out of the valid `[0, len)`
487    /// range (e.g. `-1`), so such a cell hits the out-of-range error path: this
488    /// inverse therefore ERRORS where sklearn returns `[[None, ...]]`. This is a
489    /// documented divergence, not a silent wrong-string — the honest behavior is
490    /// to error rather than fabricate a category. The default `Error`-mode
491    /// encoder produces only valid ordinals, so its inverse is COMPLETE and
492    /// bit-exact.
493    ///
494    /// # Errors
495    ///
496    /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows
497    /// (symmetry with `transform`'s #2220 guard and sklearn's `check_array`).
498    ///
499    /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
500    /// match the number of features seen during fitting (sklearn's
501    /// `_encoders.py:1619` "Shape of the passed X data is not correct").
502    ///
503    /// Returns [`FerroError::InvalidParameter`] if any cell is not an exact
504    /// non-negative integer in `[0, categories_[j].len())` (sklearn's
505    /// `IndexError`, plus the strict negative/non-integer contract above).
506    pub fn inverse_transform(&self, x: &Array2<f64>) -> Result<Array2<String>, FerroError> {
507        let n_features = self.categories.len();
508        // Symmetric with `transform`'s 0-row guard (#2220) and sklearn's
509        // `check_array` minimum-of-1-sample (`_encoders.py:1610`): a 0-row input
510        // raises "Found array with 0 sample(s) ... a minimum of 1 is required".
511        if x.nrows() == 0 {
512            return Err(FerroError::InsufficientSamples {
513                required: 1,
514                actual: 0,
515                context: "FittedOrdinalEncoder::inverse_transform".into(),
516            });
517        }
518        // sklearn validates the column count (`_encoders.py:1619`) -> ValueError.
519        if x.ncols() != n_features {
520            return Err(FerroError::ShapeMismatch {
521                expected: vec![x.nrows(), n_features],
522                actual: vec![x.nrows(), x.ncols()],
523                context: "FittedOrdinalEncoder::inverse_transform".into(),
524            });
525        }
526
527        let n_samples = x.nrows();
528        // `Array2::default` fills with the empty String; every cell is overwritten
529        // on the Ok path, so the default is never observed by the caller.
530        let mut out = Array2::<String>::default((n_samples, n_features));
531
532        for j in 0..n_features {
533            let cats = &self.categories[j];
534            // Infrequent grouping (REQ-8). When feature `j` has infrequent
535            // categories, the valid ordinal codes are `0..=n_frequent[j]`: codes
536            // `0..n_frequent` index the FREQUENT-only category list (the original
537            // `categories[j]` with the infrequent entries removed, order
538            // preserved — sklearn `frequent_categories_mask`,
539            // `_encoders.py:1648-1652`), and the shared trailing code
540            // `n_frequent` inverts to the literal String `"infrequent_sklearn"`
541            // (`_encoders.py:1675-1677` `X_tr[mask, idx] = "infrequent_sklearn"`).
542            // UNLIKE `OneHotEncoder`'s NaN proxy, this is a REAL representable
543            // String. With grouping disabled `infrequent_indices_[j]` is empty,
544            // so this branch is skipped and the SHIPPED REQ-9 path runs unchanged.
545            let frequent_only: Option<Vec<String>> = if self
546                .infrequent_indices_
547                .get(j)
548                .is_some_and(|v| !v.is_empty())
549            {
550                let map = &self.infrequent_map[j];
551                let nf = self.n_frequent[j];
552                // Slot `s` (in `0..nf`) → the `categories[j]` element whose
553                // remapped code is `s` (frequent categories keep their order).
554                let mut fo: Vec<String> = Vec::with_capacity(nf);
555                for s in 0..nf {
556                    if let Some(orig) = map.iter().position(|&c| c == s)
557                        && let Some(cat) = cats.get(orig)
558                    {
559                        fo.push(cat.clone());
560                    }
561                }
562                Some(fo)
563            } else {
564                None
565            };
566            // The category list the numpy index logic indexes into: the
567            // frequent-only list when grouping is active for this feature, else
568            // the full `categories[j]` (SHIPPED REQ-9, UNCHANGED).
569            let index_cats: &[String] = frequent_only.as_deref().unwrap_or(cats);
570            let len = index_cats.len() as i64;
571            for i in 0..n_samples {
572                let v = x[[i, j]];
573                // `use_encoded_value`: sklearn maps a cell equal to
574                // `unknown_value` back to `None` (`_encoders.py:1673`) BEFORE the
575                // int cast / indexing. `Array2<String>` cannot hold `None`, so
576                // this cell errors (documented scope limitation, R-HONEST-3) —
577                // checked first so the configured sentinel (e.g. `-1`) is NOT
578                // silently wrapped to a real category by the numpy index logic.
579                if self.handle_unknown == HandleUnknown::UseEncodedValue
580                    && let Some(uv) = self.unknown_value
581                    && (v == uv || (v.is_nan() && uv.is_nan()))
582                {
583                    return Err(FerroError::InvalidParameter {
584                        name: "X".into(),
585                        reason: format!(
586                            "value {v} at row {i}, feature {j} equals unknown_value; \
587                             sklearn inverts it to None, which Array2<String> cannot \
588                             represent (would need Array2<Option<String>>)"
589                        ),
590                    });
591                }
592                // Infrequent: a cell EXACTLY equal to the shared trailing code
593                // `n_frequent` (a float equality, computed on the RAW label
594                // BEFORE the int cast — sklearn `labels == infrequent_encoding_value`,
595                // `_encoders.py:1644`) inverts to `"infrequent_sklearn"`. A cell
596                // that merely truncates to `n_frequent` (e.g. `2.5`) does NOT —
597                // it falls through to the frequent-only index logic and errors out
598                // of range, matching the live oracle.
599                if frequent_only.is_some() && v == self.n_frequent[j] as f64 {
600                    out[[i, j]] = "infrequent_sklearn".to_string();
601                    continue;
602                }
603                // sklearn does `labels.astype('int64')` then `categories_[j][idx]`
604                // (`_encoders.py:1664`,`:1679`). A non-finite cell overflows the
605                // cast (NaN/+-inf -> IndexError/ValueError); reject it (R-CODE-2:
606                // Rust's `f64 as i64` would saturate NaN->0, diverging from numpy,
607                // so guard explicitly).
608                if !v.is_finite() {
609                    return Err(FerroError::InvalidParameter {
610                        name: "X".into(),
611                        reason: format!(
612                            "value {v} at row {i}, feature {j} is not a finite ordinal \
613                             index (sklearn raises on NaN/inf astype('int64'))"
614                        ),
615                    });
616                }
617                // `astype('int64')` truncates toward zero (Rust `as i64` matches
618                // for finite values); numpy indexing then WRAPS a negative index
619                // by `+= len` (`-1` -> last category), raising only once the
620                // wrapped index still leaves `[0, len)`.
621                let mut idx = v as i64;
622                if idx < 0 {
623                    idx += len;
624                }
625                if idx < 0 || idx >= len {
626                    return Err(FerroError::InvalidParameter {
627                        name: "X".into(),
628                        reason: format!(
629                            "ordinal index {} at row {i} is out of bounds for the {len} \
630                             categories of feature {j} (sklearn IndexError)",
631                            v as i64
632                        ),
633                    });
634                }
635                // `idx` is now provably in `[0, len)` (checked above) — no panic.
636                // `index_cats` is the frequent-only list under infrequent
637                // grouping (so a frequent code maps to its frequent category),
638                // else the full `categories[j]` (SHIPPED REQ-9, UNCHANGED).
639                out[[i, j]] = index_cats[idx as usize].clone();
640            }
641        }
642
643        Ok(out)
644    }
645}
646
647// ---------------------------------------------------------------------------
648// Helpers
649// ---------------------------------------------------------------------------
650
651/// Cast an ordinal category index to `f64`, matching scikit-learn's default
652/// `OrdinalEncoder(dtype=np.float64)` output container
653/// (`sklearn/preprocessing/_encoders.py:1262`).
654///
655/// `f64` exactly represents every integer up to `2^53`, so this is lossless for
656/// any realistic category count. Indices above `2^53` (astronomically more
657/// categories than memory could hold) round to the nearest `f64`, never panic
658/// (R-CODE-2) — the same silent float rounding numpy performs.
659#[inline]
660fn ordinal_index_to_f64(idx: usize) -> f64 {
661    idx as f64
662}
663
664/// Identify the indices of infrequent categories for one feature, given the
665/// per-category training `counts` (aligned with `categories[j]`) and the
666/// `min_frequency`/`max_categories` thresholds.
667///
668/// Mirrors scikit-learn's `_BaseEncoder._identify_infrequent`
669/// (`_encoders.py:275-318`). This is the SAME algorithm the SHIPPED
670/// `OneHotEncoder` REQ-5b uses (`one_hot_encoder.rs::identify_infrequent`):
671/// 1. min_frequency: a category with `count < min_frequency` is infrequent
672///    (`:295-296`, integer form only).
673/// 2. max_categories: if (after step 1) the feature would still produce more
674///    than `max_categories` ordinal codes — counted as `n_remaining_frequent +
675///    1` for the infrequent group (`:303`) — the least-frequent categories are
676///    additionally marked infrequent until only `max_categories - 1` frequent
677///    categories remain (`:304-315`). Ties broken by a STABLE sort over the
678///    FULL count array, so among equal counts the SMALLER category index is
679///    marked infrequent first (sklearn `np.argsort(kind="mergesort")[:-k]`),
680///    i.e. the LARGER index is favoured to stay frequent. `max_categories == 1`
681///    (frequent_category_count 0) makes every category infrequent (`:307-309`).
682///
683/// Returns the sorted-ascending infrequent indices (empty if none — sklearn's
684/// `None`). Never panics (R-CODE-2).
685fn identify_infrequent(
686    counts: &[usize],
687    min_frequency: Option<usize>,
688    max_categories: Option<usize>,
689) -> Vec<usize> {
690    let n = counts.len();
691    let mut infrequent_mask = vec![false; n];
692
693    // Step 1: min_frequency (integer count). `count < min_frequency`.
694    if let Some(min_freq) = min_frequency {
695        for (idx, &c) in counts.iter().enumerate() {
696            if c < min_freq {
697                infrequent_mask[idx] = true;
698            }
699        }
700    }
701
702    // Step 2: max_categories on the survivors. `n_current_features` counts the
703    // remaining frequent categories PLUS 1 for the infrequent group
704    // (`_encoders.py:303`).
705    if let Some(max_cat) = max_categories {
706        let n_infreq = infrequent_mask.iter().filter(|&&m| m).count();
707        let n_current_features = n - n_infreq + 1;
708        if max_cat < n_current_features {
709            // `max_categories` includes the one infrequent category.
710            let frequent_category_count = max_cat - 1;
711            if frequent_category_count == 0 {
712                // All categories are infrequent (`:307-309`).
713                infrequent_mask.iter_mut().for_each(|m| *m = true);
714            } else {
715                // Stable argsort over the FULL count array (ascending by count,
716                // ties by ascending index), then mark the smallest
717                // `n - frequent_category_count` levels infrequent — i.e. keep the
718                // top `frequent_category_count` by count, with ties resolved in
719                // favor of the LARGER index (`np.argsort(kind="mergesort")[:-k]`,
720                // `:312-315`).
721                let mut order: Vec<usize> = (0..n).collect();
722                order.sort_by(|&a, &b| counts[a].cmp(&counts[b]).then(a.cmp(&b)));
723                let keep = frequent_category_count.min(n);
724                let cut = n - keep;
725                for &idx in &order[..cut] {
726                    infrequent_mask[idx] = true;
727                }
728            }
729        }
730    }
731
732    infrequent_mask
733        .iter()
734        .enumerate()
735        .filter_map(|(idx, &m)| if m { Some(idx) } else { None })
736        .collect()
737}
738
739/// Build the per-feature mapping from a `categories[j]` index to its emitted
740/// ORDINAL code.
741///
742/// Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
743/// (`_encoders.py:373-400`): frequent categories take codes `0..n_frequent` in
744/// their original (ascending-index) order; every infrequent category maps to
745/// the single trailing code `n_frequent`. With no infrequent categories the
746/// mapping is the identity `0..n`. `infrequent` must be sorted ascending. Never
747/// panics (R-CODE-2): every index is bounds-checked.
748fn build_infrequent_map(n: usize, infrequent: &[usize]) -> Vec<usize> {
749    if infrequent.is_empty() {
750        return (0..n).collect();
751    }
752    let n_frequent = n - infrequent.len();
753    let mut map = vec![n_frequent; n];
754    let mut next_frequent = 0usize;
755    for (idx, slot) in map.iter_mut().enumerate() {
756        if infrequent.binary_search(&idx).is_ok() {
757            // Infrequent → the trailing code (already set to `n_frequent`).
758        } else {
759            *slot = next_frequent;
760            next_frequent += 1;
761        }
762    }
763    map
764}
765
766// ---------------------------------------------------------------------------
767// Trait implementations
768// ---------------------------------------------------------------------------
769
770impl Fit<Array2<String>, ()> for OrdinalEncoder {
771    type Fitted = FittedOrdinalEncoder;
772    type Error = FerroError;
773
774    /// Fit the encoder by building per-column category-to-index mappings.
775    ///
776    /// With the default `categories='auto'` ([`Categories::Auto`]), categories
777    /// are recorded in **lexicographic order** in each column, matching
778    /// scikit-learn's `OrdinalEncoder.categories_`.
779    ///
780    /// With explicit categories ([`Categories::Explicit`], set via
781    /// [`OrdinalEncoder::with_categories`]), the user-provided lists are used in
782    /// the **given order** (NOT re-sorted), and the ordinal indices follow that
783    /// order, mirroring scikit-learn (`sklearn/preprocessing/_encoders.py:114`).
784    ///
785    /// # Errors
786    ///
787    /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows.
788    ///
789    /// Returns [`FerroError::ShapeMismatch`] if explicit categories are set but
790    /// the number of category lists differs from the number of input columns
791    /// (sklearn `_encoders.py:85-89` "Shape mismatch: if categories is an array,
792    /// it has to be of shape (n_features,).").
793    ///
794    /// Returns [`FerroError::InvalidParameter`] if an explicit category list
795    /// contains duplicate elements (sklearn `_encoders.py:136-141`), or — under
796    /// the default [`HandleUnknown::Error`] — if a value seen in the data is not
797    /// in its column's explicit list (sklearn `_encoders.py:153-160` "Found
798    /// unknown categories ... during fit"; SKIPPED under
799    /// [`HandleUnknown::UseEncodedValue`]).
800    ///
801    /// Returns [`FerroError::InvalidParameter`] for the `handle_unknown` /
802    /// `unknown_value` validation failures (mirroring scikit-learn's
803    /// `TypeError`/`ValueError` at `_encoders.py:1473-1526`): selecting
804    /// [`HandleUnknown::UseEncodedValue`] without an `unknown_value`; setting an
805    /// `unknown_value` while in [`HandleUnknown::Error`] mode; or an
806    /// `unknown_value` that collides with an already-used encoding index.
807    fn fit(&self, x: &Array2<String>, _y: &()) -> Result<FittedOrdinalEncoder, FerroError> {
808        // sklearn `_parameter_constraints` (`@_fit_context`, validated BEFORE the
809        // data): `min_frequency` and `max_categories` are each
810        // `Interval(Integral, 1, None)` — a value of 0 raises
811        // `InvalidParameterError` ("must be an int in the range [1, inf)").
812        // REQ-8, verified live: `OrdinalEncoder(min_frequency=0).fit` →
813        // InvalidParameterError. (handle_unknown is a type-safe Rust enum, so its
814        // StrOptions constraint is provided by the type system.)
815        if self.min_frequency == Some(0) {
816            return Err(FerroError::InvalidParameter {
817                name: "min_frequency".into(),
818                reason: "must be an int in the range [1, inf)".into(),
819            });
820        }
821        if self.max_categories == Some(0) {
822            return Err(FerroError::InvalidParameter {
823                name: "max_categories".into(),
824                reason: "must be an int in the range [1, inf)".into(),
825            });
826        }
827
828        let n_samples = x.nrows();
829        if n_samples == 0 {
830            return Err(FerroError::InsufficientSamples {
831                required: 1,
832                actual: 0,
833                context: "OrdinalEncoder::fit".into(),
834            });
835        }
836
837        // Validation (a)/(b) on the param SHAPE — independent of the data, but
838        // matching sklearn these are evaluated in `fit`, AFTER the 0-row
839        // `check_array` guard above and (for the collision check) AFTER the
840        // categories_ compute below. (a)/(b) map sklearn's `TypeError`
841        // (`_encoders.py:1481-1493`).
842        match (self.handle_unknown, self.unknown_value) {
843            // (a) use_encoded_value REQUIRES an unknown_value (an int or nan).
844            // sklearn: `not isinstance(unknown_value, Integral)` -> TypeError
845            // (`:1481`); `unknown_value is None` falls into that branch.
846            (HandleUnknown::UseEncodedValue, None) => {
847                return Err(FerroError::InvalidParameter {
848                    name: "unknown_value".into(),
849                    reason: "unknown_value should be set (an integer or NaN) when \
850                             handle_unknown is 'use_encoded_value'"
851                        .into(),
852                });
853            }
854            // (b) error-mode forbids a set unknown_value. sklearn: `:1488`
855            // `elif self.unknown_value is not None` -> TypeError.
856            (HandleUnknown::Error, Some(v)) => {
857                return Err(FerroError::InvalidParameter {
858                    name: "unknown_value".into(),
859                    reason: format!(
860                        "unknown_value should only be set when handle_unknown is \
861                         'use_encoded_value', got {v}"
862                    ),
863                });
864            }
865            _ => {}
866        }
867
868        let n_features = x.ncols();
869        let mut categories = Vec::with_capacity(n_features);
870        let mut category_to_index = Vec::with_capacity(n_features);
871
872        match &self.categories {
873            // `categories='auto'` (default): per column, sorted-unique from the
874            // data (SHIPPED REQ-1, UNCHANGED). sklearn `_encoders.py:98-99`
875            // `result = _unique(Xi)`.
876            Categories::Auto => {
877                for j in 0..n_features {
878                    // Collect unique categories then sort lexicographically so the
879                    // assigned indices match sklearn's `OrdinalEncoder`, which
880                    // documents `categories_ = sorted(unique(X[:, j]))`. (Older
881                    // ferrolearn versions used first-seen order — #344.)
882                    let mut unique: Vec<String> = Vec::new();
883                    let mut seen_set: std::collections::HashSet<String> =
884                        std::collections::HashSet::new();
885                    for i in 0..n_samples {
886                        let cat = &x[[i, j]];
887                        if seen_set.insert(cat.clone()) {
888                            unique.push(cat.clone());
889                        }
890                    }
891                    unique.sort();
892
893                    let map: HashMap<String, usize> = unique
894                        .iter()
895                        .enumerate()
896                        .map(|(idx, s)| (s.clone(), idx))
897                        .collect();
898
899                    categories.push(unique);
900                    category_to_index.push(map);
901                }
902            }
903            // `categories=[list, ...]` (explicit): use the user-provided lists in
904            // the GIVEN order (NOT re-sorted), mirroring sklearn `_encoders.py:84-160`.
905            Categories::Explicit(lists) => {
906                // sklearn (`_encoders.py:85-89`): the list count must match
907                // n_features, else ValueError -> map to `ShapeMismatch`.
908                if lists.len() != n_features {
909                    return Err(FerroError::ShapeMismatch {
910                        expected: vec![n_features],
911                        actual: vec![lists.len()],
912                        context: "Shape mismatch: if categories is an array, it has to be of \
913                                  shape (n_features,)."
914                            .into(),
915                    });
916                }
917
918                for (j, list) in lists.iter().enumerate() {
919                    // sklearn (`_encoders.py:114-117`) indexes `cats[0]` on the
920                    // provided list BEFORE the duplicate/subset checks, so an
921                    // EMPTY explicit list raises `IndexError` at fit in BOTH
922                    // handle_unknown modes (#2229). Reject it here (the
923                    // use_encoded_value path would otherwise skip the subset
924                    // check and silently fit an empty category set).
925                    if list.is_empty() {
926                        return Err(FerroError::InvalidParameter {
927                            name: "categories".into(),
928                            reason: format!(
929                                "column {j} has an empty predefined category list; \
930                                 each feature needs at least one category"
931                            ),
932                        });
933                    }
934                    // sklearn (`_encoders.py:136-141`): a list with duplicate
935                    // elements raises ValueError. Build the index map detecting
936                    // duplicates in one pass (R-CODE-2: never panic).
937                    let mut map: HashMap<String, usize> = HashMap::with_capacity(list.len());
938                    for (idx, cat) in list.iter().enumerate() {
939                        if map.insert(cat.clone(), idx).is_some() {
940                            return Err(FerroError::InvalidParameter {
941                                name: "categories".into(),
942                                reason: format!(
943                                    "In column {j}, the predefined categories contain \
944                                     duplicate elements."
945                                ),
946                            });
947                        }
948                    }
949
950                    // sklearn (`_encoders.py:153-160`): under handle_unknown='error'
951                    // every value seen in the data must be present in the
952                    // predefined list, else ValueError. Under 'use_encoded_value'
953                    // this fit-time subset check is SKIPPED (out-of-set data is
954                    // fine — encoded to `unknown_value` later at transform time).
955                    if self.handle_unknown == HandleUnknown::Error {
956                        for i in 0..n_samples {
957                            let cat = &x[[i, j]];
958                            if !map.contains_key(cat) {
959                                return Err(FerroError::InvalidParameter {
960                                    name: "X".into(),
961                                    reason: format!(
962                                        "Found unknown categories [{cat}] in column {j} \
963                                         during fit"
964                                    ),
965                                });
966                            }
967                        }
968                    }
969
970                    // Use the list AS-GIVEN (preserve order — do NOT sort).
971                    categories.push(list.clone());
972                    category_to_index.push(map);
973                }
974            }
975        }
976
977        // Infrequent grouping (REQ-8). When `min_frequency`/`max_categories` are
978        // set, fold the least-frequent categories of each feature into a single
979        // shared trailing ORDINAL code (the frequent categories keep codes
980        // `0..n_frequent` in their original sorted order). `categories` is NOT
981        // changed (all categories retained, sklearn keeps `categories_` whole and
982        // only remaps the emitted index, `_encoders.py:1289-1370`) — only the
983        // per-feature `infrequent_map` / `infrequent_indices_` / `n_frequent` are
984        // built. With grouping disabled the map is the identity and every feature
985        // has no infrequent categories.
986        let mut infrequent_indices_: Vec<Vec<usize>> = Vec::with_capacity(n_features);
987        let mut infrequent_map: Vec<Vec<usize>> = Vec::with_capacity(n_features);
988        let mut n_frequent: Vec<usize> = Vec::with_capacity(n_features);
989        if self.infrequent_enabled() {
990            for (j, cats) in categories.iter().enumerate() {
991                // Per-category training counts ALIGNED with `categories[j]`
992                // (sklearn `_unique(Xi, return_counts=True)`,
993                // `_encoders.py:99-102`). Built from the fit data through the
994                // category→index map, so it works for BOTH the Auto and Explicit
995                // category sets. (A datum not in an explicit list contributes no
996                // count — under `handle_unknown='error'` the subset check above
997                // already rejected it; under `use_encoded_value` it is an unknown
998                // that does not affect category frequencies.)
999                let map = &category_to_index[j];
1000                let mut counts = vec![0usize; cats.len()];
1001                for i in 0..n_samples {
1002                    if let Some(&idx) = map.get(&x[[i, j]]) {
1003                        counts[idx] += 1;
1004                    }
1005                }
1006                let infreq = identify_infrequent(&counts, self.min_frequency, self.max_categories);
1007                let imap = build_infrequent_map(cats.len(), &infreq);
1008                n_frequent.push(cats.len() - infreq.len());
1009                infrequent_indices_.push(infreq);
1010                infrequent_map.push(imap);
1011            }
1012        } else {
1013            for cats in &categories {
1014                infrequent_indices_.push(Vec::new());
1015                infrequent_map.push((0..cats.len()).collect());
1016                n_frequent.push(cats.len());
1017            }
1018        }
1019
1020        // Validation (a'): sklearn (`_encoders.py:1481-1487`) requires
1021        // `unknown_value` to be an INTEGER or `np.nan` when
1022        // `handle_unknown='use_encoded_value'` — a non-integer float raises
1023        // `TypeError` BEFORE the range/collision check (#2221). `f64` cannot
1024        // express "integral", so a non-nan value with a fractional part is
1025        // rejected here.
1026        if self.handle_unknown == HandleUnknown::UseEncodedValue
1027            && let Some(v) = self.unknown_value
1028            && !v.is_nan()
1029            && v.fract() != 0.0
1030        {
1031            return Err(FerroError::InvalidParameter {
1032                name: "unknown_value".into(),
1033                reason: format!(
1034                    "unknown_value should be an integer or np.nan when \
1035                     handle_unknown is 'use_encoded_value', got {v}"
1036                ),
1037            });
1038        }
1039
1040        // Validation (c): collision of a non-nan integer unknown_value with an
1041        // already-used encoding index. sklearn (`_encoders.py:1518-1526`) loops
1042        // each column's cardinality and raises `ValueError` if
1043        // `0 <= unknown_value < cardinality`; that is equivalent to comparing
1044        // against the maximum cardinality. The earlier sklearn check
1045        // (`:1481`) already guaranteed `unknown_value` is an int or nan, so a
1046        // non-integer / nan value is fine here, as is a negative value or one
1047        // `>= max_cardinality`.
1048        if self.handle_unknown == HandleUnknown::UseEncodedValue
1049            && let Some(v) = self.unknown_value
1050            && !v.is_nan()
1051            && v.fract() == 0.0
1052        {
1053            // sklearn's collision check keys off the EFFECTIVE number of distinct
1054            // output codes per feature: with infrequent grouping a feature emits
1055            // `n_frequent + 1` codes (the shared infrequent index), so its
1056            // cardinality for the unknown_value collision is `n_frequent + 1`, NOT
1057            // `len(categories_)` (verified live: `min_frequency=2` over 4 cats →
1058            // 3 codes → `unknown_value=3` is accepted, `=2` collides). With
1059            // grouping disabled `n_frequent[j] == categories[j].len()` and there
1060            // is no infrequent code, so this reduces to the SHIPPED REQ-5 check.
1061            let max_cardinality = (0..n_features)
1062                .map(|j| n_frequent[j] + usize::from(!infrequent_indices_[j].is_empty()))
1063                .max()
1064                .unwrap_or(0);
1065            // `0 <= v < max_cardinality` with v an integer-valued f64.
1066            if v >= 0.0 && v < max_cardinality as f64 {
1067                return Err(FerroError::InvalidParameter {
1068                    name: "unknown_value".into(),
1069                    reason: format!(
1070                        "The used value for unknown_value {v} is one of the \
1071                         values already used for encoding the seen categories"
1072                    ),
1073                });
1074            }
1075        }
1076
1077        Ok(FittedOrdinalEncoder {
1078            categories,
1079            category_to_index,
1080            handle_unknown: self.handle_unknown,
1081            unknown_value: self.unknown_value,
1082            infrequent_indices_,
1083            infrequent_map,
1084            n_frequent,
1085        })
1086    }
1087}
1088
1089impl Transform<Array2<String>> for FittedOrdinalEncoder {
1090    type Output = Array2<f64>;
1091    type Error = FerroError;
1092
1093    /// Transform string categories to ordinal indices, returned as `f64`.
1094    ///
1095    /// Each cell is the (lexicographic) category index cast to `f64`. The
1096    /// ordinal VALUES are unchanged from the integer mapping; only the output
1097    /// container dtype is `f64`, matching scikit-learn's
1098    /// `OrdinalEncoder(dtype=np.float64)` default
1099    /// (`sklearn/preprocessing/_encoders.py:1262`). A configurable non-float64
1100    /// output dtype (e.g. `int32`) is OUT OF SCOPE here — ferrolearn's output is
1101    /// the fixed sklearn DEFAULT `f64`; a `dtype` param is a follow-on design
1102    /// (blocker #1158). `f64` exactly represents every integer up to `2^53`, so
1103    /// the cast is lossless for any realistic category count.
1104    ///
1105    /// # Errors
1106    ///
1107    /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
1108    /// match the number of features seen during fitting.
1109    ///
1110    /// Returns [`FerroError::InvalidParameter`] if any category was not seen
1111    /// during fitting AND `handle_unknown` is [`HandleUnknown::Error`] (the
1112    /// default). Under [`HandleUnknown::UseEncodedValue`], unknown categories
1113    /// are instead encoded as the configured `unknown_value` sentinel (which may
1114    /// be `f64::NAN`), matching sklearn `_encoders.py:1591`.
1115    fn transform(&self, x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
1116        let n_features = self.categories.len();
1117        // sklearn `OrdinalEncoder.transform` -> `_transform` -> `_check_X` ->
1118        // `check_array` (`_encoders.py:45`) enforces a minimum of 1 sample BEFORE
1119        // the n_features comparison (#2220, symmetric with the 0-row fit guard).
1120        // A 0-row input raises "Found array with 0 sample(s) ... minimum of 1".
1121        if x.nrows() == 0 {
1122            return Err(FerroError::InsufficientSamples {
1123                required: 1,
1124                actual: 0,
1125                context: "FittedOrdinalEncoder::transform".into(),
1126            });
1127        }
1128        if x.ncols() != n_features {
1129            return Err(FerroError::ShapeMismatch {
1130                expected: vec![x.nrows(), n_features],
1131                actual: vec![x.nrows(), x.ncols()],
1132                context: "FittedOrdinalEncoder::transform".into(),
1133            });
1134        }
1135
1136        let n_samples = x.nrows();
1137        let mut out = Array2::zeros((n_samples, n_features));
1138
1139        for j in 0..n_features {
1140            let map = &self.category_to_index[j];
1141            // Per-feature infrequent remapping (REQ-8): a found category's
1142            // `categories[j]` index is routed through `infrequent_map[j]` to its
1143            // emitted ordinal code (a frequent category → its remapped slot
1144            // `0..n_frequent`, an infrequent category → the shared trailing code
1145            // `n_frequent`), mirroring sklearn `_map_infrequent_categories`
1146            // (`_encoders.py:402-452`: `X_int = np.take(mapping, X_int)`). With
1147            // grouping DISABLED `infrequent_map[j]` is the identity, so the code
1148            // equals `idx` — the SHIPPED REQ-2 behaviour is UNCHANGED.
1149            let imap = self.infrequent_map.get(j);
1150            for i in 0..n_samples {
1151                let cat = &x[[i, j]];
1152                match map.get(cat) {
1153                    // Route the category index through the infrequent map, then
1154                    // cast the resulting ordinal code to f64 (sklearn's float64
1155                    // default, `_encoders.py:1262`). Lossless: codes are < 2^53.
1156                    // Bounds-safe: `imap.get(idx)` falls back to the raw `idx`
1157                    // (R-CODE-2) — `imap` always has `categories[j].len()` entries.
1158                    Some(&idx) => {
1159                        let code = imap.and_then(|m| m.get(idx)).copied().unwrap_or(idx);
1160                        out[[i, j]] = ordinal_index_to_f64(code);
1161                    }
1162                    None => match self.handle_unknown {
1163                        // handle_unknown='use_encoded_value': write the sentinel
1164                        // (which may be NaN). sklearn `_encoders.py:1591`
1165                        // `X_trans[~X_mask] = self.unknown_value`. `fit`
1166                        // guaranteed `unknown_value` is `Some` in this mode, but
1167                        // we never panic (R-CODE-2): fall back to the Error path
1168                        // if it were somehow `None`.
1169                        HandleUnknown::UseEncodedValue => match self.unknown_value {
1170                            Some(v) => out[[i, j]] = v,
1171                            None => {
1172                                return Err(FerroError::InvalidParameter {
1173                                    name: format!("x[{i},{j}]"),
1174                                    reason: format!(
1175                                        "unknown category \"{cat}\" in column {j} and \
1176                                         no unknown_value configured"
1177                                    ),
1178                                });
1179                            }
1180                        },
1181                        // handle_unknown='error' (default): reject (SHIPPED
1182                        // REQ-2, UNCHANGED). sklearn raises ValueError
1183                        // "Found unknown categories ... during transform".
1184                        HandleUnknown::Error => {
1185                            return Err(FerroError::InvalidParameter {
1186                                name: format!("x[{i},{j}]"),
1187                                reason: format!("unknown category \"{cat}\" in column {j}"),
1188                            });
1189                        }
1190                    },
1191                }
1192            }
1193        }
1194
1195        Ok(out)
1196    }
1197}
1198
1199/// Implement `Transform` on the unfitted encoder to satisfy the
1200/// `FitTransform: Transform` supertrait bound.
1201impl Transform<Array2<String>> for OrdinalEncoder {
1202    type Output = Array2<f64>;
1203    type Error = FerroError;
1204
1205    /// Always returns an error — the encoder must be fitted first.
1206    fn transform(&self, _x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
1207        Err(FerroError::InvalidParameter {
1208            name: "OrdinalEncoder".into(),
1209            reason: "encoder must be fitted before calling transform; use fit() first".into(),
1210        })
1211    }
1212}
1213
1214impl FitTransform<Array2<String>> for OrdinalEncoder {
1215    type FitError = FerroError;
1216
1217    /// Fit the encoder on `x` and return the encoded output in one step.
1218    ///
1219    /// # Errors
1220    ///
1221    /// Returns an error if fitting or transformation fails.
1222    fn fit_transform(&self, x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
1223        let fitted = self.fit(x, &())?;
1224        fitted.transform(x)
1225    }
1226}
1227
1228// ---------------------------------------------------------------------------
1229// Tests
1230// ---------------------------------------------------------------------------
1231
1232#[cfg(test)]
1233mod tests {
1234    use super::*;
1235    use ndarray::Array2;
1236
1237    fn make_2col(rows: &[(&str, &str)]) -> Array2<String> {
1238        let flat: Vec<String> = rows
1239            .iter()
1240            .flat_map(|(a, b)| [a.to_string(), b.to_string()])
1241            .collect();
1242        Array2::from_shape_vec((rows.len(), 2), flat).unwrap()
1243    }
1244
1245    #[test]
1246    fn test_ordinal_encoder_basic() {
1247        let enc = OrdinalEncoder::new();
1248        let x = make_2col(&[
1249            ("cat", "small"),
1250            ("dog", "large"),
1251            ("cat", "medium"),
1252            ("bird", "small"),
1253        ]);
1254        let fitted = enc.fit(&x, &()).unwrap();
1255
1256        // Categories are sorted lexicographically (sklearn convention).
1257        assert_eq!(fitted.categories()[0], vec!["bird", "cat", "dog"]);
1258        assert_eq!(fitted.categories()[1], vec!["large", "medium", "small"]);
1259
1260        let encoded = fitted.transform(&x).unwrap();
1261        // Output container is `Array2<f64>` (sklearn's `dtype=np.float64`).
1262        assert_eq!(encoded[[0, 0]], 1.0); // "cat"  -> 1 (lex pos)
1263        assert_eq!(encoded[[1, 0]], 2.0); // "dog"  -> 2
1264        assert_eq!(encoded[[2, 0]], 1.0); // "cat"  -> 1
1265        assert_eq!(encoded[[3, 0]], 0.0); // "bird" -> 0
1266        assert_eq!(encoded[[0, 1]], 2.0); // "small"  -> 2
1267        assert_eq!(encoded[[1, 1]], 0.0); // "large"  -> 0
1268        assert_eq!(encoded[[2, 1]], 1.0); // "medium" -> 1
1269        assert_eq!(encoded[[3, 1]], 2.0); // "small"  -> 2
1270    }
1271
1272    #[test]
1273    fn test_fit_transform_equivalence() {
1274        let enc = OrdinalEncoder::new();
1275        let x = make_2col(&[("a", "x"), ("b", "y"), ("a", "z")]);
1276        let via_ft = enc.fit_transform(&x).unwrap();
1277        let fitted = enc.fit(&x, &()).unwrap();
1278        let via_sep = fitted.transform(&x).unwrap();
1279        assert_eq!(via_ft, via_sep);
1280    }
1281
1282    #[test]
1283    fn test_unknown_category_error() {
1284        let enc = OrdinalEncoder::new();
1285        let x_train = make_2col(&[("cat", "small"), ("dog", "large")]);
1286        let fitted = enc.fit(&x_train, &()).unwrap();
1287        let x_test = make_2col(&[("fish", "small")]);
1288        assert!(fitted.transform(&x_test).is_err());
1289    }
1290
1291    #[test]
1292    fn test_shape_mismatch_error() {
1293        let enc = OrdinalEncoder::new();
1294        let x_train = make_2col(&[("a", "x")]);
1295        let fitted = enc.fit(&x_train, &()).unwrap();
1296        // Single-column input when 2 cols expected
1297        let x_bad = Array2::from_shape_vec((1, 1), vec!["a".to_string()]).unwrap();
1298        assert!(fitted.transform(&x_bad).is_err());
1299    }
1300
1301    #[test]
1302    fn test_insufficient_samples_error() {
1303        let enc = OrdinalEncoder::new();
1304        let x: Array2<String> = Array2::from_shape_vec((0, 2), vec![]).unwrap();
1305        assert!(enc.fit(&x, &()).is_err());
1306    }
1307
1308    #[test]
1309    fn test_unfitted_transform_error() {
1310        let enc = OrdinalEncoder::new();
1311        let x = make_2col(&[("a", "x")]);
1312        assert!(enc.transform(&x).is_err());
1313    }
1314
1315    #[test]
1316    fn test_single_column() {
1317        let enc = OrdinalEncoder::new();
1318        let flat = vec![
1319            "red".to_string(),
1320            "green".to_string(),
1321            "blue".to_string(),
1322            "red".to_string(),
1323        ];
1324        let x = Array2::from_shape_vec((4, 1), flat).unwrap();
1325        let fitted = enc.fit(&x, &()).unwrap();
1326        // Lex order: blue (0), green (1), red (2)
1327        assert_eq!(fitted.categories()[0], vec!["blue", "green", "red"]);
1328        let encoded = fitted.transform(&x).unwrap();
1329        assert_eq!(encoded[[0, 0]], 2.0); // red
1330        assert_eq!(encoded[[1, 0]], 1.0); // green
1331        assert_eq!(encoded[[2, 0]], 0.0); // blue
1332        assert_eq!(encoded[[3, 0]], 2.0); // red
1333    }
1334
1335    #[test]
1336    fn test_n_features() {
1337        let enc = OrdinalEncoder::new();
1338        let x = make_2col(&[("a", "x")]);
1339        let fitted = enc.fit(&x, &()).unwrap();
1340        assert_eq!(fitted.n_features(), 2);
1341    }
1342
1343    #[test]
1344    fn test_lexicographic_order() {
1345        // Categories are sorted lexicographically to match sklearn (#344).
1346        let enc = OrdinalEncoder::new();
1347        let flat = vec!["zebra".to_string(), "ant".to_string(), "moose".to_string()];
1348        let x = Array2::from_shape_vec((3, 1), flat).unwrap();
1349        let fitted = enc.fit(&x, &()).unwrap();
1350        // ant < moose < zebra
1351        assert_eq!(fitted.categories()[0][0], "ant");
1352        assert_eq!(fitted.categories()[0][1], "moose");
1353        assert_eq!(fitted.categories()[0][2], "zebra");
1354    }
1355}
ferrolearn_preprocess/ordinal_encoder.rs

ferrolearn_preprocess/
ordinal_encoder.rs