ferrolearn_preprocess/ordinal_encoder.rs
1//! Ordinal encoder: map string categories to integer indices.
2//!
3//! Each column's categories are mapped to integers `0, 1, 2, ...` in
4//! **lexicographic order** (matching scikit-learn's `OrdinalEncoder`).
5//! Unknown categories seen during `transform` produce an error by default
6//! (`handle_unknown='error'`); with `handle_unknown='use_encoded_value'` they
7//! are instead encoded as a configurable `unknown_value` sentinel
8//! (matching scikit-learn's `OrdinalEncoder`).
9//!
10//! # `## REQ status`
11//!
12//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_encoders.py` (`class OrdinalEncoder`
13//! `:1235`). Design doc: `.design/preprocess/ordinal_encoder.md`. Expected values from the live
14//! sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (`lib.rs:121`, grandfathered S5).
15//! HONEST (R-HONEST-3): a FAITHFUL String-only ordinal encoder — `categories_`=sorted-unique and
16//! the ordinal VALUES match sklearn bit-for-bit on the string path; the output container is now
17//! `Array2<f64>` (sklearn's `dtype=np.float64` default, `:1262`); remaining divergences are
18//! String-only input, the absent configurable `dtype` param, and the rest of the param/feature
19//! surface.
20//!
21//! | REQ | Status | Evidence |
22//! |---|---|---|
23//! | REQ-1 (string fit → sorted-unique categories_) | SHIPPED | `Fit::fit` per column → `categories_`=sorted-unique (`Vec<String>::sort`, lexicographic) + index map; rejects 0 rows (`InsufficientSamples`, matches sklearn `check_array`). Mirrors `_BaseEncoder._fit` `categories_=_unique(Xi)` (`_encoders.py:99`). Critic-verified vs live oracle: `green_value_match_and_categories` (`[['bird','cat','dog'],['large','medium','small']]`), `green_lexicographic_sort_matches_np_unique` + `green_non_ascii_codepoint_order` (== `np.unique`), `green_empty_fit_rejected_matches_sklearn`. Consumer: re-export `lib.rs:121`. |
24//! | REQ-2 (transform + fit_transform, ordinal values + unknown rejection) | SHIPPED | `Transform::transform` maps category→ordinal index (now cast to `f64` via `ordinal_index_to_f64`), unknown → `InvalidParameter` (matches `handle_unknown='error'` default `ValueError`), ncols-mismatch → `ShapeMismatch`. The unknown/ncols-mismatch LOGIC is byte-for-byte UNCHANGED by the dtype fix. Critic-verified: ordinal VALUES `[[1.,2.],[2.,0.],[1.,1.],[0.,2.]]` == live oracle, `green_unknown_category_rejected`, `green_fit_transform_equals_oracle`. Consumer: re-export `lib.rs:142`. |
25//! | REQ-3 (output dtype float64) | SHIPPED | `Transform::Output = Array2<f64>` on BOTH `Transform` impls (`FittedOrdinalEncoder` + the unfitted `OrdinalEncoder` shim) and `FitTransform::fit_transform`; each cell is the ordinal index cast via `ordinal_index_to_f64` (`idx as f64`, lossless < 2^53), matching sklearn's default `dtype=np.float64` output container (`_encoders.py:1262`, `transform` casts `X_int.astype(self.dtype)`). The REQ-1/REQ-2 fit + unknown-rejection LOGIC is unchanged. Critic-verified vs live oracle: `green_fit_transform_f64_oracle` (multi-feature f64 matrix), `green_exact_integer_index_to_f64` (index 10 → `10.0`), plus the value guards over `Array2<f64>`. A CONFIGURABLE non-float64 output dtype (`int32` etc.) is a FOLLOW-ON (blocker #1158 remains open for the `dtype` ctor param); ferrolearn's output is fixed to sklearn's `float64` DEFAULT. This unblocks REQ-5's float `unknown_value` sentinel. Consumer: crate re-export `lib.rs:142`. |
26//! | REQ-4 (numeric/mixed-dtype input) | NOT-STARTED | open prereq blocker #1159. `Array2<String>`-only; sklearn accepts int/str/object (`np.unique` numeric sort). |
27//! | REQ-5 (handle_unknown='use_encoded_value' + unknown_value) | SHIPPED | `HandleUnknown` enum `{ Error, UseEncodedValue }` (default `Error`) + `unknown_value: Option<f64>` on `OrdinalEncoder`, threaded into `FittedOrdinalEncoder` via `with_handle_unknown`/`with_unknown_value` builders. `Fit::fit` runs the 3 sklearn validations (`_encoders.py:1473-1526`) AFTER the unchanged `categories_` compute, mapping sklearn's `TypeError`/`ValueError` → `FerroError::InvalidParameter`: (a) `UseEncodedValue` && `unknown_value is None` (sklearn `:1481` `not isinstance(.,Integral)` TypeError); (b) `Error` && `unknown_value is Some` (sklearn `:1488` TypeError); (c) `UseEncodedValue` && non-nan integer `v` with `0 <= v < max_cardinality` (sklearn `:1518-1526` ValueError collision). `Transform::transform` branches unknown categories: `UseEncodedValue` → write `unknown_value` (incl. nan) (sklearn `:1591` `X_trans[~X_mask] = self.unknown_value`); `Error` → `InvalidParameter` (the SHIPPED REQ-2 default, UNCHANGED). Seen categories still map to `idx as f64` (UNCHANGED). NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle: `green_use_encoded_value_minus_one`, `green_use_encoded_value_nan`, `green_use_encoded_value_multifeature`, `red_uev_requires_unknown_value`, `red_error_mode_forbids_unknown_value`, `red_unknown_value_collision_in_range`, `green_unknown_value_negative_or_oob_or_nan_ok`, `green_error_mode_unknown_still_rejected` (`tests/divergence_ordinal_encoder.rs`). Configurable `dtype`/`encoded_missing_value` interplay stays OUT OF SCOPE (REQ-3/REQ-6). Consumer: crate re-export `lib.rs:142`. |
28//! | REQ-6 (encoded_missing_value / NaN) | NOT-STARTED | open prereq blocker #1161. No missing-value concept (`:1283`). |
29//! | REQ-7 (explicit categories param) | SHIPPED | `Categories` enum `{ Auto, Explicit(Vec<Vec<String>>) }` (default `Auto`) + `#[must_use] OrdinalEncoder::with_categories(Vec<Vec<String>>)` builder + `categories_param()` getter (named to avoid colliding with `FittedOrdinalEncoder::categories`). `Fit::fit` branches on the param AFTER the 0-row guard: `Auto` → the SHIPPED REQ-1 sorted-unique compute (UNCHANGED); `Explicit(lists)` → use each `lists[j]` AS-GIVEN for `categories_[j]` (GIVEN order, NOT re-sorted) + the index map in that order, mirroring sklearn `_encoders.py:114` `cats = np.array(self.categories[i])`. Validations match `_BaseEncoder._fit`: list-count ≠ n_features → `ShapeMismatch` ("Shape mismatch: if categories is an array, it has to be of shape (n_features,)." `:85-89`); an EMPTY list → `InvalidParameter` (sklearn indexes `cats[0]` -> IndexError in both modes, `:114-117`, #2229); a list with duplicate elements → `InvalidParameter` ("In column {j}, the predefined categories contain duplicate elements." `:136-141`); under [`HandleUnknown::Error`] (default) a data value not in its column's list → `InvalidParameter` ("Found unknown categories [{v}] in column {j} during fit" `:153-160`), while under [`HandleUnknown::UseEncodedValue`] this fit-time subset check is SKIPPED (out-of-set data is encoded to `unknown_value` at transform). The REQ-5 unknown_value validations still apply (the `max_cardinality` collision check now keys off the explicit list lengths). `Transform`/`inverse_transform`/`categories()`/`get_feature_names_out` are UNCHANGED — they already read `categories_`/`category_to_index`, which now reflect the explicit given-order set. NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle (`tests/divergence_ordinal_encoder.rs`): `green_explicit_given_order_not_sorted`, `green_explicit_unsorted_accepted`, `red_explicit_error_mode_data_not_in_cats_fits_err`, `green_explicit_use_encoded_value_out_of_set_ok`, `red_explicit_n_features_mismatch`, `green_explicit_multifeature_each_own_order`, `red_explicit_duplicate_categories`, `green_explicit_inverse_roundtrip_given_order`, `green_explicit_auto_still_default`. Consumer: crate re-export (`lib.rs:142`, `Categories` re-exported). Configurable numeric/`bytes` categories + the nan-last rule stay OUT OF SCOPE (String-only path, REQ-4/REQ-6). |
30//! | REQ-8 (min_frequency/max_categories infrequent) | SHIPPED | #1163: `OrdinalEncoder::with_min_frequency`/`with_max_categories` (+`min_frequency()`/`max_categories()` getters) add the integer-count infrequent thresholds (`_encoders.py:1289-1315`). The OrdinalEncoder ANALOG of the SHIPPED OneHotEncoder REQ-5b (`one_hot_encoder.rs`): the SAME `_identify_infrequent` algorithm (reused as `identify_infrequent` + `build_infrequent_map`, mirroring `_BaseEncoder._identify_infrequent` `:275-318` + `_default_to_infrequent_mappings` `:373-400`: min_frequency `count < min_freq` FIRST, then max_categories keeps top `max_categories-1` by count via a STABLE argsort over the full count array — ties favor the LARGER index; `max_categories==1` → all infrequent), but the infrequent categories collapse to a single shared ORDINAL CODE `n_frequent` (NOT a one-hot column): frequent categories keep codes `0..n_frequent` in their original sorted order, every infrequent category emits `n_frequent`. `Fit::fit` runs the `_parameter_constraints` check FIRST (`min_frequency`/`max_categories` `Some(0)` → `InvalidParameter` "must be an int in the range [1, inf)", BEFORE the data, `Interval(Integral,1,None)`), then (after the SHIPPED `categories_` compute, UNCHANGED — `categories_` keeps ALL categories) builds per-feature `infrequent_indices_`/`infrequent_map`/`n_frequent` from the fit-data category counts. `FittedOrdinalEncoder::infrequent_categories()` exposes the infrequent VALUES per feature (`infrequent_categories_`, `:255-262`). `Transform` routes a found category index through `infrequent_map[j]` then casts to f64 (frequent → own code, infrequent → shared trailing code; `_map_infrequent_categories`, `:402-452`); with grouping DISABLED the map is the identity so REQ-2 is UNCHANGED. `inverse_transform`: a code `< n_frequent` → the frequent category at that remapped slot (via the frequent-only category list); a code `== n_frequent` (exact float equality on the raw label) → the REAL String `"infrequent_sklearn"` (`:1644`,`:1675-1677`) — representable, UNLIKE OneHotEncoder's NaN proxy; the truncate+wrap numpy index logic applies over the frequent-only list (SHIPPED REQ-9 path UNCHANGED when disabled). The `unknown_value` collision check now keys off the EFFECTIVE code count `n_frequent + 1` (verified live: `min_frequency=2` over 4 cats → 3 codes → `unknown_value=3` accepted, `=2` collides). `get_feature_names_out` is UNCHANGED (OrdinalEncoder is one-to-one — infrequent does NOT add columns). NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle (`tests/divergence_ordinal_encoder.rs`): `req8_min_frequency_two_categories_transform_inverse`, `req8_max_categories_keeps_top_k_minus_one`, `req8_max_categories_tiebreak_favors_larger_index`, `req8_both_set_multifeature_some_without_infrequent`, `req8_zero_thresholds_rejected`, `req8_infrequent_plus_use_encoded_value_distinct_codes`, `req8_unknown_value_collision_uses_effective_code_count`, `req8_disabled_default_unchanged`, `req8_inverse_infrequent_non_roundtrip`. Consumer: crate re-export `lib.rs:142`. STILL NOT-STARTED (R-HONEST-3): the FLOAT-fraction `min_frequency` (`:1296-1297`,`:297-299`) and the explicit-`categories`+infrequent interaction stay unimplemented. |
31//! | REQ-9 (inverse_transform) | SHIPPED | `FittedOrdinalEncoder::inverse_transform(&Array2<f64>) -> Array2<String>` reuses the SHIPPED `categories_` (REQ-1): each cell is an ordinal index into `categories[j]`, mirroring sklearn `X_tr[:, i] = self.categories_[i][labels]` (`_encoders.py:1595-1679`). Validates the index BEFORE lookup (no panic, R-CODE-2): an exact non-negative integer in `[0, len)` → `categories[j][index].clone()`; 0-row → `InsufficientSamples` (symmetry with the #2220 transform guard); ncols-mismatch → `ShapeMismatch` (sklearn `:1619`). FAITHFUL to numpy: mirrors `labels.astype("int64")` (truncate toward zero, Rust `as i64`) + numpy fancy indexing (negative WRAP, `-1.0` → last category, `-2.0` → `len-2`), raising only once the wrapped index leaves `[0, len)` (`_encoders.py:1664`,`:1679`). Non-finite (NaN/±inf) → `InvalidParameter` (sklearn IndexError/ValueError; guarded because Rust `f64 as i64` saturates NaN→0). Critic-verified vs live sklearn 1.5.2 oracle: `green_inverse_roundtrip_multifeature`, `green_inverse_held_out_valid_ordinals`, `green_inverse_negative_wraps_like_numpy` (`-1.0`→'dog', `-2.0`→'cat', `-3.0`→Err), `green_inverse_non_integer_truncates_like_numpy` (`1.5`→'dog', `0.7`→'cat'), `red_inverse_out_of_range_positive` (`9.0`→Err), `red_inverse_ncols_mismatch`, `red_inverse_zero_row`, `red_inverse_use_encoded_value_unknown_cell` (`tests/divergence_ordinal_encoder.rs`). SCOPE LIMITATION (R-HONEST-3): the `unknown_value`-cell → `None` inverse (sklearn `:1673`) is unrepresentable in `Array2<String>` (would need `Array2<Option<String>>`), so a `use_encoded_value` cell equal to `unknown_value` ERRORS (checked BEFORE the index logic so the sentinel is not silently wrapped) instead of yielding `None`; the default `Error`-mode encoder has only valid ordinals so its inverse is COMPLETE and bit-exact. Consumer: crate re-export `lib.rs:142`. |
32//! | REQ-10 (get_feature_names_out + n_features_in_) | SHIPPED | `FittedOrdinalEncoder::n_features_in()` (= `n_features()`, sklearn `n_features_in_`) + `get_feature_names_out(input_features)` — `OneToOneFeatureMixin` (one output col per input col) returns the INPUT names unchanged: `None` -> `["x0","x1",..]` (`_check_feature_names_in`), `Some(names)` -> verbatim, a wrong-length `input_features` -> `ShapeMismatch` (sklearn ValueError). Live-oracle test `req10_feature_names_out_and_n_features_in` (`['x0','x1']`, `['a','b']`, wrong-length Err). feature_names_in_ (string input-name capture) stays NOT-STARTED (ferrolearn fit takes positional columns, no input names). Consumer: crate re-export `lib.rs:142`. |
33//! | REQ-11 (full ctor + _parameter_constraints) | NOT-STARTED | open prereq blocker #1166. `new()` takes no params (`:1320-1386`). |
34//! | REQ-12 (PyO3 binding) | SHIPPED | `_RsOrdinalEncoder` (hand `#[pyclass]`, `ferrolearn-python/src/extras.rs`) over `OrdinalEncoder`/`FittedOrdinalEncoder`/`HandleUnknown`/`Categories` — the FIRST STRING-INPUT binding: `fit(rows)`/`transform(rows)` take a Python `list[list[str]]` (PyO3 `Vec<Vec<String>>` extraction, NOT a numpy f64 array), validate rectangular rows (ragged → `PyValueError`), build `Array2<String>` via `Array2::from_shape_vec`, and `transform` returns `PyArray2<f64>`; `inverse_transform(PyReadonlyArray2<f64>)` returns the `Array2<String>` rows as `Vec<Vec<String>>` (the `use_encoded_value`→None inverse ERRORS, REQ-9 scope → `PyValueError`). Ctor knobs `handle_unknown="error"` (`resolve_handle_unknown`: "error"→`Error`, "use_encoded_value"→`UseEncodedValue`, bad→`PyValueError` per `_encoders.py:1425`), `unknown_value: Option<f64>=None`, `categories: Option<Vec<Vec<String>>>=None` (None→`Auto`, Some→`Explicit`); the REQ-5/REQ-7 fit validations (`OrdinalEncoder::fit`) surface as `FerroError`→`PyValueError`. `#[getter]`s `categories_` (PyList of str lists), `n_features_in_`, `feature_names_out` (`get_feature_names_out(None)`). Registered `lib.rs` `m.add_class::<extras::RsOrdinalEncoder>()`. Non-test production consumer (R-DEFER-1): `_extras.py::OrdinalEncoder(BaseEstimator)` — a CUSTOM class (NOT `_TransformerWrapper`, input is str), full 7-key keyword-only ctor (`categories`/`dtype`/`handle_unknown`/`unknown_value`/`encoded_missing_value`/`min_frequency`/`max_categories`, `_encoders.py:1435-1452`) for `get_params`/`clone`, `_to_rows` (numpy str/object array OR list-of-lists → `list[list[str]]` via `np.asarray(X).astype(str).tolist()`), `_check_unsupported` (non-NaN `encoded_missing_value` REQ-6 / `min_frequency`/`max_categories` REQ-8 / non-f64 `dtype` REQ-3 → `NotImplementedError`), `fit`/`transform`/`fit_transform`/`inverse_transform` (→ numpy object array)/`get_feature_names_out`, `@property` `categories_`/`n_features_in_`, pre-fit access → `NotFittedError` (`check_is_fitted(self, "_rs")`); re-exported in `ferrolearn/__init__.py` as `ferrolearn.OrdinalEncoder`. Live-oracle parity (R-CHAR-3, sklearn 1.5.2, `tests/divergence_ordinal_encoder_py.py`, 19 pass): `fit_transform([['cat'],['dog'],['cat']])==[[0.],[1.],[0.]]`==sklearn, `categories_`==sklearn sorted-unique, multi-feature, inverse_transform roundtrip==original, `use_encoded_value`/`unknown_value=-1`→-1.0, explicit `categories=[['dog','cat','bird']]`→given-order index, `n_features_in_`, `get_feature_names_out`→`['x0','x1']` (+ input_features pass-through), pre-fit `NotFittedError`, bad `handle_unknown`→`ValueError`, unsupported (`encoded_missing_value`/`min_frequency`/`max_categories`/`dtype`)→`NotImplementedError`, 7-key get_params==sklearn, `clone`, numpy object/str-array input. STRING-only input (REQ-4 #1159), the `use_encoded_value`→None inverse (REQ-9), and the rest of the param surface stay OUT OF SCOPE (R-HONEST-3). |
35//! | REQ-13 (ferray substrate) | NOT-STARTED | open prereq blocker #1168. `ndarray`+`HashMap`, not `ferray-core` (R-SUBSTRATE-1/2). |
36
37use ferrolearn_core::error::FerroError;
38use ferrolearn_core::traits::{Fit, FitTransform, Transform};
39use ndarray::Array2;
40use std::collections::HashMap;
41
42// ---------------------------------------------------------------------------
43// HandleUnknown
44// ---------------------------------------------------------------------------
45
46/// How [`OrdinalEncoder`] treats categories at `transform` time that were not
47/// seen during `fit`.
48///
49/// Mirrors scikit-learn's `OrdinalEncoder(handle_unknown=...)` parameter
50/// (`sklearn/preprocessing/_encoders.py:1262`), which accepts `'error'` and
51/// `'use_encoded_value'`.
52#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
53pub enum HandleUnknown {
54 /// Raise an error on any unknown category (scikit-learn's default
55 /// `handle_unknown='error'`). This is also the default here.
56 #[default]
57 Error,
58 /// Encode unknown categories with the configured `unknown_value` sentinel
59 /// (scikit-learn's `handle_unknown='use_encoded_value'`). Requires
60 /// `unknown_value` to be set.
61 UseEncodedValue,
62}
63
64// ---------------------------------------------------------------------------
65// Categories
66// ---------------------------------------------------------------------------
67
68/// How [`OrdinalEncoder`] determines, per column, the ordered category set used
69/// to assign ordinal indices.
70///
71/// Mirrors scikit-learn's `OrdinalEncoder(categories=...)` parameter
72/// (`sklearn/preprocessing/_encoders.py:1252`), which accepts `'auto'` or a
73/// list of per-feature category lists.
74#[derive(Debug, Clone, PartialEq, Eq, Default)]
75pub enum Categories {
76 /// Determine the categories automatically from the training data as the
77 /// sorted-unique values per column (scikit-learn's default `categories='auto'`).
78 #[default]
79 Auto,
80 /// Use the explicit, user-provided category lists. `Explicit(lists)[j]` is
81 /// the ordered category set for column `j`, used **as given** (the order is
82 /// preserved, NOT re-sorted), mirroring scikit-learn's
83 /// `categories=[list, ...]` (`_encoders.py:114`, the categories are used
84 /// `np.array(self.categories[i])` as-is).
85 Explicit(Vec<Vec<String>>),
86}
87
88// ---------------------------------------------------------------------------
89// OrdinalEncoder (unfitted)
90// ---------------------------------------------------------------------------
91
92/// An unfitted ordinal encoder.
93///
94/// Calling [`Fit::fit`] on an `Array2<String>` learns, for each column, a
95/// mapping from the unique string categories (sorted lexicographically)
96/// to consecutive integers `0, 1, 2, ...`, and returns a
97/// [`FittedOrdinalEncoder`].
98///
99/// Unknown categories at `transform` time are, by default, rejected
100/// ([`HandleUnknown::Error`]). Configuring
101/// [`with_handle_unknown`](OrdinalEncoder::with_handle_unknown) with
102/// [`HandleUnknown::UseEncodedValue`] plus
103/// [`with_unknown_value`](OrdinalEncoder::with_unknown_value) instead encodes
104/// unknown categories as the supplied sentinel (which may be `f64::NAN`),
105/// matching scikit-learn's `OrdinalEncoder(handle_unknown='use_encoded_value')`.
106///
107/// # Examples
108///
109/// ```
110/// use ferrolearn_preprocess::ordinal_encoder::OrdinalEncoder;
111/// use ferrolearn_core::traits::{Fit, Transform};
112/// use ndarray::Array2;
113///
114/// let enc = OrdinalEncoder::new();
115/// let data = Array2::from_shape_vec(
116/// (3, 2),
117/// vec![
118/// "cat".to_string(), "small".to_string(),
119/// "dog".to_string(), "large".to_string(),
120/// "cat".to_string(), "small".to_string(),
121/// ],
122/// ).unwrap();
123/// let fitted = enc.fit(&data, &()).unwrap();
124/// let encoded = fitted.transform(&data).unwrap();
125/// // Output is `Array2<f64>`, matching sklearn's `dtype=np.float64` default.
126/// assert_eq!(encoded[[0, 0]], 0.0); // "cat" is index 0 in col 0
127/// assert_eq!(encoded[[1, 0]], 1.0); // "dog" is index 1 in col 0
128/// ```
129#[derive(Debug, Clone, Default)]
130pub struct OrdinalEncoder {
131 /// How the per-column category sets are determined ([`Categories::Auto`] =
132 /// sorted-unique from the data, the default; [`Categories::Explicit`] =
133 /// user-provided lists used in the given order).
134 categories: Categories,
135 /// Strategy for unknown categories at `transform` time.
136 handle_unknown: HandleUnknown,
137 /// Sentinel written for unknown categories when `handle_unknown` is
138 /// [`HandleUnknown::UseEncodedValue`]. May be `f64::NAN`.
139 unknown_value: Option<f64>,
140 /// Minimum frequency (count) below which a category is grouped into the
141 /// single trailing "infrequent" ordinal index for that feature
142 /// (`min_frequency`). `None` (the default) disables the min-frequency
143 /// threshold. Mirrors scikit-learn's `OrdinalEncoder(min_frequency=...)`
144 /// (`sklearn/preprocessing/_encoders.py:1289-1297`). SCOPE (R-HONEST-3):
145 /// only the integer-count form is supported — sklearn also accepts a FLOAT
146 /// fraction `min_frequency * n_samples` (`:1296-1297`,`:297-299`), which is
147 /// NOT-STARTED here.
148 min_frequency: Option<usize>,
149 /// Upper limit on the number of output ordinal codes per feature when
150 /// grouping infrequent categories (`max_categories`); the infrequent group
151 /// itself counts toward this limit. `None` (the default) imposes no limit.
152 /// Mirrors scikit-learn's `OrdinalEncoder(max_categories=...)`
153 /// (`sklearn/preprocessing/_encoders.py:1301-1315`).
154 max_categories: Option<usize>,
155}
156
157impl OrdinalEncoder {
158 /// Create a new `OrdinalEncoder` with scikit-learn's defaults
159 /// (`handle_unknown='error'`, no `unknown_value`).
160 #[must_use]
161 pub fn new() -> Self {
162 Self {
163 categories: Categories::Auto,
164 handle_unknown: HandleUnknown::Error,
165 unknown_value: None,
166 min_frequency: None,
167 max_categories: None,
168 }
169 }
170
171 /// Set the explicit per-column category lists (`categories=[list, ...]`).
172 ///
173 /// Each `lists[j]` is the ordered category set for column `j`, used **as
174 /// given** at `fit` time — the order is preserved (NOT re-sorted), so the
175 /// assigned ordinal indices follow the supplied order, matching
176 /// scikit-learn's `OrdinalEncoder(categories=...)`
177 /// (`sklearn/preprocessing/_encoders.py:114`).
178 ///
179 /// At `fit` time the number of lists must equal the number of input columns,
180 /// no list may contain duplicates, and (under the default
181 /// `handle_unknown='error'`) every value seen in the data must appear in its
182 /// column's list; otherwise [`Fit::fit`] returns an error. See [`Fit::fit`]
183 /// for the exact validation contract.
184 #[must_use]
185 pub fn with_categories(mut self, categories: Vec<Vec<String>>) -> Self {
186 self.categories = Categories::Explicit(categories);
187 self
188 }
189
190 /// Return the configured `categories` strategy ([`Categories::Auto`] or
191 /// [`Categories::Explicit`]).
192 ///
193 /// Named `categories_param` to avoid colliding with
194 /// [`FittedOrdinalEncoder::categories`], which returns the *learned*
195 /// per-column category lists after fitting.
196 #[must_use]
197 pub fn categories_param(&self) -> &Categories {
198 &self.categories
199 }
200
201 /// Set the unknown-category strategy (`handle_unknown`).
202 ///
203 /// With [`HandleUnknown::UseEncodedValue`] an `unknown_value` must also be
204 /// supplied via [`with_unknown_value`](OrdinalEncoder::with_unknown_value);
205 /// otherwise [`Fit::fit`] returns an error (matching scikit-learn's
206 /// validation).
207 #[must_use]
208 pub fn with_handle_unknown(mut self, handle_unknown: HandleUnknown) -> Self {
209 self.handle_unknown = handle_unknown;
210 self
211 }
212
213 /// Set the sentinel written for unknown categories under
214 /// [`HandleUnknown::UseEncodedValue`]. May be `f64::NAN`.
215 ///
216 /// Setting this while `handle_unknown` is [`HandleUnknown::Error`] causes
217 /// [`Fit::fit`] to return an error (matching scikit-learn's validation).
218 #[must_use]
219 pub fn with_unknown_value(mut self, unknown_value: f64) -> Self {
220 self.unknown_value = Some(unknown_value);
221 self
222 }
223
224 /// Return the configured unknown-category strategy.
225 #[must_use]
226 pub fn handle_unknown(&self) -> HandleUnknown {
227 self.handle_unknown
228 }
229
230 /// Return the configured unknown-category sentinel, if any.
231 #[must_use]
232 pub fn unknown_value(&self) -> Option<f64> {
233 self.unknown_value
234 }
235
236 /// Set the minimum-frequency threshold for infrequent grouping
237 /// (`min_frequency`, integer count).
238 ///
239 /// At `fit` time a category whose count in the training data is **strictly
240 /// less than** `min_frequency` is grouped with the other infrequent
241 /// categories into a single trailing ordinal index `n_frequent` for that
242 /// feature (the frequent categories keep ordinal indices `0..n_frequent` in
243 /// their original sorted order), matching scikit-learn's
244 /// `OrdinalEncoder(min_frequency=...)` integer form
245 /// (`sklearn/preprocessing/_encoders.py:1289-1297`, `_identify_infrequent`
246 /// `:295-296` `category_count < self.min_frequency`).
247 ///
248 /// Unlike [`crate::OneHotEncoder`], the infrequent group collapses to ONE
249 /// **ordinal index** (not a one-hot column), so `categories_` is unchanged
250 /// (all categories retained) — only the emitted ordinal code is shared.
251 ///
252 /// SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn
253 /// also accepts a FLOAT `min_frequency` interpreted as the fraction
254 /// `min_frequency * n_samples` (`_encoders.py:1296-1297`,`:297-299`); the
255 /// float-fraction form is NOT-STARTED here.
256 #[must_use]
257 pub fn with_min_frequency(mut self, min_frequency: usize) -> Self {
258 self.min_frequency = Some(min_frequency);
259 self
260 }
261
262 /// Set the maximum number of output ordinal codes per feature for infrequent
263 /// grouping (`max_categories`).
264 ///
265 /// At `fit` time, if a feature would otherwise produce more than
266 /// `max_categories` distinct ordinal codes, the least-frequent categories
267 /// are grouped into the single trailing infrequent index so the number of
268 /// codes is at most `max_categories` (the infrequent group itself counts
269 /// toward the limit). Mirrors scikit-learn's
270 /// `OrdinalEncoder(max_categories=...)`
271 /// (`sklearn/preprocessing/_encoders.py:1301-1315`, `_identify_infrequent`
272 /// `:303-315`).
273 #[must_use]
274 pub fn with_max_categories(mut self, max_categories: usize) -> Self {
275 self.max_categories = Some(max_categories);
276 self
277 }
278
279 /// Return the configured minimum-frequency threshold (`min_frequency`), or
280 /// `None` if infrequent grouping by frequency is disabled.
281 #[must_use]
282 pub fn min_frequency(&self) -> Option<usize> {
283 self.min_frequency
284 }
285
286 /// Return the configured maximum ordinal-code limit (`max_categories`), or
287 /// `None` if no limit is imposed.
288 #[must_use]
289 pub fn max_categories(&self) -> Option<usize> {
290 self.max_categories
291 }
292
293 /// Whether infrequent grouping is enabled (either `min_frequency` or
294 /// `max_categories` is set). Mirrors scikit-learn's `_infrequent_enabled`
295 /// (`_encoders.py:271-273`: `(max_categories is not None and
296 /// max_categories >= 1) or min_frequency is not None`).
297 fn infrequent_enabled(&self) -> bool {
298 self.min_frequency.is_some() || self.max_categories.is_some_and(|m| m >= 1)
299 }
300}
301
302// ---------------------------------------------------------------------------
303// FittedOrdinalEncoder
304// ---------------------------------------------------------------------------
305
306/// A fitted ordinal encoder holding per-column category-to-index mappings.
307///
308/// Created by calling [`Fit::fit`] on an [`OrdinalEncoder`].
309#[derive(Debug, Clone)]
310pub struct FittedOrdinalEncoder {
311 /// Per-column ordered category lists (index = integer value).
312 pub(crate) categories: Vec<Vec<String>>,
313 /// Per-column category-to-index maps.
314 pub(crate) category_to_index: Vec<HashMap<String, usize>>,
315 /// Strategy for unknown categories at `transform` time (threaded from the
316 /// unfitted [`OrdinalEncoder`]).
317 pub(crate) handle_unknown: HandleUnknown,
318 /// Sentinel for unknown categories under
319 /// [`HandleUnknown::UseEncodedValue`] (threaded from the unfitted encoder;
320 /// validated to be present in that mode during `fit`).
321 pub(crate) unknown_value: Option<f64>,
322 /// Per-feature indices into `categories[j]` of the categories grouped as
323 /// **infrequent** (`min_frequency`/`max_categories`), sorted ascending.
324 /// Mirrors scikit-learn's private `_infrequent_indices[j]`
325 /// (`_encoders.py:336-340`,`:367-370`). Empty when feature `j` has no
326 /// infrequent categories (sklearn's `None`); with infrequent grouping
327 /// disabled every entry is empty. Length `categories.len()`. The categories
328 /// themselves are NOT removed from `categories[j]` (unlike one-hot column
329 /// dropping) — only their emitted ordinal code is folded.
330 pub(crate) infrequent_indices_: Vec<Vec<usize>>,
331 /// Per-feature mapping from a `categories[j]` index to its emitted ORDINAL
332 /// code. Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
333 /// (`_encoders.py:373-400`): a frequent category maps to its remapped slot
334 /// `0..n_frequent` (frequent categories keep their original sorted order),
335 /// every infrequent category maps to the single trailing index
336 /// `n_frequent`. When feature `j` has no infrequent categories the mapping
337 /// is the identity `0..len` (sklearn stores `None`; the identity is the
338 /// representable equivalent). Length `categories.len()`, with
339 /// `infrequent_map[j].len() == categories[j].len()`. Used by `transform`
340 /// and `inverse_transform`.
341 pub(crate) infrequent_map: Vec<Vec<usize>>,
342 /// Per-feature number of frequent categories (`n_frequent`): the trailing
343 /// infrequent ordinal index when feature `j` has infrequent categories.
344 /// Equals `categories[j].len() - infrequent_indices_[j].len()`. When feature
345 /// `j` has no infrequent categories this equals `categories[j].len()` (the
346 /// identity map's range). Length `categories.len()`. Used by
347 /// `inverse_transform` to recognise the shared infrequent code.
348 pub(crate) n_frequent: Vec<usize>,
349}
350
351impl FittedOrdinalEncoder {
352 /// Return the ordered category list for each column.
353 ///
354 /// `categories()[j][i]` is the category that maps to integer `i` in column `j`.
355 #[must_use]
356 pub fn categories(&self) -> &[Vec<String>] {
357 &self.categories
358 }
359
360 /// Return the infrequent category **values** for each feature
361 /// (`infrequent_categories_`).
362 ///
363 /// `infrequent_categories()[j]` is the sorted list of category values from
364 /// `categories[j]` that were grouped into the single trailing "infrequent"
365 /// ordinal code (because their training count fell below `min_frequency`
366 /// and/or beyond the `max_categories` limit). An EMPTY inner `Vec` means
367 /// feature `j` had no infrequent categories (scikit-learn returns `None`
368 /// there; an empty list is the representable equivalent). With infrequent
369 /// grouping disabled every entry is empty. Mirrors scikit-learn's
370 /// `OrdinalEncoder.infrequent_categories_` (`_encoders.py:255-262`):
371 /// `category[indices]` over `_infrequent_indices`.
372 #[must_use]
373 pub fn infrequent_categories(&self) -> Vec<Vec<String>> {
374 self.infrequent_indices_
375 .iter()
376 .enumerate()
377 .map(|(j, idxs)| {
378 idxs.iter()
379 .filter_map(|&idx| self.categories.get(j).and_then(|c| c.get(idx)).cloned())
380 .collect()
381 })
382 .collect()
383 }
384
385 /// Return the number of input columns (features).
386 #[must_use]
387 pub fn n_features(&self) -> usize {
388 self.categories.len()
389 }
390
391 /// Return the number of features seen during `fit`.
392 ///
393 /// Mirrors scikit-learn's `n_features_in_` attribute (set by `_validate_data`
394 /// at fit, `sklearn/base.py`). Equal to [`n_features`](Self::n_features); the
395 /// distinct name matches sklearn's fitted-attribute surface (REQ-10).
396 #[must_use]
397 pub fn n_features_in(&self) -> usize {
398 self.categories.len()
399 }
400
401 /// Return the output feature names, one per input feature.
402 ///
403 /// `OrdinalEncoder` is a `OneToOneFeatureMixin` (one output column per input
404 /// column), so `get_feature_names_out` returns the INPUT feature names
405 /// unchanged (`sklearn/utils/_set_output` / `OneToOneFeatureMixin.
406 /// get_feature_names_out`): with `input_features = None` the default names
407 /// `["x0", "x1", ...]` (`_check_feature_names_in`), otherwise the supplied
408 /// names verbatim.
409 ///
410 /// # Errors
411 ///
412 /// Returns [`FerroError::ShapeMismatch`] if `input_features` is `Some` but its
413 /// length differs from [`n_features_in`](Self::n_features_in) (sklearn raises
414 /// `ValueError("input_features should have length equal to number of features
415 /// ...")`).
416 pub fn get_feature_names_out(
417 &self,
418 input_features: Option<&[String]>,
419 ) -> Result<Vec<String>, FerroError> {
420 let n = self.categories.len();
421 match input_features {
422 None => Ok((0..n).map(|j| format!("x{j}")).collect()),
423 Some(names) => {
424 if names.len() != n {
425 return Err(FerroError::ShapeMismatch {
426 expected: vec![n],
427 actual: vec![names.len()],
428 context: "FittedOrdinalEncoder::get_feature_names_out (input_features \
429 length must equal n_features_in_)"
430 .into(),
431 });
432 }
433 Ok(names.to_vec())
434 }
435 }
436 }
437
438 /// Return the configured unknown-category strategy.
439 #[must_use]
440 pub fn handle_unknown(&self) -> HandleUnknown {
441 self.handle_unknown
442 }
443
444 /// Return the configured unknown-category sentinel, if any.
445 #[must_use]
446 pub fn unknown_value(&self) -> Option<f64> {
447 self.unknown_value
448 }
449
450 /// Convert ordinal indices back to the original category strings.
451 ///
452 /// This is the inverse of [`Transform::transform`]: each `f64` cell is read
453 /// as an ordinal index into the per-column `categories_` learned at `fit`
454 /// time, and the corresponding category string is returned. Reusing the
455 /// SHIPPED `categories_` (REQ-1), `inverse_transform(transform(X)) == X` for
456 /// any `X` whose every category was seen during `fit` (a bit-exact roundtrip
457 /// on the default `Error`-mode encoder). Mirrors scikit-learn's
458 /// `OrdinalEncoder.inverse_transform` (`sklearn/preprocessing/_encoders.py:1595`),
459 /// `X_tr[:, i] = self.categories_[i][labels]`.
460 ///
461 /// # Index contract (faithful to sklearn / numpy)
462 ///
463 /// Mirrors sklearn's `labels.astype("int64")` (`_encoders.py:1664`) followed
464 /// by numpy fancy indexing `categories_[j][labels]` (`:1679`):
465 /// - **truncates non-integers toward zero** (`1.5` → index `1` → that
466 /// category; `0.7` → `0`) — Rust `f64 as i64` matches the C-style cast.
467 /// - **wraps small negatives** via numpy negative indexing (`-1.0` →
468 /// `categories_[j][len-1]`, the LAST category; `-2.0` → `len-2`), raising
469 /// only once the wrapped index still leaves `[0, len)` (`-3.0` with 2
470 /// categories → `IndexError`).
471 /// - **errors** on an out-of-range positive ordinal (`9.0` with 2 categories
472 /// → sklearn `IndexError`) and on a non-finite cell (NaN/±inf overflow the
473 /// `astype("int64")` cast → sklearn `IndexError`/`ValueError`; guarded
474 /// explicitly because Rust's `f64 as i64` saturates NaN→0, which would
475 /// diverge).
476 ///
477 /// The roundtrip, held-out valid-ordinal, truncation, and negative-wrap paths
478 /// all match sklearn; out-of-range / non-finite both error.
479 ///
480 /// # `use_encoded_value` → `None` (SCOPE LIMITATION, R-HONEST-3)
481 ///
482 /// With [`HandleUnknown::UseEncodedValue`], sklearn maps a cell equal to
483 /// `unknown_value` back to `None` (`_encoders.py:1673`,
484 /// `X_tr[mask, idx] = None`). ferrolearn's `Array2<String>` output container
485 /// **cannot represent `None`** (it would require `Array2<Option<String>>`).
486 /// The configured `unknown_value` is itself out of the valid `[0, len)`
487 /// range (e.g. `-1`), so such a cell hits the out-of-range error path: this
488 /// inverse therefore ERRORS where sklearn returns `[[None, ...]]`. This is a
489 /// documented divergence, not a silent wrong-string — the honest behavior is
490 /// to error rather than fabricate a category. The default `Error`-mode
491 /// encoder produces only valid ordinals, so its inverse is COMPLETE and
492 /// bit-exact.
493 ///
494 /// # Errors
495 ///
496 /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows
497 /// (symmetry with `transform`'s #2220 guard and sklearn's `check_array`).
498 ///
499 /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
500 /// match the number of features seen during fitting (sklearn's
501 /// `_encoders.py:1619` "Shape of the passed X data is not correct").
502 ///
503 /// Returns [`FerroError::InvalidParameter`] if any cell is not an exact
504 /// non-negative integer in `[0, categories_[j].len())` (sklearn's
505 /// `IndexError`, plus the strict negative/non-integer contract above).
506 pub fn inverse_transform(&self, x: &Array2<f64>) -> Result<Array2<String>, FerroError> {
507 let n_features = self.categories.len();
508 // Symmetric with `transform`'s 0-row guard (#2220) and sklearn's
509 // `check_array` minimum-of-1-sample (`_encoders.py:1610`): a 0-row input
510 // raises "Found array with 0 sample(s) ... a minimum of 1 is required".
511 if x.nrows() == 0 {
512 return Err(FerroError::InsufficientSamples {
513 required: 1,
514 actual: 0,
515 context: "FittedOrdinalEncoder::inverse_transform".into(),
516 });
517 }
518 // sklearn validates the column count (`_encoders.py:1619`) -> ValueError.
519 if x.ncols() != n_features {
520 return Err(FerroError::ShapeMismatch {
521 expected: vec![x.nrows(), n_features],
522 actual: vec![x.nrows(), x.ncols()],
523 context: "FittedOrdinalEncoder::inverse_transform".into(),
524 });
525 }
526
527 let n_samples = x.nrows();
528 // `Array2::default` fills with the empty String; every cell is overwritten
529 // on the Ok path, so the default is never observed by the caller.
530 let mut out = Array2::<String>::default((n_samples, n_features));
531
532 for j in 0..n_features {
533 let cats = &self.categories[j];
534 // Infrequent grouping (REQ-8). When feature `j` has infrequent
535 // categories, the valid ordinal codes are `0..=n_frequent[j]`: codes
536 // `0..n_frequent` index the FREQUENT-only category list (the original
537 // `categories[j]` with the infrequent entries removed, order
538 // preserved — sklearn `frequent_categories_mask`,
539 // `_encoders.py:1648-1652`), and the shared trailing code
540 // `n_frequent` inverts to the literal String `"infrequent_sklearn"`
541 // (`_encoders.py:1675-1677` `X_tr[mask, idx] = "infrequent_sklearn"`).
542 // UNLIKE `OneHotEncoder`'s NaN proxy, this is a REAL representable
543 // String. With grouping disabled `infrequent_indices_[j]` is empty,
544 // so this branch is skipped and the SHIPPED REQ-9 path runs unchanged.
545 let frequent_only: Option<Vec<String>> = if self
546 .infrequent_indices_
547 .get(j)
548 .is_some_and(|v| !v.is_empty())
549 {
550 let map = &self.infrequent_map[j];
551 let nf = self.n_frequent[j];
552 // Slot `s` (in `0..nf`) → the `categories[j]` element whose
553 // remapped code is `s` (frequent categories keep their order).
554 let mut fo: Vec<String> = Vec::with_capacity(nf);
555 for s in 0..nf {
556 if let Some(orig) = map.iter().position(|&c| c == s)
557 && let Some(cat) = cats.get(orig)
558 {
559 fo.push(cat.clone());
560 }
561 }
562 Some(fo)
563 } else {
564 None
565 };
566 // The category list the numpy index logic indexes into: the
567 // frequent-only list when grouping is active for this feature, else
568 // the full `categories[j]` (SHIPPED REQ-9, UNCHANGED).
569 let index_cats: &[String] = frequent_only.as_deref().unwrap_or(cats);
570 let len = index_cats.len() as i64;
571 for i in 0..n_samples {
572 let v = x[[i, j]];
573 // `use_encoded_value`: sklearn maps a cell equal to
574 // `unknown_value` back to `None` (`_encoders.py:1673`) BEFORE the
575 // int cast / indexing. `Array2<String>` cannot hold `None`, so
576 // this cell errors (documented scope limitation, R-HONEST-3) —
577 // checked first so the configured sentinel (e.g. `-1`) is NOT
578 // silently wrapped to a real category by the numpy index logic.
579 if self.handle_unknown == HandleUnknown::UseEncodedValue
580 && let Some(uv) = self.unknown_value
581 && (v == uv || (v.is_nan() && uv.is_nan()))
582 {
583 return Err(FerroError::InvalidParameter {
584 name: "X".into(),
585 reason: format!(
586 "value {v} at row {i}, feature {j} equals unknown_value; \
587 sklearn inverts it to None, which Array2<String> cannot \
588 represent (would need Array2<Option<String>>)"
589 ),
590 });
591 }
592 // Infrequent: a cell EXACTLY equal to the shared trailing code
593 // `n_frequent` (a float equality, computed on the RAW label
594 // BEFORE the int cast — sklearn `labels == infrequent_encoding_value`,
595 // `_encoders.py:1644`) inverts to `"infrequent_sklearn"`. A cell
596 // that merely truncates to `n_frequent` (e.g. `2.5`) does NOT —
597 // it falls through to the frequent-only index logic and errors out
598 // of range, matching the live oracle.
599 if frequent_only.is_some() && v == self.n_frequent[j] as f64 {
600 out[[i, j]] = "infrequent_sklearn".to_string();
601 continue;
602 }
603 // sklearn does `labels.astype('int64')` then `categories_[j][idx]`
604 // (`_encoders.py:1664`,`:1679`). A non-finite cell overflows the
605 // cast (NaN/+-inf -> IndexError/ValueError); reject it (R-CODE-2:
606 // Rust's `f64 as i64` would saturate NaN->0, diverging from numpy,
607 // so guard explicitly).
608 if !v.is_finite() {
609 return Err(FerroError::InvalidParameter {
610 name: "X".into(),
611 reason: format!(
612 "value {v} at row {i}, feature {j} is not a finite ordinal \
613 index (sklearn raises on NaN/inf astype('int64'))"
614 ),
615 });
616 }
617 // `astype('int64')` truncates toward zero (Rust `as i64` matches
618 // for finite values); numpy indexing then WRAPS a negative index
619 // by `+= len` (`-1` -> last category), raising only once the
620 // wrapped index still leaves `[0, len)`.
621 let mut idx = v as i64;
622 if idx < 0 {
623 idx += len;
624 }
625 if idx < 0 || idx >= len {
626 return Err(FerroError::InvalidParameter {
627 name: "X".into(),
628 reason: format!(
629 "ordinal index {} at row {i} is out of bounds for the {len} \
630 categories of feature {j} (sklearn IndexError)",
631 v as i64
632 ),
633 });
634 }
635 // `idx` is now provably in `[0, len)` (checked above) — no panic.
636 // `index_cats` is the frequent-only list under infrequent
637 // grouping (so a frequent code maps to its frequent category),
638 // else the full `categories[j]` (SHIPPED REQ-9, UNCHANGED).
639 out[[i, j]] = index_cats[idx as usize].clone();
640 }
641 }
642
643 Ok(out)
644 }
645}
646
647// ---------------------------------------------------------------------------
648// Helpers
649// ---------------------------------------------------------------------------
650
651/// Cast an ordinal category index to `f64`, matching scikit-learn's default
652/// `OrdinalEncoder(dtype=np.float64)` output container
653/// (`sklearn/preprocessing/_encoders.py:1262`).
654///
655/// `f64` exactly represents every integer up to `2^53`, so this is lossless for
656/// any realistic category count. Indices above `2^53` (astronomically more
657/// categories than memory could hold) round to the nearest `f64`, never panic
658/// (R-CODE-2) — the same silent float rounding numpy performs.
659#[inline]
660fn ordinal_index_to_f64(idx: usize) -> f64 {
661 idx as f64
662}
663
664/// Identify the indices of infrequent categories for one feature, given the
665/// per-category training `counts` (aligned with `categories[j]`) and the
666/// `min_frequency`/`max_categories` thresholds.
667///
668/// Mirrors scikit-learn's `_BaseEncoder._identify_infrequent`
669/// (`_encoders.py:275-318`). This is the SAME algorithm the SHIPPED
670/// `OneHotEncoder` REQ-5b uses (`one_hot_encoder.rs::identify_infrequent`):
671/// 1. min_frequency: a category with `count < min_frequency` is infrequent
672/// (`:295-296`, integer form only).
673/// 2. max_categories: if (after step 1) the feature would still produce more
674/// than `max_categories` ordinal codes — counted as `n_remaining_frequent +
675/// 1` for the infrequent group (`:303`) — the least-frequent categories are
676/// additionally marked infrequent until only `max_categories - 1` frequent
677/// categories remain (`:304-315`). Ties broken by a STABLE sort over the
678/// FULL count array, so among equal counts the SMALLER category index is
679/// marked infrequent first (sklearn `np.argsort(kind="mergesort")[:-k]`),
680/// i.e. the LARGER index is favoured to stay frequent. `max_categories == 1`
681/// (frequent_category_count 0) makes every category infrequent (`:307-309`).
682///
683/// Returns the sorted-ascending infrequent indices (empty if none — sklearn's
684/// `None`). Never panics (R-CODE-2).
685fn identify_infrequent(
686 counts: &[usize],
687 min_frequency: Option<usize>,
688 max_categories: Option<usize>,
689) -> Vec<usize> {
690 let n = counts.len();
691 let mut infrequent_mask = vec![false; n];
692
693 // Step 1: min_frequency (integer count). `count < min_frequency`.
694 if let Some(min_freq) = min_frequency {
695 for (idx, &c) in counts.iter().enumerate() {
696 if c < min_freq {
697 infrequent_mask[idx] = true;
698 }
699 }
700 }
701
702 // Step 2: max_categories on the survivors. `n_current_features` counts the
703 // remaining frequent categories PLUS 1 for the infrequent group
704 // (`_encoders.py:303`).
705 if let Some(max_cat) = max_categories {
706 let n_infreq = infrequent_mask.iter().filter(|&&m| m).count();
707 let n_current_features = n - n_infreq + 1;
708 if max_cat < n_current_features {
709 // `max_categories` includes the one infrequent category.
710 let frequent_category_count = max_cat - 1;
711 if frequent_category_count == 0 {
712 // All categories are infrequent (`:307-309`).
713 infrequent_mask.iter_mut().for_each(|m| *m = true);
714 } else {
715 // Stable argsort over the FULL count array (ascending by count,
716 // ties by ascending index), then mark the smallest
717 // `n - frequent_category_count` levels infrequent — i.e. keep the
718 // top `frequent_category_count` by count, with ties resolved in
719 // favor of the LARGER index (`np.argsort(kind="mergesort")[:-k]`,
720 // `:312-315`).
721 let mut order: Vec<usize> = (0..n).collect();
722 order.sort_by(|&a, &b| counts[a].cmp(&counts[b]).then(a.cmp(&b)));
723 let keep = frequent_category_count.min(n);
724 let cut = n - keep;
725 for &idx in &order[..cut] {
726 infrequent_mask[idx] = true;
727 }
728 }
729 }
730 }
731
732 infrequent_mask
733 .iter()
734 .enumerate()
735 .filter_map(|(idx, &m)| if m { Some(idx) } else { None })
736 .collect()
737}
738
739/// Build the per-feature mapping from a `categories[j]` index to its emitted
740/// ORDINAL code.
741///
742/// Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
743/// (`_encoders.py:373-400`): frequent categories take codes `0..n_frequent` in
744/// their original (ascending-index) order; every infrequent category maps to
745/// the single trailing code `n_frequent`. With no infrequent categories the
746/// mapping is the identity `0..n`. `infrequent` must be sorted ascending. Never
747/// panics (R-CODE-2): every index is bounds-checked.
748fn build_infrequent_map(n: usize, infrequent: &[usize]) -> Vec<usize> {
749 if infrequent.is_empty() {
750 return (0..n).collect();
751 }
752 let n_frequent = n - infrequent.len();
753 let mut map = vec![n_frequent; n];
754 let mut next_frequent = 0usize;
755 for (idx, slot) in map.iter_mut().enumerate() {
756 if infrequent.binary_search(&idx).is_ok() {
757 // Infrequent → the trailing code (already set to `n_frequent`).
758 } else {
759 *slot = next_frequent;
760 next_frequent += 1;
761 }
762 }
763 map
764}
765
766// ---------------------------------------------------------------------------
767// Trait implementations
768// ---------------------------------------------------------------------------
769
770impl Fit<Array2<String>, ()> for OrdinalEncoder {
771 type Fitted = FittedOrdinalEncoder;
772 type Error = FerroError;
773
774 /// Fit the encoder by building per-column category-to-index mappings.
775 ///
776 /// With the default `categories='auto'` ([`Categories::Auto`]), categories
777 /// are recorded in **lexicographic order** in each column, matching
778 /// scikit-learn's `OrdinalEncoder.categories_`.
779 ///
780 /// With explicit categories ([`Categories::Explicit`], set via
781 /// [`OrdinalEncoder::with_categories`]), the user-provided lists are used in
782 /// the **given order** (NOT re-sorted), and the ordinal indices follow that
783 /// order, mirroring scikit-learn (`sklearn/preprocessing/_encoders.py:114`).
784 ///
785 /// # Errors
786 ///
787 /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows.
788 ///
789 /// Returns [`FerroError::ShapeMismatch`] if explicit categories are set but
790 /// the number of category lists differs from the number of input columns
791 /// (sklearn `_encoders.py:85-89` "Shape mismatch: if categories is an array,
792 /// it has to be of shape (n_features,).").
793 ///
794 /// Returns [`FerroError::InvalidParameter`] if an explicit category list
795 /// contains duplicate elements (sklearn `_encoders.py:136-141`), or — under
796 /// the default [`HandleUnknown::Error`] — if a value seen in the data is not
797 /// in its column's explicit list (sklearn `_encoders.py:153-160` "Found
798 /// unknown categories ... during fit"; SKIPPED under
799 /// [`HandleUnknown::UseEncodedValue`]).
800 ///
801 /// Returns [`FerroError::InvalidParameter`] for the `handle_unknown` /
802 /// `unknown_value` validation failures (mirroring scikit-learn's
803 /// `TypeError`/`ValueError` at `_encoders.py:1473-1526`): selecting
804 /// [`HandleUnknown::UseEncodedValue`] without an `unknown_value`; setting an
805 /// `unknown_value` while in [`HandleUnknown::Error`] mode; or an
806 /// `unknown_value` that collides with an already-used encoding index.
807 fn fit(&self, x: &Array2<String>, _y: &()) -> Result<FittedOrdinalEncoder, FerroError> {
808 // sklearn `_parameter_constraints` (`@_fit_context`, validated BEFORE the
809 // data): `min_frequency` and `max_categories` are each
810 // `Interval(Integral, 1, None)` — a value of 0 raises
811 // `InvalidParameterError` ("must be an int in the range [1, inf)").
812 // REQ-8, verified live: `OrdinalEncoder(min_frequency=0).fit` →
813 // InvalidParameterError. (handle_unknown is a type-safe Rust enum, so its
814 // StrOptions constraint is provided by the type system.)
815 if self.min_frequency == Some(0) {
816 return Err(FerroError::InvalidParameter {
817 name: "min_frequency".into(),
818 reason: "must be an int in the range [1, inf)".into(),
819 });
820 }
821 if self.max_categories == Some(0) {
822 return Err(FerroError::InvalidParameter {
823 name: "max_categories".into(),
824 reason: "must be an int in the range [1, inf)".into(),
825 });
826 }
827
828 let n_samples = x.nrows();
829 if n_samples == 0 {
830 return Err(FerroError::InsufficientSamples {
831 required: 1,
832 actual: 0,
833 context: "OrdinalEncoder::fit".into(),
834 });
835 }
836
837 // Validation (a)/(b) on the param SHAPE — independent of the data, but
838 // matching sklearn these are evaluated in `fit`, AFTER the 0-row
839 // `check_array` guard above and (for the collision check) AFTER the
840 // categories_ compute below. (a)/(b) map sklearn's `TypeError`
841 // (`_encoders.py:1481-1493`).
842 match (self.handle_unknown, self.unknown_value) {
843 // (a) use_encoded_value REQUIRES an unknown_value (an int or nan).
844 // sklearn: `not isinstance(unknown_value, Integral)` -> TypeError
845 // (`:1481`); `unknown_value is None` falls into that branch.
846 (HandleUnknown::UseEncodedValue, None) => {
847 return Err(FerroError::InvalidParameter {
848 name: "unknown_value".into(),
849 reason: "unknown_value should be set (an integer or NaN) when \
850 handle_unknown is 'use_encoded_value'"
851 .into(),
852 });
853 }
854 // (b) error-mode forbids a set unknown_value. sklearn: `:1488`
855 // `elif self.unknown_value is not None` -> TypeError.
856 (HandleUnknown::Error, Some(v)) => {
857 return Err(FerroError::InvalidParameter {
858 name: "unknown_value".into(),
859 reason: format!(
860 "unknown_value should only be set when handle_unknown is \
861 'use_encoded_value', got {v}"
862 ),
863 });
864 }
865 _ => {}
866 }
867
868 let n_features = x.ncols();
869 let mut categories = Vec::with_capacity(n_features);
870 let mut category_to_index = Vec::with_capacity(n_features);
871
872 match &self.categories {
873 // `categories='auto'` (default): per column, sorted-unique from the
874 // data (SHIPPED REQ-1, UNCHANGED). sklearn `_encoders.py:98-99`
875 // `result = _unique(Xi)`.
876 Categories::Auto => {
877 for j in 0..n_features {
878 // Collect unique categories then sort lexicographically so the
879 // assigned indices match sklearn's `OrdinalEncoder`, which
880 // documents `categories_ = sorted(unique(X[:, j]))`. (Older
881 // ferrolearn versions used first-seen order — #344.)
882 let mut unique: Vec<String> = Vec::new();
883 let mut seen_set: std::collections::HashSet<String> =
884 std::collections::HashSet::new();
885 for i in 0..n_samples {
886 let cat = &x[[i, j]];
887 if seen_set.insert(cat.clone()) {
888 unique.push(cat.clone());
889 }
890 }
891 unique.sort();
892
893 let map: HashMap<String, usize> = unique
894 .iter()
895 .enumerate()
896 .map(|(idx, s)| (s.clone(), idx))
897 .collect();
898
899 categories.push(unique);
900 category_to_index.push(map);
901 }
902 }
903 // `categories=[list, ...]` (explicit): use the user-provided lists in
904 // the GIVEN order (NOT re-sorted), mirroring sklearn `_encoders.py:84-160`.
905 Categories::Explicit(lists) => {
906 // sklearn (`_encoders.py:85-89`): the list count must match
907 // n_features, else ValueError -> map to `ShapeMismatch`.
908 if lists.len() != n_features {
909 return Err(FerroError::ShapeMismatch {
910 expected: vec![n_features],
911 actual: vec![lists.len()],
912 context: "Shape mismatch: if categories is an array, it has to be of \
913 shape (n_features,)."
914 .into(),
915 });
916 }
917
918 for (j, list) in lists.iter().enumerate() {
919 // sklearn (`_encoders.py:114-117`) indexes `cats[0]` on the
920 // provided list BEFORE the duplicate/subset checks, so an
921 // EMPTY explicit list raises `IndexError` at fit in BOTH
922 // handle_unknown modes (#2229). Reject it here (the
923 // use_encoded_value path would otherwise skip the subset
924 // check and silently fit an empty category set).
925 if list.is_empty() {
926 return Err(FerroError::InvalidParameter {
927 name: "categories".into(),
928 reason: format!(
929 "column {j} has an empty predefined category list; \
930 each feature needs at least one category"
931 ),
932 });
933 }
934 // sklearn (`_encoders.py:136-141`): a list with duplicate
935 // elements raises ValueError. Build the index map detecting
936 // duplicates in one pass (R-CODE-2: never panic).
937 let mut map: HashMap<String, usize> = HashMap::with_capacity(list.len());
938 for (idx, cat) in list.iter().enumerate() {
939 if map.insert(cat.clone(), idx).is_some() {
940 return Err(FerroError::InvalidParameter {
941 name: "categories".into(),
942 reason: format!(
943 "In column {j}, the predefined categories contain \
944 duplicate elements."
945 ),
946 });
947 }
948 }
949
950 // sklearn (`_encoders.py:153-160`): under handle_unknown='error'
951 // every value seen in the data must be present in the
952 // predefined list, else ValueError. Under 'use_encoded_value'
953 // this fit-time subset check is SKIPPED (out-of-set data is
954 // fine — encoded to `unknown_value` later at transform time).
955 if self.handle_unknown == HandleUnknown::Error {
956 for i in 0..n_samples {
957 let cat = &x[[i, j]];
958 if !map.contains_key(cat) {
959 return Err(FerroError::InvalidParameter {
960 name: "X".into(),
961 reason: format!(
962 "Found unknown categories [{cat}] in column {j} \
963 during fit"
964 ),
965 });
966 }
967 }
968 }
969
970 // Use the list AS-GIVEN (preserve order — do NOT sort).
971 categories.push(list.clone());
972 category_to_index.push(map);
973 }
974 }
975 }
976
977 // Infrequent grouping (REQ-8). When `min_frequency`/`max_categories` are
978 // set, fold the least-frequent categories of each feature into a single
979 // shared trailing ORDINAL code (the frequent categories keep codes
980 // `0..n_frequent` in their original sorted order). `categories` is NOT
981 // changed (all categories retained, sklearn keeps `categories_` whole and
982 // only remaps the emitted index, `_encoders.py:1289-1370`) — only the
983 // per-feature `infrequent_map` / `infrequent_indices_` / `n_frequent` are
984 // built. With grouping disabled the map is the identity and every feature
985 // has no infrequent categories.
986 let mut infrequent_indices_: Vec<Vec<usize>> = Vec::with_capacity(n_features);
987 let mut infrequent_map: Vec<Vec<usize>> = Vec::with_capacity(n_features);
988 let mut n_frequent: Vec<usize> = Vec::with_capacity(n_features);
989 if self.infrequent_enabled() {
990 for (j, cats) in categories.iter().enumerate() {
991 // Per-category training counts ALIGNED with `categories[j]`
992 // (sklearn `_unique(Xi, return_counts=True)`,
993 // `_encoders.py:99-102`). Built from the fit data through the
994 // category→index map, so it works for BOTH the Auto and Explicit
995 // category sets. (A datum not in an explicit list contributes no
996 // count — under `handle_unknown='error'` the subset check above
997 // already rejected it; under `use_encoded_value` it is an unknown
998 // that does not affect category frequencies.)
999 let map = &category_to_index[j];
1000 let mut counts = vec![0usize; cats.len()];
1001 for i in 0..n_samples {
1002 if let Some(&idx) = map.get(&x[[i, j]]) {
1003 counts[idx] += 1;
1004 }
1005 }
1006 let infreq = identify_infrequent(&counts, self.min_frequency, self.max_categories);
1007 let imap = build_infrequent_map(cats.len(), &infreq);
1008 n_frequent.push(cats.len() - infreq.len());
1009 infrequent_indices_.push(infreq);
1010 infrequent_map.push(imap);
1011 }
1012 } else {
1013 for cats in &categories {
1014 infrequent_indices_.push(Vec::new());
1015 infrequent_map.push((0..cats.len()).collect());
1016 n_frequent.push(cats.len());
1017 }
1018 }
1019
1020 // Validation (a'): sklearn (`_encoders.py:1481-1487`) requires
1021 // `unknown_value` to be an INTEGER or `np.nan` when
1022 // `handle_unknown='use_encoded_value'` — a non-integer float raises
1023 // `TypeError` BEFORE the range/collision check (#2221). `f64` cannot
1024 // express "integral", so a non-nan value with a fractional part is
1025 // rejected here.
1026 if self.handle_unknown == HandleUnknown::UseEncodedValue
1027 && let Some(v) = self.unknown_value
1028 && !v.is_nan()
1029 && v.fract() != 0.0
1030 {
1031 return Err(FerroError::InvalidParameter {
1032 name: "unknown_value".into(),
1033 reason: format!(
1034 "unknown_value should be an integer or np.nan when \
1035 handle_unknown is 'use_encoded_value', got {v}"
1036 ),
1037 });
1038 }
1039
1040 // Validation (c): collision of a non-nan integer unknown_value with an
1041 // already-used encoding index. sklearn (`_encoders.py:1518-1526`) loops
1042 // each column's cardinality and raises `ValueError` if
1043 // `0 <= unknown_value < cardinality`; that is equivalent to comparing
1044 // against the maximum cardinality. The earlier sklearn check
1045 // (`:1481`) already guaranteed `unknown_value` is an int or nan, so a
1046 // non-integer / nan value is fine here, as is a negative value or one
1047 // `>= max_cardinality`.
1048 if self.handle_unknown == HandleUnknown::UseEncodedValue
1049 && let Some(v) = self.unknown_value
1050 && !v.is_nan()
1051 && v.fract() == 0.0
1052 {
1053 // sklearn's collision check keys off the EFFECTIVE number of distinct
1054 // output codes per feature: with infrequent grouping a feature emits
1055 // `n_frequent + 1` codes (the shared infrequent index), so its
1056 // cardinality for the unknown_value collision is `n_frequent + 1`, NOT
1057 // `len(categories_)` (verified live: `min_frequency=2` over 4 cats →
1058 // 3 codes → `unknown_value=3` is accepted, `=2` collides). With
1059 // grouping disabled `n_frequent[j] == categories[j].len()` and there
1060 // is no infrequent code, so this reduces to the SHIPPED REQ-5 check.
1061 let max_cardinality = (0..n_features)
1062 .map(|j| n_frequent[j] + usize::from(!infrequent_indices_[j].is_empty()))
1063 .max()
1064 .unwrap_or(0);
1065 // `0 <= v < max_cardinality` with v an integer-valued f64.
1066 if v >= 0.0 && v < max_cardinality as f64 {
1067 return Err(FerroError::InvalidParameter {
1068 name: "unknown_value".into(),
1069 reason: format!(
1070 "The used value for unknown_value {v} is one of the \
1071 values already used for encoding the seen categories"
1072 ),
1073 });
1074 }
1075 }
1076
1077 Ok(FittedOrdinalEncoder {
1078 categories,
1079 category_to_index,
1080 handle_unknown: self.handle_unknown,
1081 unknown_value: self.unknown_value,
1082 infrequent_indices_,
1083 infrequent_map,
1084 n_frequent,
1085 })
1086 }
1087}
1088
1089impl Transform<Array2<String>> for FittedOrdinalEncoder {
1090 type Output = Array2<f64>;
1091 type Error = FerroError;
1092
1093 /// Transform string categories to ordinal indices, returned as `f64`.
1094 ///
1095 /// Each cell is the (lexicographic) category index cast to `f64`. The
1096 /// ordinal VALUES are unchanged from the integer mapping; only the output
1097 /// container dtype is `f64`, matching scikit-learn's
1098 /// `OrdinalEncoder(dtype=np.float64)` default
1099 /// (`sklearn/preprocessing/_encoders.py:1262`). A configurable non-float64
1100 /// output dtype (e.g. `int32`) is OUT OF SCOPE here — ferrolearn's output is
1101 /// the fixed sklearn DEFAULT `f64`; a `dtype` param is a follow-on design
1102 /// (blocker #1158). `f64` exactly represents every integer up to `2^53`, so
1103 /// the cast is lossless for any realistic category count.
1104 ///
1105 /// # Errors
1106 ///
1107 /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
1108 /// match the number of features seen during fitting.
1109 ///
1110 /// Returns [`FerroError::InvalidParameter`] if any category was not seen
1111 /// during fitting AND `handle_unknown` is [`HandleUnknown::Error`] (the
1112 /// default). Under [`HandleUnknown::UseEncodedValue`], unknown categories
1113 /// are instead encoded as the configured `unknown_value` sentinel (which may
1114 /// be `f64::NAN`), matching sklearn `_encoders.py:1591`.
1115 fn transform(&self, x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
1116 let n_features = self.categories.len();
1117 // sklearn `OrdinalEncoder.transform` -> `_transform` -> `_check_X` ->
1118 // `check_array` (`_encoders.py:45`) enforces a minimum of 1 sample BEFORE
1119 // the n_features comparison (#2220, symmetric with the 0-row fit guard).
1120 // A 0-row input raises "Found array with 0 sample(s) ... minimum of 1".
1121 if x.nrows() == 0 {
1122 return Err(FerroError::InsufficientSamples {
1123 required: 1,
1124 actual: 0,
1125 context: "FittedOrdinalEncoder::transform".into(),
1126 });
1127 }
1128 if x.ncols() != n_features {
1129 return Err(FerroError::ShapeMismatch {
1130 expected: vec![x.nrows(), n_features],
1131 actual: vec![x.nrows(), x.ncols()],
1132 context: "FittedOrdinalEncoder::transform".into(),
1133 });
1134 }
1135
1136 let n_samples = x.nrows();
1137 let mut out = Array2::zeros((n_samples, n_features));
1138
1139 for j in 0..n_features {
1140 let map = &self.category_to_index[j];
1141 // Per-feature infrequent remapping (REQ-8): a found category's
1142 // `categories[j]` index is routed through `infrequent_map[j]` to its
1143 // emitted ordinal code (a frequent category → its remapped slot
1144 // `0..n_frequent`, an infrequent category → the shared trailing code
1145 // `n_frequent`), mirroring sklearn `_map_infrequent_categories`
1146 // (`_encoders.py:402-452`: `X_int = np.take(mapping, X_int)`). With
1147 // grouping DISABLED `infrequent_map[j]` is the identity, so the code
1148 // equals `idx` — the SHIPPED REQ-2 behaviour is UNCHANGED.
1149 let imap = self.infrequent_map.get(j);
1150 for i in 0..n_samples {
1151 let cat = &x[[i, j]];
1152 match map.get(cat) {
1153 // Route the category index through the infrequent map, then
1154 // cast the resulting ordinal code to f64 (sklearn's float64
1155 // default, `_encoders.py:1262`). Lossless: codes are < 2^53.
1156 // Bounds-safe: `imap.get(idx)` falls back to the raw `idx`
1157 // (R-CODE-2) — `imap` always has `categories[j].len()` entries.
1158 Some(&idx) => {
1159 let code = imap.and_then(|m| m.get(idx)).copied().unwrap_or(idx);
1160 out[[i, j]] = ordinal_index_to_f64(code);
1161 }
1162 None => match self.handle_unknown {
1163 // handle_unknown='use_encoded_value': write the sentinel
1164 // (which may be NaN). sklearn `_encoders.py:1591`
1165 // `X_trans[~X_mask] = self.unknown_value`. `fit`
1166 // guaranteed `unknown_value` is `Some` in this mode, but
1167 // we never panic (R-CODE-2): fall back to the Error path
1168 // if it were somehow `None`.
1169 HandleUnknown::UseEncodedValue => match self.unknown_value {
1170 Some(v) => out[[i, j]] = v,
1171 None => {
1172 return Err(FerroError::InvalidParameter {
1173 name: format!("x[{i},{j}]"),
1174 reason: format!(
1175 "unknown category \"{cat}\" in column {j} and \
1176 no unknown_value configured"
1177 ),
1178 });
1179 }
1180 },
1181 // handle_unknown='error' (default): reject (SHIPPED
1182 // REQ-2, UNCHANGED). sklearn raises ValueError
1183 // "Found unknown categories ... during transform".
1184 HandleUnknown::Error => {
1185 return Err(FerroError::InvalidParameter {
1186 name: format!("x[{i},{j}]"),
1187 reason: format!("unknown category \"{cat}\" in column {j}"),
1188 });
1189 }
1190 },
1191 }
1192 }
1193 }
1194
1195 Ok(out)
1196 }
1197}
1198
1199/// Implement `Transform` on the unfitted encoder to satisfy the
1200/// `FitTransform: Transform` supertrait bound.
1201impl Transform<Array2<String>> for OrdinalEncoder {
1202 type Output = Array2<f64>;
1203 type Error = FerroError;
1204
1205 /// Always returns an error — the encoder must be fitted first.
1206 fn transform(&self, _x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
1207 Err(FerroError::InvalidParameter {
1208 name: "OrdinalEncoder".into(),
1209 reason: "encoder must be fitted before calling transform; use fit() first".into(),
1210 })
1211 }
1212}
1213
1214impl FitTransform<Array2<String>> for OrdinalEncoder {
1215 type FitError = FerroError;
1216
1217 /// Fit the encoder on `x` and return the encoded output in one step.
1218 ///
1219 /// # Errors
1220 ///
1221 /// Returns an error if fitting or transformation fails.
1222 fn fit_transform(&self, x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
1223 let fitted = self.fit(x, &())?;
1224 fitted.transform(x)
1225 }
1226}
1227
1228// ---------------------------------------------------------------------------
1229// Tests
1230// ---------------------------------------------------------------------------
1231
1232#[cfg(test)]
1233mod tests {
1234 use super::*;
1235 use ndarray::Array2;
1236
1237 fn make_2col(rows: &[(&str, &str)]) -> Array2<String> {
1238 let flat: Vec<String> = rows
1239 .iter()
1240 .flat_map(|(a, b)| [a.to_string(), b.to_string()])
1241 .collect();
1242 Array2::from_shape_vec((rows.len(), 2), flat).unwrap()
1243 }
1244
1245 #[test]
1246 fn test_ordinal_encoder_basic() {
1247 let enc = OrdinalEncoder::new();
1248 let x = make_2col(&[
1249 ("cat", "small"),
1250 ("dog", "large"),
1251 ("cat", "medium"),
1252 ("bird", "small"),
1253 ]);
1254 let fitted = enc.fit(&x, &()).unwrap();
1255
1256 // Categories are sorted lexicographically (sklearn convention).
1257 assert_eq!(fitted.categories()[0], vec!["bird", "cat", "dog"]);
1258 assert_eq!(fitted.categories()[1], vec!["large", "medium", "small"]);
1259
1260 let encoded = fitted.transform(&x).unwrap();
1261 // Output container is `Array2<f64>` (sklearn's `dtype=np.float64`).
1262 assert_eq!(encoded[[0, 0]], 1.0); // "cat" -> 1 (lex pos)
1263 assert_eq!(encoded[[1, 0]], 2.0); // "dog" -> 2
1264 assert_eq!(encoded[[2, 0]], 1.0); // "cat" -> 1
1265 assert_eq!(encoded[[3, 0]], 0.0); // "bird" -> 0
1266 assert_eq!(encoded[[0, 1]], 2.0); // "small" -> 2
1267 assert_eq!(encoded[[1, 1]], 0.0); // "large" -> 0
1268 assert_eq!(encoded[[2, 1]], 1.0); // "medium" -> 1
1269 assert_eq!(encoded[[3, 1]], 2.0); // "small" -> 2
1270 }
1271
1272 #[test]
1273 fn test_fit_transform_equivalence() {
1274 let enc = OrdinalEncoder::new();
1275 let x = make_2col(&[("a", "x"), ("b", "y"), ("a", "z")]);
1276 let via_ft = enc.fit_transform(&x).unwrap();
1277 let fitted = enc.fit(&x, &()).unwrap();
1278 let via_sep = fitted.transform(&x).unwrap();
1279 assert_eq!(via_ft, via_sep);
1280 }
1281
1282 #[test]
1283 fn test_unknown_category_error() {
1284 let enc = OrdinalEncoder::new();
1285 let x_train = make_2col(&[("cat", "small"), ("dog", "large")]);
1286 let fitted = enc.fit(&x_train, &()).unwrap();
1287 let x_test = make_2col(&[("fish", "small")]);
1288 assert!(fitted.transform(&x_test).is_err());
1289 }
1290
1291 #[test]
1292 fn test_shape_mismatch_error() {
1293 let enc = OrdinalEncoder::new();
1294 let x_train = make_2col(&[("a", "x")]);
1295 let fitted = enc.fit(&x_train, &()).unwrap();
1296 // Single-column input when 2 cols expected
1297 let x_bad = Array2::from_shape_vec((1, 1), vec!["a".to_string()]).unwrap();
1298 assert!(fitted.transform(&x_bad).is_err());
1299 }
1300
1301 #[test]
1302 fn test_insufficient_samples_error() {
1303 let enc = OrdinalEncoder::new();
1304 let x: Array2<String> = Array2::from_shape_vec((0, 2), vec![]).unwrap();
1305 assert!(enc.fit(&x, &()).is_err());
1306 }
1307
1308 #[test]
1309 fn test_unfitted_transform_error() {
1310 let enc = OrdinalEncoder::new();
1311 let x = make_2col(&[("a", "x")]);
1312 assert!(enc.transform(&x).is_err());
1313 }
1314
1315 #[test]
1316 fn test_single_column() {
1317 let enc = OrdinalEncoder::new();
1318 let flat = vec![
1319 "red".to_string(),
1320 "green".to_string(),
1321 "blue".to_string(),
1322 "red".to_string(),
1323 ];
1324 let x = Array2::from_shape_vec((4, 1), flat).unwrap();
1325 let fitted = enc.fit(&x, &()).unwrap();
1326 // Lex order: blue (0), green (1), red (2)
1327 assert_eq!(fitted.categories()[0], vec!["blue", "green", "red"]);
1328 let encoded = fitted.transform(&x).unwrap();
1329 assert_eq!(encoded[[0, 0]], 2.0); // red
1330 assert_eq!(encoded[[1, 0]], 1.0); // green
1331 assert_eq!(encoded[[2, 0]], 0.0); // blue
1332 assert_eq!(encoded[[3, 0]], 2.0); // red
1333 }
1334
1335 #[test]
1336 fn test_n_features() {
1337 let enc = OrdinalEncoder::new();
1338 let x = make_2col(&[("a", "x")]);
1339 let fitted = enc.fit(&x, &()).unwrap();
1340 assert_eq!(fitted.n_features(), 2);
1341 }
1342
1343 #[test]
1344 fn test_lexicographic_order() {
1345 // Categories are sorted lexicographically to match sklearn (#344).
1346 let enc = OrdinalEncoder::new();
1347 let flat = vec!["zebra".to_string(), "ant".to_string(), "moose".to_string()];
1348 let x = Array2::from_shape_vec((3, 1), flat).unwrap();
1349 let fitted = enc.fit(&x, &()).unwrap();
1350 // ant < moose < zebra
1351 assert_eq!(fitted.categories()[0][0], "ant");
1352 assert_eq!(fitted.categories()[0][1], "moose");
1353 assert_eq!(fitted.categories()[0][2], "zebra");
1354 }
1355}