ferrolearn-preprocess 0.5.0

//! Ordinal encoder: map string categories to integer indices.
//!
//! Each column's categories are mapped to integers `0, 1, 2, ...` in
//! **lexicographic order** (matching scikit-learn's `OrdinalEncoder`).
//! Unknown categories seen during `transform` produce an error by default
//! (`handle_unknown='error'`); with `handle_unknown='use_encoded_value'` they
//! are instead encoded as a configurable `unknown_value` sentinel
//! (matching scikit-learn's `OrdinalEncoder`).
//!
//! # `## REQ status`
//!
//! Binary (R-DEFER-2), translating `sklearn/preprocessing/_encoders.py` (`class OrdinalEncoder`
//! `:1235`). Design doc: `.design/preprocess/ordinal_encoder.md`. Expected values from the live
//! sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (`lib.rs:121`, grandfathered S5).
//! HONEST (R-HONEST-3): a FAITHFUL String-only ordinal encoder — `categories_`=sorted-unique and
//! the ordinal VALUES match sklearn bit-for-bit on the string path; the output container is now
//! `Array2<f64>` (sklearn's `dtype=np.float64` default, `:1262`); remaining divergences are
//! String-only input, the absent configurable `dtype` param, and the rest of the param/feature
//! surface.
//!
//! | REQ | Status | Evidence |
//! |---|---|---|
//! | REQ-1 (string fit → sorted-unique categories_) | SHIPPED | `Fit::fit` per column → `categories_`=sorted-unique (`Vec<String>::sort`, lexicographic) + index map; rejects 0 rows (`InsufficientSamples`, matches sklearn `check_array`). Mirrors `_BaseEncoder._fit` `categories_=_unique(Xi)` (`_encoders.py:99`). Critic-verified vs live oracle: `green_value_match_and_categories` (`[['bird','cat','dog'],['large','medium','small']]`), `green_lexicographic_sort_matches_np_unique` + `green_non_ascii_codepoint_order` (== `np.unique`), `green_empty_fit_rejected_matches_sklearn`. Consumer: re-export `lib.rs:121`. |
//! | REQ-2 (transform + fit_transform, ordinal values + unknown rejection) | SHIPPED | `Transform::transform` maps category→ordinal index (now cast to `f64` via `ordinal_index_to_f64`), unknown → `InvalidParameter` (matches `handle_unknown='error'` default `ValueError`), ncols-mismatch → `ShapeMismatch`. The unknown/ncols-mismatch LOGIC is byte-for-byte UNCHANGED by the dtype fix. Critic-verified: ordinal VALUES `[[1.,2.],[2.,0.],[1.,1.],[0.,2.]]` == live oracle, `green_unknown_category_rejected`, `green_fit_transform_equals_oracle`. Consumer: re-export `lib.rs:142`. |
//! | REQ-3 (output dtype float64) | SHIPPED | `Transform::Output = Array2<f64>` on BOTH `Transform` impls (`FittedOrdinalEncoder` + the unfitted `OrdinalEncoder` shim) and `FitTransform::fit_transform`; each cell is the ordinal index cast via `ordinal_index_to_f64` (`idx as f64`, lossless < 2^53), matching sklearn's default `dtype=np.float64` output container (`_encoders.py:1262`, `transform` casts `X_int.astype(self.dtype)`). The REQ-1/REQ-2 fit + unknown-rejection LOGIC is unchanged. Critic-verified vs live oracle: `green_fit_transform_f64_oracle` (multi-feature f64 matrix), `green_exact_integer_index_to_f64` (index 10 → `10.0`), plus the value guards over `Array2<f64>`. A CONFIGURABLE non-float64 output dtype (`int32` etc.) is a FOLLOW-ON (blocker #1158 remains open for the `dtype` ctor param); ferrolearn's output is fixed to sklearn's `float64` DEFAULT. This unblocks REQ-5's float `unknown_value` sentinel. Consumer: crate re-export `lib.rs:142`. |
//! | REQ-4 (numeric/mixed-dtype input) | NOT-STARTED | open prereq blocker #1159. `Array2<String>`-only; sklearn accepts int/str/object (`np.unique` numeric sort). |
//! | REQ-5 (handle_unknown='use_encoded_value' + unknown_value) | SHIPPED | `HandleUnknown` enum `{ Error, UseEncodedValue }` (default `Error`) + `unknown_value: Option<f64>` on `OrdinalEncoder`, threaded into `FittedOrdinalEncoder` via `with_handle_unknown`/`with_unknown_value` builders. `Fit::fit` runs the 3 sklearn validations (`_encoders.py:1473-1526`) AFTER the unchanged `categories_` compute, mapping sklearn's `TypeError`/`ValueError` → `FerroError::InvalidParameter`: (a) `UseEncodedValue` && `unknown_value is None` (sklearn `:1481` `not isinstance(.,Integral)` TypeError); (b) `Error` && `unknown_value is Some` (sklearn `:1488` TypeError); (c) `UseEncodedValue` && non-nan integer `v` with `0 <= v < max_cardinality` (sklearn `:1518-1526` ValueError collision). `Transform::transform` branches unknown categories: `UseEncodedValue` → write `unknown_value` (incl. nan) (sklearn `:1591` `X_trans[~X_mask] = self.unknown_value`); `Error` → `InvalidParameter` (the SHIPPED REQ-2 default, UNCHANGED). Seen categories still map to `idx as f64` (UNCHANGED). NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle: `green_use_encoded_value_minus_one`, `green_use_encoded_value_nan`, `green_use_encoded_value_multifeature`, `red_uev_requires_unknown_value`, `red_error_mode_forbids_unknown_value`, `red_unknown_value_collision_in_range`, `green_unknown_value_negative_or_oob_or_nan_ok`, `green_error_mode_unknown_still_rejected` (`tests/divergence_ordinal_encoder.rs`). Configurable `dtype`/`encoded_missing_value` interplay stays OUT OF SCOPE (REQ-3/REQ-6). Consumer: crate re-export `lib.rs:142`. |
//! | REQ-6 (encoded_missing_value / NaN) | NOT-STARTED | open prereq blocker #1161. No missing-value concept (`:1283`). |
//! | REQ-7 (explicit categories param) | SHIPPED | `Categories` enum `{ Auto, Explicit(Vec<Vec<String>>) }` (default `Auto`) + `#[must_use] OrdinalEncoder::with_categories(Vec<Vec<String>>)` builder + `categories_param()` getter (named to avoid colliding with `FittedOrdinalEncoder::categories`). `Fit::fit` branches on the param AFTER the 0-row guard: `Auto` → the SHIPPED REQ-1 sorted-unique compute (UNCHANGED); `Explicit(lists)` → use each `lists[j]` AS-GIVEN for `categories_[j]` (GIVEN order, NOT re-sorted) + the index map in that order, mirroring sklearn `_encoders.py:114` `cats = np.array(self.categories[i])`. Validations match `_BaseEncoder._fit`: list-count ≠ n_features → `ShapeMismatch` ("Shape mismatch: if categories is an array, it has to be of shape (n_features,)." `:85-89`); an EMPTY list → `InvalidParameter` (sklearn indexes `cats[0]` -> IndexError in both modes, `:114-117`, #2229); a list with duplicate elements → `InvalidParameter` ("In column {j}, the predefined categories contain duplicate elements." `:136-141`); under [`HandleUnknown::Error`] (default) a data value not in its column's list → `InvalidParameter` ("Found unknown categories [{v}] in column {j} during fit" `:153-160`), while under [`HandleUnknown::UseEncodedValue`] this fit-time subset check is SKIPPED (out-of-set data is encoded to `unknown_value` at transform). The REQ-5 unknown_value validations still apply (the `max_cardinality` collision check now keys off the explicit list lengths). `Transform`/`inverse_transform`/`categories()`/`get_feature_names_out` are UNCHANGED — they already read `categories_`/`category_to_index`, which now reflect the explicit given-order set. NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle (`tests/divergence_ordinal_encoder.rs`): `green_explicit_given_order_not_sorted`, `green_explicit_unsorted_accepted`, `red_explicit_error_mode_data_not_in_cats_fits_err`, `green_explicit_use_encoded_value_out_of_set_ok`, `red_explicit_n_features_mismatch`, `green_explicit_multifeature_each_own_order`, `red_explicit_duplicate_categories`, `green_explicit_inverse_roundtrip_given_order`, `green_explicit_auto_still_default`. Consumer: crate re-export (`lib.rs:142`, `Categories` re-exported). Configurable numeric/`bytes` categories + the nan-last rule stay OUT OF SCOPE (String-only path, REQ-4/REQ-6). |
//! | REQ-8 (min_frequency/max_categories infrequent) | SHIPPED | #1163: `OrdinalEncoder::with_min_frequency`/`with_max_categories` (+`min_frequency()`/`max_categories()` getters) add the integer-count infrequent thresholds (`_encoders.py:1289-1315`). The OrdinalEncoder ANALOG of the SHIPPED OneHotEncoder REQ-5b (`one_hot_encoder.rs`): the SAME `_identify_infrequent` algorithm (reused as `identify_infrequent` + `build_infrequent_map`, mirroring `_BaseEncoder._identify_infrequent` `:275-318` + `_default_to_infrequent_mappings` `:373-400`: min_frequency `count < min_freq` FIRST, then max_categories keeps top `max_categories-1` by count via a STABLE argsort over the full count array — ties favor the LARGER index; `max_categories==1` → all infrequent), but the infrequent categories collapse to a single shared ORDINAL CODE `n_frequent` (NOT a one-hot column): frequent categories keep codes `0..n_frequent` in their original sorted order, every infrequent category emits `n_frequent`. `Fit::fit` runs the `_parameter_constraints` check FIRST (`min_frequency`/`max_categories` `Some(0)` → `InvalidParameter` "must be an int in the range [1, inf)", BEFORE the data, `Interval(Integral,1,None)`), then (after the SHIPPED `categories_` compute, UNCHANGED — `categories_` keeps ALL categories) builds per-feature `infrequent_indices_`/`infrequent_map`/`n_frequent` from the fit-data category counts. `FittedOrdinalEncoder::infrequent_categories()` exposes the infrequent VALUES per feature (`infrequent_categories_`, `:255-262`). `Transform` routes a found category index through `infrequent_map[j]` then casts to f64 (frequent → own code, infrequent → shared trailing code; `_map_infrequent_categories`, `:402-452`); with grouping DISABLED the map is the identity so REQ-2 is UNCHANGED. `inverse_transform`: a code `< n_frequent` → the frequent category at that remapped slot (via the frequent-only category list); a code `== n_frequent` (exact float equality on the raw label) → the REAL String `"infrequent_sklearn"` (`:1644`,`:1675-1677`) — representable, UNLIKE OneHotEncoder's NaN proxy; the truncate+wrap numpy index logic applies over the frequent-only list (SHIPPED REQ-9 path UNCHANGED when disabled). The `unknown_value` collision check now keys off the EFFECTIVE code count `n_frequent + 1` (verified live: `min_frequency=2` over 4 cats → 3 codes → `unknown_value=3` accepted, `=2` collides). `get_feature_names_out` is UNCHANGED (OrdinalEncoder is one-to-one — infrequent does NOT add columns). NEVER panics (R-CODE-2). Critic-verified vs live sklearn 1.5.2 oracle (`tests/divergence_ordinal_encoder.rs`): `req8_min_frequency_two_categories_transform_inverse`, `req8_max_categories_keeps_top_k_minus_one`, `req8_max_categories_tiebreak_favors_larger_index`, `req8_both_set_multifeature_some_without_infrequent`, `req8_zero_thresholds_rejected`, `req8_infrequent_plus_use_encoded_value_distinct_codes`, `req8_unknown_value_collision_uses_effective_code_count`, `req8_disabled_default_unchanged`, `req8_inverse_infrequent_non_roundtrip`. Consumer: crate re-export `lib.rs:142`. STILL NOT-STARTED (R-HONEST-3): the FLOAT-fraction `min_frequency` (`:1296-1297`,`:297-299`) and the explicit-`categories`+infrequent interaction stay unimplemented. |
//! | REQ-9 (inverse_transform) | SHIPPED | `FittedOrdinalEncoder::inverse_transform(&Array2<f64>) -> Array2<String>` reuses the SHIPPED `categories_` (REQ-1): each cell is an ordinal index into `categories[j]`, mirroring sklearn `X_tr[:, i] = self.categories_[i][labels]` (`_encoders.py:1595-1679`). Validates the index BEFORE lookup (no panic, R-CODE-2): an exact non-negative integer in `[0, len)` → `categories[j][index].clone()`; 0-row → `InsufficientSamples` (symmetry with the #2220 transform guard); ncols-mismatch → `ShapeMismatch` (sklearn `:1619`). FAITHFUL to numpy: mirrors `labels.astype("int64")` (truncate toward zero, Rust `as i64`) + numpy fancy indexing (negative WRAP, `-1.0` → last category, `-2.0` → `len-2`), raising only once the wrapped index leaves `[0, len)` (`_encoders.py:1664`,`:1679`). Non-finite (NaN/±inf) → `InvalidParameter` (sklearn IndexError/ValueError; guarded because Rust `f64 as i64` saturates NaN→0). Critic-verified vs live sklearn 1.5.2 oracle: `green_inverse_roundtrip_multifeature`, `green_inverse_held_out_valid_ordinals`, `green_inverse_negative_wraps_like_numpy` (`-1.0`→'dog', `-2.0`→'cat', `-3.0`→Err), `green_inverse_non_integer_truncates_like_numpy` (`1.5`→'dog', `0.7`→'cat'), `red_inverse_out_of_range_positive` (`9.0`→Err), `red_inverse_ncols_mismatch`, `red_inverse_zero_row`, `red_inverse_use_encoded_value_unknown_cell` (`tests/divergence_ordinal_encoder.rs`). SCOPE LIMITATION (R-HONEST-3): the `unknown_value`-cell → `None` inverse (sklearn `:1673`) is unrepresentable in `Array2<String>` (would need `Array2<Option<String>>`), so a `use_encoded_value` cell equal to `unknown_value` ERRORS (checked BEFORE the index logic so the sentinel is not silently wrapped) instead of yielding `None`; the default `Error`-mode encoder has only valid ordinals so its inverse is COMPLETE and bit-exact. Consumer: crate re-export `lib.rs:142`. |
//! | REQ-10 (get_feature_names_out + n_features_in_) | SHIPPED | `FittedOrdinalEncoder::n_features_in()` (= `n_features()`, sklearn `n_features_in_`) + `get_feature_names_out(input_features)` — `OneToOneFeatureMixin` (one output col per input col) returns the INPUT names unchanged: `None` -> `["x0","x1",..]` (`_check_feature_names_in`), `Some(names)` -> verbatim, a wrong-length `input_features` -> `ShapeMismatch` (sklearn ValueError). Live-oracle test `req10_feature_names_out_and_n_features_in` (`['x0','x1']`, `['a','b']`, wrong-length Err). feature_names_in_ (string input-name capture) stays NOT-STARTED (ferrolearn fit takes positional columns, no input names). Consumer: crate re-export `lib.rs:142`. |
//! | REQ-11 (full ctor + _parameter_constraints) | NOT-STARTED | open prereq blocker #1166. `new()` takes no params (`:1320-1386`). |
//! | REQ-12 (PyO3 binding) | SHIPPED | `_RsOrdinalEncoder` (hand `#[pyclass]`, `ferrolearn-python/src/extras.rs`) over `OrdinalEncoder`/`FittedOrdinalEncoder`/`HandleUnknown`/`Categories` — the FIRST STRING-INPUT binding: `fit(rows)`/`transform(rows)` take a Python `list[list[str]]` (PyO3 `Vec<Vec<String>>` extraction, NOT a numpy f64 array), validate rectangular rows (ragged → `PyValueError`), build `Array2<String>` via `Array2::from_shape_vec`, and `transform` returns `PyArray2<f64>`; `inverse_transform(PyReadonlyArray2<f64>)` returns the `Array2<String>` rows as `Vec<Vec<String>>` (the `use_encoded_value`→None inverse ERRORS, REQ-9 scope → `PyValueError`). Ctor knobs `handle_unknown="error"` (`resolve_handle_unknown`: "error"→`Error`, "use_encoded_value"→`UseEncodedValue`, bad→`PyValueError` per `_encoders.py:1425`), `unknown_value: Option<f64>=None`, `categories: Option<Vec<Vec<String>>>=None` (None→`Auto`, Some→`Explicit`); the REQ-5/REQ-7 fit validations (`OrdinalEncoder::fit`) surface as `FerroError`→`PyValueError`. `#[getter]`s `categories_` (PyList of str lists), `n_features_in_`, `feature_names_out` (`get_feature_names_out(None)`). Registered `lib.rs` `m.add_class::<extras::RsOrdinalEncoder>()`. Non-test production consumer (R-DEFER-1): `_extras.py::OrdinalEncoder(BaseEstimator)` — a CUSTOM class (NOT `_TransformerWrapper`, input is str), full 7-key keyword-only ctor (`categories`/`dtype`/`handle_unknown`/`unknown_value`/`encoded_missing_value`/`min_frequency`/`max_categories`, `_encoders.py:1435-1452`) for `get_params`/`clone`, `_to_rows` (numpy str/object array OR list-of-lists → `list[list[str]]` via `np.asarray(X).astype(str).tolist()`), `_check_unsupported` (non-NaN `encoded_missing_value` REQ-6 / `min_frequency`/`max_categories` REQ-8 / non-f64 `dtype` REQ-3 → `NotImplementedError`), `fit`/`transform`/`fit_transform`/`inverse_transform` (→ numpy object array)/`get_feature_names_out`, `@property` `categories_`/`n_features_in_`, pre-fit access → `NotFittedError` (`check_is_fitted(self, "_rs")`); re-exported in `ferrolearn/__init__.py` as `ferrolearn.OrdinalEncoder`. Live-oracle parity (R-CHAR-3, sklearn 1.5.2, `tests/divergence_ordinal_encoder_py.py`, 19 pass): `fit_transform([['cat'],['dog'],['cat']])==[[0.],[1.],[0.]]`==sklearn, `categories_`==sklearn sorted-unique, multi-feature, inverse_transform roundtrip==original, `use_encoded_value`/`unknown_value=-1`→-1.0, explicit `categories=[['dog','cat','bird']]`→given-order index, `n_features_in_`, `get_feature_names_out`→`['x0','x1']` (+ input_features pass-through), pre-fit `NotFittedError`, bad `handle_unknown`→`ValueError`, unsupported (`encoded_missing_value`/`min_frequency`/`max_categories`/`dtype`)→`NotImplementedError`, 7-key get_params==sklearn, `clone`, numpy object/str-array input. STRING-only input (REQ-4 #1159), the `use_encoded_value`→None inverse (REQ-9), and the rest of the param surface stay OUT OF SCOPE (R-HONEST-3). |
//! | REQ-13 (ferray substrate) | NOT-STARTED | open prereq blocker #1168. `ndarray`+`HashMap`, not `ferray-core` (R-SUBSTRATE-1/2). |

use ferrolearn_core::error::FerroError;
use ferrolearn_core::traits::{Fit, FitTransform, Transform};
use ndarray::Array2;
use std::collections::HashMap;

// ---------------------------------------------------------------------------
// HandleUnknown
// ---------------------------------------------------------------------------

/// How [`OrdinalEncoder`] treats categories at `transform` time that were not
/// seen during `fit`.
///
/// Mirrors scikit-learn's `OrdinalEncoder(handle_unknown=...)` parameter
/// (`sklearn/preprocessing/_encoders.py:1262`), which accepts `'error'` and
/// `'use_encoded_value'`.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum HandleUnknown {
    /// Raise an error on any unknown category (scikit-learn's default
    /// `handle_unknown='error'`). This is also the default here.
    #[default]
    Error,
    /// Encode unknown categories with the configured `unknown_value` sentinel
    /// (scikit-learn's `handle_unknown='use_encoded_value'`). Requires
    /// `unknown_value` to be set.
    UseEncodedValue,
}

// ---------------------------------------------------------------------------
// Categories
// ---------------------------------------------------------------------------

/// How [`OrdinalEncoder`] determines, per column, the ordered category set used
/// to assign ordinal indices.
///
/// Mirrors scikit-learn's `OrdinalEncoder(categories=...)` parameter
/// (`sklearn/preprocessing/_encoders.py:1252`), which accepts `'auto'` or a
/// list of per-feature category lists.
#[derive(Debug, Clone, PartialEq, Eq, Default)]
pub enum Categories {
    /// Determine the categories automatically from the training data as the
    /// sorted-unique values per column (scikit-learn's default `categories='auto'`).
    #[default]
    Auto,
    /// Use the explicit, user-provided category lists. `Explicit(lists)[j]` is
    /// the ordered category set for column `j`, used **as given** (the order is
    /// preserved, NOT re-sorted), mirroring scikit-learn's
    /// `categories=[list, ...]` (`_encoders.py:114`, the categories are used
    /// `np.array(self.categories[i])` as-is).
    Explicit(Vec<Vec<String>>),
}

// ---------------------------------------------------------------------------
// OrdinalEncoder (unfitted)
// ---------------------------------------------------------------------------

/// An unfitted ordinal encoder.
///
/// Calling [`Fit::fit`] on an `Array2<String>` learns, for each column, a
/// mapping from the unique string categories (sorted lexicographically)
/// to consecutive integers `0, 1, 2, ...`, and returns a
/// [`FittedOrdinalEncoder`].
///
/// Unknown categories at `transform` time are, by default, rejected
/// ([`HandleUnknown::Error`]). Configuring
/// [`with_handle_unknown`](OrdinalEncoder::with_handle_unknown) with
/// [`HandleUnknown::UseEncodedValue`] plus
/// [`with_unknown_value`](OrdinalEncoder::with_unknown_value) instead encodes
/// unknown categories as the supplied sentinel (which may be `f64::NAN`),
/// matching scikit-learn's `OrdinalEncoder(handle_unknown='use_encoded_value')`.
///
/// # Examples
///
/// ```
/// use ferrolearn_preprocess::ordinal_encoder::OrdinalEncoder;
/// use ferrolearn_core::traits::{Fit, Transform};
/// use ndarray::Array2;
///
/// let enc = OrdinalEncoder::new();
/// let data = Array2::from_shape_vec(
///     (3, 2),
///     vec![
///         "cat".to_string(), "small".to_string(),
///         "dog".to_string(), "large".to_string(),
///         "cat".to_string(), "small".to_string(),
///     ],
/// ).unwrap();
/// let fitted = enc.fit(&data, &()).unwrap();
/// let encoded = fitted.transform(&data).unwrap();
/// // Output is `Array2<f64>`, matching sklearn's `dtype=np.float64` default.
/// assert_eq!(encoded[[0, 0]], 0.0); // "cat" is index 0 in col 0
/// assert_eq!(encoded[[1, 0]], 1.0); // "dog" is index 1 in col 0
/// ```
#[derive(Debug, Clone, Default)]
pub struct OrdinalEncoder {
    /// How the per-column category sets are determined ([`Categories::Auto`] =
    /// sorted-unique from the data, the default; [`Categories::Explicit`] =
    /// user-provided lists used in the given order).
    categories: Categories,
    /// Strategy for unknown categories at `transform` time.
    handle_unknown: HandleUnknown,
    /// Sentinel written for unknown categories when `handle_unknown` is
    /// [`HandleUnknown::UseEncodedValue`]. May be `f64::NAN`.
    unknown_value: Option<f64>,
    /// Minimum frequency (count) below which a category is grouped into the
    /// single trailing "infrequent" ordinal index for that feature
    /// (`min_frequency`). `None` (the default) disables the min-frequency
    /// threshold. Mirrors scikit-learn's `OrdinalEncoder(min_frequency=...)`
    /// (`sklearn/preprocessing/_encoders.py:1289-1297`). SCOPE (R-HONEST-3):
    /// only the integer-count form is supported — sklearn also accepts a FLOAT
    /// fraction `min_frequency * n_samples` (`:1296-1297`,`:297-299`), which is
    /// NOT-STARTED here.
    min_frequency: Option<usize>,
    /// Upper limit on the number of output ordinal codes per feature when
    /// grouping infrequent categories (`max_categories`); the infrequent group
    /// itself counts toward this limit. `None` (the default) imposes no limit.
    /// Mirrors scikit-learn's `OrdinalEncoder(max_categories=...)`
    /// (`sklearn/preprocessing/_encoders.py:1301-1315`).
    max_categories: Option<usize>,
}

impl OrdinalEncoder {
    /// Create a new `OrdinalEncoder` with scikit-learn's defaults
    /// (`handle_unknown='error'`, no `unknown_value`).
    #[must_use]
    pub fn new() -> Self {
        Self {
            categories: Categories::Auto,
            handle_unknown: HandleUnknown::Error,
            unknown_value: None,
            min_frequency: None,
            max_categories: None,
        }
    }

    /// Set the explicit per-column category lists (`categories=[list, ...]`).
    ///
    /// Each `lists[j]` is the ordered category set for column `j`, used **as
    /// given** at `fit` time — the order is preserved (NOT re-sorted), so the
    /// assigned ordinal indices follow the supplied order, matching
    /// scikit-learn's `OrdinalEncoder(categories=...)`
    /// (`sklearn/preprocessing/_encoders.py:114`).
    ///
    /// At `fit` time the number of lists must equal the number of input columns,
    /// no list may contain duplicates, and (under the default
    /// `handle_unknown='error'`) every value seen in the data must appear in its
    /// column's list; otherwise [`Fit::fit`] returns an error. See [`Fit::fit`]
    /// for the exact validation contract.
    #[must_use]
    pub fn with_categories(mut self, categories: Vec<Vec<String>>) -> Self {
        self.categories = Categories::Explicit(categories);
        self
    }

    /// Return the configured `categories` strategy ([`Categories::Auto`] or
    /// [`Categories::Explicit`]).
    ///
    /// Named `categories_param` to avoid colliding with
    /// [`FittedOrdinalEncoder::categories`], which returns the *learned*
    /// per-column category lists after fitting.
    #[must_use]
    pub fn categories_param(&self) -> &Categories {
        &self.categories
    }

    /// Set the unknown-category strategy (`handle_unknown`).
    ///
    /// With [`HandleUnknown::UseEncodedValue`] an `unknown_value` must also be
    /// supplied via [`with_unknown_value`](OrdinalEncoder::with_unknown_value);
    /// otherwise [`Fit::fit`] returns an error (matching scikit-learn's
    /// validation).
    #[must_use]
    pub fn with_handle_unknown(mut self, handle_unknown: HandleUnknown) -> Self {
        self.handle_unknown = handle_unknown;
        self
    }

    /// Set the sentinel written for unknown categories under
    /// [`HandleUnknown::UseEncodedValue`]. May be `f64::NAN`.
    ///
    /// Setting this while `handle_unknown` is [`HandleUnknown::Error`] causes
    /// [`Fit::fit`] to return an error (matching scikit-learn's validation).
    #[must_use]
    pub fn with_unknown_value(mut self, unknown_value: f64) -> Self {
        self.unknown_value = Some(unknown_value);
        self
    }

    /// Return the configured unknown-category strategy.
    #[must_use]
    pub fn handle_unknown(&self) -> HandleUnknown {
        self.handle_unknown
    }

    /// Return the configured unknown-category sentinel, if any.
    #[must_use]
    pub fn unknown_value(&self) -> Option<f64> {
        self.unknown_value
    }

    /// Set the minimum-frequency threshold for infrequent grouping
    /// (`min_frequency`, integer count).
    ///
    /// At `fit` time a category whose count in the training data is **strictly
    /// less than** `min_frequency` is grouped with the other infrequent
    /// categories into a single trailing ordinal index `n_frequent` for that
    /// feature (the frequent categories keep ordinal indices `0..n_frequent` in
    /// their original sorted order), matching scikit-learn's
    /// `OrdinalEncoder(min_frequency=...)` integer form
    /// (`sklearn/preprocessing/_encoders.py:1289-1297`, `_identify_infrequent`
    /// `:295-296` `category_count < self.min_frequency`).
    ///
    /// Unlike [`crate::OneHotEncoder`], the infrequent group collapses to ONE
    /// **ordinal index** (not a one-hot column), so `categories_` is unchanged
    /// (all categories retained) — only the emitted ordinal code is shared.
    ///
    /// SCOPE (R-HONEST-3): only the integer-count form is supported. sklearn
    /// also accepts a FLOAT `min_frequency` interpreted as the fraction
    /// `min_frequency * n_samples` (`_encoders.py:1296-1297`,`:297-299`); the
    /// float-fraction form is NOT-STARTED here.
    #[must_use]
    pub fn with_min_frequency(mut self, min_frequency: usize) -> Self {
        self.min_frequency = Some(min_frequency);
        self
    }

    /// Set the maximum number of output ordinal codes per feature for infrequent
    /// grouping (`max_categories`).
    ///
    /// At `fit` time, if a feature would otherwise produce more than
    /// `max_categories` distinct ordinal codes, the least-frequent categories
    /// are grouped into the single trailing infrequent index so the number of
    /// codes is at most `max_categories` (the infrequent group itself counts
    /// toward the limit). Mirrors scikit-learn's
    /// `OrdinalEncoder(max_categories=...)`
    /// (`sklearn/preprocessing/_encoders.py:1301-1315`, `_identify_infrequent`
    /// `:303-315`).
    #[must_use]
    pub fn with_max_categories(mut self, max_categories: usize) -> Self {
        self.max_categories = Some(max_categories);
        self
    }

    /// Return the configured minimum-frequency threshold (`min_frequency`), or
    /// `None` if infrequent grouping by frequency is disabled.
    #[must_use]
    pub fn min_frequency(&self) -> Option<usize> {
        self.min_frequency
    }

    /// Return the configured maximum ordinal-code limit (`max_categories`), or
    /// `None` if no limit is imposed.
    #[must_use]
    pub fn max_categories(&self) -> Option<usize> {
        self.max_categories
    }

    /// Whether infrequent grouping is enabled (either `min_frequency` or
    /// `max_categories` is set). Mirrors scikit-learn's `_infrequent_enabled`
    /// (`_encoders.py:271-273`: `(max_categories is not None and
    /// max_categories >= 1) or min_frequency is not None`).
    fn infrequent_enabled(&self) -> bool {
        self.min_frequency.is_some() || self.max_categories.is_some_and(|m| m >= 1)
    }
}

// ---------------------------------------------------------------------------
// FittedOrdinalEncoder
// ---------------------------------------------------------------------------

/// A fitted ordinal encoder holding per-column category-to-index mappings.
///
/// Created by calling [`Fit::fit`] on an [`OrdinalEncoder`].
#[derive(Debug, Clone)]
pub struct FittedOrdinalEncoder {
    /// Per-column ordered category lists (index = integer value).
    pub(crate) categories: Vec<Vec<String>>,
    /// Per-column category-to-index maps.
    pub(crate) category_to_index: Vec<HashMap<String, usize>>,
    /// Strategy for unknown categories at `transform` time (threaded from the
    /// unfitted [`OrdinalEncoder`]).
    pub(crate) handle_unknown: HandleUnknown,
    /// Sentinel for unknown categories under
    /// [`HandleUnknown::UseEncodedValue`] (threaded from the unfitted encoder;
    /// validated to be present in that mode during `fit`).
    pub(crate) unknown_value: Option<f64>,
    /// Per-feature indices into `categories[j]` of the categories grouped as
    /// **infrequent** (`min_frequency`/`max_categories`), sorted ascending.
    /// Mirrors scikit-learn's private `_infrequent_indices[j]`
    /// (`_encoders.py:336-340`,`:367-370`). Empty when feature `j` has no
    /// infrequent categories (sklearn's `None`); with infrequent grouping
    /// disabled every entry is empty. Length `categories.len()`. The categories
    /// themselves are NOT removed from `categories[j]` (unlike one-hot column
    /// dropping) — only their emitted ordinal code is folded.
    pub(crate) infrequent_indices_: Vec<Vec<usize>>,
    /// Per-feature mapping from a `categories[j]` index to its emitted ORDINAL
    /// code. Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
    /// (`_encoders.py:373-400`): a frequent category maps to its remapped slot
    /// `0..n_frequent` (frequent categories keep their original sorted order),
    /// every infrequent category maps to the single trailing index
    /// `n_frequent`. When feature `j` has no infrequent categories the mapping
    /// is the identity `0..len` (sklearn stores `None`; the identity is the
    /// representable equivalent). Length `categories.len()`, with
    /// `infrequent_map[j].len() == categories[j].len()`. Used by `transform`
    /// and `inverse_transform`.
    pub(crate) infrequent_map: Vec<Vec<usize>>,
    /// Per-feature number of frequent categories (`n_frequent`): the trailing
    /// infrequent ordinal index when feature `j` has infrequent categories.
    /// Equals `categories[j].len() - infrequent_indices_[j].len()`. When feature
    /// `j` has no infrequent categories this equals `categories[j].len()` (the
    /// identity map's range). Length `categories.len()`. Used by
    /// `inverse_transform` to recognise the shared infrequent code.
    pub(crate) n_frequent: Vec<usize>,
}

impl FittedOrdinalEncoder {
    /// Return the ordered category list for each column.
    ///
    /// `categories()[j][i]` is the category that maps to integer `i` in column `j`.
    #[must_use]
    pub fn categories(&self) -> &[Vec<String>] {
        &self.categories
    }

    /// Return the infrequent category **values** for each feature
    /// (`infrequent_categories_`).
    ///
    /// `infrequent_categories()[j]` is the sorted list of category values from
    /// `categories[j]` that were grouped into the single trailing "infrequent"
    /// ordinal code (because their training count fell below `min_frequency`
    /// and/or beyond the `max_categories` limit). An EMPTY inner `Vec` means
    /// feature `j` had no infrequent categories (scikit-learn returns `None`
    /// there; an empty list is the representable equivalent). With infrequent
    /// grouping disabled every entry is empty. Mirrors scikit-learn's
    /// `OrdinalEncoder.infrequent_categories_` (`_encoders.py:255-262`):
    /// `category[indices]` over `_infrequent_indices`.
    #[must_use]
    pub fn infrequent_categories(&self) -> Vec<Vec<String>> {
        self.infrequent_indices_
            .iter()
            .enumerate()
            .map(|(j, idxs)| {
                idxs.iter()
                    .filter_map(|&idx| self.categories.get(j).and_then(|c| c.get(idx)).cloned())
                    .collect()
            })
            .collect()
    }

    /// Return the number of input columns (features).
    #[must_use]
    pub fn n_features(&self) -> usize {
        self.categories.len()
    }

    /// Return the number of features seen during `fit`.
    ///
    /// Mirrors scikit-learn's `n_features_in_` attribute (set by `_validate_data`
    /// at fit, `sklearn/base.py`). Equal to [`n_features`](Self::n_features); the
    /// distinct name matches sklearn's fitted-attribute surface (REQ-10).
    #[must_use]
    pub fn n_features_in(&self) -> usize {
        self.categories.len()
    }

    /// Return the output feature names, one per input feature.
    ///
    /// `OrdinalEncoder` is a `OneToOneFeatureMixin` (one output column per input
    /// column), so `get_feature_names_out` returns the INPUT feature names
    /// unchanged (`sklearn/utils/_set_output` / `OneToOneFeatureMixin.
    /// get_feature_names_out`): with `input_features = None` the default names
    /// `["x0", "x1", ...]` (`_check_feature_names_in`), otherwise the supplied
    /// names verbatim.
    ///
    /// # Errors
    ///
    /// Returns [`FerroError::ShapeMismatch`] if `input_features` is `Some` but its
    /// length differs from [`n_features_in`](Self::n_features_in) (sklearn raises
    /// `ValueError("input_features should have length equal to number of features
    /// ...")`).
    pub fn get_feature_names_out(
        &self,
        input_features: Option<&[String]>,
    ) -> Result<Vec<String>, FerroError> {
        let n = self.categories.len();
        match input_features {
            None => Ok((0..n).map(|j| format!("x{j}")).collect()),
            Some(names) => {
                if names.len() != n {
                    return Err(FerroError::ShapeMismatch {
                        expected: vec![n],
                        actual: vec![names.len()],
                        context: "FittedOrdinalEncoder::get_feature_names_out (input_features \
                                  length must equal n_features_in_)"
                            .into(),
                    });
                }
                Ok(names.to_vec())
            }
        }
    }

    /// Return the configured unknown-category strategy.
    #[must_use]
    pub fn handle_unknown(&self) -> HandleUnknown {
        self.handle_unknown
    }

    /// Return the configured unknown-category sentinel, if any.
    #[must_use]
    pub fn unknown_value(&self) -> Option<f64> {
        self.unknown_value
    }

    /// Convert ordinal indices back to the original category strings.
    ///
    /// This is the inverse of [`Transform::transform`]: each `f64` cell is read
    /// as an ordinal index into the per-column `categories_` learned at `fit`
    /// time, and the corresponding category string is returned. Reusing the
    /// SHIPPED `categories_` (REQ-1), `inverse_transform(transform(X)) == X` for
    /// any `X` whose every category was seen during `fit` (a bit-exact roundtrip
    /// on the default `Error`-mode encoder). Mirrors scikit-learn's
    /// `OrdinalEncoder.inverse_transform` (`sklearn/preprocessing/_encoders.py:1595`),
    /// `X_tr[:, i] = self.categories_[i][labels]`.
    ///
    /// # Index contract (faithful to sklearn / numpy)
    ///
    /// Mirrors sklearn's `labels.astype("int64")` (`_encoders.py:1664`) followed
    /// by numpy fancy indexing `categories_[j][labels]` (`:1679`):
    /// - **truncates non-integers toward zero** (`1.5` → index `1` → that
    ///   category; `0.7` → `0`) — Rust `f64 as i64` matches the C-style cast.
    /// - **wraps small negatives** via numpy negative indexing (`-1.0` →
    ///   `categories_[j][len-1]`, the LAST category; `-2.0` → `len-2`), raising
    ///   only once the wrapped index still leaves `[0, len)` (`-3.0` with 2
    ///   categories → `IndexError`).
    /// - **errors** on an out-of-range positive ordinal (`9.0` with 2 categories
    ///   → sklearn `IndexError`) and on a non-finite cell (NaN/±inf overflow the
    ///   `astype("int64")` cast → sklearn `IndexError`/`ValueError`; guarded
    ///   explicitly because Rust's `f64 as i64` saturates NaN→0, which would
    ///   diverge).
    ///
    /// The roundtrip, held-out valid-ordinal, truncation, and negative-wrap paths
    /// all match sklearn; out-of-range / non-finite both error.
    ///
    /// # `use_encoded_value` → `None` (SCOPE LIMITATION, R-HONEST-3)
    ///
    /// With [`HandleUnknown::UseEncodedValue`], sklearn maps a cell equal to
    /// `unknown_value` back to `None` (`_encoders.py:1673`,
    /// `X_tr[mask, idx] = None`). ferrolearn's `Array2<String>` output container
    /// **cannot represent `None`** (it would require `Array2<Option<String>>`).
    /// The configured `unknown_value` is itself out of the valid `[0, len)`
    /// range (e.g. `-1`), so such a cell hits the out-of-range error path: this
    /// inverse therefore ERRORS where sklearn returns `[[None, ...]]`. This is a
    /// documented divergence, not a silent wrong-string — the honest behavior is
    /// to error rather than fabricate a category. The default `Error`-mode
    /// encoder produces only valid ordinals, so its inverse is COMPLETE and
    /// bit-exact.
    ///
    /// # Errors
    ///
    /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows
    /// (symmetry with `transform`'s #2220 guard and sklearn's `check_array`).
    ///
    /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
    /// match the number of features seen during fitting (sklearn's
    /// `_encoders.py:1619` "Shape of the passed X data is not correct").
    ///
    /// Returns [`FerroError::InvalidParameter`] if any cell is not an exact
    /// non-negative integer in `[0, categories_[j].len())` (sklearn's
    /// `IndexError`, plus the strict negative/non-integer contract above).
    pub fn inverse_transform(&self, x: &Array2<f64>) -> Result<Array2<String>, FerroError> {
        let n_features = self.categories.len();
        // Symmetric with `transform`'s 0-row guard (#2220) and sklearn's
        // `check_array` minimum-of-1-sample (`_encoders.py:1610`): a 0-row input
        // raises "Found array with 0 sample(s) ... a minimum of 1 is required".
        if x.nrows() == 0 {
            return Err(FerroError::InsufficientSamples {
                required: 1,
                actual: 0,
                context: "FittedOrdinalEncoder::inverse_transform".into(),
            });
        }
        // sklearn validates the column count (`_encoders.py:1619`) -> ValueError.
        if x.ncols() != n_features {
            return Err(FerroError::ShapeMismatch {
                expected: vec![x.nrows(), n_features],
                actual: vec![x.nrows(), x.ncols()],
                context: "FittedOrdinalEncoder::inverse_transform".into(),
            });
        }

        let n_samples = x.nrows();
        // `Array2::default` fills with the empty String; every cell is overwritten
        // on the Ok path, so the default is never observed by the caller.
        let mut out = Array2::<String>::default((n_samples, n_features));

        for j in 0..n_features {
            let cats = &self.categories[j];
            // Infrequent grouping (REQ-8). When feature `j` has infrequent
            // categories, the valid ordinal codes are `0..=n_frequent[j]`: codes
            // `0..n_frequent` index the FREQUENT-only category list (the original
            // `categories[j]` with the infrequent entries removed, order
            // preserved — sklearn `frequent_categories_mask`,
            // `_encoders.py:1648-1652`), and the shared trailing code
            // `n_frequent` inverts to the literal String `"infrequent_sklearn"`
            // (`_encoders.py:1675-1677` `X_tr[mask, idx] = "infrequent_sklearn"`).
            // UNLIKE `OneHotEncoder`'s NaN proxy, this is a REAL representable
            // String. With grouping disabled `infrequent_indices_[j]` is empty,
            // so this branch is skipped and the SHIPPED REQ-9 path runs unchanged.
            let frequent_only: Option<Vec<String>> = if self
                .infrequent_indices_
                .get(j)
                .is_some_and(|v| !v.is_empty())
            {
                let map = &self.infrequent_map[j];
                let nf = self.n_frequent[j];
                // Slot `s` (in `0..nf`) → the `categories[j]` element whose
                // remapped code is `s` (frequent categories keep their order).
                let mut fo: Vec<String> = Vec::with_capacity(nf);
                for s in 0..nf {
                    if let Some(orig) = map.iter().position(|&c| c == s)
                        && let Some(cat) = cats.get(orig)
                    {
                        fo.push(cat.clone());
                    }
                }
                Some(fo)
            } else {
                None
            };
            // The category list the numpy index logic indexes into: the
            // frequent-only list when grouping is active for this feature, else
            // the full `categories[j]` (SHIPPED REQ-9, UNCHANGED).
            let index_cats: &[String] = frequent_only.as_deref().unwrap_or(cats);
            let len = index_cats.len() as i64;
            for i in 0..n_samples {
                let v = x[[i, j]];
                // `use_encoded_value`: sklearn maps a cell equal to
                // `unknown_value` back to `None` (`_encoders.py:1673`) BEFORE the
                // int cast / indexing. `Array2<String>` cannot hold `None`, so
                // this cell errors (documented scope limitation, R-HONEST-3) —
                // checked first so the configured sentinel (e.g. `-1`) is NOT
                // silently wrapped to a real category by the numpy index logic.
                if self.handle_unknown == HandleUnknown::UseEncodedValue
                    && let Some(uv) = self.unknown_value
                    && (v == uv || (v.is_nan() && uv.is_nan()))
                {
                    return Err(FerroError::InvalidParameter {
                        name: "X".into(),
                        reason: format!(
                            "value {v} at row {i}, feature {j} equals unknown_value; \
                             sklearn inverts it to None, which Array2<String> cannot \
                             represent (would need Array2<Option<String>>)"
                        ),
                    });
                }
                // Infrequent: a cell EXACTLY equal to the shared trailing code
                // `n_frequent` (a float equality, computed on the RAW label
                // BEFORE the int cast — sklearn `labels == infrequent_encoding_value`,
                // `_encoders.py:1644`) inverts to `"infrequent_sklearn"`. A cell
                // that merely truncates to `n_frequent` (e.g. `2.5`) does NOT —
                // it falls through to the frequent-only index logic and errors out
                // of range, matching the live oracle.
                if frequent_only.is_some() && v == self.n_frequent[j] as f64 {
                    out[[i, j]] = "infrequent_sklearn".to_string();
                    continue;
                }
                // sklearn does `labels.astype('int64')` then `categories_[j][idx]`
                // (`_encoders.py:1664`,`:1679`). A non-finite cell overflows the
                // cast (NaN/+-inf -> IndexError/ValueError); reject it (R-CODE-2:
                // Rust's `f64 as i64` would saturate NaN->0, diverging from numpy,
                // so guard explicitly).
                if !v.is_finite() {
                    return Err(FerroError::InvalidParameter {
                        name: "X".into(),
                        reason: format!(
                            "value {v} at row {i}, feature {j} is not a finite ordinal \
                             index (sklearn raises on NaN/inf astype('int64'))"
                        ),
                    });
                }
                // `astype('int64')` truncates toward zero (Rust `as i64` matches
                // for finite values); numpy indexing then WRAPS a negative index
                // by `+= len` (`-1` -> last category), raising only once the
                // wrapped index still leaves `[0, len)`.
                let mut idx = v as i64;
                if idx < 0 {
                    idx += len;
                }
                if idx < 0 || idx >= len {
                    return Err(FerroError::InvalidParameter {
                        name: "X".into(),
                        reason: format!(
                            "ordinal index {} at row {i} is out of bounds for the {len} \
                             categories of feature {j} (sklearn IndexError)",
                            v as i64
                        ),
                    });
                }
                // `idx` is now provably in `[0, len)` (checked above) — no panic.
                // `index_cats` is the frequent-only list under infrequent
                // grouping (so a frequent code maps to its frequent category),
                // else the full `categories[j]` (SHIPPED REQ-9, UNCHANGED).
                out[[i, j]] = index_cats[idx as usize].clone();
            }
        }

        Ok(out)
    }
}

// ---------------------------------------------------------------------------
// Helpers
// ---------------------------------------------------------------------------

/// Cast an ordinal category index to `f64`, matching scikit-learn's default
/// `OrdinalEncoder(dtype=np.float64)` output container
/// (`sklearn/preprocessing/_encoders.py:1262`).
///
/// `f64` exactly represents every integer up to `2^53`, so this is lossless for
/// any realistic category count. Indices above `2^53` (astronomically more
/// categories than memory could hold) round to the nearest `f64`, never panic
/// (R-CODE-2) — the same silent float rounding numpy performs.
#[inline]
fn ordinal_index_to_f64(idx: usize) -> f64 {
    idx as f64
}

/// Identify the indices of infrequent categories for one feature, given the
/// per-category training `counts` (aligned with `categories[j]`) and the
/// `min_frequency`/`max_categories` thresholds.
///
/// Mirrors scikit-learn's `_BaseEncoder._identify_infrequent`
/// (`_encoders.py:275-318`). This is the SAME algorithm the SHIPPED
/// `OneHotEncoder` REQ-5b uses (`one_hot_encoder.rs::identify_infrequent`):
/// 1. min_frequency: a category with `count < min_frequency` is infrequent
///    (`:295-296`, integer form only).
/// 2. max_categories: if (after step 1) the feature would still produce more
///    than `max_categories` ordinal codes — counted as `n_remaining_frequent +
///    1` for the infrequent group (`:303`) — the least-frequent categories are
///    additionally marked infrequent until only `max_categories - 1` frequent
///    categories remain (`:304-315`). Ties broken by a STABLE sort over the
///    FULL count array, so among equal counts the SMALLER category index is
///    marked infrequent first (sklearn `np.argsort(kind="mergesort")[:-k]`),
///    i.e. the LARGER index is favoured to stay frequent. `max_categories == 1`
///    (frequent_category_count 0) makes every category infrequent (`:307-309`).
///
/// Returns the sorted-ascending infrequent indices (empty if none — sklearn's
/// `None`). Never panics (R-CODE-2).
fn identify_infrequent(
    counts: &[usize],
    min_frequency: Option<usize>,
    max_categories: Option<usize>,
) -> Vec<usize> {
    let n = counts.len();
    let mut infrequent_mask = vec![false; n];

    // Step 1: min_frequency (integer count). `count < min_frequency`.
    if let Some(min_freq) = min_frequency {
        for (idx, &c) in counts.iter().enumerate() {
            if c < min_freq {
                infrequent_mask[idx] = true;
            }
        }
    }

    // Step 2: max_categories on the survivors. `n_current_features` counts the
    // remaining frequent categories PLUS 1 for the infrequent group
    // (`_encoders.py:303`).
    if let Some(max_cat) = max_categories {
        let n_infreq = infrequent_mask.iter().filter(|&&m| m).count();
        let n_current_features = n - n_infreq + 1;
        if max_cat < n_current_features {
            // `max_categories` includes the one infrequent category.
            let frequent_category_count = max_cat - 1;
            if frequent_category_count == 0 {
                // All categories are infrequent (`:307-309`).
                infrequent_mask.iter_mut().for_each(|m| *m = true);
            } else {
                // Stable argsort over the FULL count array (ascending by count,
                // ties by ascending index), then mark the smallest
                // `n - frequent_category_count` levels infrequent — i.e. keep the
                // top `frequent_category_count` by count, with ties resolved in
                // favor of the LARGER index (`np.argsort(kind="mergesort")[:-k]`,
                // `:312-315`).
                let mut order: Vec<usize> = (0..n).collect();
                order.sort_by(|&a, &b| counts[a].cmp(&counts[b]).then(a.cmp(&b)));
                let keep = frequent_category_count.min(n);
                let cut = n - keep;
                for &idx in &order[..cut] {
                    infrequent_mask[idx] = true;
                }
            }
        }
    }

    infrequent_mask
        .iter()
        .enumerate()
        .filter_map(|(idx, &m)| if m { Some(idx) } else { None })
        .collect()
}

/// Build the per-feature mapping from a `categories[j]` index to its emitted
/// ORDINAL code.
///
/// Mirrors scikit-learn's `_default_to_infrequent_mappings[j]`
/// (`_encoders.py:373-400`): frequent categories take codes `0..n_frequent` in
/// their original (ascending-index) order; every infrequent category maps to
/// the single trailing code `n_frequent`. With no infrequent categories the
/// mapping is the identity `0..n`. `infrequent` must be sorted ascending. Never
/// panics (R-CODE-2): every index is bounds-checked.
fn build_infrequent_map(n: usize, infrequent: &[usize]) -> Vec<usize> {
    if infrequent.is_empty() {
        return (0..n).collect();
    }
    let n_frequent = n - infrequent.len();
    let mut map = vec![n_frequent; n];
    let mut next_frequent = 0usize;
    for (idx, slot) in map.iter_mut().enumerate() {
        if infrequent.binary_search(&idx).is_ok() {
            // Infrequent → the trailing code (already set to `n_frequent`).
        } else {
            *slot = next_frequent;
            next_frequent += 1;
        }
    }
    map
}

// ---------------------------------------------------------------------------
// Trait implementations
// ---------------------------------------------------------------------------

impl Fit<Array2<String>, ()> for OrdinalEncoder {
    type Fitted = FittedOrdinalEncoder;
    type Error = FerroError;

    /// Fit the encoder by building per-column category-to-index mappings.
    ///
    /// With the default `categories='auto'` ([`Categories::Auto`]), categories
    /// are recorded in **lexicographic order** in each column, matching
    /// scikit-learn's `OrdinalEncoder.categories_`.
    ///
    /// With explicit categories ([`Categories::Explicit`], set via
    /// [`OrdinalEncoder::with_categories`]), the user-provided lists are used in
    /// the **given order** (NOT re-sorted), and the ordinal indices follow that
    /// order, mirroring scikit-learn (`sklearn/preprocessing/_encoders.py:114`).
    ///
    /// # Errors
    ///
    /// Returns [`FerroError::InsufficientSamples`] if the input has zero rows.
    ///
    /// Returns [`FerroError::ShapeMismatch`] if explicit categories are set but
    /// the number of category lists differs from the number of input columns
    /// (sklearn `_encoders.py:85-89` "Shape mismatch: if categories is an array,
    /// it has to be of shape (n_features,).").
    ///
    /// Returns [`FerroError::InvalidParameter`] if an explicit category list
    /// contains duplicate elements (sklearn `_encoders.py:136-141`), or — under
    /// the default [`HandleUnknown::Error`] — if a value seen in the data is not
    /// in its column's explicit list (sklearn `_encoders.py:153-160` "Found
    /// unknown categories ... during fit"; SKIPPED under
    /// [`HandleUnknown::UseEncodedValue`]).
    ///
    /// Returns [`FerroError::InvalidParameter`] for the `handle_unknown` /
    /// `unknown_value` validation failures (mirroring scikit-learn's
    /// `TypeError`/`ValueError` at `_encoders.py:1473-1526`): selecting
    /// [`HandleUnknown::UseEncodedValue`] without an `unknown_value`; setting an
    /// `unknown_value` while in [`HandleUnknown::Error`] mode; or an
    /// `unknown_value` that collides with an already-used encoding index.
    fn fit(&self, x: &Array2<String>, _y: &()) -> Result<FittedOrdinalEncoder, FerroError> {
        // sklearn `_parameter_constraints` (`@_fit_context`, validated BEFORE the
        // data): `min_frequency` and `max_categories` are each
        // `Interval(Integral, 1, None)` — a value of 0 raises
        // `InvalidParameterError` ("must be an int in the range [1, inf)").
        // REQ-8, verified live: `OrdinalEncoder(min_frequency=0).fit` →
        // InvalidParameterError. (handle_unknown is a type-safe Rust enum, so its
        // StrOptions constraint is provided by the type system.)
        if self.min_frequency == Some(0) {
            return Err(FerroError::InvalidParameter {
                name: "min_frequency".into(),
                reason: "must be an int in the range [1, inf)".into(),
            });
        }
        if self.max_categories == Some(0) {
            return Err(FerroError::InvalidParameter {
                name: "max_categories".into(),
                reason: "must be an int in the range [1, inf)".into(),
            });
        }

        let n_samples = x.nrows();
        if n_samples == 0 {
            return Err(FerroError::InsufficientSamples {
                required: 1,
                actual: 0,
                context: "OrdinalEncoder::fit".into(),
            });
        }

        // Validation (a)/(b) on the param SHAPE — independent of the data, but
        // matching sklearn these are evaluated in `fit`, AFTER the 0-row
        // `check_array` guard above and (for the collision check) AFTER the
        // categories_ compute below. (a)/(b) map sklearn's `TypeError`
        // (`_encoders.py:1481-1493`).
        match (self.handle_unknown, self.unknown_value) {
            // (a) use_encoded_value REQUIRES an unknown_value (an int or nan).
            // sklearn: `not isinstance(unknown_value, Integral)` -> TypeError
            // (`:1481`); `unknown_value is None` falls into that branch.
            (HandleUnknown::UseEncodedValue, None) => {
                return Err(FerroError::InvalidParameter {
                    name: "unknown_value".into(),
                    reason: "unknown_value should be set (an integer or NaN) when \
                             handle_unknown is 'use_encoded_value'"
                        .into(),
                });
            }
            // (b) error-mode forbids a set unknown_value. sklearn: `:1488`
            // `elif self.unknown_value is not None` -> TypeError.
            (HandleUnknown::Error, Some(v)) => {
                return Err(FerroError::InvalidParameter {
                    name: "unknown_value".into(),
                    reason: format!(
                        "unknown_value should only be set when handle_unknown is \
                         'use_encoded_value', got {v}"
                    ),
                });
            }
            _ => {}
        }

        let n_features = x.ncols();
        let mut categories = Vec::with_capacity(n_features);
        let mut category_to_index = Vec::with_capacity(n_features);

        match &self.categories {
            // `categories='auto'` (default): per column, sorted-unique from the
            // data (SHIPPED REQ-1, UNCHANGED). sklearn `_encoders.py:98-99`
            // `result = _unique(Xi)`.
            Categories::Auto => {
                for j in 0..n_features {
                    // Collect unique categories then sort lexicographically so the
                    // assigned indices match sklearn's `OrdinalEncoder`, which
                    // documents `categories_ = sorted(unique(X[:, j]))`. (Older
                    // ferrolearn versions used first-seen order — #344.)
                    let mut unique: Vec<String> = Vec::new();
                    let mut seen_set: std::collections::HashSet<String> =
                        std::collections::HashSet::new();
                    for i in 0..n_samples {
                        let cat = &x[[i, j]];
                        if seen_set.insert(cat.clone()) {
                            unique.push(cat.clone());
                        }
                    }
                    unique.sort();

                    let map: HashMap<String, usize> = unique
                        .iter()
                        .enumerate()
                        .map(|(idx, s)| (s.clone(), idx))
                        .collect();

                    categories.push(unique);
                    category_to_index.push(map);
                }
            }
            // `categories=[list, ...]` (explicit): use the user-provided lists in
            // the GIVEN order (NOT re-sorted), mirroring sklearn `_encoders.py:84-160`.
            Categories::Explicit(lists) => {
                // sklearn (`_encoders.py:85-89`): the list count must match
                // n_features, else ValueError -> map to `ShapeMismatch`.
                if lists.len() != n_features {
                    return Err(FerroError::ShapeMismatch {
                        expected: vec![n_features],
                        actual: vec![lists.len()],
                        context: "Shape mismatch: if categories is an array, it has to be of \
                                  shape (n_features,)."
                            .into(),
                    });
                }

                for (j, list) in lists.iter().enumerate() {
                    // sklearn (`_encoders.py:114-117`) indexes `cats[0]` on the
                    // provided list BEFORE the duplicate/subset checks, so an
                    // EMPTY explicit list raises `IndexError` at fit in BOTH
                    // handle_unknown modes (#2229). Reject it here (the
                    // use_encoded_value path would otherwise skip the subset
                    // check and silently fit an empty category set).
                    if list.is_empty() {
                        return Err(FerroError::InvalidParameter {
                            name: "categories".into(),
                            reason: format!(
                                "column {j} has an empty predefined category list; \
                                 each feature needs at least one category"
                            ),
                        });
                    }
                    // sklearn (`_encoders.py:136-141`): a list with duplicate
                    // elements raises ValueError. Build the index map detecting
                    // duplicates in one pass (R-CODE-2: never panic).
                    let mut map: HashMap<String, usize> = HashMap::with_capacity(list.len());
                    for (idx, cat) in list.iter().enumerate() {
                        if map.insert(cat.clone(), idx).is_some() {
                            return Err(FerroError::InvalidParameter {
                                name: "categories".into(),
                                reason: format!(
                                    "In column {j}, the predefined categories contain \
                                     duplicate elements."
                                ),
                            });
                        }
                    }

                    // sklearn (`_encoders.py:153-160`): under handle_unknown='error'
                    // every value seen in the data must be present in the
                    // predefined list, else ValueError. Under 'use_encoded_value'
                    // this fit-time subset check is SKIPPED (out-of-set data is
                    // fine — encoded to `unknown_value` later at transform time).
                    if self.handle_unknown == HandleUnknown::Error {
                        for i in 0..n_samples {
                            let cat = &x[[i, j]];
                            if !map.contains_key(cat) {
                                return Err(FerroError::InvalidParameter {
                                    name: "X".into(),
                                    reason: format!(
                                        "Found unknown categories [{cat}] in column {j} \
                                         during fit"
                                    ),
                                });
                            }
                        }
                    }

                    // Use the list AS-GIVEN (preserve order — do NOT sort).
                    categories.push(list.clone());
                    category_to_index.push(map);
                }
            }
        }

        // Infrequent grouping (REQ-8). When `min_frequency`/`max_categories` are
        // set, fold the least-frequent categories of each feature into a single
        // shared trailing ORDINAL code (the frequent categories keep codes
        // `0..n_frequent` in their original sorted order). `categories` is NOT
        // changed (all categories retained, sklearn keeps `categories_` whole and
        // only remaps the emitted index, `_encoders.py:1289-1370`) — only the
        // per-feature `infrequent_map` / `infrequent_indices_` / `n_frequent` are
        // built. With grouping disabled the map is the identity and every feature
        // has no infrequent categories.
        let mut infrequent_indices_: Vec<Vec<usize>> = Vec::with_capacity(n_features);
        let mut infrequent_map: Vec<Vec<usize>> = Vec::with_capacity(n_features);
        let mut n_frequent: Vec<usize> = Vec::with_capacity(n_features);
        if self.infrequent_enabled() {
            for (j, cats) in categories.iter().enumerate() {
                // Per-category training counts ALIGNED with `categories[j]`
                // (sklearn `_unique(Xi, return_counts=True)`,
                // `_encoders.py:99-102`). Built from the fit data through the
                // category→index map, so it works for BOTH the Auto and Explicit
                // category sets. (A datum not in an explicit list contributes no
                // count — under `handle_unknown='error'` the subset check above
                // already rejected it; under `use_encoded_value` it is an unknown
                // that does not affect category frequencies.)
                let map = &category_to_index[j];
                let mut counts = vec![0usize; cats.len()];
                for i in 0..n_samples {
                    if let Some(&idx) = map.get(&x[[i, j]]) {
                        counts[idx] += 1;
                    }
                }
                let infreq = identify_infrequent(&counts, self.min_frequency, self.max_categories);
                let imap = build_infrequent_map(cats.len(), &infreq);
                n_frequent.push(cats.len() - infreq.len());
                infrequent_indices_.push(infreq);
                infrequent_map.push(imap);
            }
        } else {
            for cats in &categories {
                infrequent_indices_.push(Vec::new());
                infrequent_map.push((0..cats.len()).collect());
                n_frequent.push(cats.len());
            }
        }

        // Validation (a'): sklearn (`_encoders.py:1481-1487`) requires
        // `unknown_value` to be an INTEGER or `np.nan` when
        // `handle_unknown='use_encoded_value'` — a non-integer float raises
        // `TypeError` BEFORE the range/collision check (#2221). `f64` cannot
        // express "integral", so a non-nan value with a fractional part is
        // rejected here.
        if self.handle_unknown == HandleUnknown::UseEncodedValue
            && let Some(v) = self.unknown_value
            && !v.is_nan()
            && v.fract() != 0.0
        {
            return Err(FerroError::InvalidParameter {
                name: "unknown_value".into(),
                reason: format!(
                    "unknown_value should be an integer or np.nan when \
                     handle_unknown is 'use_encoded_value', got {v}"
                ),
            });
        }

        // Validation (c): collision of a non-nan integer unknown_value with an
        // already-used encoding index. sklearn (`_encoders.py:1518-1526`) loops
        // each column's cardinality and raises `ValueError` if
        // `0 <= unknown_value < cardinality`; that is equivalent to comparing
        // against the maximum cardinality. The earlier sklearn check
        // (`:1481`) already guaranteed `unknown_value` is an int or nan, so a
        // non-integer / nan value is fine here, as is a negative value or one
        // `>= max_cardinality`.
        if self.handle_unknown == HandleUnknown::UseEncodedValue
            && let Some(v) = self.unknown_value
            && !v.is_nan()
            && v.fract() == 0.0
        {
            // sklearn's collision check keys off the EFFECTIVE number of distinct
            // output codes per feature: with infrequent grouping a feature emits
            // `n_frequent + 1` codes (the shared infrequent index), so its
            // cardinality for the unknown_value collision is `n_frequent + 1`, NOT
            // `len(categories_)` (verified live: `min_frequency=2` over 4 cats →
            // 3 codes → `unknown_value=3` is accepted, `=2` collides). With
            // grouping disabled `n_frequent[j] == categories[j].len()` and there
            // is no infrequent code, so this reduces to the SHIPPED REQ-5 check.
            let max_cardinality = (0..n_features)
                .map(|j| n_frequent[j] + usize::from(!infrequent_indices_[j].is_empty()))
                .max()
                .unwrap_or(0);
            // `0 <= v < max_cardinality` with v an integer-valued f64.
            if v >= 0.0 && v < max_cardinality as f64 {
                return Err(FerroError::InvalidParameter {
                    name: "unknown_value".into(),
                    reason: format!(
                        "The used value for unknown_value {v} is one of the \
                         values already used for encoding the seen categories"
                    ),
                });
            }
        }

        Ok(FittedOrdinalEncoder {
            categories,
            category_to_index,
            handle_unknown: self.handle_unknown,
            unknown_value: self.unknown_value,
            infrequent_indices_,
            infrequent_map,
            n_frequent,
        })
    }
}

impl Transform<Array2<String>> for FittedOrdinalEncoder {
    type Output = Array2<f64>;
    type Error = FerroError;

    /// Transform string categories to ordinal indices, returned as `f64`.
    ///
    /// Each cell is the (lexicographic) category index cast to `f64`. The
    /// ordinal VALUES are unchanged from the integer mapping; only the output
    /// container dtype is `f64`, matching scikit-learn's
    /// `OrdinalEncoder(dtype=np.float64)` default
    /// (`sklearn/preprocessing/_encoders.py:1262`). A configurable non-float64
    /// output dtype (e.g. `int32`) is OUT OF SCOPE here — ferrolearn's output is
    /// the fixed sklearn DEFAULT `f64`; a `dtype` param is a follow-on design
    /// (blocker #1158). `f64` exactly represents every integer up to `2^53`, so
    /// the cast is lossless for any realistic category count.
    ///
    /// # Errors
    ///
    /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
    /// match the number of features seen during fitting.
    ///
    /// Returns [`FerroError::InvalidParameter`] if any category was not seen
    /// during fitting AND `handle_unknown` is [`HandleUnknown::Error`] (the
    /// default). Under [`HandleUnknown::UseEncodedValue`], unknown categories
    /// are instead encoded as the configured `unknown_value` sentinel (which may
    /// be `f64::NAN`), matching sklearn `_encoders.py:1591`.
    fn transform(&self, x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
        let n_features = self.categories.len();
        // sklearn `OrdinalEncoder.transform` -> `_transform` -> `_check_X` ->
        // `check_array` (`_encoders.py:45`) enforces a minimum of 1 sample BEFORE
        // the n_features comparison (#2220, symmetric with the 0-row fit guard).
        // A 0-row input raises "Found array with 0 sample(s) ... minimum of 1".
        if x.nrows() == 0 {
            return Err(FerroError::InsufficientSamples {
                required: 1,
                actual: 0,
                context: "FittedOrdinalEncoder::transform".into(),
            });
        }
        if x.ncols() != n_features {
            return Err(FerroError::ShapeMismatch {
                expected: vec![x.nrows(), n_features],
                actual: vec![x.nrows(), x.ncols()],
                context: "FittedOrdinalEncoder::transform".into(),
            });
        }

        let n_samples = x.nrows();
        let mut out = Array2::zeros((n_samples, n_features));

        for j in 0..n_features {
            let map = &self.category_to_index[j];
            // Per-feature infrequent remapping (REQ-8): a found category's
            // `categories[j]` index is routed through `infrequent_map[j]` to its
            // emitted ordinal code (a frequent category → its remapped slot
            // `0..n_frequent`, an infrequent category → the shared trailing code
            // `n_frequent`), mirroring sklearn `_map_infrequent_categories`
            // (`_encoders.py:402-452`: `X_int = np.take(mapping, X_int)`). With
            // grouping DISABLED `infrequent_map[j]` is the identity, so the code
            // equals `idx` — the SHIPPED REQ-2 behaviour is UNCHANGED.
            let imap = self.infrequent_map.get(j);
            for i in 0..n_samples {
                let cat = &x[[i, j]];
                match map.get(cat) {
                    // Route the category index through the infrequent map, then
                    // cast the resulting ordinal code to f64 (sklearn's float64
                    // default, `_encoders.py:1262`). Lossless: codes are < 2^53.
                    // Bounds-safe: `imap.get(idx)` falls back to the raw `idx`
                    // (R-CODE-2) — `imap` always has `categories[j].len()` entries.
                    Some(&idx) => {
                        let code = imap.and_then(|m| m.get(idx)).copied().unwrap_or(idx);
                        out[[i, j]] = ordinal_index_to_f64(code);
                    }
                    None => match self.handle_unknown {
                        // handle_unknown='use_encoded_value': write the sentinel
                        // (which may be NaN). sklearn `_encoders.py:1591`
                        // `X_trans[~X_mask] = self.unknown_value`. `fit`
                        // guaranteed `unknown_value` is `Some` in this mode, but
                        // we never panic (R-CODE-2): fall back to the Error path
                        // if it were somehow `None`.
                        HandleUnknown::UseEncodedValue => match self.unknown_value {
                            Some(v) => out[[i, j]] = v,
                            None => {
                                return Err(FerroError::InvalidParameter {
                                    name: format!("x[{i},{j}]"),
                                    reason: format!(
                                        "unknown category \"{cat}\" in column {j} and \
                                         no unknown_value configured"
                                    ),
                                });
                            }
                        },
                        // handle_unknown='error' (default): reject (SHIPPED
                        // REQ-2, UNCHANGED). sklearn raises ValueError
                        // "Found unknown categories ... during transform".
                        HandleUnknown::Error => {
                            return Err(FerroError::InvalidParameter {
                                name: format!("x[{i},{j}]"),
                                reason: format!("unknown category \"{cat}\" in column {j}"),
                            });
                        }
                    },
                }
            }
        }

        Ok(out)
    }
}

/// Implement `Transform` on the unfitted encoder to satisfy the
/// `FitTransform: Transform` supertrait bound.
impl Transform<Array2<String>> for OrdinalEncoder {
    type Output = Array2<f64>;
    type Error = FerroError;

    /// Always returns an error — the encoder must be fitted first.
    fn transform(&self, _x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
        Err(FerroError::InvalidParameter {
            name: "OrdinalEncoder".into(),
            reason: "encoder must be fitted before calling transform; use fit() first".into(),
        })
    }
}

impl FitTransform<Array2<String>> for OrdinalEncoder {
    type FitError = FerroError;

    /// Fit the encoder on `x` and return the encoded output in one step.
    ///
    /// # Errors
    ///
    /// Returns an error if fitting or transformation fails.
    fn fit_transform(&self, x: &Array2<String>) -> Result<Array2<f64>, FerroError> {
        let fitted = self.fit(x, &())?;
        fitted.transform(x)
    }
}

// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------

#[cfg(test)]
mod tests {
    use super::*;
    use ndarray::Array2;

    fn make_2col(rows: &[(&str, &str)]) -> Array2<String> {
        let flat: Vec<String> = rows
            .iter()
            .flat_map(|(a, b)| [a.to_string(), b.to_string()])
            .collect();
        Array2::from_shape_vec((rows.len(), 2), flat).unwrap()
    }

    #[test]
    fn test_ordinal_encoder_basic() {
        let enc = OrdinalEncoder::new();
        let x = make_2col(&[
            ("cat", "small"),
            ("dog", "large"),
            ("cat", "medium"),
            ("bird", "small"),
        ]);
        let fitted = enc.fit(&x, &()).unwrap();

        // Categories are sorted lexicographically (sklearn convention).
        assert_eq!(fitted.categories()[0], vec!["bird", "cat", "dog"]);
        assert_eq!(fitted.categories()[1], vec!["large", "medium", "small"]);

        let encoded = fitted.transform(&x).unwrap();
        // Output container is `Array2<f64>` (sklearn's `dtype=np.float64`).
        assert_eq!(encoded[[0, 0]], 1.0); // "cat"  -> 1 (lex pos)
        assert_eq!(encoded[[1, 0]], 2.0); // "dog"  -> 2
        assert_eq!(encoded[[2, 0]], 1.0); // "cat"  -> 1
        assert_eq!(encoded[[3, 0]], 0.0); // "bird" -> 0
        assert_eq!(encoded[[0, 1]], 2.0); // "small"  -> 2
        assert_eq!(encoded[[1, 1]], 0.0); // "large"  -> 0
        assert_eq!(encoded[[2, 1]], 1.0); // "medium" -> 1
        assert_eq!(encoded[[3, 1]], 2.0); // "small"  -> 2
    }

    #[test]
    fn test_fit_transform_equivalence() {
        let enc = OrdinalEncoder::new();
        let x = make_2col(&[("a", "x"), ("b", "y"), ("a", "z")]);
        let via_ft = enc.fit_transform(&x).unwrap();
        let fitted = enc.fit(&x, &()).unwrap();
        let via_sep = fitted.transform(&x).unwrap();
        assert_eq!(via_ft, via_sep);
    }

    #[test]
    fn test_unknown_category_error() {
        let enc = OrdinalEncoder::new();
        let x_train = make_2col(&[("cat", "small"), ("dog", "large")]);
        let fitted = enc.fit(&x_train, &()).unwrap();
        let x_test = make_2col(&[("fish", "small")]);
        assert!(fitted.transform(&x_test).is_err());
    }

    #[test]
    fn test_shape_mismatch_error() {
        let enc = OrdinalEncoder::new();
        let x_train = make_2col(&[("a", "x")]);
        let fitted = enc.fit(&x_train, &()).unwrap();
        // Single-column input when 2 cols expected
        let x_bad = Array2::from_shape_vec((1, 1), vec!["a".to_string()]).unwrap();
        assert!(fitted.transform(&x_bad).is_err());
    }

    #[test]
    fn test_insufficient_samples_error() {
        let enc = OrdinalEncoder::new();
        let x: Array2<String> = Array2::from_shape_vec((0, 2), vec![]).unwrap();
        assert!(enc.fit(&x, &()).is_err());
    }

    #[test]
    fn test_unfitted_transform_error() {
        let enc = OrdinalEncoder::new();
        let x = make_2col(&[("a", "x")]);
        assert!(enc.transform(&x).is_err());
    }

    #[test]
    fn test_single_column() {
        let enc = OrdinalEncoder::new();
        let flat = vec![
            "red".to_string(),
            "green".to_string(),
            "blue".to_string(),
            "red".to_string(),
        ];
        let x = Array2::from_shape_vec((4, 1), flat).unwrap();
        let fitted = enc.fit(&x, &()).unwrap();
        // Lex order: blue (0), green (1), red (2)
        assert_eq!(fitted.categories()[0], vec!["blue", "green", "red"]);
        let encoded = fitted.transform(&x).unwrap();
        assert_eq!(encoded[[0, 0]], 2.0); // red
        assert_eq!(encoded[[1, 0]], 1.0); // green
        assert_eq!(encoded[[2, 0]], 0.0); // blue
        assert_eq!(encoded[[3, 0]], 2.0); // red
    }

    #[test]
    fn test_n_features() {
        let enc = OrdinalEncoder::new();
        let x = make_2col(&[("a", "x")]);
        let fitted = enc.fit(&x, &()).unwrap();
        assert_eq!(fitted.n_features(), 2);
    }

    #[test]
    fn test_lexicographic_order() {
        // Categories are sorted lexicographically to match sklearn (#344).
        let enc = OrdinalEncoder::new();
        let flat = vec!["zebra".to_string(), "ant".to_string(), "moose".to_string()];
        let x = Array2::from_shape_vec((3, 1), flat).unwrap();
        let fitted = enc.fit(&x, &()).unwrap();
        // ant < moose < zebra
        assert_eq!(fitted.categories()[0][0], "ant");
        assert_eq!(fitted.categories()[0][1], "moose");
        assert_eq!(fitted.categories()[0][2], "zebra");
    }
}