Expand description
Label encoder: maps string labels to integer indices.
Learns an ordered mapping from unique string labels to consecutive integers
0, 1, ..., n_classes - 1. Supports forward (label → int) and reverse
(int → label) transformation.
§## REQ status
Binary (R-DEFER-2), translating sklearn/preprocessing/_label.py (class LabelEncoder
:34). Design doc: .design/preprocess/label_encoder.md. Expected values from the live
sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (lib.rs:116, grandfathered S5).
HONEST (R-HONEST-3): ferrolearn is Array1<String>-only; sklearn LabelEncoder accepts any
hashable+comparable dtype. The non-empty string path value-matches the oracle exactly.
| REQ | Status | Evidence |
|---|---|---|
| REQ-1 (string fit → sorted-unique classes_) | SHIPPED | Fit::fit collects unique labels, Vec<String>::sort (lexicographic), builds label_to_index; mirrors sklearn classes_ = _unique(y) (_label.py:98). Critic-verified vs live oracle: green_fit_classes_sorted (["bird","cat","dog"]), green_sort_order_mixed_ascii_matches_numpy (["10","2","A","B","a","b"] == np.unique). Consumer: crate re-export lib.rs:116. |
| REQ-2 (inverse_transform) | SHIPPED | FittedLabelEncoder::inverse_transform = classes[idx] with out-of-range → InvalidParameter; mirrors sklearn classes_[y] + setdiff1d guard (:158-162). Critic-verified: green_inverse_transform_roundtrip, out-of-range rejected. |
| REQ-3 (transform + fit_transform) | SHIPPED | transform = label_to_index.get (unknown → InvalidParameter), mirrors _encode (:137); fit_transform mirrors _unique(return_inverse=True) (:115). Critic-verified: green_transform ([1,2,1,0]), green_fit_transform_equals_fit_then_transform, empty transform/inverse → empty (:134-135,:155-156). |
| REQ-5 (empty-fit parity) | SHIPPED | FIXED #1134. Removed the if x.is_empty() → InsufficientSamples guard; fit([]) now yields an empty FittedLabelEncoder matching sklearn _unique([]) (:98). Critic-verified: divergence_empty_fit_succeeds + 4 post-empty-fit guards; in-module test_empty_fit_yields_empty_classes (R-HONEST-4). |
| REQ-4 (numeric/generic dtype) | NOT-STARTED | open prereq blocker #1135. Array1<String>-only; sklearn accepts any dtype, numeric sort [10,2,1]→[1,2,10] (np.unique) unrepresentable (R-DEV-3). |
| REQ-6 (error-contract parity, R-DEV-2) | NOT-STARTED | open prereq blocker #1136. Unseen-label message (“unknown label” vs “y contains previously unseen labels”, :137,160) + unfitted-transform InvalidParameter vs NotFittedError (:131). Both REJECT (type maps to FerroError); message/NotFitted-analog gap. |
| REQ-7 (PyO3 binding) | SHIPPED | FIXED #1137. _RsLabelEncoder (hand #[pyclass] in ferrolearn-python/src/extras.rs, the 1-D string-input analog of _RsOrdinalEncoder) over FittedLabelEncoder exposes fit(Vec<String>) (→ Fit::fit on Array1<String>), transform(Vec<String>) → numpy int64 codes (unknown label → PyValueError “y contains previously unseen labels”, _label.py:137), inverse_transform(Vec<i64>) → Vec<String> (negative code rejected pre-cast, out-of-range → PyValueError, _label.py:158-160), and #[getter] classes_ (sorted-unique str list, _label.py:98); registered in lib.rs (add_class::<extras::RsLabelEncoder>). The _extras.py::LabelEncoder(BaseEstimator) wrapper mirrors sklearn’s no-param ctor (get_params() == {}, _label.py:34), a _to_labels(y) helper that requires 1-D (column_or_1d, _label.py:97; 2-D → ValueError) and REJECTS numeric-dtype input (NotImplementedError — string-only core sorts lexicographically; #2230 lesson), check_is_fitted→NotFittedError pre-fit, and an explicit fit_transform (LabelEncoder is not a TransformerMixin, _label.py:101). Non-test consumer: _extras.py::LabelEncoder + lib.rs registration + __init__.py re-export (R-DEFER-1). Verification (model B): tests/divergence_label_encoder_py.py (20 pass, live sklearn 1.5.2 oracle) — fit_transform(['b','a','c','a'])==[1,0,2,0], classes_==['a','b','c'], inverse roundtrip, unknown-label/out-of-range/negative ValueError, numeric-input NotImplementedError (vs sklearn numeric sort), numeric-looking-string string-sorted parity, pre-fit NotFittedError, 2-D ValueError, get_params()=={}, clone, numpy str/object array input. LabelEncoder has no n_features_in_/get_feature_names_out (target encoder). |
| REQ-8 (ferray substrate) | NOT-STARTED | open prereq blocker #1138. ndarray::Array1<String> + std::HashMap, not ferray-core (R-SUBSTRATE-1/2). |
Structs§
- Fitted
Label Encoder - A fitted label encoder holding the bidirectional label-to-index mapping.
- Label
Encoder - An unfitted label encoder.