Skip to main content

Module label_encoder

Module label_encoder 

Source
Expand description

Label encoder: maps string labels to integer indices.

Learns an ordered mapping from unique string labels to consecutive integers 0, 1, ..., n_classes - 1. Supports forward (label → int) and reverse (int → label) transformation.

§## REQ status

Binary (R-DEFER-2), translating sklearn/preprocessing/_label.py (class LabelEncoder :34). Design doc: .design/preprocess/label_encoder.md. Expected values from the live sklearn 1.5.2 oracle (R-CHAR-3). Consumer: crate re-export (lib.rs:116, grandfathered S5). HONEST (R-HONEST-3): ferrolearn is Array1<String>-only; sklearn LabelEncoder accepts any hashable+comparable dtype. The non-empty string path value-matches the oracle exactly.

REQStatusEvidence
REQ-1 (string fit → sorted-unique classes_)SHIPPEDFit::fit collects unique labels, Vec<String>::sort (lexicographic), builds label_to_index; mirrors sklearn classes_ = _unique(y) (_label.py:98). Critic-verified vs live oracle: green_fit_classes_sorted (["bird","cat","dog"]), green_sort_order_mixed_ascii_matches_numpy (["10","2","A","B","a","b"] == np.unique). Consumer: crate re-export lib.rs:116.
REQ-2 (inverse_transform)SHIPPEDFittedLabelEncoder::inverse_transform = classes[idx] with out-of-range → InvalidParameter; mirrors sklearn classes_[y] + setdiff1d guard (:158-162). Critic-verified: green_inverse_transform_roundtrip, out-of-range rejected.
REQ-3 (transform + fit_transform)SHIPPEDtransform = label_to_index.get (unknown → InvalidParameter), mirrors _encode (:137); fit_transform mirrors _unique(return_inverse=True) (:115). Critic-verified: green_transform ([1,2,1,0]), green_fit_transform_equals_fit_then_transform, empty transform/inverse → empty (:134-135,:155-156).
REQ-5 (empty-fit parity)SHIPPEDFIXED #1134. Removed the if x.is_empty()InsufficientSamples guard; fit([]) now yields an empty FittedLabelEncoder matching sklearn _unique([]) (:98). Critic-verified: divergence_empty_fit_succeeds + 4 post-empty-fit guards; in-module test_empty_fit_yields_empty_classes (R-HONEST-4).
REQ-4 (numeric/generic dtype)NOT-STARTEDopen prereq blocker #1135. Array1<String>-only; sklearn accepts any dtype, numeric sort [10,2,1]→[1,2,10] (np.unique) unrepresentable (R-DEV-3).
REQ-6 (error-contract parity, R-DEV-2)NOT-STARTEDopen prereq blocker #1136. Unseen-label message (“unknown label” vs “y contains previously unseen labels”, :137,160) + unfitted-transform InvalidParameter vs NotFittedError (:131). Both REJECT (type maps to FerroError); message/NotFitted-analog gap.
REQ-7 (PyO3 binding)SHIPPEDFIXED #1137. _RsLabelEncoder (hand #[pyclass] in ferrolearn-python/src/extras.rs, the 1-D string-input analog of _RsOrdinalEncoder) over FittedLabelEncoder exposes fit(Vec<String>) (→ Fit::fit on Array1<String>), transform(Vec<String>) → numpy int64 codes (unknown label → PyValueError “y contains previously unseen labels”, _label.py:137), inverse_transform(Vec<i64>)Vec<String> (negative code rejected pre-cast, out-of-range → PyValueError, _label.py:158-160), and #[getter] classes_ (sorted-unique str list, _label.py:98); registered in lib.rs (add_class::<extras::RsLabelEncoder>). The _extras.py::LabelEncoder(BaseEstimator) wrapper mirrors sklearn’s no-param ctor (get_params() == {}, _label.py:34), a _to_labels(y) helper that requires 1-D (column_or_1d, _label.py:97; 2-D → ValueError) and REJECTS numeric-dtype input (NotImplementedError — string-only core sorts lexicographically; #2230 lesson), check_is_fittedNotFittedError pre-fit, and an explicit fit_transform (LabelEncoder is not a TransformerMixin, _label.py:101). Non-test consumer: _extras.py::LabelEncoder + lib.rs registration + __init__.py re-export (R-DEFER-1). Verification (model B): tests/divergence_label_encoder_py.py (20 pass, live sklearn 1.5.2 oracle) — fit_transform(['b','a','c','a'])==[1,0,2,0], classes_==['a','b','c'], inverse roundtrip, unknown-label/out-of-range/negative ValueError, numeric-input NotImplementedError (vs sklearn numeric sort), numeric-looking-string string-sorted parity, pre-fit NotFittedError, 2-D ValueError, get_params()=={}, clone, numpy str/object array input. LabelEncoder has no n_features_in_/get_feature_names_out (target encoder).
REQ-8 (ferray substrate)NOT-STARTEDopen prereq blocker #1138. ndarray::Array1<String> + std::HashMap, not ferray-core (R-SUBSTRATE-1/2).

Structs§

FittedLabelEncoder
A fitted label encoder holding the bidirectional label-to-index mapping.
LabelEncoder
An unfitted label encoder.