Module target_encoder

Expand description

Target encoder: encode categorical features using target statistics.

TargetEncoder replaces each category with the mean of the target variable for that category, regularised toward the global mean using smoothing.

This is especially useful for high-cardinality categorical features where one-hot encoding would produce too many columns.

§Smoothing

The encoded value for category c is (matching scikit-learn _target_encoder_fast.pyx:60-75 — the accumulator is seeded with smooth * global_mean then the category’s targets are added, divided by smooth + count(c)):

encoded(c) = (smooth * global_mean + sum_of_targets(c)) / (smooth + count(c))

where smooth controls the degree of regularisation.

Translation target: scikit-learn 1.5.2 class TargetEncoder (sklearn/preprocessing/_target_encoder.py). Design: .design/preprocess/target_encoder.md. Tracking: #1260.

## REQ status

REQ	Status	Anchor
REQ-1 manual-`smooth` m-estimate value match (f64, bit-exact)	SHIPPED	`TargetEncoder::fit` / `transform`; sklearn `_target_encoder_fast.pyx:60-75`, `_target_encoder.py:289`,`:383` (#1261 pairwise sum, #1262 formula)
REQ-2 unseen category → `target_mean_` (global mean)	SHIPPED	`transform` `unwrap_or(global_mean)`; sklearn `_target_encoder.py:324-345`
REQ-3 InsufficientSamples / ShapeMismatch / InvalidParameter errors	SHIPPED	`fit` / `transform` guards; sklearn `_target_encoder.py:189`
REQ-4 `smooth="auto"` empirical-Bayes encoding + DEFAULT	SHIPPED	`Smooth` enum `{ Auto, Fixed(F) }` (`Default`/`TargetEncoder::default` → `Auto`); `fit_feature_encoding` Auto branch (two-pass means/ssd, `lambda_ = y_variancecount/(y_variancecount+ssd/count)`, NaN→y_mean), `population_variance_f64` (ddof=0) computed once in `fit`; sklearn `_target_encoder_fast.pyx:140-165`, `_target_encoder.py:199`,`:416`. Consumer: `TargetEncoder::fit`/`fit_transform`/`default` (the `Smooth` field drives the encoding branch) + the public module path `ferrolearn_preprocess::target_encoder::Smooth` (`pub mod target_encoder` in `lib.rs`). Verify: pins `divergence_default_smooth_is_auto`/`divergence_smooth_auto_empirical_bayes` green (#2342 #2343)
REQ-5 cross-fitting `fit_transform` (deterministic KFold)	SHIPPED	`TargetEncoder::fit_transform` cross-fits over `kfold_test_ranges` (contiguous no-shuffle folds, `cv` default 5), per-fold `fit_feature_encoding` on TRAIN rows → encode TEST rows (unseen-in-train → `y_train_mean`); sklearn `_target_encoder.py:232`,`:254-303`, `_split.py:521-534`. Consumer: crate re-export (`lib.rs`). Verify: pin `divergence_crossfit_fit_transform` green (#2344). NOTE: `shuffle`/`random_state` (REQ-8 NOT-STARTED) absent → deterministic `shuffle=False` KFold only
REQ-6 `target_type` binary/multiclass	NOT-STARTED (#1266)	sklearn `_target_encoder.py:269-273`,`:376-379`
REQ-7 `categories` param + `categories_`/`target_type_`/`classes_`	NOT-STARTED (#1267)	sklearn `_target_encoder.py:197`,`:358-381`
REQ-8 `cv`/`shuffle`/`random_state` params	NOT-STARTED (#1268)	sklearn `_target_encoder.py:200-209`
REQ-9 string/object categories	NOT-STARTED (#1269)	usize-only, R-DEV-3
REQ-10 `get_feature_names_out`/`n_features_in_`	NOT-STARTED (#1270)	sklearn `OneToOneFeatureMixin`
REQ-11 PyO3 binding	NOT-STARTED (#1271)	`ferrolearn-python/src/` (absent)
REQ-12 ferray substrate	NOT-STARTED (#1272)	R-SUBSTRATE
REQ-13 per-category sums accumulate in f64 (always), matching sklearn’s C `double`	SHIPPED	`fit` accumulates `cat_stats: HashMap<usize,(f64,usize)>` seeded with `smooth_f64*global_mean_f64`, `+= y[i].to_f64()`, then `F::from(sum/(smooth_f64+count))`; sklearn `_target_encoder_fast.pyx:42,44,68` (`double sums[]`/`counts[]`, `sums[cat]+=y[i]` regardless of `Y_DTYPE`), `encodings_` always float64 (`_target_encoder.py:335`). f64 path identity (bit-exact unchanged); `TargetEncoder<f32>` now captures `2^24+1` (#1263)

Structs§

FittedTargetEncoder: A fitted target encoder holding per-feature, per-category encoding values.
TargetEncoder: An unfitted target encoder.

Enums§

Smooth: The smoothing strategy for TargetEncoder.