Skip to main content

Module target_encoder

Module target_encoder 

Source
Expand description

Target encoder: encode categorical features using target statistics.

TargetEncoder replaces each category with the mean of the target variable for that category, regularised toward the global mean using smoothing.

This is especially useful for high-cardinality categorical features where one-hot encoding would produce too many columns.

§Smoothing

The encoded value for category c is (matching scikit-learn _target_encoder_fast.pyx:60-75 — the accumulator is seeded with smooth * global_mean then the category’s targets are added, divided by smooth + count(c)):

encoded(c) = (smooth * global_mean + sum_of_targets(c)) / (smooth + count(c))

where smooth controls the degree of regularisation.

Translation target: scikit-learn 1.5.2 class TargetEncoder (sklearn/preprocessing/_target_encoder.py). Design: .design/preprocess/target_encoder.md. Tracking: #1260.

## REQ status

REQStatusAnchor
REQ-1 manual-smooth m-estimate value match (f64, bit-exact)SHIPPEDTargetEncoder::fit / transform; sklearn _target_encoder_fast.pyx:60-75, _target_encoder.py:289,:383 (#1261 pairwise sum, #1262 formula)
REQ-2 unseen category → target_mean_ (global mean)SHIPPEDtransform unwrap_or(global_mean); sklearn _target_encoder.py:324-345
REQ-3 InsufficientSamples / ShapeMismatch / InvalidParameter errorsSHIPPEDfit / transform guards; sklearn _target_encoder.py:189
REQ-4 smooth="auto" empirical-Bayes encoding + DEFAULTSHIPPEDSmooth enum { Auto, Fixed(F) } (Default/TargetEncoder::defaultAuto); fit_feature_encoding Auto branch (two-pass means/ssd, lambda_ = y_variance*count/(y_variance*count+ssd/count), NaN→y_mean), population_variance_f64 (ddof=0) computed once in fit; sklearn _target_encoder_fast.pyx:140-165, _target_encoder.py:199,:416. Consumer: TargetEncoder::fit/fit_transform/default (the Smooth field drives the encoding branch) + the public module path ferrolearn_preprocess::target_encoder::Smooth (pub mod target_encoder in lib.rs). Verify: pins divergence_default_smooth_is_auto/divergence_smooth_auto_empirical_bayes green (#2342 #2343)
REQ-5 cross-fitting fit_transform (deterministic KFold)SHIPPEDTargetEncoder::fit_transform cross-fits over kfold_test_ranges (contiguous no-shuffle folds, cv default 5), per-fold fit_feature_encoding on TRAIN rows → encode TEST rows (unseen-in-train → y_train_mean); sklearn _target_encoder.py:232,:254-303, _split.py:521-534. Consumer: crate re-export (lib.rs). Verify: pin divergence_crossfit_fit_transform green (#2344). NOTE: shuffle/random_state (REQ-8 NOT-STARTED) absent → deterministic shuffle=False KFold only
REQ-6 target_type binary/multiclassNOT-STARTED (#1266)sklearn _target_encoder.py:269-273,:376-379
REQ-7 categories param + categories_/target_type_/classes_NOT-STARTED (#1267)sklearn _target_encoder.py:197,:358-381
REQ-8 cv/shuffle/random_state paramsNOT-STARTED (#1268)sklearn _target_encoder.py:200-209
REQ-9 string/object categoriesNOT-STARTED (#1269)usize-only, R-DEV-3
REQ-10 get_feature_names_out/n_features_in_NOT-STARTED (#1270)sklearn OneToOneFeatureMixin
REQ-11 PyO3 bindingNOT-STARTED (#1271)ferrolearn-python/src/ (absent)
REQ-12 ferray substrateNOT-STARTED (#1272)R-SUBSTRATE
REQ-13 per-category sums accumulate in f64 (always), matching sklearn’s C doubleSHIPPEDfit accumulates cat_stats: HashMap<usize,(f64,usize)> seeded with smooth_f64*global_mean_f64, += y[i].to_f64(), then F::from(sum/(smooth_f64+count)); sklearn _target_encoder_fast.pyx:42,44,68 (double sums[]/counts[], sums[cat]+=y[i] regardless of Y_DTYPE), encodings_ always float64 (_target_encoder.py:335). f64 path identity (bit-exact unchanged); TargetEncoder<f32> now captures 2^24+1 (#1263)

Structs§

FittedTargetEncoder
A fitted target encoder holding per-feature, per-category encoding values.
TargetEncoder
An unfitted target encoder.

Enums§

Smooth
The smoothing strategy for TargetEncoder.