Expand description
Target encoder: encode categorical features using target statistics.
TargetEncoder replaces each category with the mean of the target variable
for that category, regularised toward the global mean using smoothing.
This is especially useful for high-cardinality categorical features where one-hot encoding would produce too many columns.
§Smoothing
The encoded value for category c is (matching scikit-learn
_target_encoder_fast.pyx:60-75 — the accumulator is seeded with
smooth * global_mean then the category’s targets are added, divided by
smooth + count(c)):
encoded(c) = (smooth * global_mean + sum_of_targets(c)) / (smooth + count(c))where smooth controls the degree of regularisation.
Translation target: scikit-learn 1.5.2 class TargetEncoder
(sklearn/preprocessing/_target_encoder.py). Design:
.design/preprocess/target_encoder.md. Tracking: #1260.
## REQ status
| REQ | Status | Anchor |
|---|---|---|
REQ-1 manual-smooth m-estimate value match (f64, bit-exact) | SHIPPED | TargetEncoder::fit / transform; sklearn _target_encoder_fast.pyx:60-75, _target_encoder.py:289,:383 (#1261 pairwise sum, #1262 formula) |
REQ-2 unseen category → target_mean_ (global mean) | SHIPPED | transform unwrap_or(global_mean); sklearn _target_encoder.py:324-345 |
| REQ-3 InsufficientSamples / ShapeMismatch / InvalidParameter errors | SHIPPED | fit / transform guards; sklearn _target_encoder.py:189 |
REQ-4 smooth="auto" empirical-Bayes encoding + DEFAULT | SHIPPED | Smooth enum { Auto, Fixed(F) } (Default/TargetEncoder::default → Auto); fit_feature_encoding Auto branch (two-pass means/ssd, lambda_ = y_variance*count/(y_variance*count+ssd/count), NaN→y_mean), population_variance_f64 (ddof=0) computed once in fit; sklearn _target_encoder_fast.pyx:140-165, _target_encoder.py:199,:416. Consumer: TargetEncoder::fit/fit_transform/default (the Smooth field drives the encoding branch) + the public module path ferrolearn_preprocess::target_encoder::Smooth (pub mod target_encoder in lib.rs). Verify: pins divergence_default_smooth_is_auto/divergence_smooth_auto_empirical_bayes green (#2342 #2343) |
REQ-5 cross-fitting fit_transform (deterministic KFold) | SHIPPED | TargetEncoder::fit_transform cross-fits over kfold_test_ranges (contiguous no-shuffle folds, cv default 5), per-fold fit_feature_encoding on TRAIN rows → encode TEST rows (unseen-in-train → y_train_mean); sklearn _target_encoder.py:232,:254-303, _split.py:521-534. Consumer: crate re-export (lib.rs). Verify: pin divergence_crossfit_fit_transform green (#2344). NOTE: shuffle/random_state (REQ-8 NOT-STARTED) absent → deterministic shuffle=False KFold only |
REQ-6 target_type binary/multiclass | NOT-STARTED (#1266) | sklearn _target_encoder.py:269-273,:376-379 |
REQ-7 categories param + categories_/target_type_/classes_ | NOT-STARTED (#1267) | sklearn _target_encoder.py:197,:358-381 |
REQ-8 cv/shuffle/random_state params | NOT-STARTED (#1268) | sklearn _target_encoder.py:200-209 |
| REQ-9 string/object categories | NOT-STARTED (#1269) | usize-only, R-DEV-3 |
REQ-10 get_feature_names_out/n_features_in_ | NOT-STARTED (#1270) | sklearn OneToOneFeatureMixin |
| REQ-11 PyO3 binding | NOT-STARTED (#1271) | ferrolearn-python/src/ (absent) |
| REQ-12 ferray substrate | NOT-STARTED (#1272) | R-SUBSTRATE |
REQ-13 per-category sums accumulate in f64 (always), matching sklearn’s C double | SHIPPED | fit accumulates cat_stats: HashMap<usize,(f64,usize)> seeded with smooth_f64*global_mean_f64, += y[i].to_f64(), then F::from(sum/(smooth_f64+count)); sklearn _target_encoder_fast.pyx:42,44,68 (double sums[]/counts[], sums[cat]+=y[i] regardless of Y_DTYPE), encodings_ always float64 (_target_encoder.py:335). f64 path identity (bit-exact unchanged); TargetEncoder<f32> now captures 2^24+1 (#1263) |
Structs§
- Fitted
Target Encoder - A fitted target encoder holding per-feature, per-category encoding values.
- Target
Encoder - An unfitted target encoder.
Enums§
- Smooth
- The smoothing strategy for
TargetEncoder.