Expand description
Feature selection transformers.
This module provides three feature selection strategies:
VarianceThreshold— remove features whose variance falls below a configurable threshold (default 0.0 removes zero-variance features).SelectKBest— keep the K features with the highest ANOVA F-scores computed against a class label vector.SelectFromModel— keep features whose importance weight (provided by a previously fitted model) exceeds a configurable threshold.
All three implement the standard ferrolearn Fit / Transform pattern
and integrate with the dynamic ferrolearn_core::pipeline::Pipeline.
§REQ status
Translation target: scikit-learn 1.5.2 VarianceThreshold
(sklearn/feature_selection/_variance_threshold.py) + SelectKBest +
GenericUnivariateSelect (sklearn/feature_selection/_univariate_selection.py).
Tracking: #1424. Each REQ is BINARY — SHIPPED (impl + non-test consumer +
tests + green verification) or NOT-STARTED (with a concrete open blocker).
The basic SelectFromModel here duplicates the richer
select_from_model.rs::SelectFromModelExt — its parity is covered by
.design/preprocess/select_from_model.md (REQ-9).
| REQ | Scope | Status | Evidence / Blocker |
|---|---|---|---|
| REQ-1 | VarianceThreshold mask (var > threshold, strict) + population variances | SHIPPED | fit Welford population variance matches np.nanvar + _variance_threshold.py:133 on the common case; oracle tests in tests/divergence_feature_selection.rs. Consumer: re-export lib.rs:110 + PipelineTransformer |
| REQ-2 | SelectKBest ANOVA F-score VALUES (finite / non-constant) | SHIPPED | anova_f_scores matches f_oneway _univariate_selection.py:43-117; oracle score tests (tol 1e-9) |
| REQ-3 | SelectKBest top-k SELECTION (tie-break + constant-feature + k-boundary) | SHIPPED | matches sklearn mask[argsort(scores, mergesort)[-k:]] :794 (ties → higher index) + _clean_nans :24-33 (constant feature → NaN → finfo.min → ranks last) + k>n_features clamp-keep-all :774-779; constant anova_f_scores now NaN (was +inf), verified across multi-tie/multi-constant/k∈{0,all,>n}/mixed/f32 (21 oracle tests — was DIV-A/B #1425 + DIV-C #1426, fixed) |
| REQ-4 | Error/parameter contracts (VarianceThreshold threshold<0/zero-rows; SelectKBest zero-rows/y-mismatch) | SHIPPED (scoped) | per-fn guards; divergence error tests |
| REQ-5a | VarianceThreshold np.nanvar NaN-handling + “no feature meets threshold” ValueError | SHIPPED | fit skips NaN in the Welford pass (population var over FINITE values, ddof=0, all-NaN col → NaN), matching np.nanvar (_variance_threshold.py:112, force_all_finite="allow-nan" :103); raises InvalidParameter("No feature in X meets the variance threshold {:.5}" [+ " (X contains only one sample)"]) when no var > threshold, matching :121-126. Oracle: X=[[1,7],[2,7],[NaN,7]] → variances_=[0.25,0.0], support [0]; all-constant / single-sample → ValueError. Consumer: re-export lib.rs + PipelineTransformer. Tests: tests/divergence_variance_threshold_2349.rs (#2350 #2351). |
| REQ-5b | VarianceThreshold threshold==0 peak-to-peak guard (min(var, ptp)) | NOT-STARTED | sklearn _variance_threshold.py:113-120 — blocker #1427 (ptp-guard only; nanvar+ValueError shipped as REQ-5a) |
| REQ-6 | SelectKBest k='all' string + pluggable score_func (chi2/f_regression/mutual_info) + general _clean_nans | NOT-STARTED | usize k + FClassif only; sklearn _univariate_selection.py:770-795 — blocker #1428 |
| REQ-7 | GenericUnivariateSelect (mode meta-selector) | NOT-STARTED | absent (route parity_op); sklearn _univariate_selection.py:1054 — blocker #1429 |
| REQ-8 | SelectorMixin surface (get_support/inverse_transform/get_feature_names_out) + scores_/pvalues_/n_features_in_ attrs | NOT-STARTED | sklearn _univariate_selection.py:526 — blocker #1430 |
| REQ-9 | Basic SelectFromModel here duplicates SelectFromModelExt | NOT-STARTED | tech-debt; parity in .design/preprocess/select_from_model.md — blocker #1431 |
| REQ-10 | PyO3 binding | NOT-STARTED | no ferrolearn-python registration — blocker #1432 |
| REQ-11 | ferray substrate | NOT-STARTED | dense Array1/Array2 + num_traits::Float only — blocker #1433 |
Structs§
- Fitted
SelectK Best - A fitted K-best selector holding per-feature scores and selected indices.
- Fitted
Variance Threshold - A fitted variance-threshold selector holding the selected column indices and the per-column variances observed during fitting.
- Select
From Model - A feature selector driven by external feature-importance weights.
- SelectK
Best - An unfitted K-best feature selector.
- Variance
Threshold - An unfitted variance-threshold feature selector.
Enums§
- Score
Func - Scoring function variants for
SelectKBest.