Skip to main content

Module feature_selection

Module feature_selection 

Source
Expand description

Feature selection transformers.

This module provides three feature selection strategies:

  • VarianceThreshold — remove features whose variance falls below a configurable threshold (default 0.0 removes zero-variance features).
  • SelectKBest — keep the K features with the highest ANOVA F-scores computed against a class label vector.
  • SelectFromModel — keep features whose importance weight (provided by a previously fitted model) exceeds a configurable threshold.

All three implement the standard ferrolearn Fit / Transform pattern and integrate with the dynamic ferrolearn_core::pipeline::Pipeline.

§REQ status

Translation target: scikit-learn 1.5.2 VarianceThreshold (sklearn/feature_selection/_variance_threshold.py) + SelectKBest + GenericUnivariateSelect (sklearn/feature_selection/_univariate_selection.py). Tracking: #1424. Each REQ is BINARY — SHIPPED (impl + non-test consumer + tests + green verification) or NOT-STARTED (with a concrete open blocker). The basic SelectFromModel here duplicates the richer select_from_model.rs::SelectFromModelExt — its parity is covered by .design/preprocess/select_from_model.md (REQ-9).

REQScopeStatusEvidence / Blocker
REQ-1VarianceThreshold mask (var > threshold, strict) + population variancesSHIPPEDfit Welford population variance matches np.nanvar + _variance_threshold.py:133 on the common case; oracle tests in tests/divergence_feature_selection.rs. Consumer: re-export lib.rs:110 + PipelineTransformer
REQ-2SelectKBest ANOVA F-score VALUES (finite / non-constant)SHIPPEDanova_f_scores matches f_oneway _univariate_selection.py:43-117; oracle score tests (tol 1e-9)
REQ-3SelectKBest top-k SELECTION (tie-break + constant-feature + k-boundary)SHIPPEDmatches sklearn mask[argsort(scores, mergesort)[-k:]] :794 (ties → higher index) + _clean_nans :24-33 (constant feature → NaN → finfo.min → ranks last) + k>n_features clamp-keep-all :774-779; constant anova_f_scores now NaN (was +inf), verified across multi-tie/multi-constant/k∈{0,all,>n}/mixed/f32 (21 oracle tests — was DIV-A/B #1425 + DIV-C #1426, fixed)
REQ-4Error/parameter contracts (VarianceThreshold threshold<0/zero-rows; SelectKBest zero-rows/y-mismatch)SHIPPED (scoped)per-fn guards; divergence error tests
REQ-5aVarianceThreshold np.nanvar NaN-handling + “no feature meets threshold” ValueErrorSHIPPEDfit skips NaN in the Welford pass (population var over FINITE values, ddof=0, all-NaN col → NaN), matching np.nanvar (_variance_threshold.py:112, force_all_finite="allow-nan" :103); raises InvalidParameter("No feature in X meets the variance threshold {:.5}" [+ " (X contains only one sample)"]) when no var > threshold, matching :121-126. Oracle: X=[[1,7],[2,7],[NaN,7]]variances_=[0.25,0.0], support [0]; all-constant / single-sample → ValueError. Consumer: re-export lib.rs + PipelineTransformer. Tests: tests/divergence_variance_threshold_2349.rs (#2350 #2351).
REQ-5bVarianceThreshold threshold==0 peak-to-peak guard (min(var, ptp))NOT-STARTEDsklearn _variance_threshold.py:113-120 — blocker #1427 (ptp-guard only; nanvar+ValueError shipped as REQ-5a)
REQ-6SelectKBest k='all' string + pluggable score_func (chi2/f_regression/mutual_info) + general _clean_nansNOT-STARTEDusize k + FClassif only; sklearn _univariate_selection.py:770-795 — blocker #1428
REQ-7GenericUnivariateSelect (mode meta-selector)NOT-STARTEDabsent (route parity_op); sklearn _univariate_selection.py:1054 — blocker #1429
REQ-8SelectorMixin surface (get_support/inverse_transform/get_feature_names_out) + scores_/pvalues_/n_features_in_ attrsNOT-STARTEDsklearn _univariate_selection.py:526 — blocker #1430
REQ-9Basic SelectFromModel here duplicates SelectFromModelExtNOT-STARTEDtech-debt; parity in .design/preprocess/select_from_model.md — blocker #1431
REQ-10PyO3 bindingNOT-STARTEDno ferrolearn-python registration — blocker #1432
REQ-11ferray substrateNOT-STARTEDdense Array1/Array2 + num_traits::Float only — blocker #1433

Structs§

FittedSelectKBest
A fitted K-best selector holding per-feature scores and selected indices.
FittedVarianceThreshold
A fitted variance-threshold selector holding the selected column indices and the per-column variances observed during fitting.
SelectFromModel
A feature selector driven by external feature-importance weights.
SelectKBest
An unfitted K-best feature selector.
VarianceThreshold
An unfitted variance-threshold feature selector.

Enums§

ScoreFunc
Scoring function variants for SelectKBest.