Skip to main content

ferrolearn_preprocess/
lib.rs

1//! # ferrolearn-preprocess
2//!
3//! Data preprocessing transformers for the ferrolearn machine learning framework.
4//!
5//! This crate provides standard scalers, encoders, imputers, and feature
6//! selection utilities that follow the ferrolearn `Fit`/`Transform` trait
7//! pattern.
8//!
9//! ## Scalers
10//!
11//! All scalers are generic over `F: Float + Send + Sync + 'static` and implement
12//! [`Fit<Array2<F>, ()>`](ferrolearn_core::Fit) (returning a `Fitted*` type) and
13//! [`FitTransform<Array2<F>>`](ferrolearn_core::FitTransform). The fitted types
14//! implement [`Transform<Array2<F>>`](ferrolearn_core::Transform).
15//!
16//! - [`StandardScaler`] — zero-mean, unit-variance scaling
17//! - [`MinMaxScaler`] — scale features to a given range (default `[0, 1]`)
18//! - [`RobustScaler`] — median / IQR-based scaling, robust to outliers
19//! - [`MaxAbsScaler`] — scale by maximum absolute value so values are in `[-1, 1]`
20//! - [`normalizer::Normalizer`] — normalize each sample (row) to unit norm
21//! - [`power_transformer::PowerTransformer`] — Yeo-Johnson power transform
22//!
23//! ## Encoders
24//!
25//! - [`OneHotEncoder`] — encode `Array2<F>` numeric categorical columns as binary columns (per-column sorted-unique `categories_`)
26//! - [`LabelEncoder`] — map `Array1<String>` labels to integer indices
27//! - [`ordinal_encoder::OrdinalEncoder`] — map string categories to integers in
28//!   order of first appearance
29//!
30//! ## Imputers
31//!
32//! - [`imputer::SimpleImputer`] — fill missing (NaN) values per feature column
33//!   using Mean, Median, MostFrequent, or Constant strategy.
34//!
35//! ## Feature Selection
36//!
37//! - [`feature_selection::VarianceThreshold`] — remove features with variance
38//!   below a configurable threshold.
39//! - [`feature_selection::SelectKBest`] — keep the K features with the highest
40//!   ANOVA F-scores against class labels.
41//! - [`feature_selection::SelectFromModel`] — keep features whose importance
42//!   weight (from a pre-fitted model) meets a configurable threshold.
43//!
44//! ## Feature Engineering
45//!
46//! - [`polynomial_features::PolynomialFeatures`] — generate polynomial and interaction features
47//! - [`binarizer::Binarizer`] — threshold features to binary values
48//! - [`function_transformer::FunctionTransformer`] — apply a user-provided function element-wise
49//!
50//! ## Pipeline Integration
51//!
52//! `StandardScaler<f64>`, `MinMaxScaler<f64>`, `RobustScaler<f64>`,
53//! `MaxAbsScaler<f64>`, `Normalizer<f64>`, `PowerTransformer<f64>`,
54//! `PolynomialFeatures<f64>`, `SimpleImputer<f64>`, `VarianceThreshold<f64>`,
55//! `SelectKBest<f64>`, and `SelectFromModel<f64>` each implement
56//! [`PipelineTransformer`](ferrolearn_core::pipeline::PipelineTransformer)
57//! so they can be used as steps inside a
58//! [`Pipeline`](ferrolearn_core::pipeline::Pipeline).
59//!
60//! # Examples
61//!
62//! ```
63//! use ferrolearn_preprocess::StandardScaler;
64//! use ferrolearn_core::traits::FitTransform;
65//! use ndarray::array;
66//!
67//! let x = array![[1.0_f64, 10.0], [2.0, 20.0], [3.0, 30.0]];
68//! let scaled = StandardScaler::<f64>::new().fit_transform(&x).unwrap();
69//! // scaled columns now have mean ≈ 0 and std ≈ 1
70//! ```
71//!
72//! ## REQ status
73//!
74//! Binary (R-DEFER-2) for the crate-root RE-EXPORT BOUNDARY — this file is the
75//! public-API surface, NOT an estimator. Mirrors the `__all__` of six sklearn
76//! modules: `preprocessing/__init__.py:30-60`, `feature_selection/__init__.py:27-47`,
77//! `feature_extraction/text.py:34-43`, `impute/__init__.py:13`,
78//! `random_projection.py:50-54`, `compose/__init__.py:15-20`. Design doc:
79//! `.design/preprocess/lib.md`. Per-estimator value parity lives in the sibling
80//! routed module docs; this table covers only the boundary surface + the ferray
81//! substrate gap. Tracking: #1361.
82//!
83//! | REQ | Status | Evidence |
84//! |---|---|---|
85//! | REQ-1 (re-export boundary) | SHIPPED | the `pub use` block (`:106-163`) surfaces every implemented estimator's unfitted + `Fitted*` pair (plus supporting enums and the `chi2`/`f_classif`/`f_regression` scoring fns), mirroring the six modules' `__all__`. The surfaced set is the documented subset that is implemented; not-yet-translated names (`KernelCenterer`, `GenericUnivariateSelect`, `MissingIndicator`, `HashingVectorizer`/`TfidfVectorizer`, `mutual_info_*`, the preprocessing free fns, `johnson_lindenstrauss_min_dim`, `make_column_selector`) are enumerated in the design doc (honest underclaim). Consumers: meta-crate `pub use ferrolearn_preprocess as preprocess;` (`ferrolearn/src/lib.rs:36`) + the `_RsStandardScaler`/`_RsMinMaxScaler`/`_RsMaxAbsScaler`/`_RsRobustScaler`/`_RsPowerTransformer` PyO3 pyclasses (`ferrolearn-python/src/{transformers,extras}.rs`, registered `lib.rs:22,81-84`). Verification: `cargo build -p ferrolearn-preprocess` resolves every re-export; boundary-integrity green-guard `tests/divergence_lib.rs` (fails to compile if any re-export is removed); `cargo test -p ferrolearn-preprocess` green. |
86//! | REQ-2 (ferray substrate) | NOT-STARTED | the crate is `ndarray` + `num_traits` across all 33 submodules behind the boundary, not `ferray-core`/`ferray-ufunc` (R-SUBSTRATE-1) — blocker #1362 |
87//!
88//! `BinaryEncoder`/`FittedBinaryEncoder` (`:129`) is a `category_encoders`-style
89//! extension with no sklearn `__all__` analog — an extension of the boundary, not
90//! a sklearn-parity item and not a blocker.
91
92pub mod binarizer;
93pub mod binary_encoder;
94pub mod column_transformer;
95pub mod count_vectorizer;
96pub mod feature_scoring;
97pub mod feature_selection;
98pub mod function_transformer;
99pub mod imputer;
100pub mod iterative_imputer;
101pub mod kbins_discretizer;
102pub mod knn_imputer;
103pub mod label_binarizer;
104pub mod label_encoder;
105pub mod max_abs_scaler;
106pub mod min_max_scaler;
107pub mod multi_label_binarizer;
108pub mod normalizer;
109pub mod one_hot_encoder;
110pub mod ordinal_encoder;
111pub mod polynomial_features;
112pub mod power_transformer;
113pub mod quantile_transformer;
114pub mod random_projection;
115pub mod rfe;
116pub mod robust_scaler;
117pub mod select_from_model;
118pub mod select_percentile;
119pub mod sequential_feature_selector;
120pub mod spline_transformer;
121pub mod standard_scaler;
122pub mod stat_selectors;
123pub mod target_encoder;
124pub mod tfidf;
125
126// Re-exports
127pub use binarizer::Binarizer;
128pub use column_transformer::{
129    ColumnSelector, ColumnTransformer, FittedColumnTransformer, Remainder, make_column_transformer,
130};
131pub use feature_selection::{
132    FittedSelectKBest, FittedVarianceThreshold, ScoreFunc, SelectFromModel, SelectKBest,
133    VarianceThreshold,
134};
135pub use function_transformer::FunctionTransformer;
136pub use imputer::{FittedSimpleImputer, ImputeStrategy, SimpleImputer};
137pub use label_encoder::{FittedLabelEncoder, LabelEncoder};
138pub use max_abs_scaler::{FittedMaxAbsScaler, MaxAbsScaler};
139pub use min_max_scaler::{FittedMinMaxScaler, MinMaxScaler};
140pub use normalizer::Normalizer;
141pub use one_hot_encoder::{FittedOneHotEncoder, OneHotDrop, OneHotEncoder, OneHotHandleUnknown};
142pub use ordinal_encoder::{Categories, FittedOrdinalEncoder, HandleUnknown, OrdinalEncoder};
143pub use polynomial_features::{FittedPolynomialFeatures, PolynomialFeatures};
144pub use power_transformer::{FittedPowerTransformer, PowerTransformer};
145pub use robust_scaler::{FittedRobustScaler, RobustScaler};
146pub use standard_scaler::{FittedStandardScaler, StandardScaler};
147
148// Phase 3 re-exports
149pub use binary_encoder::{BinaryEncoder, FittedBinaryEncoder};
150pub use iterative_imputer::{FittedIterativeImputer, InitialStrategy, IterativeImputer};
151pub use kbins_discretizer::{BinEncoding, BinStrategy, FittedKBinsDiscretizer, KBinsDiscretizer};
152pub use knn_imputer::{FittedKNNImputer, KNNImputer, KNNWeights};
153pub use quantile_transformer::{
154    FittedQuantileTransformer, OutputDistribution, QuantileTransformer, quantile_transform,
155};
156pub use rfe::{RFE, RFECV};
157pub use select_from_model::{FittedSelectFromModelExt, SelectFromModelExt, ThresholdStrategy};
158pub use select_percentile::{FittedSelectPercentile, SelectPercentile};
159pub use spline_transformer::{FittedSplineTransformer, KnotStrategy, SplineTransformer};
160pub use target_encoder::{FittedTargetEncoder, TargetEncoder};
161
162// Text processing re-exports
163pub use count_vectorizer::{CountVectorizer, FittedCountVectorizer};
164pub use tfidf::{FittedTfidfTransformer, TfidfNorm, TfidfTransformer};
165
166// Random projection re-exports
167pub use random_projection::{
168    FittedGaussianRandomProjection, FittedSparseRandomProjection, GaussianRandomProjection,
169    SparseRandomProjection,
170};
171
172// Newly wired (previously orphaned) re-exports
173pub use feature_scoring::{
174    chi2, compute_scores_classif, compute_scores_regression, f_classif, f_regression,
175};
176pub use label_binarizer::{FittedLabelBinarizer, LabelBinarizer, label_binarize};
177pub use multi_label_binarizer::{FittedMultiLabelBinarizer, MultiLabelBinarizer};
178pub use sequential_feature_selector::{
179    Direction, FittedSequentialFeatureSelector, SequentialFeatureSelector,
180};
181pub use stat_selectors::{
182    FittedSelectFdr, FittedSelectFpr, FittedSelectFwe, SelectFdr, SelectFpr, SelectFwe,
183};