1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
//! Preprocessing transformations for data preparation
//!
//! This module provides standard preprocessing operations:
//!
//! ## Scaling
//! - **StandardScaler**: Zero mean, unit variance (most common)
//! - **MinMaxScaler**: Scale to fixed range [min, max]
//! - **RobustScaler**: Median/IQR scaling (robust to outliers)
//!
//! ## Categorical Encoding
//! - **FrequencyEncoder**: Category → count (optimal for trees)
//! - **LabelEncoder**: String → integer label (CSV loading essential)
//! - **OneHotEncoder**: Category → binary columns (for linear models)
//! - **OrderedTargetEncoder**: Target-based encoding with M-estimate smoothing (high-cardinality)
//!
//! ## Missing Value Imputation
//! - **SimpleImputer**: Mean/Median/Mode/Constant strategies
//! - **IndicatorImputer**: Binary flags for missing values
//!
//! ## Power Transforms
//! - **YeoJohnsonTransform**: Normalize skewed distributions (handles negatives)
//!
//! ## Time-Series Features
//! - **LagGenerator**: Create lagged features (x_{t-1}, x_{t-2}, etc.)
//! - **RollingGenerator**: Rolling statistics (mean, std, min, max, sum, median)
//! - **EwmaGenerator**: Exponentially weighted moving average
//! - **SeasonalGenerator**: Extract datetime components with cyclical encoding
//!
//! ## Outlier Detection
//! - **OutlierDetector**: Detect and handle outliers with IQR or Z-score methods
//! - Actions: Cap (winsorize), Flag (indicator columns), Remove (filter rows)
//!
//! ## Design Philosophy
//!
//! All preprocessors follow the fit-transform pattern:
//! 1. `fit()` on training data to learn parameters
//! 2. `transform()` on train/test data using learned parameters
//! 3. Serialize fitted state with model for inference consistency
//!
//! ## GBDT vs Linear Model Considerations
//!
//! **For Trees (GBDT)**:
//! - Prefer FrequencyEncoder or LabelEncoder
//! - Scalers help with regularization fairness and binning
//! - Missing values handled via bin 0 (implicit indicator)
//! - Power transforms have minimal impact (non-parametric)
//!
//! **For Linear Models**:
//! - Scalers are ESSENTIAL (sensitive to feature scales)
//! - OneHotEncoder for categorical (need binary indicators)
//! - SimpleImputer required (linear models can't handle NaN)
//! - YeoJohnsonTransform critical (Gaussian residuals assumption)
//!
//! **For Mixed Ensembles (Linear + Tree)**:
//! - Use all preprocessing types
//! - Different encodings for different model components
//!
//! ## Polars Integration
//!
//! The `polars_ext` module provides ergonomic helpers for working with
//! Polars DataFrames:
//!
//! ```ignore
//! use treeboost::preprocessing::polars_ext::*;
//!
//! let (data, num_features) = df_to_features(&df, &["col1", "col2"])?;
//! scaler.fit_transform(&mut data, num_features)?;
//! let scaled_df = features_to_df(&data, num_features, &["col1", "col2"])?;
//! ```
pub use ;
pub use ;
// Re-export production-grade encoders from encoding module for convenience
pub use crate;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use YeoJohnsonTransform;