1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
//! Dataset Analysis and Intelligent Mode Selection
//!
//! This module provides TreeBoost's "MRI scan" capability - analyzing dataset
//! characteristics to automatically recommend the optimal boosting mode.
//!
//! # Philosophy
//!
//! Unlike other AutoML tools that waste compute trying every model, TreeBoost
//! **analyzes first, then prescribes**. A 5-second analysis beats a 4-hour search.
//!
//! # How It Works
//!
//! 1. **Subsample**: Work on 10k-50k rows (enough to detect patterns)
//! 2. **Linear Probe**: Quick Ridge regression to measure linear signal
//! 3. **Tree Probe**: Shallow tree on residuals to measure non-linear structure
//! 4. **Feature Analysis**: Categorical ratio, correlations, interactions
//! 5. **Noise Estimation**: Local variance to detect irreducible error
//! 6. **Recommend**: Pick mode with confidence score and full explanation
//!
//! # Example
//!
//! ```ignore
//! use treeboost::analysis::DatasetAnalysis;
//!
//! let analysis = DatasetAnalysis::analyze(&dataset);
//! println!("{}", analysis.report()); // See the full diagnostic
//!
//! let mode = analysis.recommend_mode();
//! let confidence = analysis.confidence();
//! ```
//!
//! # The Statistics We Compute
//!
//! | Metric | Range | What It Measures |
//! |--------|-------|------------------|
//! | `linear_r2` | 0-1 | How much variance a linear model explains |
//! | `tree_gain` | 0-1 | How much trees improve over linear |
//! | `interaction_strength` | 0-1 | Non-additive feature interactions |
//! | `categorical_ratio` | 0-1 | Proportion of categorical features |
//! | `noise_floor` | 0-1 | Estimated irreducible error |
//! | `monotonicity_score` | 0-1 | How monotonic feature-target relationships are |
//!
//! # Decision Logic
//!
//! The recommendation isn't based on single thresholds but on **combinations**:
//!
//! - **LinearThenTree**: High linear signal (R² > 0.3) AND trees add value (gain > 0.1)
//! - **PureTree**: Weak linear signal OR categorical-heavy OR high interactions
//! - **RandomForest**: High noise floor AND need variance reduction
pub use ;
pub use ;
pub use ;
pub use ;