# Ensemble Methods Theory
**Chapter Status**: ✅ 100% Working (All examples verified)
| ✅ Working | 34+ | Random Forest classification + regression + OOB estimation verified |
| ⏳ In Progress | 0 | - |
| ⬜ Not Implemented | 0 | - |
*Last tested: 2025-11-21*
*Aprender version: 0.4.1*
*Test file: src/tree/mod.rs tests (726 tests, 11 OOB tests)*
---
## Overview
Ensemble methods combine multiple models to achieve better performance than any single model. The key insight: many weak learners together make a strong learner.
**Key Techniques**:
- **Bagging**: Bootstrap aggregating (Random Forests)
- **Boosting**: Sequential learning from mistakes (future work)
- **Voting**: Combine predictions via majority vote
**Why This Matters**:
Single decision trees overfit. Random Forests solve this by averaging many trees trained on different data subsets. Result: lower variance, better generalization.
---
## Mathematical Foundation
### The Ensemble Principle
**Problem**: Single model has high variance
**Solution**: Average predictions from multiple models
```text
Ensemble_prediction = Aggregate(model₁, model₂, ..., modelₙ)
For classification: Majority vote
For regression: Mean prediction
```
**Key Insight**: If models make uncorrelated errors, averaging reduces overall error.
### Variance Reduction Through Averaging
**Mathematical property**:
```text
Var(Average of N models) = Var(single model) / N
(assuming independent, identically distributed models)
```
**In practice**: Models aren't fully independent, but ensemble still reduces variance significantly.
### Bagging (Bootstrap Aggregating)
**Algorithm**:
```text
1. For i = 1 to N:
- Create bootstrap sample Dᵢ (sample with replacement from D)
- Train model Mᵢ on Dᵢ
2. Prediction = Majority_vote(M₁, M₂, ..., Mₙ)
```
**Bootstrap Sample**:
- **Size**: Same as original dataset (n samples)
- **Sampling**: With replacement (some samples repeated, some excluded)
- **Out-of-Bag (OOB)**: ~37% of samples not in each bootstrap sample
**Why it works**: Each model sees slightly different data → diverse models → uncorrelated errors
---
## Random Forests: Bagging + Feature Randomness
### The Random Forest Algorithm
Random Forests extend bagging with **feature randomness**:
```text
function RandomForest(X, y, n_trees, max_features):
forest = []
for i = 1 to n_trees:
# Bootstrap sampling
D_i = bootstrap_sample(X, y)
# Train tree with feature randomness
tree = DecisionTree(max_features=sqrt(n_features))
tree.fit(D_i)
forest.append(tree)
return forest
function Predict(forest, x):
votes = [tree.predict(x) for tree in forest]
return majority_vote(votes)
```
**Two Sources of Randomness**:
1. **Bootstrap sampling**: Each tree sees different data subset
2. **Feature randomness**: At each split, only consider random subset of features (typically √m features)
**Why feature randomness?** Prevents correlation between trees. Without it, all trees would use the same strong features at the top.
### Out-of-Bag (OOB) Error Estimation
**Key Insight**: Each tree trained on ~63% of data, leaving ~37% out-of-bag
**The Mathematics**:
```text
Bootstrap sampling with replacement:
- Probability sample is NOT selected: (1 - 1/n)ⁿ
- As n → ∞: lim (1 - 1/n)ⁿ = 1/e ≈ 0.368
- Therefore: ~36.8% samples are OOB per tree
```
**OOB Score Algorithm**:
```text
For each sample xᵢ in training set:
1. Find all trees where xᵢ was NOT in bootstrap sample
2. Predict using only those trees
3. Aggregate predictions (majority vote or averaging)
Classification: OOB_accuracy = accuracy(oob_predictions, y_true)
Regression: OOB_R² = r_squared(oob_predictions, y_true)
```
**Why OOB is Powerful**:
- ✅ **Free validation**: No separate validation set needed
- ✅ **Unbiased estimate**: Similar to cross-validation accuracy
- ✅ **Use all data**: 100% for training, still get validation score
- ✅ **Model selection**: Compare different n_estimators values
- ✅ **Early stopping**: Monitor OOB score during training
**When to Use OOB**:
- Small datasets (can't afford to hold out validation set)
- Hyperparameter tuning (test different forest sizes)
- Production monitoring (track OOB score over time)
**Practical Usage in Aprender**:
```rust,ignore
use aprender::tree::RandomForestClassifier;
use aprender::primitives::Matrix;
let mut rf = RandomForestClassifier::new(50)
.with_max_depth(10)
.with_random_state(42);
rf.fit(&x_train, &y_train).unwrap();
// Get OOB score (unbiased estimate of generalization error)
let oob_accuracy = rf.oob_score().unwrap();
let training_accuracy = rf.score(&x_train, &y_train);
println!("Training accuracy: {:.3}", training_accuracy); // Often high
println!("OOB accuracy: {:.3}", oob_accuracy); // More realistic
// OOB accuracy typically close to test set accuracy!
```
**Test Reference**: `src/tree/mod.rs::tests::test_random_forest_classifier_oob_score_after_fit`
---
## Implementation in Aprender
### Example 1: Basic Random Forest
```rust,ignore
use aprender::tree::RandomForestClassifier;
use aprender::primitives::Matrix;
// XOR problem (not linearly separable)
let x = Matrix::from_vec(4, 2, vec![
0.0, 0.0, // Class 0
0.0, 1.0, // Class 1
1.0, 0.0, // Class 1
1.0, 1.0, // Class 0
]).unwrap();
let y = vec![0, 1, 1, 0];
// Random Forest with 10 trees
let mut forest = RandomForestClassifier::new(10)
.with_max_depth(5)
.with_random_state(42); // Reproducible
forest.fit(&x, &y).unwrap();
// Predict
let predictions = forest.predict(&x);
println!("Predictions: {:?}", predictions); // [0, 1, 1, 0]
let accuracy = forest.score(&x, &y);
println!("Accuracy: {:.3}", accuracy); // 1.000
```
**Test Reference**: `src/tree/mod.rs::tests::test_random_forest_fit_basic`
### Example 2: Multi-Class Classification (Iris)
```rust,ignore
// Iris dataset (3 classes, 4 features)
// Simplified - see case study for full implementation
let mut forest = RandomForestClassifier::new(100) // 100 trees
.with_max_depth(10)
.with_random_state(42);
forest.fit(&x_train, &y_train).unwrap();
// Test set evaluation
let y_pred = forest.predict(&x_test);
let accuracy = forest.score(&x_test, &y_test);
println!("Test Accuracy: {:.3}", accuracy); // e.g., 0.973
// Random Forest typically outperforms single tree!
```
**Case Study**: See [Random Forest - Iris Classification](../examples/random-forest-iris.md)
### Example 3: Reproducibility
```rust,ignore
// Same random_state → same results
let mut forest1 = RandomForestClassifier::new(50)
.with_random_state(42);
forest1.fit(&x, &y).unwrap();
let mut forest2 = RandomForestClassifier::new(50)
.with_random_state(42);
forest2.fit(&x, &y).unwrap();
// Predictions identical
assert_eq!(forest1.predict(&x), forest2.predict(&x));
```
**Test Reference**: `src/tree/mod.rs::tests::test_random_forest_reproducible`
---
## Random Forest Regression
Random Forests also work for **regression** tasks (predicting continuous values) using the same bagging principle with a key difference: instead of majority voting, predictions are **averaged** across all trees.
### Algorithm for Regression
```rust
use aprender::tree::RandomForestRegressor;
use aprender::primitives::{Matrix, Vector};
// Housing data: [sqft, bedrooms, age] → price
let x = Matrix::from_vec(8, 3, vec![
1500.0, 3.0, 10.0, // $280k
2000.0, 4.0, 5.0, // $350k
1200.0, 2.0, 30.0, // $180k
// ... more samples
]).unwrap();
let y = Vector::from_slice(&[280.0, 350.0, 180.0, /* ... */]);
// Train Random Forest Regressor
let mut rf = RandomForestRegressor::new(50)
.with_max_depth(8)
.with_random_state(42);
rf.fit(&x, &y).unwrap();
// Predict: Average predictions from all 50 trees
let predictions = rf.predict(&x);
let r2 = rf.score(&x, &y); // R² coefficient
```
**Test Reference**: `src/tree/mod.rs::tests::test_random_forest_regressor_*`
### Prediction Aggregation for Regression
**Classification**:
```text
Prediction = mode([tree₁(x), tree₂(x), ..., treeₙ(x)]) # Majority vote
```
**Regression**:
```text
Prediction = mean([tree₁(x), tree₂(x), ..., treeₙ(x)]) # Average
```
**Why averaging works**:
- Each tree makes different errors due to bootstrap sampling
- Errors cancel out when averaged
- Result: smoother, more stable predictions
### Variance Reduction in Regression
**Single Decision Tree**:
- High variance (sensitive to data changes)
- Can overfit training data
- Predictions can be "jumpy" (discontinuous)
**Random Forest Ensemble**:
- Lower variance: Var(RF) ≈ Var(Tree) / √n_trees
- Averaging smooths out individual tree predictions
- More robust to outliers and noise
**Example**:
```text
Sample: [2000 sqft, 3 bed, 10 years]
Tree 1 predicts: $305k
Tree 2 predicts: $295k
Tree 3 predicts: $310k
...
Tree 50 predicts: $302k
Random Forest prediction: mean = $303k (stable!)
Single tree might predict: $310k or $295k (unstable)
```
### Comparison: Regression vs Classification
| **Task** | Predict continuous values | Predict discrete classes |
| **Base learner** | DecisionTreeRegressor | DecisionTreeClassifier |
| **Split criterion** | MSE (variance reduction) | Gini impurity |
| **Leaf prediction** | Mean of samples | Majority class |
| **Aggregation** | Average predictions | Majority vote |
| **Evaluation** | R² score, MSE, MAE | Accuracy, F1 score |
| **Output** | Real number (e.g., $305k) | Class label (e.g., 0, 1, 2) |
### When to Use Random Forest Regression
✅ **Good for**:
- Non-linear relationships (e.g., housing prices)
- Feature interactions (e.g., size × location)
- Outlier robustness
- When single tree overfits
- Want stable predictions (low variance)
❌ **Not ideal for**:
- Linear relationships (use LinearRegression)
- Need smooth predictions (trees predict step functions)
- Extrapolation beyond training range
- Very small datasets (< 50 samples)
### Example: Housing Price Prediction
```rust
// Non-linear housing data
let x = Matrix::from_vec(20, 4, vec![
1000.0, 2.0, 1.0, 50.0, // $140k (small, old)
2500.0, 5.0, 3.0, 3.0, // $480k (large, new)
// ... quadratic relationship between size and price
]).unwrap();
let y = Vector::from_slice(&[140.0, 480.0, /* ... */]);
// Train Random Forest
let mut rf = RandomForestRegressor::new(30).with_max_depth(6);
rf.fit(&x, &y).unwrap();
// Compare with single tree
let mut single_tree = DecisionTreeRegressor::new().with_max_depth(6);
single_tree.fit(&x, &y).unwrap();
let rf_r2 = rf.score(&x, &y); // e.g., 0.95
let tree_r2 = single_tree.score(&x, &y); // e.g., 1.00 (overfit!)
// On test data:
// RF generalizes better due to averaging
```
**Case Study**: See [Random Forest Regression](../examples/random-forest-regression.md)
### Hyperparameter Recommendations for Regression
**Default configuration**:
- `n_estimators = 50-100` (more trees = more stable)
- `max_depth = 8-12` (can be deeper than classification trees)
- No min_samples_split needed (averaging handles overfitting)
**Tuning strategy**:
1. Start with 50 trees, max_depth=8
2. Check train vs test R²
3. If overfitting: decrease max_depth or increase min_samples_split
4. If underfitting: increase max_depth or n_estimators
5. Use cross-validation for final tuning
---
## Hyperparameter Tuning
### Number of Trees (n_estimators)
**Trade-off**:
- **Too few (n < 10)**: High variance, unstable
- **Enough (n = 100)**: Good performance, stable
- **Many (n = 500+)**: Diminishing returns, slower training
**Rule of Thumb**:
- Start with 100 trees
- More trees never hurt accuracy (just slower)
- Increasing trees reduces overfitting
**Finding optimal n**:
```text
// Pseudocode
for n in [10, 50, 100, 200, 500] {
forest = RandomForestClassifier::new(n);
cv_score = cross_validate(forest, x, y, k=5);
// Select n with best cv_score (or when improvement plateaus)
}
```
### Max Depth (max_depth)
**Trade-off**:
- **Shallow trees (max_depth = 3)**: Underfitting
- **Deep trees (max_depth = 20+)**: OK for Random Forests! (bagging reduces overfitting)
- **Unlimited depth**: Common in Random Forests (unlike single trees)
**Random Forest advantage**: Can use deeper trees than single decision tree without overfitting.
### Feature Randomness (max_features)
**Typical values**:
- **Classification**: max_features = √m (where m = total features)
- **Regression**: max_features = m/3
**Trade-off**:
- **Low (e.g., 1)**: Very diverse trees, may miss important features
- **High (e.g., m)**: Correlated trees, loses ensemble benefit
- **Sqrt(m)**: Good balance (recommended default)
---
## Random Forest vs Single Decision Tree
### Comparison Table
| **Overfitting** | High | Low (averaging reduces variance) |
| **Stability** | Low (small data changes → different tree) | High (ensemble is stable) |
| **Interpretability** | High (can visualize) | Medium (100 trees hard to interpret) |
| **Training Speed** | Fast | Slower (train N trees) |
| **Prediction Speed** | Very fast | Slower (N predictions + voting) |
| **Accuracy** | Good | Better (typically +5-15% improvement) |
### Empirical Example
**Scenario**: Iris classification (150 samples, 4 features, 3 classes)
| Single Decision Tree (max_depth=5) | 93.3% |
| Random Forest (100 trees, max_depth=10) | 97.3% |
**Improvement**: +4% absolute, ~60% reduction in error rate!
---
## Advantages and Limitations
### Advantages ✅
1. **Reduced overfitting**: Averaging reduces variance
2. **Robust**: Handles noise, outliers well
3. **Feature importance**: Can rank feature importance across forest
4. **No feature scaling**: Inherits from decision trees
5. **Handles missing values**: Can impute or split on missingness
6. **Parallel training**: Trees are independent (can train in parallel)
7. **OOB score**: Free validation estimate
### Limitations ❌
1. **Less interpretable**: 100 trees vs 1 tree
2. **Memory**: Stores N trees (larger model size)
3. **Slower prediction**: Must query N trees
4. **Black box**: Hard to explain individual predictions (vs single tree)
5. **Extrapolation**: Can't predict outside training data range
---
## Understanding Bootstrap Sampling
### Bootstrap Sample Properties
**Original dataset**: 100 samples [S₁, S₂, ..., S₁₀₀]
**Bootstrap sample** (with replacement):
- Some samples appear 0 times (out-of-bag)
- Some samples appear 1 time
- Some samples appear 2+ times
**Probability analysis**:
```text
P(sample not chosen in one draw) = (n-1)/n
P(sample not in bootstrap, after n draws) = ((n-1)/n)ⁿ
As n → ∞: ((n-1)/n)ⁿ → 1/e ≈ 0.37
Result: ~37% of samples are out-of-bag
```
**Test Reference**: `src/tree/mod.rs::tests::test_bootstrap_sample_*`
### Diversity Through Sampling
**Example**: Dataset with 6 samples [A, B, C, D, E, F]
**Bootstrap Sample 1**: [A, A, C, D, F, F] (B and E missing)
**Bootstrap Sample 2**: [B, C, C, D, E, E] (A and F missing)
**Bootstrap Sample 3**: [A, B, D, D, E, F] (C missing)
**Result**: Each tree sees different data → different structure → diverse predictions
---
## Feature Importance
Random Forests naturally compute feature importance:
**Method**: For each feature, measure total reduction in Gini impurity across all trees
```text
Importance(feature_i) = Σ (over all nodes using feature_i) InfoGain
Normalize: Importance / Σ(all importances)
```
**Interpretation**:
- **High importance**: Feature frequently used for splits, high information gain
- **Low importance**: Feature rarely used or low information gain
- **Zero importance**: Feature never used
**Use cases**:
- Feature selection (drop low-importance features)
- Model interpretation (which features matter most?)
- Domain validation (do important features make sense?)
---
## Real-World Application
### Medical Diagnosis: Cancer Detection
**Problem**: Classify tumor as benign/malignant from 30 measurements
**Why Random Forest?**:
- Handles high-dimensional data (30 features)
- Robust to measurement noise
- Provides feature importance (which biomarkers matter?)
- Good accuracy (ensemble outperforms single tree)
**Result**: Random Forest achieves 97% accuracy vs 93% for single tree
### Credit Risk Assessment
**Problem**: Predict loan default from income, debt, employment, credit history
**Why Random Forest?**:
- Captures non-linear relationships (income × debt interaction)
- Robust to outliers (unusual income values)
- Handles mixed features (numeric + categorical)
**Result**: Random Forest reduces false negatives by 40% vs logistic regression
---
## Verification Through Tests
Random Forest tests verify ensemble properties:
**Bootstrap Tests**:
- Bootstrap sample has correct size (n samples)
- Reproducibility (same seed → same sample)
- Coverage (~63% of data in each sample)
**Forest Tests**:
- Correct number of trees trained
- All trees make predictions
- Majority voting works correctly
- Reproducible with random_state
**Test Reference**: `src/tree/mod.rs` (7+ ensemble tests)
---
## Further Reading
### Peer-Reviewed Papers
**Breiman (2001)** - *Random Forests*
- **Relevance**: Original Random Forest paper
- **Link**: [SpringerLink](https://link.springer.com/article/10.1023/A:1010933404324) (publicly accessible)
- **Key Contributions**:
- Bagging + feature randomness
- OOB error estimation
- Feature importance computation
- **Applied in**: `src/tree/mod.rs` RandomForestClassifier
**Dietterich (2000)** - *Ensemble Methods in Machine Learning*
- **Relevance**: Survey of ensemble techniques (bagging, boosting, voting)
- **Link**: [SpringerLink](https://link.springer.com/chapter/10.1007/3-540-45014-9_1)
- **Key Insight**: Why and when ensembles work
### Related Chapters
- [Decision Trees Theory](./decision-trees.md) - Foundation for Random Forests
- [Cross-Validation Theory](./cross-validation.md) - Tuning hyperparameters
- [Classification Metrics Theory](./classification-metrics.md) - Evaluating ensembles
---
## Summary
**What You Learned**:
- ✅ Ensemble methods: combine many models → better than any single model
- ✅ Bagging: train on bootstrap samples, average predictions
- ✅ Random Forests: bagging + feature randomness
- ✅ Variance reduction: Var(ensemble) ≈ Var(single) / N
- ✅ OOB score: free validation estimate (~37% out-of-bag)
- ✅ Hyperparameters: n_trees (100+), max_depth (deeper OK), max_features (√m)
- ✅ Advantages: less overfitting, robust, accurate
- ✅ Trade-off: less interpretable, slower than single tree
**Verification Guarantee**: Random Forest implementation extensively tested (7+ tests) in `src/tree/mod.rs`. Tests verify bootstrap sampling, tree training, voting, and reproducibility.
**Quick Reference**:
- **Default config**: 100 trees, max_depth=10-20, max_features=√m
- **Tuning**: More trees → better (just slower)
- **OOB score**: Estimate test accuracy without test set
- **Feature importance**: Which features matter most?
**Key Equations**:
```text
Bootstrap: Sample n times with replacement
Prediction: Majority_vote(tree₁, tree₂, ..., treeₙ)
Variance reduction: σ²_ensemble ≈ σ²_tree / N (if independent)
OOB samples: ~37% per tree
```
---
**Next Chapter**: [K-Means Clustering Theory](./kmeans-clustering.md)
**Previous Chapter**: [Decision Trees Theory](./decision-trees.md)