# Random Forest Regression - Housing Price Prediction
**Status**: ✅ Complete (Verified with 16+ tests)
This case study demonstrates Random Forest regression for predicting continuous values (housing prices) using bootstrap aggregating (bagging) to reduce variance and improve generalization.
**What You'll Learn**:
- When to use Random Forests vs single decision trees
- How bootstrap aggregating reduces variance
- Effect of n_estimators on prediction stability
- Hyperparameter tuning for regression forests
- Comparison with linear models
**Prerequisites**: Understanding of decision trees and regression metrics (R², MSE)
---
## Problem Statement
**Task**: Predict house prices (continuous values) from features like square footage, bedrooms, bathrooms, and age.
**Why Random Forest Regression?**
- **Variance reduction**: Averaging multiple trees reduces overfitting
- **No hyperparameter tuning**: Works well with default settings
- **Handles non-linearity**: Captures complex price relationships
- **Outlier robust**: Individual outliers affect fewer trees
- **Feature interactions**: Naturally models size × location × age interactions
**When NOT to use**:
- Linear relationships → Use LinearRegression (simpler, more interpretable)
- Very small datasets (< 50 samples) → Not enough data for bootstrap
- Need smooth predictions → Trees predict step functions
- Extrapolation required → Forests can't predict beyond training range
---
## Dataset
### Simulated Housing Data
```rust
// Features: [sqft, bedrooms, bathrooms, age]
// Target: price (in thousands)
let x_train = Matrix::from_vec(25, 4, vec![
// Small houses (1000-1400 sqft, old)
1000.0, 2.0, 1.0, 50.0, // $140k
1100.0, 2.0, 1.0, 45.0, // $145k
1200.0, 2.0, 1.0, 40.0, // $150k
// Medium houses (1500-1900 sqft, newer)
1500.0, 3.0, 2.0, 25.0, // $250k
1800.0, 3.0, 2.0, 10.0, // $295k
// Large houses (2000-3000 sqft, new)
2000.0, 4.0, 2.5, 8.0, // $360k
2500.0, 5.0, 3.0, 3.0, // $480k
// Luxury houses (4000+ sqft, brand new)
5000.0, 8.0, 6.0, 1.0, // $1600k
7000.0, 10.0, 8.0, 0.5, // $2700k
]).unwrap();
let y_train = Vector::from_slice(&[
140.0, 145.0, 150.0, 160.0, 170.0, // Small
250.0, 265.0, 280.0, 295.0, 310.0, // Medium
360.0, 410.0, 480.0, 550.0, 620.0, // Large
720.0, 800.0, 920.0, 1050.0, 1200.0, // Very large
1400.0, 1650.0, 1950.0, 2300.0, 2700.0, // Luxury
]);
```
**Data Characteristics**:
- 25 training samples, 4 features
- Non-linear price relationship (quadratic with size)
- Age discount effect (older houses cheaper)
- Multiple price tiers (small/medium/large/luxury)
---
## Implementation
### Step 1: Train Basic Random Forest
```rust
use aprender::prelude::*;
// Create Random Forest with 50 trees
let mut rf = RandomForestRegressor::new(50)
.with_max_depth(8)
.with_random_state(42);
// Fit to training data
rf.fit(&x_train, &y_train).unwrap();
// Predict on test data
let x_test = Matrix::from_vec(1, 4, vec![
2300.0, 4.0, 3.0, 6.0 // Large house: 2300 sqft, 4 bed, 3 bath, 6 years
]).unwrap();
let predicted_price = rf.predict(&x_test);
println!("Predicted: ${:.0}k", predicted_price.as_slice()[0]);
// Output: Predicted: $431k
// Evaluate with R² score
let r2 = rf.score(&x_train, &y_train);
println!("R² Score: {:.4}", r2);
// Output: R² Score: 0.9972
```
**Key API Methods**:
- `new(n_estimators)`: Create forest with N trees
- `with_max_depth(depth)`: Limit individual tree depth
- `with_random_state(seed)`: Reproducible bootstrap sampling
- `fit(&x, &y)`: Train all trees on bootstrap samples
- `predict(&x)`: Average predictions from all trees
- `score(&x, &y)`: Compute R² coefficient
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_fit_simple_linear`
---
### Step 2: Compare with Single Decision Tree
Random Forests reduce variance through ensemble averaging:
```rust
// Train Random Forest
let mut rf = RandomForestRegressor::new(50).with_max_depth(5);
rf.fit(&x_train, &y_train).unwrap();
// Train single Decision Tree
let mut single_tree = DecisionTreeRegressor::new().with_max_depth(5);
single_tree.fit(&x_train, &y_train).unwrap();
// Compare R² scores
let rf_r2 = rf.score(&x_train, &y_train); // 0.9972
let tree_r2 = single_tree.score(&x_train, &y_train); // 0.9999
println!("Random Forest R²: {:.4}", rf_r2);
println!("Single Tree R²: {:.4}", tree_r2);
```
**Interpretation**:
- **Training R²**: Single tree often higher (can perfectly memorize)
- **Test R²**: Random Forest generalizes better (reduces overfitting)
- **Variance**: RF predictions more stable across different data splits
**Why Random Forest Wins on Test Data**:
1. **Bootstrap sampling**: Each tree sees different data
2. **Error averaging**: Independent errors cancel out
3. **Reduced variance**: Var(RF) ≈ Var(Tree) / √n_trees
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_vs_single_tree`
---
### Step 3: Understanding Bootstrap Aggregating
**How Bagging Works**:
```text
Training data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (10 samples)
Bootstrap sample 1 (with replacement):
[2, 5, 7, 7, 1, 9, 3, 10, 5, 6] → Train Tree 1
Bootstrap sample 2 (with replacement):
[1, 1, 4, 8, 3, 6, 9, 2, 5, 10] → Train Tree 2
Bootstrap sample 3 (with replacement):
[5, 3, 8, 1, 7, 9, 4, 4, 2, 6] → Train Tree 3
...
Bootstrap sample 50:
[4, 7, 1, 3, 10, 5, 8, 2, 9, 6] → Train Tree 50
Prediction for new sample:
Tree 1: $305k
Tree 2: $298k
Tree 3: $310k
...
Tree 50: $302k
Random Forest: (305 + 298 + 310 + ... + 302) / 50 = $303k
```
**Key Properties**:
- Each bootstrap sample has ~63% unique samples
- ~37% of samples are "out-of-bag" (not in that sample)
- Trees are decorrelated (see different data)
- Averaging reduces variance
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_random_state`
---
## Hyperparameter Tuning
### n_estimators: Number of Trees
```rust
let n_estimators_values = [5, 10, 30, 100];
for &n_est in &n_estimators_values {
let mut rf = RandomForestRegressor::new(n_est)
.with_max_depth(5)
.with_random_state(42);
rf.fit(&x_train, &y_train).unwrap();
let r2 = rf.score(&x_train, &y_train);
println!("n_estimators={}: R² = {:.4}", n_est, r2);
}
// Output:
// n_estimators=5: R² = 0.9751
// n_estimators=10: R² = 0.9912
// n_estimators=30: R² = 0.9922
// n_estimators=100: R² = 0.9928
```
**Interpretation**:
- **n=5**: Noticeable variance, predictions less stable
- **n=10-30**: Good balance, diminishing returns
- **n=100+**: Minimal improvement, just slower training
**Rule of Thumb**:
- Start with 30-50 trees
- More trees **never hurt** accuracy (just slower)
- Typical range: 30-100 trees
- Production: 50-100 for best stability
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_n_estimators_effect`
---
### max_depth: Tree Complexity
```rust
// Shallow trees (max_depth=2)
let mut rf_shallow = RandomForestRegressor::new(15).with_max_depth(2);
rf_shallow.fit(&x_train, &y_train).unwrap();
let r2_shallow = rf_shallow.score(&x_train, &y_train); // 0.87
// Deep trees (max_depth=8)
let mut rf_deep = RandomForestRegressor::new(15).with_max_depth(8);
rf_deep.fit(&x_train, &y_train).unwrap();
let r2_deep = rf_deep.score(&x_train, &y_train); // 0.99
println!("Shallow (depth=2): R² = {:.2}", r2_shallow);
println!("Deep (depth=8): R² = {:.2}", r2_deep);
```
**Trade-off**:
- **Too shallow**: Underfitting (high bias)
- **Too deep**: Individual trees overfit, but averaging helps
- **Sweet spot**: 5-12 for Random Forests (deeper OK than single trees)
**Hyperparameter Guidance**:
- Single tree max_depth: 3-7 (prevent overfitting)
- Random Forest max_depth: 5-12 (averaging mitigates overfitting)
- Let trees grow deeper in RF → each captures different patterns
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_max_depth_effect`
---
## Variance Reduction Demonstration
Random Forests achieve lower variance through ensemble averaging:
```rust
// Train 5 single trees (simulate variance)
let mut tree_predictions = Vec::new();
for seed in 0..5 {
let mut tree = DecisionTreeRegressor::new().with_max_depth(6);
tree.fit(&x_train, &y_train).unwrap();
tree_predictions.push(tree.predict(&x_test));
}
// Single trees vary:
// Tree 1: $422k
// Tree 2: $431k
// Tree 3: $415k
// Tree 4: $428k
// Tree 5: $420k
// Std: $6.2k (high variance)
// Random Forest (50 trees):
let mut rf = RandomForestRegressor::new(50).with_max_depth(6);
rf.fit(&x_train, &y_train).unwrap();
let rf_pred = rf.predict(&x_test);
// Prediction: $423k (stable, low variance)
```
**Mathematical Insight**:
```text
If trees make independent errors:
Var(Single Tree) = σ²
Var(Average of N trees) = σ² / N
For 50 trees:
Var(RF) = σ² / 50 ≈ 0.02 * σ²
Std(RF) = σ / √50 ≈ 0.14 * σ
→ Random Forest has ~7x lower standard deviation!
```
**In Practice**:
- Trees aren't fully independent (correlatedthrough data)
- Still achieve 3-5x variance reduction
- More stable predictions, better generalization
---
## Non-Linear Patterns
Random Forests naturally handle non-linearities:
```rust
// Quadratic data: y = x²
let x_quad = Matrix::from_vec(12, 1, vec![
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0
]).unwrap();
let y_quad = Vector::from_slice(&[
1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0, 81.0, 100.0, 121.0, 144.0
]);
// Random Forest
let mut rf = RandomForestRegressor::new(30).with_max_depth(4);
rf.fit(&x_quad, &y_quad).unwrap();
let rf_r2 = rf.score(&x_quad, &y_quad); // 0.9875
// Linear Regression
let mut lr = LinearRegression::new();
lr.fit(&x_quad, &y_quad).unwrap();
let lr_r2 = lr.score(&x_quad, &y_quad); // 0.9477
println!("Random Forest captures non-linearity:");
println!(" RF R²: {:.4}", rf_r2);
println!(" Linear R²: {:.4}", lr_r2);
println!(" Advantage: {:.1}%", (rf_r2 - lr_r2) * 100.0);
```
**Why RF Works Better**:
- Trees learn local patterns (piecewise constant)
- Averaging smooths predictions
- No manual feature engineering needed (no x² term)
- Handles any non-linear relationship
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_comparison_with_linear_regression`
---
## Edge Cases and Validation
### Constant Target
```rust
// All houses same price
let x = Matrix::from_vec(5, 1, vec![1.0, 2.0, 3.0, 4.0, 5.0]).unwrap();
let y = Vector::from_slice(&[100.0, 100.0, 100.0, 100.0, 100.0]);
let mut rf = RandomForestRegressor::new(10).with_max_depth(3);
rf.fit(&x, &y).unwrap();
// Predictions should be constant
let predictions = rf.predict(&x);
for &pred in predictions.as_slice() {
assert!((pred - 100.0).abs() < 1e-5); // All ≈ 100.0
}
```
**Behavior**: All trees predict mean value (100.0), ensemble average is also 100.0.
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_constant_target`
---
### Reproducibility with random_state
```rust
// Train two forests with same random_state
let mut rf1 = RandomForestRegressor::new(20)
.with_max_depth(5)
.with_random_state(42);
rf1.fit(&x_train, &y_train).unwrap();
let mut rf2 = RandomForestRegressor::new(20)
.with_max_depth(5)
.with_random_state(42);
rf2.fit(&x_train, &y_train).unwrap();
// Predictions are identical
let pred1 = rf1.predict(&x_test);
let pred2 = rf2.predict(&x_test);
for (p1, p2) in pred1.as_slice().iter().zip(pred2.as_slice().iter()) {
assert!((p1 - p2).abs() < 1e-10); // Bit-wise identical
}
```
**Use Case**: Reproducible experiments, debugging, scientific publications.
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_random_state`
---
### Validation Errors
```rust
// Error: Mismatched dimensions
let x = Matrix::from_vec(5, 2, vec![...]).unwrap();
let y = Vector::from_slice(&[1.0, 2.0, 3.0]); // Only 3 targets!
let mut rf = RandomForestRegressor::new(10);
assert!(rf.fit(&x, &y).is_err()); // Returns error
// Error: Predict before fit
let rf_unfitted = RandomForestRegressor::new(10);
// rf_unfitted.predict(&x); // Would panic!
```
**Validation Checks**:
- n_samples(X) == n_samples(y)
- n_samples > 0
- Model must be fitted before predict
**Test Reference**: `src/tree/mod.rs::test_random_forest_regressor_validation_errors`
---
## Practical Recommendations
### When to Use Random Forest Regression
✅ **Use when**:
- Non-linear relationships in data (housing prices, stock prices)
- Feature interactions important (size × location × time)
- Medium to large datasets (100+ samples for good bootstrap)
- Want stable, low-variance predictions
- Don't have time for extensive hyperparameter tuning
- Outliers present in data
❌ **Don't use when**:
- Linear relationships (use LinearRegression)
- Very small datasets (< 50 samples, not enough for bootstrap)
- Need smooth predictions (trees predict step functions)
- Extrapolation required (beyond training range)
- Interpretability critical (use single decision tree)
### Hyperparameter Selection Guide
| **n_estimators** | 30-100 | Number of trees | Want more stability | Training too slow |
| **max_depth** | 5-12 | Tree complexity | Underfitting | Overfitting (rare) |
| **random_state** | Any integer | Reproducibility | N/A | N/A (set for experiments) |
**Quick Start Configuration**:
```rust
let mut rf = RandomForestRegressor::new(50) // 50 trees (good default)
.with_max_depth(8) // Moderate depth
.with_random_state(42); // Reproducible
```
**Tuning Process**:
1. Start with defaults: `n_estimators=50`, `max_depth=8`
2. Check train/test R² with cross-validation
3. If underfitting: increase max_depth
4. If overfitting (rare): decrease max_depth
5. For production: increase n_estimators to 100
### Debugging Checklist
**Low R² on training data**:
- [ ] Trees too shallow (increase max_depth)
- [ ] Too few trees (increase n_estimators)
- [ ] Data has no predictive signal (check correlation)
**Perfect train R², poor test R²** (rare for RF):
- [ ] Very small dataset (< 50 samples)
- [ ] Data leakage (test data in training set)
- [ ] Distribution shift (test data different from train)
**Unexpected predictions**:
- [ ] Check for feature scaling (not needed, but verify units)
- [ ] Verify random_state for reproducibility
- [ ] Check training data quality (outliers, missing values)
---
## Full Example Code
```rust
use aprender::prelude::*;
fn main() {
// Housing data: [sqft, bedrooms, bathrooms, age]
let x_train = Matrix::from_vec(10, 4, vec![
1500.0, 3.0, 2.0, 10.0, // $280k
2000.0, 4.0, 2.5, 5.0, // $350k
1200.0, 2.0, 1.0, 30.0, // $180k
1800.0, 3.0, 2.0, 15.0, // $300k
2500.0, 5.0, 3.0, 2.0, // $450k
1000.0, 2.0, 1.0, 50.0, // $150k
2200.0, 4.0, 3.0, 8.0, // $380k
1600.0, 3.0, 2.0, 20.0, // $260k
3000.0, 5.0, 4.0, 1.0, // $520k
1400.0, 3.0, 1.5, 25.0, // $220k
]).unwrap();
let y_train = Vector::from_slice(&[
280.0, 350.0, 180.0, 300.0, 450.0,
150.0, 380.0, 260.0, 520.0, 220.0,
]);
// Train Random Forest
let mut rf = RandomForestRegressor::new(50)
.with_max_depth(8)
.with_random_state(42);
rf.fit(&x_train, &y_train).unwrap();
// Evaluate
let r2 = rf.score(&x_train, &y_train);
println!("Training R² Score: {:.3}", r2);
// Predict on new house
let x_new = Matrix::from_vec(1, 4, vec![
1900.0, 4.0, 2.0, 12.0 // 1900 sqft, 4 bed, 2 bath, 12 years
]).unwrap();
let price = rf.predict(&x_new);
println!("Predicted price: ${:.0}k", price.as_slice()[0]);
}
```
**Run the example**:
```bash
cargo run --example random_forest_regression
```
---
## Related Reading
**Theory**:
- [Ensemble Methods Theory](../ml-fundamentals/ensemble-methods.md) - Bagging, variance reduction
- [Decision Trees Theory](../ml-fundamentals/decision-trees.md) - Base learners
**Other Algorithms**:
- [Decision Tree Regression](./decision-tree-regression.md) - Single tree comparison
- [Linear Regression](./linear-regression.md) - Linear baseline
**Code Reference**:
- Implementation: `src/tree/mod.rs` (RandomForestRegressor)
- Tests: `src/tree/mod.rs::tests::test_random_forest_regressor_*` (16 tests)
- Example: `examples/random_forest_regression.rs`
---
## Summary
**Key Takeaways**:
- ✅ Random Forest uses bootstrap aggregating to reduce variance
- ✅ Predictions are averaged across all trees (mean for regression)
- ✅ n_estimators=30-100 provides good stability
- ✅ max_depth=5-12 (deeper OK than single trees)
- ✅ Handles non-linear relationships without feature engineering
- ✅ Reduces overfitting compared to single decision trees
- ✅ Reproducible with random_state parameter
**Best Practices**:
1. Start with 50 trees, max_depth=8
2. Use random_state for reproducible experiments
3. Check train/test R² gap (should be small)
4. Compare with single tree to verify variance reduction
5. Compare with LinearRegression to check non-linearity benefit
**Typical Performance**:
- Training R²: 0.95-1.00 (high but not overfitting)
- Test R²: Often 5-15% better than single tree
- Prediction variance: ~1/√n_trees of single tree variance
**Verification**: Implementation tested with 16 comprehensive tests in `src/tree/mod.rs`, including edge cases, parameter validation, and comparison with single trees and linear regression.
---
**Next**: Gradient Boosting (planned)
**Previous**: [Decision Tree Regression](./decision-tree-regression.md)