aprender 0.31.2

<!-- PCU: examples-knn-iris | contract: contracts/apr-page-examples-knn-iris-v1.yaml -->
<!-- Example: cargo run -p aprender-core --example knn_iris -->
<!-- Status: enforced -->

# Case Study: KNN Iris

This case study demonstrates K-Nearest Neighbors (kNN) classification on the Iris dataset, exploring the effects of k values, distance metrics, and voting strategies to achieve 90% test accuracy.

## Overview

We'll apply kNN to Iris flower data to:
- Classify three species (Setosa, Versicolor, Virginica)
- Explore the effect of k parameter (1, 3, 5, 7, 9)
- Compare distance metrics (Euclidean, Manhattan, Minkowski)
- Analyze weighted vs uniform voting
- Generate probabilistic predictions with confidence scores

## Running the Example

```bash
cargo run --example knn_iris
```

Expected output: Comprehensive kNN analysis including accuracy for different k values, distance metric comparison, voting strategy comparison, and probabilistic predictions with confidence scores.

## Dataset

### Iris Flower Measurements

```rust,ignore
// Features: [sepal_length, sepal_width, petal_length, petal_width]
// Classes: 0=Setosa, 1=Versicolor, 2=Virginica

// Training set: 20 samples (7 Setosa, 7 Versicolor, 6 Virginica)
let x_train = Matrix::from_vec(20, 4, vec![
    // Setosa (small petals, large sepals)
    5.1, 3.5, 1.4, 0.2,
    4.9, 3.0, 1.4, 0.2,
    ...
    // Versicolor (medium petals and sepals)
    7.0, 3.2, 4.7, 1.4,
    6.4, 3.2, 4.5, 1.5,
    ...
    // Virginica (large petals and sepals)
    6.3, 3.3, 6.0, 2.5,
    5.8, 2.7, 5.1, 1.9,
    ...
])?;

// Test set: 10 samples (3 Setosa, 3 Versicolor, 4 Virginica)
```

**Dataset characteristics**:
- 20 training samples (67% of 30-sample dataset)
- 10 test samples (33% of dataset)
- 4 continuous features (all in centimeters)
- 3 well-separated species classes
- Balanced class distribution in training set

## Part 1: Basic kNN (k=3)

### Implementation

```rust,ignore
use aprender::classification::KNearestNeighbors;
use aprender::primitives::Matrix;

let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;

let predictions = knn.predict(&x_test)?;
let accuracy = compute_accuracy(&predictions, &y_test);
```

### Results

```text
Test Accuracy: 90.0%
```

**Analysis**:
- 9 out of 10 test samples correctly classified
- k=3 provides good balance between bias and variance
- Works well even without hyperparameter tuning

## Part 2: Effect of k Parameter

### Experiment

```rust,ignore
for k in [1, 3, 5, 7, 9] {
    let mut knn = KNearestNeighbors::new(k);
    knn.fit(&x_train, &y_train)?;
    let predictions = knn.predict(&x_test)?;
    let accuracy = compute_accuracy(&predictions, &y_test);
    println!("k={}: Accuracy = {:.1}%", k, accuracy * 100.0);
}
```

### Results

```text
k=1: Accuracy = 90.0%
k=3: Accuracy = 90.0%
k=5: Accuracy = 80.0%
k=7: Accuracy = 80.0%
k=9: Accuracy = 80.0%
```

### Interpretation

**Small k (1-3)**:
- **90% accuracy**: Best performance on this dataset
- **k=1** memorizes training data perfectly (lazy learning)
- **k=3** balances local patterns with noise reduction
- **Risk**: Overfitting, sensitive to outliers

**Large k (5-9)**:
- **80% accuracy**: Performance degrades
- Decision boundaries become smoother
- More robust to noise but loses fine distinctions
- **k=9** uses 45% of training data for each prediction (9/20)
- **Risk**: Underfitting, class boundaries blur

**Optimal k**:
- For this dataset: **k=3** provides best test accuracy
- General rule: k ≈ √n = √20 ≈ 4.5 (close to optimal)
- Use cross-validation for systematic selection

## Part 3: Distance Metrics (k=5)

### Comparison

```rust,ignore
let mut knn_euclidean = KNearestNeighbors::new(5)
    .with_metric(DistanceMetric::Euclidean);

let mut knn_manhattan = KNearestNeighbors::new(5)
    .with_metric(DistanceMetric::Manhattan);

let mut knn_minkowski = KNearestNeighbors::new(5)
    .with_metric(DistanceMetric::Minkowski(3.0));
```

### Results

```text
Euclidean distance:   80.0%
Manhattan distance:   80.0%
Minkowski (p=3):      80.0%
```

### Interpretation

**Identical performance** (80%) across all metrics for k=5.

**Why?**:
- Iris features (sepal/petal dimensions) are all continuous and similarly scaled
- All three metrics capture species differences effectively
- Ranking of neighbors is similar across metrics

**When metrics differ**:
- **Euclidean**: Best for continuous, normally distributed features
- **Manhattan**: Better for count data or when outliers present
- **Minkowski (p>2)**: Emphasizes dimensions with largest differences

**Recommendation**: Use Euclidean (default) for continuous features, Manhattan for robustness to outliers.

## Part 4: Weighted vs Uniform Voting

### Comparison

```rust,ignore
// Uniform voting: all neighbors count equally
let mut knn_uniform = KNearestNeighbors::new(5);
knn_uniform.fit(&x_train, &y_train)?;

// Weighted voting: closer neighbors count more
let mut knn_weighted = KNearestNeighbors::new(5).with_weights(true);
knn_weighted.fit(&x_train, &y_train)?;
```

### Results

```text
Uniform voting:   80.0%
Weighted voting:  90.0%
```

### Interpretation

**Weighted voting improves accuracy by 10%** (from 80% to 90%).

**Why weighted voting helps**:
- Gives more influence to closer (more similar) neighbors
- Reduces impact of distant outliers in k=5 neighborhood
- More intuitive: "very close neighbors matter more"
- Weight formula: w_i = 1 / distance_i

**Example scenario**:
```text
Neighbor distances for test sample:
  Neighbor 1: d=0.2, class=Versicolor, weight=5.0
  Neighbor 2: d=0.3, class=Versicolor, weight=3.3
  Neighbor 3: d=0.5, class=Versicolor, weight=2.0
  Neighbor 4: d=1.8, class=Setosa,     weight=0.56
  Neighbor 5: d=2.0, class=Setosa,     weight=0.50

Uniform: 3 votes Versicolor, 2 votes Setosa → Versicolor (60%)
Weighted: 10.3 weighted votes Versicolor, 1.06 Setosa → Versicolor (91%)
```

**Recommendation**: Use weighted voting for k ≥ 5, uniform for k ≤ 3.

## Part 5: Probabilistic Predictions

### Implementation

```rust,ignore
let mut knn_proba = KNearestNeighbors::new(5).with_weights(true);
knn_proba.fit(&x_train, &y_train)?;

let probabilities = knn_proba.predict_proba(&x_test)?;
let predictions = knn_proba.predict(&x_test)?;
```

### Results

```text
Sample  Predicted  Setosa  Versicolor  Virginica
─────────────────────────────────────────────────────
   0     Setosa       100.0%    0.0%       0.0%
   1     Setosa       100.0%    0.0%       0.0%
   2     Setosa       100.0%    0.0%       0.0%
   3     Versicolor   30.4%    69.6%       0.0%
   4     Versicolor   0.0%    100.0%       0.0%
```

### Interpretation

**Sample 0-2 (Setosa)**:
- **100% confidence**: All 5 nearest neighbors are Setosa
- Perfect separation from other species
- Small petals (1.4-1.5 cm) characteristic of Setosa

**Sample 3 (Versicolor)**:
- **69.6% confidence**: Some Setosa neighbors nearby
- **30.4% Setosa**: Near species boundary
- Medium features create some overlap

**Sample 4 (Versicolor)**:
- **100% confidence**: Clear Versicolor region
- All 5 neighbors are Versicolor

**Confidence interpretation**:
- 90-100%: High confidence, far from decision boundary
- 70-90%: Medium confidence, near boundary
- 50-70%: Low confidence, ambiguous region
- <50%: Prediction uncertain, manual review recommended

## Best Configuration

### Summary

```text
Best configuration found:
- k = 5 neighbors
- Distance metric: Euclidean
- Voting: Weighted by inverse distance
- Test accuracy: 90.0%
```

### Why This Works

1. **k=5**: Large enough to be robust, small enough to capture local patterns
2. **Euclidean**: Natural for continuous features
3. **Weighted voting**: Leverages proximity information effectively
4. **90% accuracy**: Excellent for 10-sample test set (1 misclassification)

### Comparison to Other Classifiers

| Classifier | Iris Accuracy | Training Time | Prediction Time |
|------------|--------------|---------------|-----------------|
| **kNN (k=5, weighted)** | **90%** | Instant | O(n) per sample |
| Logistic Regression | 90-95% | Fast | Very fast |
| Decision Tree | 85-95% | Medium | Fast |
| Random Forest | 95-100% | Slow | Medium |

kNN provides competitive accuracy with zero training time but slower predictions.

## Key Insights

### 1. Small k (1-3)
- Risk of **overfitting**
- Sensitive to noise and outliers
- Captures fine-grained decision boundaries
- Best when data is clean and well-separated

### 2. Large k (7-9)
- Risk of **underfitting**
- Class boundaries blur together
- More robust to noise
- Best when data is noisy or classes overlap

### 3. Weighted Voting
- Gives more influence to closer neighbors
- **Critical improvement**: 80% → 90% accuracy for k=5
- Especially beneficial for larger k values
- More intuitive than uniform voting

### 4. Distance Metric Selection
- **Euclidean**: Best for continuous features (default choice)
- **Manhattan**: More robust to outliers
- **Minkowski**: Tunable between Euclidean and Manhattan
- For Iris: All metrics perform similarly (well-behaved data)

## Performance Metrics

### Time Complexity

| Operation | Iris Dataset | General (n=20, p=4, k=5) |
|-----------|-------------|--------------------------|
| Training (fit) | 0.001 ms | O(1) - just stores data |
| Distance computation | 0.02 ms | O(n·p) per sample |
| Finding k-nearest | 0.01 ms | O(n log k) per sample |
| Voting | <0.001 ms | O(k·c) per sample |
| **Total prediction** | **~0.03 ms** | **O(n·p) per sample** |

**Bottleneck**: Distance computation dominates (67% of time).

### Memory Usage

**Training storage**:
- x_train: 20×4×4 = 320 bytes
- y_train: 20×8 = 160 bytes
- **Total**: ~480 bytes

**Per-sample prediction**:
- Distance array: 20×4 = 80 bytes
- Neighbor buffer: 5×12 = 60 bytes
- **Total**: ~140 bytes per sample

**Scalability**: kNN requires storing entire training set, making it memory-intensive for large datasets (n > 100,000).

## Full Code

```rust,ignore
use aprender::classification::{KNearestNeighbors, DistanceMetric};
use aprender::primitives::Matrix;

// 1. Load data
let (x_train, y_train, x_test, y_test) = load_iris_data()?;

// 2. Basic kNN
let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
println!("Accuracy: {:.1}%", compute_accuracy(&predictions, &y_test) * 100.0);

// 3. Hyperparameter tuning
for k in [1, 3, 5, 7, 9] {
    let mut knn = KNearestNeighbors::new(k);
    knn.fit(&x_train, &y_train)?;
    let acc = compute_accuracy(&knn.predict(&x_test)?, &y_test);
    println!("k={}: {:.1}%", k, acc * 100.0);
}

// 4. Best model with weighted voting
let mut knn_best = KNearestNeighbors::new(5)
    .with_weights(true);
knn_best.fit(&x_train, &y_train)?;

// 5. Probabilistic predictions
let probabilities = knn_best.predict_proba(&x_test)?;
for (i, &pred) in knn_best.predict(&x_test)?.iter().enumerate() {
    println!("Sample {}: class={}, confidence={:.1}%",
             i, pred, probabilities[i][pred] * 100.0);
}
```

## Further Exploration

**Try different k values**:
```rust,ignore
// Very small k (high variance)
let knn1 = KNearestNeighbors::new(1);  // Perfect training fit

// Very large k (high bias)
let knn15 = KNearestNeighbors::new(15); // 75% of training data
```

**Feature importance analysis**:
- Remove one feature at a time
- Measure impact on accuracy
- Identify most discriminative features (likely petal dimensions)

**Cross-validation**:
- Split data into 5 folds
- Average accuracy across folds
- More robust performance estimate than single train/test split

**Standardization effect**:
- Compare with/without StandardScaler
- Iris features are already similar scale (all in cm)
- Expect minimal difference, but good practice

## Related Examples

- [`examples/iris_clustering.rs`](./iris-clustering.md) - K-Means on same dataset
- [`book/src/ml-fundamentals/knn.md`](../ml-fundamentals/knn.md) - Full kNN theory
- [`examples/logistic-regression.md`](./logistic-regression.md) - Parametric alternative