# Case Study: KNN Iris
This case study demonstrates K-Nearest Neighbors (kNN) classification on the Iris dataset, exploring the effects of k values, distance metrics, and voting strategies to achieve 90% test accuracy.
## Overview
We'll apply kNN to Iris flower data to:
- Classify three species (Setosa, Versicolor, Virginica)
- Explore the effect of k parameter (1, 3, 5, 7, 9)
- Compare distance metrics (Euclidean, Manhattan, Minkowski)
- Analyze weighted vs uniform voting
- Generate probabilistic predictions with confidence scores
## Running the Example
```bash
cargo run --example knn_iris
```
Expected output: Comprehensive kNN analysis including accuracy for different k values, distance metric comparison, voting strategy comparison, and probabilistic predictions with confidence scores.
## Dataset
### Iris Flower Measurements
```rust,ignore
// Features: [sepal_length, sepal_width, petal_length, petal_width]
// Classes: 0=Setosa, 1=Versicolor, 2=Virginica
// Training set: 20 samples (7 Setosa, 7 Versicolor, 6 Virginica)
let x_train = Matrix::from_vec(20, 4, vec![
// Setosa (small petals, large sepals)
5.1, 3.5, 1.4, 0.2,
4.9, 3.0, 1.4, 0.2,
...
// Versicolor (medium petals and sepals)
7.0, 3.2, 4.7, 1.4,
6.4, 3.2, 4.5, 1.5,
...
// Virginica (large petals and sepals)
6.3, 3.3, 6.0, 2.5,
5.8, 2.7, 5.1, 1.9,
...
])?;
// Test set: 10 samples (3 Setosa, 3 Versicolor, 4 Virginica)
```
**Dataset characteristics**:
- 20 training samples (67% of 30-sample dataset)
- 10 test samples (33% of dataset)
- 4 continuous features (all in centimeters)
- 3 well-separated species classes
- Balanced class distribution in training set
## Part 1: Basic kNN (k=3)
### Implementation
```rust,ignore
use aprender::classification::KNearestNeighbors;
use aprender::primitives::Matrix;
let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
let accuracy = compute_accuracy(&predictions, &y_test);
```
### Results
```text
Test Accuracy: 90.0%
```
**Analysis**:
- 9 out of 10 test samples correctly classified
- k=3 provides good balance between bias and variance
- Works well even without hyperparameter tuning
## Part 2: Effect of k Parameter
### Experiment
```rust,ignore
for k in [1, 3, 5, 7, 9] {
let mut knn = KNearestNeighbors::new(k);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
let accuracy = compute_accuracy(&predictions, &y_test);
println!("k={}: Accuracy = {:.1}%", k, accuracy * 100.0);
}
```
### Results
```text
k=1: Accuracy = 90.0%
k=3: Accuracy = 90.0%
k=5: Accuracy = 80.0%
k=7: Accuracy = 80.0%
k=9: Accuracy = 80.0%
```
### Interpretation
**Small k (1-3)**:
- **90% accuracy**: Best performance on this dataset
- **k=1** memorizes training data perfectly (lazy learning)
- **k=3** balances local patterns with noise reduction
- **Risk**: Overfitting, sensitive to outliers
**Large k (5-9)**:
- **80% accuracy**: Performance degrades
- Decision boundaries become smoother
- More robust to noise but loses fine distinctions
- **k=9** uses 45% of training data for each prediction (9/20)
- **Risk**: Underfitting, class boundaries blur
**Optimal k**:
- For this dataset: **k=3** provides best test accuracy
- General rule: k ≈ √n = √20 ≈ 4.5 (close to optimal)
- Use cross-validation for systematic selection
## Part 3: Distance Metrics (k=5)
### Comparison
```rust,ignore
let mut knn_euclidean = KNearestNeighbors::new(5)
.with_metric(DistanceMetric::Euclidean);
let mut knn_manhattan = KNearestNeighbors::new(5)
.with_metric(DistanceMetric::Manhattan);
let mut knn_minkowski = KNearestNeighbors::new(5)
.with_metric(DistanceMetric::Minkowski(3.0));
```
### Results
```text
Euclidean distance: 80.0%
Manhattan distance: 80.0%
Minkowski (p=3): 80.0%
```
### Interpretation
**Identical performance** (80%) across all metrics for k=5.
**Why?**:
- Iris features (sepal/petal dimensions) are all continuous and similarly scaled
- All three metrics capture species differences effectively
- Ranking of neighbors is similar across metrics
**When metrics differ**:
- **Euclidean**: Best for continuous, normally distributed features
- **Manhattan**: Better for count data or when outliers present
- **Minkowski (p>2)**: Emphasizes dimensions with largest differences
**Recommendation**: Use Euclidean (default) for continuous features, Manhattan for robustness to outliers.
## Part 4: Weighted vs Uniform Voting
### Comparison
```rust,ignore
// Uniform voting: all neighbors count equally
let mut knn_uniform = KNearestNeighbors::new(5);
knn_uniform.fit(&x_train, &y_train)?;
// Weighted voting: closer neighbors count more
let mut knn_weighted = KNearestNeighbors::new(5).with_weights(true);
knn_weighted.fit(&x_train, &y_train)?;
```
### Results
```text
Uniform voting: 80.0%
Weighted voting: 90.0%
```
### Interpretation
**Weighted voting improves accuracy by 10%** (from 80% to 90%).
**Why weighted voting helps**:
- Gives more influence to closer (more similar) neighbors
- Reduces impact of distant outliers in k=5 neighborhood
- More intuitive: "very close neighbors matter more"
- Weight formula: w_i = 1 / distance_i
**Example scenario**:
```text
Neighbor distances for test sample:
Neighbor 1: d=0.2, class=Versicolor, weight=5.0
Neighbor 2: d=0.3, class=Versicolor, weight=3.3
Neighbor 3: d=0.5, class=Versicolor, weight=2.0
Neighbor 4: d=1.8, class=Setosa, weight=0.56
Neighbor 5: d=2.0, class=Setosa, weight=0.50
Uniform: 3 votes Versicolor, 2 votes Setosa → Versicolor (60%)
Weighted: 10.3 weighted votes Versicolor, 1.06 Setosa → Versicolor (91%)
```
**Recommendation**: Use weighted voting for k ≥ 5, uniform for k ≤ 3.
## Part 5: Probabilistic Predictions
### Implementation
```rust,ignore
let mut knn_proba = KNearestNeighbors::new(5).with_weights(true);
knn_proba.fit(&x_train, &y_train)?;
let probabilities = knn_proba.predict_proba(&x_test)?;
let predictions = knn_proba.predict(&x_test)?;
```
### Results
```text
Sample Predicted Setosa Versicolor Virginica
─────────────────────────────────────────────────────
0 Setosa 100.0% 0.0% 0.0%
1 Setosa 100.0% 0.0% 0.0%
2 Setosa 100.0% 0.0% 0.0%
3 Versicolor 30.4% 69.6% 0.0%
4 Versicolor 0.0% 100.0% 0.0%
```
### Interpretation
**Sample 0-2 (Setosa)**:
- **100% confidence**: All 5 nearest neighbors are Setosa
- Perfect separation from other species
- Small petals (1.4-1.5 cm) characteristic of Setosa
**Sample 3 (Versicolor)**:
- **69.6% confidence**: Some Setosa neighbors nearby
- **30.4% Setosa**: Near species boundary
- Medium features create some overlap
**Sample 4 (Versicolor)**:
- **100% confidence**: Clear Versicolor region
- All 5 neighbors are Versicolor
**Confidence interpretation**:
- 90-100%: High confidence, far from decision boundary
- 70-90%: Medium confidence, near boundary
- 50-70%: Low confidence, ambiguous region
- <50%: Prediction uncertain, manual review recommended
## Best Configuration
### Summary
```text
Best configuration found:
- k = 5 neighbors
- Distance metric: Euclidean
- Voting: Weighted by inverse distance
- Test accuracy: 90.0%
```
### Why This Works
1. **k=5**: Large enough to be robust, small enough to capture local patterns
2. **Euclidean**: Natural for continuous features
3. **Weighted voting**: Leverages proximity information effectively
4. **90% accuracy**: Excellent for 10-sample test set (1 misclassification)
### Comparison to Other Classifiers
| Classifier | Iris Accuracy | Training Time | Prediction Time |
|------------|--------------|---------------|-----------------|
| **kNN (k=5, weighted)** | **90%** | Instant | O(n) per sample |
| Logistic Regression | 90-95% | Fast | Very fast |
| Decision Tree | 85-95% | Medium | Fast |
| Random Forest | 95-100% | Slow | Medium |
kNN provides competitive accuracy with zero training time but slower predictions.
## Key Insights
### 1. Small k (1-3)
- Risk of **overfitting**
- Sensitive to noise and outliers
- Captures fine-grained decision boundaries
- Best when data is clean and well-separated
### 2. Large k (7-9)
- Risk of **underfitting**
- Class boundaries blur together
- More robust to noise
- Best when data is noisy or classes overlap
### 3. Weighted Voting
- Gives more influence to closer neighbors
- **Critical improvement**: 80% → 90% accuracy for k=5
- Especially beneficial for larger k values
- More intuitive than uniform voting
### 4. Distance Metric Selection
- **Euclidean**: Best for continuous features (default choice)
- **Manhattan**: More robust to outliers
- **Minkowski**: Tunable between Euclidean and Manhattan
- For Iris: All metrics perform similarly (well-behaved data)
## Performance Metrics
### Time Complexity
| Operation | Iris Dataset | General (n=20, p=4, k=5) |
|-----------|-------------|--------------------------|
| Training (fit) | 0.001 ms | O(1) - just stores data |
| Distance computation | 0.02 ms | O(n·p) per sample |
| Finding k-nearest | 0.01 ms | O(n log k) per sample |
| Voting | <0.001 ms | O(k·c) per sample |
| **Total prediction** | **~0.03 ms** | **O(n·p) per sample** |
**Bottleneck**: Distance computation dominates (67% of time).
### Memory Usage
**Training storage**:
- x_train: 20×4×4 = 320 bytes
- y_train: 20×8 = 160 bytes
- **Total**: ~480 bytes
**Per-sample prediction**:
- Distance array: 20×4 = 80 bytes
- Neighbor buffer: 5×12 = 60 bytes
- **Total**: ~140 bytes per sample
**Scalability**: kNN requires storing entire training set, making it memory-intensive for large datasets (n > 100,000).
## Full Code
```rust,ignore
use aprender::classification::{KNearestNeighbors, DistanceMetric};
use aprender::primitives::Matrix;
// 1. Load data
let (x_train, y_train, x_test, y_test) = load_iris_data()?;
// 2. Basic kNN
let mut knn = KNearestNeighbors::new(3);
knn.fit(&x_train, &y_train)?;
let predictions = knn.predict(&x_test)?;
println!("Accuracy: {:.1}%", compute_accuracy(&predictions, &y_test) * 100.0);
// 3. Hyperparameter tuning
for k in [1, 3, 5, 7, 9] {
let mut knn = KNearestNeighbors::new(k);
knn.fit(&x_train, &y_train)?;
let acc = compute_accuracy(&knn.predict(&x_test)?, &y_test);
println!("k={}: {:.1}%", k, acc * 100.0);
}
// 4. Best model with weighted voting
let mut knn_best = KNearestNeighbors::new(5)
.with_weights(true);
knn_best.fit(&x_train, &y_train)?;
// 5. Probabilistic predictions
let probabilities = knn_best.predict_proba(&x_test)?;
for (i, &pred) in knn_best.predict(&x_test)?.iter().enumerate() {
println!("Sample {}: class={}, confidence={:.1}%",
i, pred, probabilities[i][pred] * 100.0);
}
```
## Further Exploration
**Try different k values**:
```rust,ignore
// Very small k (high variance)
let knn1 = KNearestNeighbors::new(1); // Perfect training fit
// Very large k (high bias)
let knn15 = KNearestNeighbors::new(15); // 75% of training data
```
**Feature importance analysis**:
- Remove one feature at a time
- Measure impact on accuracy
- Identify most discriminative features (likely petal dimensions)
**Cross-validation**:
- Split data into 5 folds
- Average accuracy across folds
- More robust performance estimate than single train/test split
**Standardization effect**:
- Compare with/without StandardScaler
- Iris features are already similar scale (all in cm)
- Expect minimal difference, but good practice
## Related Examples
- [`examples/iris_clustering.rs`](./iris-clustering.md) - K-Means on same dataset
- [`book/src/ml-fundamentals/knn.md`](../ml-fundamentals/knn.md) - Full kNN theory
- [`examples/logistic-regression.md`](./logistic-regression.md) - Parametric alternative