aprender 0.29.3

Next-generation ML framework in pure Rust — `cargo install aprender` for the `apr` CLI
Documentation
# Case Study: DBSCAN Clustering Implementation

This chapter documents the complete EXTREME TDD implementation of aprender's DBSCAN clustering algorithm. This is a real-world example showing every phase of the RED-GREEN-REFACTOR cycle from Issue #14.

## Background

**GitHub Issue #14**: Implement DBSCAN clustering algorithm

**Requirements:**
- Density-based clustering without requiring k specification
- Automatic outlier detection (noise points labeled as -1)
- `eps` parameter for neighborhood radius
- `min_samples` parameter for core point density threshold
- Integration with `UnsupervisedEstimator` trait
- Deterministic clustering results
- Comprehensive example demonstrating parameter effects

**Initial State:**
- Tests: 548 passing
- Existing clustering: K-Means only
- No density-based clustering support

## CYCLE 1: Core DBSCAN Algorithm

### RED Phase

Created 12 comprehensive tests in `src/cluster/mod.rs`:

```rust,ignore
#[test]
fn test_dbscan_new() {
    let dbscan = DBSCAN::new(0.5, 3);
    assert_eq!(dbscan.eps(), 0.5);
    assert_eq!(dbscan.min_samples(), 3);
    assert!(!dbscan.is_fitted());
}

#[test]
fn test_dbscan_fit_basic() {
    let data = Matrix::from_vec(
        6,
        2,
        vec![
            1.0, 1.0, 1.1, 1.0, 1.0, 1.1,
            5.0, 5.0, 5.1, 5.0, 5.0, 5.1,
        ],
    )
    .unwrap();

    let mut dbscan = DBSCAN::new(0.5, 2);
    dbscan.fit(&data).unwrap();
    assert!(dbscan.is_fitted());
}
```

Additional tests covered:
- Cluster prediction consistency
- Noise detection for outliers
- Single cluster scenarios
- All-noise scenarios
- Parameter sensitivity (eps and min_samples)
- Reproducibility
- Error handling (predict before fit)

**Verification:**
```bash
$ cargo test dbscan
error[E0422]: cannot find struct `DBSCAN` in this scope
```

**Result:** 12 tests failing ✅ (expected - DBSCAN doesn't exist)

### GREEN Phase

Implemented minimal DBSCAN algorithm in `src/cluster/mod.rs`:

```rust,ignore
/// DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct DBSCAN {
    eps: f32,
    min_samples: usize,
    labels: Option<Vec<i32>>,
}

impl DBSCAN {
    pub fn new(eps: f32, min_samples: usize) -> Self {
        Self {
            eps,
            min_samples,
            labels: None,
        }
    }

    fn region_query(&self, x: &Matrix<f32>, i: usize) -> Vec<usize> {
        let mut neighbors = Vec::new();
        let n_samples = x.shape().0;
        for j in 0..n_samples {
            let dist = self.euclidean_distance(x, i, j);
            if dist <= self.eps {
                neighbors.push(j);
            }
        }
        neighbors
    }

    fn euclidean_distance(&self, x: &Matrix<f32>, i: usize, j: usize) -> f32 {
        let n_features = x.shape().1;
        let mut sum = 0.0;
        for k in 0..n_features {
            let diff = x.get(i, k) - x.get(j, k);
            sum += diff * diff;
        }
        sum.sqrt()
    }

    fn expand_cluster(
        &self,
        x: &Matrix<f32>,
        labels: &mut [i32],
        point: usize,
        neighbors: &mut Vec<usize>,
        cluster_id: i32,
    ) {
        labels[point] = cluster_id;
        let mut i = 0;
        while i < neighbors.len() {
            let neighbor = neighbors[i];
            if labels[neighbor] == -2 {
                labels[neighbor] = cluster_id;
                let neighbor_neighbors = self.region_query(x, neighbor);
                if neighbor_neighbors.len() >= self.min_samples {
                    for &nn in &neighbor_neighbors {
                        if !neighbors.contains(&nn) {
                            neighbors.push(nn);
                        }
                    }
                }
            } else if labels[neighbor] == -1 {
                labels[neighbor] = cluster_id;
            }
            i += 1;
        }
    }
}

impl UnsupervisedEstimator for DBSCAN {
    type Labels = Vec<i32>;

    fn fit(&mut self, x: &Matrix<f32>) -> Result<()> {
        let n_samples = x.shape().0;
        let mut labels = vec![-2; n_samples]; // -2 = unlabeled
        let mut cluster_id = 0;

        for i in 0..n_samples {
            if labels[i] != -2 {
                continue;
            }
            let mut neighbors = self.region_query(x, i);
            if neighbors.len() < self.min_samples {
                labels[i] = -1;
                continue;
            }
            self.expand_cluster(x, &mut labels, i, &mut neighbors, cluster_id);
            cluster_id += 1;
        }
        self.labels = Some(labels);
        Ok(())
    }

    fn predict(&self, _x: &Matrix<f32>) -> Self::Labels {
        self.labels().clone()
    }
}
```

**Verification:**
```bash
$ cargo test
test result: ok. 560 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
```

**Result:** All 560 tests passing ✅ (12 new DBSCAN tests)

### REFACTOR Phase

**Code Quality:**
- Fixed clippy warnings (unused variables, map_clone, manual_contains)
- Added comprehensive documentation
- Exported DBSCAN in prelude for easy access
- Added public getter methods (eps, min_samples, is_fitted, labels)

**Verification:**
```bash
$ cargo clippy --all-targets -- -D warnings
Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.53s
```

**Result:** Zero clippy warnings ✅

## Example Implementation

Created comprehensive example `examples/dbscan_clustering.rs` demonstrating:

1. **Standard DBSCAN clustering** - Basic usage with 2 clusters and noise
2. **Effect of eps parameter** - Shows how neighborhood radius affects clustering
3. **Effect of min_samples parameter** - Demonstrates density threshold impact
4. **Comparison with K-Means** - Highlights DBSCAN's outlier detection advantage
5. **Anomaly detection use case** - Practical application for identifying outliers

**Key differences from K-Means:**
- K-Means: requires specifying k, assigns all points to clusters
- DBSCAN: discovers k automatically, identifies outliers as noise

**Run the example:**
```bash
cargo run --example dbscan_clustering
```

## Algorithm Details

**Time Complexity:** O(n²) for naive distance computations
**Space Complexity:** O(n) for storing labels

**Core Concepts:**
- **Core points:** Points with ≥ min_samples neighbors within eps
- **Border points:** Non-core points within eps of a core point
- **Noise points:** Points neither core nor border (labeled -1)
- **Cluster expansion:** Recursive growth from core points to reachable neighbors

## Final State

**Tests:** 560 passing (548 → 560, +12 DBSCAN tests)
**Coverage:** All DBSCAN functionality comprehensively tested
**Quality:** Zero clippy warnings, full documentation
**Exports:** Available via `use aprender::prelude::*;`

## Key Takeaways

1. **EXTREME TDD works:** Tests written first caught edge cases early
2. **Algorithm correctness:** Comprehensive tests verify all scenarios
3. **Quality gates:** Clippy and formatting ensure consistent code style
4. **Documentation:** Example demonstrates practical usage and parameter tuning

## Related Topics

- [K-Means Clustering]./kmeans-clustering.md
- `UnsupervisedEstimator` trait in `aprender::traits`
- [What is EXTREME TDD?]../methodology/what-is-extreme-tdd.md