aprender 0.31.2

Next-generation ML framework in pure Rust — `cargo install aprender` for the `apr` CLI
Documentation
<!-- PCU: examples-tsne-visualization | contract: contracts/apr-page-examples-tsne-visualization-v1.yaml -->
<!-- Example: cargo run -p aprender-core --example none -->
<!-- Status: enforced -->

# Case Study: t-SNE Implementation

This chapter documents the complete EXTREME TDD implementation of aprender's t-SNE algorithm for dimensionality reduction and visualization from Issue #18.

## Background

**GitHub Issue #18**: Implement t-SNE for Dimensionality Reduction and Visualization

**Requirements:**
- Non-linear dimensionality reduction (2D/3D)
- Perplexity-based similarity computation
- KL divergence minimization via gradient descent
- Parameters: n_components, perplexity, learning_rate, n_iter
- Reproducibility with random_state

**Initial State:**
- Tests: 640 passing
- Existing dimensionality reduction: PCA (linear)
- No non-linear dimensionality reduction

## Implementation Summary

### RED Phase

Created 12 comprehensive tests covering:
- Constructor and basic fitting
- Transform and fit_transform methods
- Perplexity parameter effects
- Learning rate and iteration count
- 2D and 3D embeddings
- Reproducibility with random_state
- Error handling (transform before fit)
- Local structure preservation
- Embedding finite values

### GREEN Phase

Implemented complete t-SNE algorithm (~400 lines):

**Core Components:**

1. **TSNE**: Public API with builder pattern
   - with_perplexity, with_learning_rate, with_n_iter, with_random_state
   - fit/transform/fit_transform methods

2. **Pairwise Distances** (`compute_pairwise_distances`):
   - Squared Euclidean distances in high-D
   - O(n²) computation

3. **Conditional Probabilities** (`compute_p_conditional`):
   - Binary search for sigma to match perplexity
   - Gaussian kernel: P(j|i) ∝ exp(-||x_i - x_j||² / (2σ_i²))
   - Target entropy: H = log₂(perplexity)

4. **Joint Probabilities** (`compute_p_joint`):
   - Symmetrize: P_{ij} = (P(j|i) + P(i|j)) / (2N)
   - Numerical stability with max(1e-12)

5. **Q Matrix** (`compute_q`):
   - Student's t-distribution in low-D
   - Q_{ij} ∝ (1 + ||y_i - y_j||²)^{-1}
   - Heavy-tailed distribution avoids crowding

6. **Gradient Computation** (`compute_gradient`):
   - ∇KL(P||Q) = 4Σ_j (p_ij - q_ij) · (y_i - y_j) / (1 + ||y_i - y_j||²)

7. **Optimization**:
   - Gradient descent with momentum (0.5 → 0.8)
   - Small random initialization (±0.00005)
   - Reproducible LCG random number generator

**Key Algorithm Steps:**

```text
1. Compute pairwise distances in high-D
2. Binary search for sigma to match perplexity
3. Compute conditional probabilities P(j|i)
4. Symmetrize to joint probabilities P_{ij}
5. Initialize embedding randomly (small values)
6. For each iteration:
   a. Compute Q matrix (Student's t in low-D)
   b. Compute gradient of KL divergence
   c. Update embedding with momentum
7. Return final embedding
```

### REFACTOR Phase

- Fixed legacy numeric constants (f32::INFINITY)
- Zero clippy warnings
- Exported TSNE in prelude
- Comprehensive documentation
- Example demonstrating all key features

**Final State:**
- Tests: 640 → 652 (+12)
- Zero warnings
- All quality gates passing

## Algorithm Details

**Time Complexity:** O(n² · iterations)
- Dominated by pairwise distance computation each iteration
- Typical: 1000 iterations × O(n²) = impractical for n > 10,000

**Space Complexity:** O(n²)
- Distance matrix, P matrix, Q matrix all n×n

**Binary Search for Perplexity:**
- Target: H(P_i) = log₂(perplexity)
- Search for beta = 1/(2σ²) in range [0, ∞)
- 50 iterations max for convergence
- Tolerance: |H - target| < 1e-5

**Momentum Optimization:**
- Initial momentum: 0.5 (first 250 iterations)
- Final momentum: 0.8 (after iteration 250)
- Helps escape local minima and speed convergence

## Parameters

- **n_components** (default: 2): Output dimensions (usually 2 or 3)

- **perplexity** (default: 30.0): Balance local/global structure
  - Low (5-10): Very local, reveals fine clusters
  - Medium (20-30): Balanced (recommended)
  - High (50+): More global structure
  - Rule of thumb: perplexity < n_samples / 3

- **learning_rate** (default: 200.0): Gradient descent step size
  - Too low: Slow convergence
  - Too high: Unstable/divergence
  - Typical range: 10-1000

- **n_iter** (default: 1000): Number of gradient descent iterations
  - Minimum: 250 for reasonable results
  - Recommended: 1000 for convergence
  - More iterations: Better but slower

- **random_state** (default: None): Random seed for reproducibility

## Example Highlights

The example demonstrates:
1. Basic 4D → 2D reduction
2. Perplexity effects (2.0 vs 5.0)
3. 3D embedding
4. Learning rate effects (50.0 vs 500.0)
5. Reproducibility with random_state
6. t-SNE vs PCA comparison

## Key Takeaways

1. **Non-Linear**: Captures manifolds that PCA cannot
2. **Local Preservation**: Excellent at preserving neighborhoods
3. **Visualization**: Best for 2D/3D plots
4. **Perplexity Critical**: Try multiple values (5, 10, 30, 50)
5. **Stochastic**: Different runs give different embeddings
6. **Slow**: O(n²) limits scalability
7. **No Transform**: Cannot embed new data points

## Comparison: t-SNE vs PCA

| Feature | t-SNE | PCA |
|---------|-------|-----|
| Type | Non-linear | Linear |
| Preserves | Local structure | Global variance |
| Speed | O(n²·iter) | O(n·d·k) |
| Transform New Data | No | Yes |
| Deterministic | No (stochastic) | Yes |
| Best For | Visualization | Preprocessing |

**When to use t-SNE:**
- Visualizing high-dimensional data
- Exploratory data analysis
- Finding hidden clusters
- Presentations (2D plots)

**When to use PCA:**
- Feature reduction before modeling
- Large datasets (n > 10,000)
- Need to transform new data
- Need deterministic results

## Use Cases

1. **MNIST Visualization**: Visualize 784D digit images in 2D
2. **Word Embeddings**: Explore word2vec/GloVe embeddings
3. **Single-Cell RNA-seq**: Cluster cell types
4. **Image Features**: Visualize CNN features
5. **Customer Segmentation**: Explore behavioral clusters

## Testing Strategy

**Unit Tests** (12 implemented):
- Correctness: Embeddings have correct shape
- Reproducibility: Same random_state → same result
- Parameters: Perplexity, learning rate, n_iter effects
- Edge cases: Transform before fit, finite values

**Property-Based Tests** (future work):
- Local structure: Nearby points in high-D → nearby in low-D
- Perplexity monotonicity: Higher perplexity → smoother embedding
- Convergence: More iterations → lower KL divergence

## Technical Challenges Solved

### Challenge 1: Perplexity Matching
**Problem**: Finding sigma to match target perplexity.
**Solution**: Binary search on beta = 1/(2σ²) with entropy target.

### Challenge 2: Numerical Stability
**Problem**: Very small probabilities cause log(0) errors.
**Solution**: Clamp probabilities to max(p, 1e-12).

### Challenge 3: Reproducibility
**Problem**: std::random is non-deterministic.
**Solution**: Custom LCG random generator with seed.

### Challenge 4: Large Embedding Values
**Problem**: Embeddings can have very large absolute values.
**Solution**: This is expected - t-SNE preserves relative distances, not absolute positions.

## Related Topics

- [PCA Implementation](./pca-iris.md)
- [K-Means Clustering](./kmeans-clustering.md)
- [Spectral Clustering](./spectral-clustering.md)
- [What is EXTREME TDD?](../methodology/what-is-extreme-tdd.md)

## References

1. van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. JMLR.
2. Wattenberg, et al. (2016). How to Use t-SNE Effectively. Distill.
3. Kobak, D., & Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications.