aprender 0.25.1

Next-generation machine learning library in pure Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
# Aprender Roadmap

Next Generation Machine Learning in Pure Rust

## Version Status

| Version | Status | Description |
|---------|--------|-------------|
| v0.1.0  |**Released** | Foundation - Linear Regression + K-Means |
| v0.2.0  |**Released** | Decision Trees, Random Forests, Cross-Validation, Serialization |
| v0.3.0  |**Released** | Regularization & Optimization |
| v0.4.0  |**Released** | TOP 10 ML Algorithms Complete |
| v0.4.1  |**Released** | Graph Algorithms, Advanced Clustering, Anomaly Detection, Stats |
| v0.5.0  | Planned | Neural Networks |
| v1.0.0  | Planned | Production Hardening |

---

## v0.2.0: Tree Models & Cross-Validation (Current)

**Target**: Decision Trees, Random Forests, Model Selection, Model Persistence

### Completed

- [x] **Tree Module**: Complete decision tree implementation
  - DecisionTreeClassifier with GINI impurity
  - Configurable max_depth
  - Recursive tree building
  - Integration tests with Iris dataset
- [x] **Random Forest**: Bootstrap aggregating ensemble
  - RandomForestClassifier with configurable n_estimators
  - Bootstrap sampling with replacement
  - Majority voting for predictions
  - Reproducible with random_state
- [x] **Cross-Validation**: Model evaluation utilities
  - train_test_split() with reproducible random seeds
  - KFold cross-validator with optional shuffling
  - cross_validate() with statistics (mean, std, min, max)
  - CrossValidationResult struct
- [x] **Model Serialization**: Save/load models to disk
  - Serde + bincode integration
  - Works with all models (LinearRegression, KMeans, DecisionTree, RandomForest)
  - Example demonstrating persistence
- [x] **Examples**: New comprehensive examples
  - decision_tree_iris.rs - Decision tree classification
  - random_forest_iris.rs - Ensemble classification
  - cross_validation.rs - Model evaluation workflow
  - model_persistence.rs - Save/load demonstration
- [x] **Documentation**: EXTREME TDD Book
  - 90+ chapter structure on GitHub Pages
  - Complete case study: Cross-Validation
  - RED-GREEN-REFACTOR methodology
  - Live at https://paiml.github.io/aprender/
- [x] **Bug Fixes**
  - Clear error messages for underdetermined systems

### Quality Metrics Achieved

- TDG Score: 93.3/100 (A grade)
- Total Tests: 184 passing (+64 from v0.1.0)
- Property Tests: 22
- Doc Tests: 16
- Test Coverage: ~97%
- Max Cyclomatic Complexity: ≤10
- Zero clippy warnings
- Zero SATD comments

### Released ✅

- [x] Published to crates.io (2024-11-18)
- [x] GitHub Release created with release notes
- [x] EXTREME TDD Book deployed
- [x] CI/CD pipeline passing

---

## v0.1.0: Foundation

**Target**: Linear Regression + K-Means (2 algorithms, viable from day one)

### Completed

- [x] Project scaffolding (Cargo workspace, quality gates)
- [x] Core primitives: Vector, Matrix (with Cholesky solver)
- [x] DataFrame (named column container with `to_matrix()`)
- [x] Linear Regression (OLS via normal equations)
- [x] K-Means clustering (k-means++ initialization + Lloyd's algorithm)
- [x] Metrics: R², MSE, RMSE, MAE, inertia, silhouette_score
- [x] Estimator/UnsupervisedEstimator/Transformer traits
- [x] Property-based tests (19 proptest properties)
- [x] Unit tests (120 tests)
- [x] Doc tests (13 tests)
- [x] Examples: boston_housing, iris_clustering, dataframe_basics
- [x] Benchmarks: linear_regression, kmeans

### Quality Metrics Achieved

- TDG Score: 95.6/100 (A+ grade)
- Repository Score: 95.0/100 (A+)
- Test Coverage: 97.72%
- Mutation Score: 85.3%
- Max Cyclomatic Complexity: 5
- Zero clippy warnings
- Zero SATD comments
- Total Tests: 149 (127 unit + 22 property)

### Released ✅

- [x] Published to crates.io (2024-11-18)
- [x] GitHub Release created with artifacts
- [x] Complete rustdoc coverage
- [x] CI/CD pipeline operational

---

## v0.3.0: Regularization & Optimization

**Target**: Regularized Linear Models, Optimizers, Advanced Model Selection

### Completed

- [x] **Regularized Linear Models**
  - Ridge regression (L2 regularization)
  - Lasso regression (L1 via coordinate descent)
  - Elastic Net (L1 + L2 combination)
  - Full builder pattern API with max_iter, tolerance
- [x] **Optimizers**
  - SGD with optional momentum
  - Adam with adaptive learning rates
  - Trait-based design for extensibility
- [x] **Loss Functions**
  - MSE (Mean Squared Error)
  - MAE (Mean Absolute Error)
  - Huber loss (smooth combination)
  - Both functional and OOP APIs
- [x] **Advanced Model Selection**
  - Stratified K-Fold cross-validation
  - Grid search for hyperparameter tuning
  - Works with all regularized models
- [x] **Preprocessing**
  - StandardScaler (z-score normalization)
  - MinMaxScaler (range scaling)
  - Transformer trait integration
- [x] **Examples**: New comprehensive demonstrations
  - regularized_regression.rs - Ridge, Lasso, ElasticNet with grid search
  - optimizer_demo.rs - SGD and Adam optimization
- [x] **Chaos Engineering**
  - ChaosConfig from renacer integration
  - Property-based testing with proptest
  - Fuzz testing infrastructure with cargo-fuzz
  - 41 new tests (6 property + 14 integration + 1 fuzz target)
- [x] **Refactoring**
  - Reduced tree module complexity: 10 → 7 cyclomatic, 22 → 13 cognitive
  - Reduced grid search complexity: 9 → 4 cyclomatic, 23 → 6 cognitive
  - Extracted 6 helper functions for better maintainability

### Quality Metrics Achieved

- TDG Score: 95.6/100 (A+ grade)
- Total Tests: 498 passing (+314 from v0.2.0)
- Property Tests: 32 (+10)
- Doc Tests: 49 (+33)
- Integration Tests: 6
- Unit Tests: 387 (+203)
- Max Cyclomatic Complexity: 7 (down from 14)
- Max Cognitive Complexity: 13 (down from 23)
- Zero clippy warnings
- Zero SATD comments
- All quality gates passing

### Released ✅

- [x] All 10 features implemented and tested
- [x] Chaos engineering infrastructure integrated
- [x] Code complexity significantly reduced
- [x] 9 comprehensive examples running
- [x] All quality gates passing

---

## v0.4.0: TOP 10 ML Algorithms Complete

**Target**: Industry's most popular machine learning algorithms with comprehensive testing

### Completed

- [x] **Logistic Regression** (Issue #12)
  - Binary and multi-class classification
  - Gradient descent optimization with learning rate decay
  - L2 regularization support
  - predict_proba() for probability estimates
- [x] **K-Nearest Neighbors** (Issue #23)
  - Distance-based classification
  - Configurable k parameter
  - Brute-force and optimized distance computation
  - Works with any distance metric
- [x] **Support Vector Machine** (Issue #24)
  - Linear SVM with hinge loss
  - Subgradient descent optimizer
  - C regularization parameter
  - decision_function() for margin-based predictions
- [x] **Naive Bayes** (Issue #25)
  - GaussianNB with probabilistic classification
  - Variance smoothing parameter
  - predict_proba() returns class probabilities
  - Fast training O(n·d)
- [x] **Gradient Boosting Machine** (Issue #26)
  - Adaptive boosting with residual learning
  - Configurable n_estimators and learning_rate
  - Tree-based weak learners
  - Feature importance scores
- [x] **Decision Trees & Random Forests** (v0.2.0)
  - GINI impurity-based splitting
  - Bootstrap aggregating for Random Forest
  - Configurable max_depth and n_estimators
- [x] **Linear Regression** (v0.1.0)
  - OLS via normal equations
  - Ridge/Lasso/ElasticNet (v0.3.0)
- [x] **K-Means Clustering** (v0.1.0)
  - k-means++ initialization
  - Lloyd's algorithm
  - Configurable max_iter
- [x] **Principal Component Analysis** (Issue #13)
  - Eigendecomposition-based dimensionality reduction
  - Configurable n_components
  - explained_variance_ratio for feature analysis
  - transform() for new data projection
- [x] **Classification Metrics**
  - accuracy_score, precision_score, recall_score
  - f1_score, confusion_matrix
  - Multi-class support (macro/micro averaging)

### Quality Metrics Achieved

- Total Tests: 528 passing
- Zero clippy warnings
- Zero SATD violations
- All quality gates passing
- Comprehensive documentation with examples
- 10/10 TOP algorithms implemented ✅

### Released ✅

- [x] Published to crates.io as v0.4.0 (2024-11-19)
- [x] All TOP 10 algorithms tested and documented
- [x] Examples for each algorithm
- [x] Book chapters for theory + case studies

---

## v0.4.1: Graph Algorithms, Advanced Clustering & Statistics

**Target**: Expand beyond TOP 10 with graph theory, advanced clustering, anomaly detection, and statistical analysis

### Completed

- [x] **Graph Algorithms** (Issue #9)
  - Betweenness Centrality (shortest path counting)
  - PageRank (iterative power method)
  - Graph data structure with adjacency list
  - Weighted and unweighted edge support
- [x] **Community Detection** (Issue #22)
  - Louvain algorithm for modularity optimization
  - Modularity computation Q = (1/2m) Σ[A_ij - k_i*k_j/2m] δ(c_i, c_j)
  - Detects densely connected groups in networks
  - O(m·log n) complexity
- [x] **Advanced Clustering**
  - DBSCAN (Issue #14) - Density-based clustering
  - Hierarchical Clustering (Issue #15) - Agglomerative with linkage methods
  - Gaussian Mixture Models (Issue #16) - EM algorithm for soft clustering
  - Spectral Clustering (Issue #19) - Graph Laplacian eigendecomposition
- [x] **Anomaly Detection**
  - Isolation Forest (Issue #17) - Ensemble of isolation trees
  - Local Outlier Factor (Issue #20) - Density-based outlier detection
  - score_samples() and predict() methods
  - Contamination parameter for threshold setting
- [x] **Dimensionality Reduction**
  - t-SNE (Issue #18) - Non-linear visualization
  - Perplexity-based similarity computation
  - KL divergence minimization via gradient descent
  - 2D/3D embedding support
- [x] **Association Rule Mining**
  - Apriori Algorithm (Issue #21) - Frequent itemset mining
  - Support, confidence, and lift metrics
  - Market basket analysis support
  - Efficient pruning with apriori property
- [x] **Descriptive Statistics** (Issue #9)
  - Mean, median, mode, variance, std deviation
  - Quartiles (Q1, Q2, Q3), IQR
  - Histograms with multiple binning strategies
  - Five-number summary (min, Q1, median, Q3, max)

### Quality Metrics Achieved

- Total Tests: 683 passing (+155 from v0.4.0)
- Zero clippy warnings
- Zero critical SATD violations (1 low-priority Bayesian Blocks TODO)
- All quality gates passing
- Comprehensive EXTREME TDD book with case studies
- mdbook tests: 0 failures across 119 chapters

### Released ✅

- [x] Published to crates.io as v0.4.1 (2024-11-21)
- [x] All 6 advanced clustering algorithms implemented
- [x] Graph algorithms with social network examples
- [x] Complete anomaly detection suite
- [x] Association rule mining for market basket analysis
- [x] Comprehensive book chapters for all algorithms

---

## v0.5.0: Regression Trees & Advanced Ensemble Methods

- [ ] Decision tree regression (CART algorithm)
- [ ] Random Forest regression
- [ ] Out-of-bag error estimation for Random Forests
- [ ] Feature importance visualization
- [ ] XGBoost-style optimizations (histogram binning, approximate split finding)

---

## v0.6.0: Neural Networks

- [ ] Autodiff integration (feature-gated)
- [ ] Dense layers (fully connected)
- [ ] Activation functions (ReLU, Leaky ReLU, ELU, SELU)
- [ ] Optimizers: SGD, Adam, AdaGrad, RMSprop
- [ ] Batch normalization
- [ ] Dropout regularization
- [ ] Sequential model API

---

## v0.7.0: Advanced Statistics & Inference

- [ ] Generalized Linear Models (GLM)
- [ ] Statistical tests: t-test, chi-square, ANOVA, F-test
- [ ] Covariance/correlation matrices
- [ ] Independent Component Analysis (ICA)
- [ ] Factor Analysis
- [ ] Hypothesis testing framework

---

## v0.8.0: Showcase & QA Protocol (Completed)

**Target**: Unified Inference Architecture (GGUF/SafeTensors/APR) & Severe Testing Protocol

### Completed

- [x] **Unified Inference Architecture**
  - Multi-format support: GGUF, SafeTensors, APR
  - Hybrid backend: CPU + GPU (CUDA)
  - Rosetta ML Diagnostics for format conversion
  - `apr run`, `apr chat`, `apr serve` commands
- [x] **Severe Testing Protocol (PMAT-QA-PROTOCOL-001)**
  - Hang Detection: 60s timeout wrapper for all tests
  - Garbage Detection: Strict Level 5 verification (regex + patterns)
  - Zombie Mitigation: SIGINT resiliency for `apr serve`
  - RAII Model Fixtures: Automated setup/teardown
- [x] **QA Matrix**
  - 21-cell test matrix (Modality × Format × Trace)
  - Full traceability with `--trace` flag
  - Performance regression baselines
- [x] **Documentation**
  - Updated showcase spec (v1.7.0) with Dr. Popper's audit
  - Falsification suite (qa_falsify.rs)
  - Comprehensive QA Report

### Quality Metrics Achieved

- QA Pass Rate: 100% (21/21 matrix cells)
- Falsification Coverage: 5 attack vectors verified
- Zombie Processes: 0 (verified by SIGINT tests)
- Documentation: Epistemologically audited (Corroborated)

### Released ✅

- [x] QA Protocol fully operational
- [x] Showcase demo verified
- [x] Falsification suite integrated

---

## v0.8.1: Time Series (Planned)

- [ ] ARIMA models
- [ ] Exponential smoothing (Holt-Winters)
- [ ] Seasonal decomposition
- [ ] Forecasting metrics: MAPE, SMAPE

---

## v1.0.0: Production Hardening

- [ ] GPU benchmarks and optimization
- [ ] WASM examples (in-browser training)
- [ ] Model serialization versioning
- [ ] Complete EXTREME TDD Book content
- [ ] Performance whitepaper
- [ ] Production deployment examples

---

## Quality Targets

All releases must meet:

- **TDG Score**: A+ (95.0+/100)
- **Test Coverage**: 95%+ line coverage
- **Mutation Score**: 85%+ (cargo-mutants)
- **Complexity**: ≤10 cyclomatic per function
- **Documentation**: 100% rustdoc coverage

---

## Contributing

Contributions welcome! Please ensure:
- All tests pass: `cargo test --all`
- No clippy warnings: `cargo clippy --all-targets`
- Code is formatted: `cargo fmt`

Priorities:
1. Bug fixes and test coverage improvements
2. Documentation and examples
3. Performance optimizations
4. New algorithms (must include tests and benchmarks)