aprender 0.29.3

Next-generation ML framework in pure Rust β€” `cargo install aprender` for the `apr` CLI
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
# AutoML: Automated Machine Learning

Aprender's AutoML module provides type-safe hyperparameter optimization with multiple search strategies, including the state-of-the-art Tree-structured Parzen Estimator (TPE).

## Overview

AutoML automates the tedious process of hyperparameter tuning:

1. **Define search space** with type-safe parameter enums
2. **Choose strategy** (Random, Grid, or TPE)
3. **Run optimization** with callbacks for early stopping and time limits
4. **Get best configuration** automatically

## Key Features

- **Type Safety (Poka-Yoke)**: Parameter keys are enums, not stringsβ€”typos caught at compile time
- **Multiple Strategies**: RandomSearch, GridSearch, TPE
- **Callbacks**: TimeBudget, EarlyStopping, ProgressCallback
- **Extensible**: Custom parameter enums for any model family

## Quick Start

```rust
use aprender::automl::{AutoTuner, TPE, SearchSpace};
use aprender::automl::params::RandomForestParam as RF;

// Define type-safe search space
let space = SearchSpace::new()
    .add(RF::NEstimators, 10..500)
    .add(RF::MaxDepth, 2..20);

// Use TPE optimizer with early stopping
let result = AutoTuner::new(TPE::new(100))
    .time_limit_secs(60)
    .early_stopping(20)
    .maximize(&space, |trial| {
        let n = trial.get_usize(&RF::NEstimators).unwrap_or(100);
        let d = trial.get_usize(&RF::MaxDepth).unwrap_or(5);
        evaluate_model(n, d)  // Your objective function
    });

println!("Best: {:?}", result.best_trial);
```

## Type-Safe Parameter Enums

### The Problem with String Keys

Traditional AutoML libraries use string keys for parameters:

```python
# Optuna/scikit-optimize style (error-prone)
space = {
    "n_estimators": (10, 500),
    "max_detph": (2, 20),  # TYPO! Silent bug
}
```

### Aprender's Solution: Poka-Yoke

Aprender uses typed enums that catch typos at compile time:

```rust
use aprender::automl::params::RandomForestParam as RF;

let space = SearchSpace::new()
    .add(RF::NEstimators, 10..500)
    .add(RF::MaxDetph, 2..20);  // Compile error! Typo caught
//       ^^^^^^^^^^^^ Unknown variant
```

### Built-in Parameter Enums

```rust
// Random Forest
use aprender::automl::params::RandomForestParam;
// NEstimators, MaxDepth, MinSamplesLeaf, MaxFeatures, Bootstrap

// Gradient Boosting
use aprender::automl::params::GradientBoostingParam;
// NEstimators, LearningRate, MaxDepth, Subsample

// K-Nearest Neighbors
use aprender::automl::params::KNNParam;
// NNeighbors, Weights, P

// Linear Models
use aprender::automl::params::LinearParam;
// Alpha, L1Ratio, MaxIter, Tol

// Decision Trees
use aprender::automl::params::DecisionTreeParam;
// MaxDepth, MinSamplesLeaf, MinSamplesSplit

// K-Means
use aprender::automl::params::KMeansParam;
// NClusters, MaxIter, NInit
```

### Custom Parameter Enums

```rust
use aprender::automl::params::ParamKey;

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MyModelParam {
    LearningRate,
    HiddenLayers,
    Dropout,
}

impl ParamKey for MyModelParam {
    fn name(&self) -> &'static str {
        match self {
            Self::LearningRate => "learning_rate",
            Self::HiddenLayers => "hidden_layers",
            Self::Dropout => "dropout",
        }
    }
}

impl std::fmt::Display for MyModelParam {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "{}", self.name())
    }
}
```

## Search Space Definition

### Integer Parameters

```rust
let space = SearchSpace::new()
    .add(RF::NEstimators, 10..500)   // [10, 499]
    .add(RF::MaxDepth, 2..20);       // [2, 19]
```

### Continuous Parameters

```rust
let space = SearchSpace::new()
    .add_continuous(Param::LearningRate, 0.001, 0.1)
    .add_log_scale(Param::Alpha, LogScale { low: 1e-4, high: 1.0 });
```

### Categorical Parameters

```rust
let space = SearchSpace::new()
    .add_categorical(RF::MaxFeatures, ["sqrt", "log2", "0.5"])
    .add_bool(RF::Bootstrap, [true, false]);
```

## Search Strategies

### RandomSearch

Best for: Initial exploration, large search spaces

```rust
use aprender::automl::{RandomSearch, SearchStrategy};

let mut search = RandomSearch::new(100)  // 100 trials
    .with_seed(42);                       // Reproducible

let trials = search.suggest(&space, 10);  // Get 10 suggestions
```

**Why Random Search?**

Bergstra & Bengio (2012) showed random search achieves equivalent results to grid search with 60x fewer trials for many problems.

### GridSearch

Best for: Small, discrete search spaces

```rust
use aprender::automl::GridSearch;

let mut search = GridSearch::new(5);  // 5 points per continuous param
let trials = search.suggest(&space, 100);
```

### TPE (Tree-structured Parzen Estimator)

Best for: >10 trials, expensive objective functions

```rust
use aprender::automl::TPE;

let mut tpe = TPE::new(100)
    .with_seed(42)
    .with_startup_trials(10)  // Random before model
    .with_gamma(0.25);        // Top 25% as "good"
```

**How TPE Works:**

1. **Split observations**: Separate into "good" (top Ξ³) and "bad" based on objective values
2. **Fit KDEs**: Build Kernel Density Estimators for good (l) and bad (g) distributions
3. **Sample candidates**: Generate multiple candidates
4. **Select by EI**: Choose candidate maximizing l(x)/g(x) (Expected Improvement)

**TPE Configuration:**

| Parameter | Default | Description |
|-----------|---------|-------------|
| `gamma` | 0.25 | Quantile for good/bad split |
| `n_candidates` | 24 | Candidates per iteration |
| `n_startup_trials` | 10 | Random trials before model |

## AutoTuner with Callbacks

### Basic Usage

```rust
use aprender::automl::{AutoTuner, TPE, SearchSpace};

let result = AutoTuner::new(TPE::new(100))
    .maximize(&space, |trial| {
        // Your objective function
        evaluate(trial)
    });

println!("Best score: {}", result.best_score);
println!("Best params: {:?}", result.best_trial);
```

### Time Budget

```rust
let result = AutoTuner::new(TPE::new(1000))
    .time_limit_secs(60)   // Stop after 60 seconds
    .maximize(&space, objective);
```

### Early Stopping

```rust
let result = AutoTuner::new(TPE::new(1000))
    .early_stopping(20)    // Stop if no improvement for 20 trials
    .maximize(&space, objective);
```

### Verbose Progress

```rust
let result = AutoTuner::new(TPE::new(100))
    .verbose()             // Print trial results
    .maximize(&space, objective);

// Output:
// Trial   1: score=0.8234 params={n_estimators=142, max_depth=7}
// Trial   2: score=0.8456 params={n_estimators=287, max_depth=12}
// ...
```

### Combined Callbacks

```rust
let result = AutoTuner::new(TPE::new(500))
    .time_limit_secs(300)    // 5 minute budget
    .early_stopping(30)      // Stop if stuck
    .verbose()               // Show progress
    .maximize(&space, objective);
```

### Custom Callbacks

```rust
use aprender::automl::{Callback, TrialResult};

struct MyCallback {
    best_so_far: f64,
}

impl<P: ParamKey> Callback<P> for MyCallback {
    fn on_trial_end(&mut self, trial_num: usize, result: &TrialResult<P>) {
        if result.score > self.best_so_far {
            self.best_so_far = result.score;
            println!("New best at trial {}: {}", trial_num, result.score);
        }
    }

    fn should_stop(&self) -> bool {
        self.best_so_far > 0.99  // Stop if reached target
    }
}

let result = AutoTuner::new(TPE::new(100))
    .callback(MyCallback { best_so_far: 0.0 })
    .maximize(&space, objective);
```

## TuneResult Structure

```rust
pub struct TuneResult<P: ParamKey> {
    pub best_trial: Trial<P>,       // Best configuration
    pub best_score: f64,            // Best objective value
    pub history: Vec<TrialResult<P>>, // All trial results
    pub elapsed: Duration,          // Total time
    pub n_trials: usize,            // Trials completed
}
```

## Trial Accessors

```rust
let trial: Trial<RF> = /* ... */;

// Type-safe accessors
let n: Option<usize> = trial.get_usize(&RF::NEstimators);
let d: Option<i64> = trial.get_i64(&RF::MaxDepth);
let lr: Option<f64> = trial.get_f64(&Param::LearningRate);
let bootstrap: Option<bool> = trial.get_bool(&RF::Bootstrap);
```

## Real-World Example: aprender-shell

The `aprender-shell tune` command uses TPE to optimize n-gram size:

```rust
fn cmd_tune(history_path: Option<PathBuf>, trials: usize, ratio: f32) {
    use aprender::automl::{AutoTuner, SearchSpace, TPE};
    use aprender::automl::params::ParamKey;

    // Define custom parameter
    #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
    enum ShellParam { NGram }

    impl ParamKey for ShellParam {
        fn name(&self) -> &'static str { "ngram" }
    }

    let space: SearchSpace<ShellParam> = SearchSpace::new()
        .add(ShellParam::NGram, 2..6);  // n-gram sizes 2-5

    let tpe = TPE::new(trials)
        .with_seed(42)
        .with_startup_trials(2)
        .with_gamma(0.25);

    let result = AutoTuner::new(tpe)
        .early_stopping(4)
        .maximize(&space, |trial| {
            let ngram = trial.get_usize(&ShellParam::NGram).unwrap_or(3);

            // 3-fold cross-validation
            let mut scores = Vec::new();
            for fold in 0..3 {
                let score = validate_model(&commands, ngram, ratio, fold);
                scores.push(score);
            }
            scores.iter().sum::<f64>() / 3.0
        });

    println!("Best n-gram: {}", result.best_trial.get_usize(&ShellParam::NGram).unwrap());
    println!("Best score: {:.3}", result.best_score);
}
```

**Output:**

```
🎯 aprender-shell: AutoML Hyperparameter Tuning (TPE)

πŸ“‚ History file: /home/user/.zsh_history
πŸ“Š Total commands: 21780
πŸ”¬ TPE trials: 8

══════════════════════════════════════════════════
 Trial β”‚ N-gram β”‚   Hit@5   β”‚    MRR    β”‚  Score
═══════β•ͺ════════β•ͺ═══════════β•ͺ═══════════β•ͺ═════════
    1  β”‚    4   β”‚   26.2%   β”‚  0.182   β”‚  0.282
    2  β”‚    5   β”‚   26.8%   β”‚  0.186   β”‚  0.257
    3  β”‚    2   β”‚   26.2%   β”‚  0.181   β”‚  0.280
══════════════════════════════════════════════════

πŸ† Best Configuration (TPE):
   N-gram size: 4
   Score:       0.282
   Trials run:  5
   Time:        51.3s
```

## Synthetic Data Augmentation

Aprender's `synthetic` module enables automatic data augmentation with quality control and diversity monitoringβ€”particularly powerful for low-resource domains like shell autocomplete.

### The Problem: Limited Training Data

Many ML tasks suffer from insufficient training data:
- Shell autocomplete: Limited user history
- Code translation: Sparse parallel corpora
- Domain-specific NLP: Rare terminology

### The Solution: Quality-Controlled Synthetic Data

```rust
use aprender::synthetic::{SyntheticConfig, DiversityMonitor, DiversityScore};

// Configure augmentation with quality controls
let config = SyntheticConfig::default()
    .with_augmentation_ratio(1.0)    // 100% more data
    .with_quality_threshold(0.7)     // 70% minimum quality
    .with_diversity_weight(0.3);     // Balance quality vs diversity

// Monitor for mode collapse
let mut monitor = DiversityMonitor::new(10)
    .with_collapse_threshold(0.1);
```

### SyntheticConfig Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `augmentation_ratio` | 0.5 | Synthetic/original ratio (1.0 = double data) |
| `quality_threshold` | 0.7 | Minimum score for acceptance [0.0, 1.0] |
| `diversity_weight` | 0.3 | Balance: 0=quality only, 1=diversity only |
| `max_attempts` | 10 | Retries per sample before giving up |

### Generation Strategies

```rust
use aprender::synthetic::GenerationStrategy;

// Available strategies
GenerationStrategy::Template       // Slot-filling templates
GenerationStrategy::EDA            // Easy Data Augmentation
GenerationStrategy::BackTranslation // Via intermediate representation
GenerationStrategy::MixUp          // Embedding interpolation
GenerationStrategy::GrammarBased   // Formal grammar rules
GenerationStrategy::SelfTraining   // Pseudo-labels
GenerationStrategy::WeakSupervision // Labeling functions (Snorkel)
```

### Real-World Example: aprender-shell augment

The `aprender-shell augment` command demonstrates synthetic data power:

```bash
aprender-shell augment -a 1.0 -q 0.6 --monitor-diversity
```

**Output:**

```
🧬 aprender-shell: Data Augmentation (with aprender synthetic)

πŸ“‚ History file: /home/user/.zsh_history
πŸ“Š Real commands: 21789
βš™οΈ  Augmentation ratio: 1.0x
βš™οΈ  Quality threshold:  60.0%
🎯 Target synthetic:   21789 commands
πŸ”’ Known n-grams: 39180

πŸ§ͺ Generating synthetic commands... done!

πŸ“ˆ Coverage Report:
   Generated:          21789
   Quality filtered:   21430 (rejected 359)
   Known n-grams:      39180
   Total n-grams:      26616
   New n-grams added:  23329
   Coverage gain:      87.7%

πŸ“Š Diversity Metrics:
   Mean diversity:     1.000
   βœ“  Diversity is healthy

πŸ“Š Model Statistics:
   Original commands:   21789
   Synthetic commands:  21430
   Total training:      43219
   Unique n-grams:      65764
   Vocabulary size:     37531
```

### Before vs After Comparison

```
═══════════════════════════════════════════════════════════════
                    πŸ“ˆ IMPROVEMENT SUMMARY
═══════════════════════════════════════════════════════════════

                      BASELINE    AUGMENTED    GAIN
───────────────────────────────────────────────────────────────
  Commands:           21,789      43,219       +98%
  Unique n-grams:     40,852      65,764       +61%
  Vocabulary size:    16,102      37,531       +133%
  Model size:         2,016 KB    3,017 KB     +50%
  Coverage gain:        --        87.7%         βœ“
  Diversity:            --        1.000        Healthy
═══════════════════════════════════════════════════════════════
```

### New Capabilities from Synthetic Data

Commands the model never saw in history but now suggests:

```
kubectl suggestions (DevOps):
kubectl exec        0.050
kubectl config      0.050
kubectl delete      0.050

aws suggestions (Cloud):
aws ec2             0.096
aws lambda          0.076
aws iam             0.065

rustup suggestions (Rust):
rustup toolchain    0.107
rustup override     0.107
rustup doc          0.107
```

### DiversityMonitor: Detecting Mode Collapse

```rust
use aprender::synthetic::{DiversityMonitor, DiversityScore};

let mut monitor = DiversityMonitor::new(10)
    .with_collapse_threshold(0.1);

// Record diversity scores during generation
for sample in generated_samples {
    let score = DiversityScore::new(
        mean_distance,   // Pairwise distance
        min_distance,    // Closest pair
        coverage,        // Space coverage
    );
    monitor.record(score);
}

// Check for problems
if monitor.is_collapsing() {
    println!("⚠️  Mode collapse detected!");
}
if monitor.is_trending_down() {
    println!("⚠️  Diversity trending downward");
}

println!("Mean diversity: {:.3}", monitor.mean_diversity());
```

### QualityDegradationDetector

Monitors whether synthetic data is helping or hurting:

```rust
use aprender::synthetic::QualityDegradationDetector;

// Baseline: score without synthetic data
let mut detector = QualityDegradationDetector::new(0.85, 10)
    .with_min_improvement(0.02);

// Record scores from training with synthetic data
detector.record(0.87);  // Better!
detector.record(0.86);
detector.record(0.82);  // Getting worse...

if detector.should_disable_synthetic() {
    println!("Synthetic data is hurting performance");
}

let summary = detector.summary();
println!("Improvement: {:.1}%", summary.improvement * 100.0);
```

### Type-Safe Synthetic Parameters

```rust
use aprender::synthetic::SyntheticParam;
use aprender::automl::SearchSpace;

// Add synthetic params to AutoML search space
let space = SearchSpace::new()
    // Model hyperparameters
    .add(ModelParam::HiddenSize, 64..512)
    // Synthetic data hyperparameters (jointly optimized!)
    .add(SyntheticParam::AugmentationRatio, 0.0..2.0)
    .add(SyntheticParam::QualityThreshold, 0.5..0.95);
```

### Key Benefits

1. **Quality Filtering**: Rejected 359 low-quality commands (1.6%)
2. **Diversity Monitoring**: Confirmed no mode collapse
3. **Coverage Gain**: 87.7% of synthetic data introduced new n-grams
4. **Vocabulary Expansion**: +133% vocabulary size
5. **Joint Optimization**: Augmentation params tuned alongside model

## Best Practices

### 1. Start with Random Search

```rust
// Quick exploration
let result = AutoTuner::new(RandomSearch::new(20))
    .maximize(&space, objective);

// Then refine with TPE
let result = AutoTuner::new(TPE::new(100))
    .maximize(&refined_space, objective);
```

### 2. Use Log Scale for Learning Rates

```rust
let space = SearchSpace::new()
    .add_log_scale(Param::LearningRate, LogScale { low: 1e-5, high: 1e-1 });
```

### 3. Set Reasonable Time Budgets

```rust
// For expensive evaluations
let result = AutoTuner::new(TPE::new(1000))
    .time_limit_mins(30)
    .maximize(&space, expensive_objective);
```

### 4. Combine Early Stopping with Time Budget

```rust
let result = AutoTuner::new(TPE::new(500))
    .time_limit_secs(600)   // Max 10 minutes
    .early_stopping(50)     // Stop if stuck for 50 trials
    .maximize(&space, objective);
```

## Algorithm Comparison

| Strategy | Best For | Sample Efficiency | Overhead |
|----------|----------|-------------------|----------|
| RandomSearch | Large spaces, quick exploration | Low | Minimal |
| GridSearch | Small, discrete spaces | Medium | Minimal |
| TPE | Expensive objectives, >10 trials | High | Low |

## References

1. Bergstra, J., Bardenet, R., Bengio, Y., & KΓ©gl, B. (2011). **Algorithms for Hyper-Parameter Optimization.** NeurIPS.

2. Bergstra, J., & Bengio, Y. (2012). **Random Search for Hyper-Parameter Optimization.** JMLR, 13, 281-305.

## Running the Example

```bash
cargo run --example automl_clustering
```

**Sample Output:**

```
AutoML Clustering - TPE Optimization
=====================================

Generated 100 samples with 4 true clusters

Search Space: K ∈ [2, 10]
Objective: Maximize silhouette score

═══════════════════════════════════════════
 Trial β”‚   K   β”‚ Silhouette β”‚   Status
═══════β•ͺ═══════β•ͺ════════════β•ͺ════════════
    1  β”‚    9  β”‚    0.460   β”‚ moderate
    2  β”‚    6  β”‚    0.599   β”‚ good
    3  β”‚    5  β”‚    0.707   β”‚ good
    ...
═══════════════════════════════════════════

πŸ† TPE Optimization Results:
   Best K:          5
   Best silhouette: 0.7072
   True K:          4
   Trials run:      8

πŸ“ˆ Interpretation:
   βœ“ TPE found a close approximation (within Β±1)
   βœ… Excellent cluster separation (silhouette > 0.5)
```

## Related Topics

- [Case Study: AutoML Clustering]../examples/automl-clustering.md - Full example
- [Grid Search Hyperparameter Tuning]../examples/grid-search-tuning.md - Manual grid search
- [Cross-Validation]./cross-validation.md - CV fundamentals
- [Random Forest]../examples/random-forest.md - Model to tune