aprender 0.30.0

Next-generation ML framework in pure Rust — `cargo install aprender` for the `apr` CLI
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
<!-- PCU: examples-batuta-integration | contract: contracts/apr-page-examples-batuta-integration-v1.yaml -->
<!-- Example: cargo run -p aprender-core --example none -->
<!-- Status: enforced -->

# Case Study: Batuta - Automated Migration to Aprender

Using Batuta to automatically convert Python ML projects to Aprender/Rust.

## Overview

**Batuta** (Spanish for "conductor's baton") is an orchestration framework that converts Python ML projects to high-performance Rust using Aprender. It automates the migration of scikit-learn codebases to Aprender equivalents with:

- Automatic API mapping (sklearn → Aprender)
- NumPy → Trueno tensor conversion
- Mixture-of-Experts (MoE) backend routing
- Semantic-preserving transformation

```
┌─────────────────────────────────────────────────────────────────┐
│                     BATUTA MIGRATION FLOW                       │
│                                                                 │
│   Python Project                    Rust Project                │
│   ──────────────                    ────────────                │
│   sklearn.linear_model    ═══►     aprender::linear_model      │
│   sklearn.cluster         ═══►     aprender::cluster           │
│   sklearn.ensemble        ═══►     aprender::ensemble          │
│   sklearn.preprocessing   ═══►     aprender::preprocessing     │
│   numpy operations        ═══►     trueno primitives           │
│                                                                 │
│   Result: 2-10× performance improvement with memory safety      │
└─────────────────────────────────────────────────────────────────┘
```

## The 5-Phase Workflow

Batuta follows a Toyota Way-inspired Kanban workflow:

```
┌──────────┐   ┌──────────────┐   ┌──────────────┐   ┌────────────┐   ┌────────────┐
│ Analysis │──►│ Transpilation│──►│ Optimization │──►│ Validation │──►│ Deployment │
└──────────┘   └──────────────┘   └──────────────┘   └────────────┘   └────────────┘
     │                │                   │                │               │
     ▼                ▼                   ▼                ▼               ▼
   PMAT          Depyler           MoE Backend        Renacer          Reports
  TDG Score     Type Inference      Routing          Tracing         Migration
```

### Phase 1: Analysis

```bash
$ batuta analyze ./my-sklearn-project

Primary language: Python
Total files: 127
Total lines: 8,432

Dependencies:
  • pip (42 packages) in requirements.txt
  • ML frameworks detected:
    - scikit-learn 1.3.0 → Aprender mapping available
    - numpy 1.24.0 → Trueno mapping available
    - pandas 2.0.0 → DataFrame support

Quality:
  • TDG Score: 73.2/100 (B)
  • Test coverage: 68%

Recommended transpiler: Depyler (Python → Rust)
Estimated migration complexity: Medium
```

### Phase 2: Transpilation

```bash
$ batuta transpile --output ./rust-project
```

### Phase 3: Optimization

```bash
$ batuta optimize --enable-simd --enable-gpu
```

### Phase 4: Validation

```bash
$ batuta validate --trace-syscalls --benchmark
```

### Phase 5: Deployment

```bash
$ batuta build --release
$ batuta report --format markdown --output MIGRATION.md
```

## scikit-learn to Aprender Mapping

Batuta provides complete mappings for sklearn algorithms:

### Linear Models

| scikit-learn | Aprender | Complexity |
|--------------|----------|------------|
| `LinearRegression` | `aprender::linear_model::LinearRegression` | Medium |
| `LogisticRegression` | `aprender::linear_model::LogisticRegression` | Medium |
| `Ridge` | `aprender::linear_model::Ridge` | Medium |
| `Lasso` | `aprender::linear_model::Lasso` | Medium |

### Tree-Based Models

| scikit-learn | Aprender | Complexity |
|--------------|----------|------------|
| `DecisionTreeClassifier` | `aprender::tree::DecisionTreeClassifier` | High |
| `RandomForestClassifier` | `aprender::ensemble::RandomForestClassifier` | High |
| `GradientBoostingClassifier` | `aprender::ensemble::GradientBoosting` | High |

### Clustering

| scikit-learn | Aprender | Complexity |
|--------------|----------|------------|
| `KMeans` | `aprender::cluster::KMeans` | Medium |
| `DBSCAN` | `aprender::cluster::DBSCAN` | High |

### Preprocessing

| scikit-learn | Aprender | Complexity |
|--------------|----------|------------|
| `StandardScaler` | `aprender::preprocessing::StandardScaler` | Low |
| `MinMaxScaler` | `aprender::preprocessing::MinMaxScaler` | Low |
| `LabelEncoder` | `aprender::preprocessing::LabelEncoder` | Low |

### Model Selection

| scikit-learn | Aprender | Notes |
|--------------|----------|-------|
| `train_test_split` | `aprender::model_selection::train_test_split` | Same API |
| `cross_val_score` | `aprender::model_selection::cross_validate` | Same API |
| `GridSearchCV` | `aprender::model_selection::GridSearchCV` | Parallel by default |

## Conversion Examples

### Example 1: Basic ML Pipeline

**Python (Original):**

```python
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
data = load_iris()
X, y = data.data, data.target

# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.4f}")
```

**Rust (Batuta Output):**

```rust
use aprender::datasets::load_iris;
use aprender::preprocessing::StandardScaler;
use aprender::model_selection::train_test_split;
use aprender::ensemble::RandomForestClassifier;
use aprender::metrics::accuracy_score;
use aprender::{Estimator, Transformer};

fn main() -> anyhow::Result<()> {
    // Load data
    let data = load_iris()?;
    let (X, y) = (&data.features, &data.targets);

    // Preprocess
    let mut scaler = StandardScaler::new();
    let X_scaled = scaler.fit_transform(X)?;

    // Split (80/20, seed=42)
    let (X_train, X_test, y_train, y_test) = train_test_split(
        &X_scaled, y, 0.2, Some(42)
    )?;

    // Train
    let mut model = RandomForestClassifier::new()
        .with_n_estimators(100)
        .with_seed(42);
    model.fit(&X_train, &y_train)?;

    // Evaluate
    let predictions = model.predict(&X_test)?;
    let accuracy = accuracy_score(&y_test, &predictions)?;
    println!("Accuracy: {:.4}", accuracy);

    Ok(())
}
```

### Example 2: Linear Regression with Cross-Validation

**Python (Original):**

```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([1.5, 3.5, 5.5, 7.5, 9.5])

model = LinearRegression()
scores = cross_val_score(model, X, y, cv=3, scoring='r2')
print(f"R² scores: {scores}")
print(f"Mean R²: {scores.mean():.4f}")
```

**Rust (Batuta Output):**

```rust
use aprender::linear_model::LinearRegression;
use aprender::model_selection::cross_validate;
use aprender::Estimator;
use aprender::compute::Matrix;

fn main() -> anyhow::Result<()> {
    let X = Matrix::from_slice(&[
        [1.0, 2.0],
        [3.0, 4.0],
        [5.0, 6.0],
        [7.0, 8.0],
        [9.0, 10.0],
    ]);
    let y = vec![1.5, 3.5, 5.5, 7.5, 9.5];

    let model = LinearRegression::new();
    let scores = cross_validate(&model, &X, &y, 3)?;

    println!("R² scores: {:?}", scores.test_scores);
    println!("Mean R²: {:.4}", scores.mean_test_score());

    Ok(())
}
```

### Example 3: Clustering with KMeans

**Python (Original):**

```python
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

X = np.random.randn(1000, 5)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

score = silhouette_score(X, labels)
print(f"Silhouette score: {score:.4f}")
print(f"Inertia: {kmeans.inertia_:.2f}")
```

**Rust (Batuta Output):**

```rust
use aprender::cluster::KMeans;
use aprender::metrics::silhouette_score;
use aprender::UnsupervisedEstimator;
use aprender::compute::Matrix;

fn main() -> anyhow::Result<()> {
    // Generate random data (using trueno's random)
    let X = Matrix::random(1000, 5);

    let mut kmeans = KMeans::new(3)
        .with_seed(42)
        .with_n_init(10);
    let labels = kmeans.fit_predict(&X)?;

    let score = silhouette_score(&X, &labels)?;
    println!("Silhouette score: {:.4}", score);
    println!("Inertia: {:.2}", kmeans.inertia());

    Ok(())
}
```

## NumPy to Trueno Mapping

Batuta converts NumPy operations to Trueno equivalents:

| NumPy | Trueno | Notes |
|-------|--------|-------|
| `np.array([...])` | `Vector::from_slice(&[...])` | Direct mapping |
| `np.zeros((m, n))` | `Matrix::zeros(m, n)` | Same semantics |
| `np.ones((m, n))` | `Matrix::ones(m, n)` | Same semantics |
| `np.dot(a, b)` | `a.dot(&b)` | SIMD-accelerated |
| `a @ b` | `a.matmul(&b)` | MoE backend selection |
| `np.sum(a)` | `a.sum()` | Reduction operation |
| `np.mean(a)` | `a.mean()` | Statistical operation |
| `np.max(a)` | `a.max()` | Reduction operation |
| `np.min(a)` | `a.min()` | Reduction operation |
| `a.T` | `a.transpose()` | View-based (zero-copy) |
| `a.reshape(m, n)` | `a.reshape(m, n)` | Same API |

### Example: Matrix Operations

**Python:**

```python
import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiply
C = A @ B

# Element-wise operations
D = A + B
E = A * B

# Reductions
total = np.sum(A)
mean = np.mean(A)
```

**Rust (via Batuta):**

```rust
use aprender::compute::{Matrix, Vector};

fn main() {
    let A = Matrix::from_slice(&[
        [1.0, 2.0],
        [3.0, 4.0],
    ]);
    let B = Matrix::from_slice(&[
        [5.0, 6.0],
        [7.0, 8.0],
    ]);

    // Matrix multiply (MoE selects SIMD for small matrices)
    let C = A.matmul(&B);

    // Element-wise operations (SIMD-accelerated)
    let D = &A + &B;
    let E = &A * &B;

    // Reductions
    let total = A.sum();
    let mean = A.mean();
}
```

## Mixture-of-Experts Backend Routing

Batuta automatically selects optimal backends based on operation complexity and data size:

```
┌─────────────────────────────────────────────────────────────────┐
│                    MoE BACKEND SELECTION                        │
│                                                                 │
│   Operation Type          Data Size      Backend Selected       │
│   ──────────────          ─────────      ────────────────       │
│   Element-wise (Low)      < 1M           Scalar/SIMD            │
│   Element-wise (Low)      ≥ 1M           SIMD                   │
│                                                                 │
│   Reductions (Medium)     < 10K          Scalar                 │
│   Reductions (Medium)     10K - 100K     SIMD                   │
│   Reductions (Medium)     ≥ 100K         GPU                    │
│                                                                 │
│   MatMul (High)           < 1K           Scalar                 │
│   MatMul (High)           1K - 10K       SIMD                   │
│   MatMul (High)           ≥ 10K          GPU                    │
└─────────────────────────────────────────────────────────────────┘
```

Based on the **5× PCIe dispatch rule** (Gregg & Hazelwood 2011): GPU dispatch is only beneficial when compute time exceeds 5× the PCIe transfer time.

### Using the Backend Selector

```rust
use aprender::orchestrate::backend::{BackendSelector, OpComplexity};

fn main() {
    let selector = BackendSelector::new();

    // Element-wise on 1M elements → SIMD
    let backend = selector.select_with_moe(OpComplexity::Low, 1_000_000);
    println!("1M element-wise: {}", backend);  // "SIMD"

    // Matrix multiply on 50K elements → GPU
    let backend = selector.select_with_moe(OpComplexity::High, 50_000);
    println!("50K matmul: {}", backend);  // "GPU"

    // Reduction on 5K elements → Scalar
    let backend = selector.select_with_moe(OpComplexity::Medium, 5_000);
    println!("5K reduction: {}", backend);  // "Scalar"
}
```

## Performance Comparison

Real-world benchmarks from migrated projects:

```
┌─────────────────────────────────────────────────────────────────┐
│              BATUTA MIGRATION PERFORMANCE GAINS                 │
│                                                                 │
│   Operation               Python      Rust        Improvement   │
│   ────────────────────    ──────      ────        ───────────   │
│   Linear regression fit   45ms        4ms         11.2× faster  │
│   Random forest predict   890ms       89ms        10.0× faster  │
│   KMeans clustering       2.3s        0.21s       10.9× faster  │
│   StandardScaler          12ms        0.8ms       15.0× faster  │
│   Matrix multiply (1K)    5.2ms       0.3ms       17.3× faster  │
│                                                                 │
│   Memory Usage:                                                 │
│   Peak RSS               127MB        24MB        5.3× smaller  │
│   Heap allocations       45K          3K          15.0× fewer   │
│                                                                 │
│   Binary Size:           N/A          2.1MB       Static linked │
│   Startup Time:          ~500ms       23ms        21.7× faster  │
└─────────────────────────────────────────────────────────────────┘
```

## Oracle Mode

Batuta includes an intelligent query interface for component selection:

```bash
# Find the right approach
$ batuta oracle "How do I train random forest on 1M samples?"

Recommendation: Use aprender::ensemble::RandomForestClassifier
  • Data size: 1M samples → High complexity
  • Recommended backend: GPU (via Trueno)
  • Memory estimate: ~800MB for training
  • Parallel trees: Enable with --n-jobs=-1

Code template:
```rust
use aprender::ensemble::RandomForestClassifier;

let mut model = RandomForestClassifier::new()
    .with_n_estimators(100)
    .with_max_depth(Some(10))
    .with_seed(42);
model.fit(&X_train, &y_train)?;
```

# List all stack components
$ batuta oracle --list

# Show component details
$ batuta oracle --show aprender
```

## Plugin Architecture

Extend Batuta with custom transpilers:

```rust
use aprender::orchestrate::plugin::{TranspilerPlugin, PluginMetadata, PluginRegistry};
use aprender::orchestrate::types::Language;

struct MyCustomConverter;

impl TranspilerPlugin for MyCustomConverter {
    fn metadata(&self) -> PluginMetadata {
        PluginMetadata {
            name: "custom-ml-converter".to_string(),
            version: "0.1.0".to_string(),
            description: "Custom ML framework converter".to_string(),
            author: "Your Name".to_string(),
            supported_languages: vec![Language::Python],
        }
    }

    fn transpile(&self, source: &str, _lang: Language) -> anyhow::Result<String> {
        // Custom conversion logic
        Ok(convert_custom_framework(source))
    }
}

fn main() -> anyhow::Result<()> {
    let mut registry = PluginRegistry::new();
    registry.register(Box::new(MyCustomConverter))?;

    // Use plugin for conversion
    let plugins = registry.get_for_language(Language::Python);
    if let Some(plugin) = plugins.first() {
        let output = plugin.transpile(source_code, Language::Python)?;
    }
    Ok(())
}
```

## Integration with CITL

Batuta integrates with the Compiler-in-the-Loop (CITL) system for iterative refinement:

```
┌─────────────────────────────────────────────────────────────────┐
│                  BATUTA + CITL INTEGRATION                      │
│                                                                 │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                 │
│   │  Batuta  │───►│  Depyler │───►│   rustc  │                 │
│   │ Analyzer │    │Transpiler│    │ Compiler │                 │
│   └──────────┘    └──────────┘    └────┬─────┘                 │
│                                        │                        │
│                         ┌──────────────┘                        │
│                         ▼                                       │
│   ┌──────────────────────────────────────────────────────┐     │
│   │                    CITL Oracle                        │     │
│   │                                                       │     │
│   │   Error E0308 → TypeMapping fix                       │     │
│   │   Error E0382 → BorrowStrategy fix                    │     │
│   │   Error E0597 → LifetimeInfer fix                     │     │
│   └──────────────────────────────────────────────────────┘     │
│                         │                                       │
│                         ▼                                       │
│   ┌──────────────┐    ┌───────────┐    ┌────────────┐          │
│   │ Apply Fix    │───►│  Recompile │───►│  Success!  │          │
│   └──────────────┘    └───────────┘    └────────────┘          │
└─────────────────────────────────────────────────────────────────┘
```

When transpiled code fails to compile, Batuta queries the CITL oracle for fixes:

```rust
use aprender::orchestrate::citl::CITLIntegration;

let citl = CITLIntegration::new()
    .with_max_iterations(5)
    .with_confidence_threshold(0.8);

// Transpile with automatic fix attempts
let result = citl.transpile_with_repair(python_source)?;

match result {
    TranspileResult::Success { rust_code, fixes_applied } => {
        println!("Successfully transpiled with {} fixes", fixes_applied.len());
    }
    TranspileResult::Partial { rust_code, remaining_errors } => {
        println!("Partial success, {} errors remain", remaining_errors.len());
    }
}
```

## Best Practices

### 1. Start with Analysis

Always analyze your project before migration:

```bash
batuta analyze ./my-project --tdg --languages --dependencies
```

### 2. Migrate Incrementally

Use Ruchy for gradual migration:

```bash
batuta transpile --incremental --modules core,utils
```

### 3. Validate Thoroughly

Run semantic validation with syscall tracing:

```bash
batuta validate --trace-syscalls --diff-output --benchmark
```

### 4. Optimize Last

Enable optimizations only after validation:

```bash
batuta optimize --enable-simd --enable-gpu --profile aggressive
```

### 5. Document the Migration

Generate a migration report:

```bash
batuta report --format markdown --output MIGRATION.md
```

## Troubleshooting

### Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Type mismatch errors | Python dynamic typing | Add type hints in Python first |
| Missing algorithm | Unsupported sklearn feature | Check Aprender docs for equivalent |
| Performance regression | Wrong backend selected | Use `--force-backend` flag |
| Memory explosion | Large intermediate tensors | Enable streaming mode |

### Debugging Tips

```bash
# Verbose transpilation
batuta transpile --verbose --debug

# Show backend selection reasoning
batuta optimize --explain-backend

# Profile memory usage
batuta validate --profile-memory
```

## See Also

- [Compiler-in-the-Loop Learning]../ml-fundamentals/compiler-in-the-loop.md
- [CITL Automated Program Repair]./citl-automated-repair.md
- [Case Study: aprender-tsp Sub-Crate]./tsp-solver-crate.md