# Case Study: Batuta - Automated Migration to Aprender
Using Batuta to automatically convert Python ML projects to Aprender/Rust.
## Overview
**Batuta** (Spanish for "conductor's baton") is an orchestration framework that converts Python ML projects to high-performance Rust using Aprender. It automates the migration of scikit-learn codebases to Aprender equivalents with:
- Automatic API mapping (sklearn → Aprender)
- NumPy → Trueno tensor conversion
- Mixture-of-Experts (MoE) backend routing
- Semantic-preserving transformation
```
┌─────────────────────────────────────────────────────────────────┐
│ BATUTA MIGRATION FLOW │
│ │
│ Python Project Rust Project │
│ ────────────── ──────────── │
│ sklearn.linear_model ═══► aprender::linear_model │
│ sklearn.cluster ═══► aprender::cluster │
│ sklearn.ensemble ═══► aprender::ensemble │
│ sklearn.preprocessing ═══► aprender::preprocessing │
│ numpy operations ═══► trueno primitives │
│ │
│ Result: 2-10× performance improvement with memory safety │
└─────────────────────────────────────────────────────────────────┘
```
## The 5-Phase Workflow
Batuta follows a Toyota Way-inspired Kanban workflow:
```
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ ┌────────────┐
│ Analysis │──►│ Transpilation│──►│ Optimization │──►│ Validation │──►│ Deployment │
└──────────┘ └──────────────┘ └──────────────┘ └────────────┘ └────────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
PMAT Depyler MoE Backend Renacer Reports
TDG Score Type Inference Routing Tracing Migration
```
### Phase 1: Analysis
```bash
$ batuta analyze ./my-sklearn-project
Primary language: Python
Total files: 127
Total lines: 8,432
Dependencies:
• pip (42 packages) in requirements.txt
• ML frameworks detected:
- scikit-learn 1.3.0 → Aprender mapping available
- numpy 1.24.0 → Trueno mapping available
- pandas 2.0.0 → DataFrame support
Quality:
• TDG Score: 73.2/100 (B)
• Test coverage: 68%
Recommended transpiler: Depyler (Python → Rust)
Estimated migration complexity: Medium
```
### Phase 2: Transpilation
```bash
$ batuta transpile --output ./rust-project
```
### Phase 3: Optimization
```bash
$ batuta optimize --enable-simd --enable-gpu
```
### Phase 4: Validation
```bash
$ batuta validate --trace-syscalls --benchmark
```
### Phase 5: Deployment
```bash
$ batuta build --release
$ batuta report --format markdown --output MIGRATION.md
```
## scikit-learn to Aprender Mapping
Batuta provides complete mappings for sklearn algorithms:
### Linear Models
| `LinearRegression` | `aprender::linear_model::LinearRegression` | Medium |
| `LogisticRegression` | `aprender::linear_model::LogisticRegression` | Medium |
| `Ridge` | `aprender::linear_model::Ridge` | Medium |
| `Lasso` | `aprender::linear_model::Lasso` | Medium |
### Tree-Based Models
| `DecisionTreeClassifier` | `aprender::tree::DecisionTreeClassifier` | High |
| `RandomForestClassifier` | `aprender::ensemble::RandomForestClassifier` | High |
| `GradientBoostingClassifier` | `aprender::ensemble::GradientBoosting` | High |
### Clustering
| `KMeans` | `aprender::cluster::KMeans` | Medium |
| `DBSCAN` | `aprender::cluster::DBSCAN` | High |
### Preprocessing
| `StandardScaler` | `aprender::preprocessing::StandardScaler` | Low |
| `MinMaxScaler` | `aprender::preprocessing::MinMaxScaler` | Low |
| `LabelEncoder` | `aprender::preprocessing::LabelEncoder` | Low |
### Model Selection
| `train_test_split` | `aprender::model_selection::train_test_split` | Same API |
| `cross_val_score` | `aprender::model_selection::cross_validate` | Same API |
| `GridSearchCV` | `aprender::model_selection::GridSearchCV` | Parallel by default |
## Conversion Examples
### Example 1: Basic ML Pipeline
**Python (Original):**
```python
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load data
data = load_iris()
X, y = data.data, data.target
# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
# Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.4f}")
```
**Rust (Batuta Output):**
```rust
use aprender::datasets::load_iris;
use aprender::preprocessing::StandardScaler;
use aprender::model_selection::train_test_split;
use aprender::ensemble::RandomForestClassifier;
use aprender::metrics::accuracy_score;
use aprender::{Estimator, Transformer};
fn main() -> anyhow::Result<()> {
// Load data
let data = load_iris()?;
let (X, y) = (&data.features, &data.targets);
// Preprocess
let mut scaler = StandardScaler::new();
let X_scaled = scaler.fit_transform(X)?;
// Split (80/20, seed=42)
let (X_train, X_test, y_train, y_test) = train_test_split(
&X_scaled, y, 0.2, Some(42)
)?;
// Train
let mut model = RandomForestClassifier::new()
.with_n_estimators(100)
.with_seed(42);
model.fit(&X_train, &y_train)?;
// Evaluate
let predictions = model.predict(&X_test)?;
let accuracy = accuracy_score(&y_test, &predictions)?;
println!("Accuracy: {:.4}", accuracy);
Ok(())
}
```
### Example 2: Linear Regression with Cross-Validation
**Python (Original):**
```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([1.5, 3.5, 5.5, 7.5, 9.5])
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=3, scoring='r2')
print(f"R² scores: {scores}")
print(f"Mean R²: {scores.mean():.4f}")
```
**Rust (Batuta Output):**
```rust
use aprender::linear_model::LinearRegression;
use aprender::model_selection::cross_validate;
use aprender::Estimator;
use trueno::Matrix;
fn main() -> anyhow::Result<()> {
let X = Matrix::from_slice(&[
[1.0, 2.0],
[3.0, 4.0],
[5.0, 6.0],
[7.0, 8.0],
[9.0, 10.0],
]);
let y = vec![1.5, 3.5, 5.5, 7.5, 9.5];
let model = LinearRegression::new();
let scores = cross_validate(&model, &X, &y, 3)?;
println!("R² scores: {:?}", scores.test_scores);
println!("Mean R²: {:.4}", scores.mean_test_score());
Ok(())
}
```
### Example 3: Clustering with KMeans
**Python (Original):**
```python
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
X = np.random.randn(1000, 5)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
print(f"Silhouette score: {score:.4f}")
print(f"Inertia: {kmeans.inertia_:.2f}")
```
**Rust (Batuta Output):**
```rust
use aprender::cluster::KMeans;
use aprender::metrics::silhouette_score;
use aprender::UnsupervisedEstimator;
use trueno::Matrix;
fn main() -> anyhow::Result<()> {
// Generate random data (using trueno's random)
let X = Matrix::random(1000, 5);
let mut kmeans = KMeans::new(3)
.with_seed(42)
.with_n_init(10);
let labels = kmeans.fit_predict(&X)?;
let score = silhouette_score(&X, &labels)?;
println!("Silhouette score: {:.4}", score);
println!("Inertia: {:.2}", kmeans.inertia());
Ok(())
}
```
## NumPy to Trueno Mapping
Batuta converts NumPy operations to Trueno equivalents:
| `np.array([...])` | `Vector::from_slice(&[...])` | Direct mapping |
| `np.zeros((m, n))` | `Matrix::zeros(m, n)` | Same semantics |
| `np.ones((m, n))` | `Matrix::ones(m, n)` | Same semantics |
| `np.dot(a, b)` | `a.dot(&b)` | SIMD-accelerated |
| `a @ b` | `a.matmul(&b)` | MoE backend selection |
| `np.sum(a)` | `a.sum()` | Reduction operation |
| `np.mean(a)` | `a.mean()` | Statistical operation |
| `np.max(a)` | `a.max()` | Reduction operation |
| `np.min(a)` | `a.min()` | Reduction operation |
| `a.T` | `a.transpose()` | View-based (zero-copy) |
| `a.reshape(m, n)` | `a.reshape(m, n)` | Same API |
### Example: Matrix Operations
**Python:**
```python
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Matrix multiply
C = A @ B
# Element-wise operations
D = A + B
E = A * B
# Reductions
total = np.sum(A)
mean = np.mean(A)
```
**Rust (via Batuta):**
```rust
use trueno::{Matrix, Vector};
fn main() {
let A = Matrix::from_slice(&[
[1.0, 2.0],
[3.0, 4.0],
]);
let B = Matrix::from_slice(&[
[5.0, 6.0],
[7.0, 8.0],
]);
// Matrix multiply (MoE selects SIMD for small matrices)
let C = A.matmul(&B);
// Element-wise operations (SIMD-accelerated)
let D = &A + &B;
let E = &A * &B;
// Reductions
let total = A.sum();
let mean = A.mean();
}
```
## Mixture-of-Experts Backend Routing
Batuta automatically selects optimal backends based on operation complexity and data size:
```
┌─────────────────────────────────────────────────────────────────┐
│ MoE BACKEND SELECTION │
│ │
│ Operation Type Data Size Backend Selected │
│ ────────────── ───────── ──────────────── │
│ Element-wise (Low) < 1M Scalar/SIMD │
│ Element-wise (Low) ≥ 1M SIMD │
│ │
│ Reductions (Medium) < 10K Scalar │
│ Reductions (Medium) 10K - 100K SIMD │
│ Reductions (Medium) ≥ 100K GPU │
│ │
│ MatMul (High) < 1K Scalar │
│ MatMul (High) 1K - 10K SIMD │
│ MatMul (High) ≥ 10K GPU │
└─────────────────────────────────────────────────────────────────┘
```
Based on the **5× PCIe dispatch rule** (Gregg & Hazelwood 2011): GPU dispatch is only beneficial when compute time exceeds 5× the PCIe transfer time.
### Using the Backend Selector
```rust
use batuta::backend::{BackendSelector, OpComplexity};
fn main() {
let selector = BackendSelector::new();
// Element-wise on 1M elements → SIMD
let backend = selector.select_with_moe(OpComplexity::Low, 1_000_000);
println!("1M element-wise: {}", backend); // "SIMD"
// Matrix multiply on 50K elements → GPU
let backend = selector.select_with_moe(OpComplexity::High, 50_000);
println!("50K matmul: {}", backend); // "GPU"
// Reduction on 5K elements → Scalar
let backend = selector.select_with_moe(OpComplexity::Medium, 5_000);
println!("5K reduction: {}", backend); // "Scalar"
}
```
## Performance Comparison
Real-world benchmarks from migrated projects:
```
┌─────────────────────────────────────────────────────────────────┐
│ BATUTA MIGRATION PERFORMANCE GAINS │
│ │
│ Operation Python Rust Improvement │
│ ──────────────────── ────── ──── ─────────── │
│ Linear regression fit 45ms 4ms 11.2× faster │
│ Random forest predict 890ms 89ms 10.0× faster │
│ KMeans clustering 2.3s 0.21s 10.9× faster │
│ StandardScaler 12ms 0.8ms 15.0× faster │
│ Matrix multiply (1K) 5.2ms 0.3ms 17.3× faster │
│ │
│ Memory Usage: │
│ Peak RSS 127MB 24MB 5.3× smaller │
│ Heap allocations 45K 3K 15.0× fewer │
│ │
│ Binary Size: N/A 2.1MB Static linked │
│ Startup Time: ~500ms 23ms 21.7× faster │
└─────────────────────────────────────────────────────────────────┘
```
## Oracle Mode
Batuta includes an intelligent query interface for component selection:
```bash
# Find the right approach
$ batuta oracle "How do I train random forest on 1M samples?"
Recommendation: Use aprender::ensemble::RandomForestClassifier
• Data size: 1M samples → High complexity
• Recommended backend: GPU (via Trueno)
• Memory estimate: ~800MB for training
• Parallel trees: Enable with --n-jobs=-1
Code template:
```rust
use aprender::ensemble::RandomForestClassifier;
let mut model = RandomForestClassifier::new()
.with_n_estimators(100)
.with_max_depth(Some(10))
.with_seed(42);
model.fit(&X_train, &y_train)?;
```
# List all stack components
$ batuta oracle --list
# Show component details
$ batuta oracle --show aprender
```
## Plugin Architecture
Extend Batuta with custom transpilers:
```rust
use batuta::plugin::{TranspilerPlugin, PluginMetadata, PluginRegistry};
use batuta::types::Language;
struct MyCustomConverter;
impl TranspilerPlugin for MyCustomConverter {
fn metadata(&self) -> PluginMetadata {
PluginMetadata {
name: "custom-ml-converter".to_string(),
version: "0.1.0".to_string(),
description: "Custom ML framework converter".to_string(),
author: "Your Name".to_string(),
supported_languages: vec![Language::Python],
}
}
fn transpile(&self, source: &str, _lang: Language) -> anyhow::Result<String> {
// Custom conversion logic
Ok(convert_custom_framework(source))
}
}
fn main() -> anyhow::Result<()> {
let mut registry = PluginRegistry::new();
registry.register(Box::new(MyCustomConverter))?;
// Use plugin for conversion
let plugins = registry.get_for_language(Language::Python);
if let Some(plugin) = plugins.first() {
let output = plugin.transpile(source_code, Language::Python)?;
}
Ok(())
}
```
## Integration with CITL
Batuta integrates with the Compiler-in-the-Loop (CITL) system for iterative refinement:
```
┌─────────────────────────────────────────────────────────────────┐
│ BATUTA + CITL INTEGRATION │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Batuta │───►│ Depyler │───►│ rustc │ │
│ │ Analyzer │ │Transpiler│ │ Compiler │ │
│ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌──────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ CITL Oracle │ │
│ │ │ │
│ │ Error E0308 → TypeMapping fix │ │
│ │ Error E0382 → BorrowStrategy fix │ │
│ │ Error E0597 → LifetimeInfer fix │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌───────────┐ ┌────────────┐ │
│ │ Apply Fix │───►│ Recompile │───►│ Success! │ │
│ └──────────────┘ └───────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
When transpiled code fails to compile, Batuta queries the CITL oracle for fixes:
```rust
use batuta::citl::CITLIntegration;
let citl = CITLIntegration::new()
.with_max_iterations(5)
.with_confidence_threshold(0.8);
// Transpile with automatic fix attempts
let result = citl.transpile_with_repair(python_source)?;
match result {
TranspileResult::Success { rust_code, fixes_applied } => {
println!("Successfully transpiled with {} fixes", fixes_applied.len());
}
TranspileResult::Partial { rust_code, remaining_errors } => {
println!("Partial success, {} errors remain", remaining_errors.len());
}
}
```
## Best Practices
### 1. Start with Analysis
Always analyze your project before migration:
```bash
batuta analyze ./my-project --tdg --languages --dependencies
```
### 2. Migrate Incrementally
Use Ruchy for gradual migration:
```bash
batuta transpile --incremental --modules core,utils
```
### 3. Validate Thoroughly
Run semantic validation with syscall tracing:
```bash
batuta validate --trace-syscalls --diff-output --benchmark
```
### 4. Optimize Last
Enable optimizations only after validation:
```bash
batuta optimize --enable-simd --enable-gpu --profile aggressive
```
### 5. Document the Migration
Generate a migration report:
```bash
batuta report --format markdown --output MIGRATION.md
```
## Troubleshooting
### Common Issues
| Type mismatch errors | Python dynamic typing | Add type hints in Python first |
| Missing algorithm | Unsupported sklearn feature | Check Aprender docs for equivalent |
| Performance regression | Wrong backend selected | Use `--force-backend` flag |
| Memory explosion | Large intermediate tensors | Enable streaming mode |
### Debugging Tips
```bash
# Verbose transpilation
batuta transpile --verbose --debug
# Show backend selection reasoning
batuta optimize --explain-backend
# Profile memory usage
batuta validate --profile-memory
```
## See Also
- [Compiler-in-the-Loop Learning](../ml-fundamentals/compiler-in-the-loop.md)
- [CITL Automated Program Repair](./citl-automated-repair.md)
- [Case Study: aprender-tsp Sub-Crate](./tsp-solver-crate.md)