rustify-ml 0.1.2

Profile Python hotspots and auto-generate Rust + PyO3 stubs via maturin
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
# rustify-ml

> **Auto-accelerate Python ML hotspots with Rust.** Profile → Identify → Generate → Build — drop-in PyO3 extensions with no manual rewrite.

> **20x faster `running_mean`. 15x faster `convolve1d`. 12x faster BPE tokenizer. Zero manual Rust.**

Install: `cargo install rustify-ml` (from crates.io) — also `pip install maturin` for builds.

[![CI](https://github.com/homezloco/rustify-ml/actions/workflows/ci.yml/badge.svg)](https://github.com/homezloco/rustify-ml/actions)
[![crates.io](https://img.shields.io/crates/v/rustify-ml.svg)](https://crates.io/crates/rustify-ml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

---

## What It Does

`rustify-ml` is a CLI tool that:

1. **Profiles** your Python file using `cProfile` (no elevated privileges required)
2. **Identifies** CPU hotspots above a configurable threshold
3. **Generates** safe Rust + PyO3 stubs with length-check guards and type inference
4. **Builds** an installable Python extension via `maturin develop --release`

**Bridge:** Python (cProfile) → hotspot selection → Rust codegen (PyO3) → maturin wheel → editable install → parity tests + benchmarks. No manual glue required.

Typical speedups: **5–100x** on pure-Python loops (tokenizers, matrix ops, image preprocessing, data pipelines).

---

## Quick Start

```bash
# Install dependencies
pip install maturin
cargo install rustify-ml         # once published on crates.io
cargo install --path rustify-ml   # or: cargo build --release

# Accelerate a Python file (dry-run: generate code, skip build)
rustify-ml accelerate --file examples/euclidean.py --output dist --threshold 0 --dry-run

# Full run: profile → generate → build extension
rustify-ml accelerate --file examples/euclidean.py --output dist --threshold 10

# Install and use the generated extension
cd dist/rustify_ml_ext && maturin develop --release
python -c "from rustify_ml_ext import euclidean; print(euclidean([0.0,3.0,4.0],[0.0,0.0,0.0]))"
# → 5.0

# Validate parity + speedups
python -X utf8 tests/test_all_fixtures.py --with-rust
python benches/compare.py --with-rust
```

---

## CLI Reference

```
rustify-ml accelerate [OPTIONS]

Input (one required):
  --file <PATH>          Python file to profile and accelerate
  --snippet              Read Python code from stdin
  --git <URL>            Git repo URL to clone and analyze
  --git-path <PATH>      Path within the git repo (required with --git)

Profiler:
  --threshold <FLOAT>    Minimum hotspot % to target [default: 10.0]
                        Tip: set to 0.0 to include all defined functions (parsed from the source)
  --iterations <N>       Profiler loop count for better sampling [default: 100]
  --list-targets         Profile only: print hotspot table and exit (no codegen)
  --function <NAME>      Skip profiler, target a specific function by name

Generation:
  --output <DIR>         Output directory for generated extension [default: dist]
  --ml-mode              Enable ML-focused heuristics (numpy → PyReadonlyArray1)
  --dry-run              Generate code without building (inspect before install)
  --benchmark            After building, run Python timing harness + speedup table
  --no-regen             Skip code regeneration; only rebuild the existing extension

Logging:
  -v / -vv               Increase verbosity (debug / trace)
```

### New in latest build

| Flag | What it does |
|------|-------------|
| `--list-targets` | Profile only, print ranked hotspot table, exit — no code generated |
| `--function <name>` | Skip profiler entirely, target one function by name (100% weight) |
| `--iterations <n>` | Control how many times the profiler loops the script (default: 100) |
| `--ml-mode` | Detect numpy imports → use `PyReadonlyArray1<f64>` + add numpy dep to Cargo.toml |
| `--threshold 0` | Force inclusion of all defined functions (parser-based), even if profiler reports 0% |
| `--no-regen` | Skip code regeneration; only rebuild the existing `dist/rustify_ml_ext` (prevents overwriting manual edits) |

### BPE Tokenizer Demo

One of the best targets for rustify-ml is the BPE (Byte-Pair Encoding) encode loop — the same algorithm used by tiktoken (OpenAI) and HuggingFace tokenizers. The inner merge pass is O(n²) in Python and translates cleanly to Rust `Vec<usize>` + `while` loops:

```bash
# Profile and generate Rust stubs for the BPE tokenizer
cargo run -- accelerate \
  --file examples/bpe_tokenizer.py \
  --function count_pairs \
  --output dist \
  --dry-run

# Or let the profiler find hotspots automatically
cargo run -- accelerate \
  --file examples/bpe_tokenizer.py \
  --threshold 5 \
  --output dist \
  --benchmark
```

**Latest benchmark snapshot** (WSL, CPython 3.12, `python benches/compare.py --with-rust`):
```
  Function                            |  Python us |    Rust us |  Speedup
  ------------------------------------+------------+------------+---------
  euclidean (n=1000)                  |       55.8 |       20.8 |     2.7x
  dot_product (n=1000)                |       45.8 |       19.4 |     2.4x
  normalize_pixels (n=1000)           |       53.2 |       25.1 |     2.1x
  running_mean (n=500, w=10)          |      376.9 |       18.7 |    20.2x
  count_pairs (n=500)                 |       88.3 |       61.5 |     1.4x
  bpe_encode (len=100)                |       11.1 |        0.9 |    12.3x
  standard_scale (n=1000)             |       55.1 |       27.2 |     2.0x
  min_max_scale (n=1000)              |       60.6 |       28.6 |     2.1x
  l2_normalize (n=1000)               |       95.5 |       29.4 |     3.2x
  convolve1d (n=1000, k=5)            |      329.8 |       21.0 |    15.7x
  moving_average (n=1000, w=10)       |      471.5 |       30.8 |    15.3x
  diff (n=1000)                       |       58.5 |       16.1 |     3.6x
  cumsum (n=1000)                     |       40.0 |       27.3 |     1.5x
```

After `maturin develop --release`, re-run `python benches/compare.py --with-rust` to refresh numbers for your machine.

## Examples

```bash
# Snippet from stdin
echo "def dot(a, b):\n    return sum(x*y for x,y in zip(a,b))" | \
  rustify-ml accelerate --snippet --output dist --dry-run

# Git repo (shallow clone, analyze one file)
rustify-ml accelerate \
  --git https://github.com/huggingface/transformers \
  --git-path examples/slow_preproc.py \
  --output dist --threshold 5

# ML mode (numpy/torch type hints in generated stubs)
rustify-ml accelerate --file examples/image_preprocess.py --ml-mode --output dist --dry-run
```

### Timing Demo (euclidean)

Baseline vs Rust extension on WSL, CPython 3.12, Ryzen 7:

| Function | Input | Python (us) | Rust (us) | Speedup |
|----------|-------|-------------|-----------|---------|
| euclidean | n=1_000 | 55.8 | 20.8 | 2.7x |

Reproduce:

```bash
python -X utf8 benches/compare.py --function euclidean --with-rust
```

### ML-mode benchmarks (numpy arrays)

`--ml-mode` is optimized for numeric array inputs (numpy). Use it when your hotspots already operate on `np.ndarray` or can be cheaply converted to arrays. Example (image preprocessing):

```bash
python -X utf8 benches/compare.py --function normalize_pixels --with-rust --ml-mode
```

Sample (WSL, CPython 3.12, numpy arrays):

| Function | Input | Python (us) | Rust (us) | Speedup |
|----------|-------|-------------|-----------|---------|
| normalize_pixels | n=1_000 | 53.2 | 25.1 | 2.1x |
| convolve1d | n=1_000, k=5 | 329.8 | 21.0 | 15.7x |
| running_mean | n=500, w=10 | 376.9 | 18.7 | 20.2x |

Best practices: keep data as `np.ndarray` before calling Rust, avoid per-call Python↔Rust conversions, and rerun `benches/compare.py --with-rust --ml-mode` on your hardware to refresh numbers.

### CLI Output (screenshot)

![CLI demo](cli.gif)

### Using `rustify-stdlib` directly

```bash
pip install maturin
pip install rustify-stdlib  # once published
python - <<'PY'
import rustify_stdlib as rs
print(rs.euclidean([0.0,3.0,4.0],[0.0,0.0,0.0]))
print(rs.dot_product([1.0,2.0],[3.0,4.0]))
PY
```

---

## Example Output

After running `accelerate`, rustify-ml prints a summary table to stdout:

```
Accelerated 3/4 targets (1 fallback)

Func               | Line | % Time | Translation | Status
-------------------+------+--------+-------------+---------
euclidean          |  1   | 42.1%  | Full        | Success
dot_product        |  18  | 31.8%  | Full        | Success
matmul             |  7   | 20.4%  | Partial     | Fallback (nested loop)
normalize_pixels   |  24  |  5.7%  | Full        | Success

Generated: dist/rustify_ml_ext/
Install:   cd dist/rustify_ml_ext && maturin develop --release
```

---

## Translation Patterns

| Python Pattern | Rust Translation | Status |
|----------------|-----------------|--------|
| `for i in range(len(x)):` | `for i in 0..x.len() {` | ✅ Done |
| `total += a * b` | `total += a * b;` | ✅ Done |
| `return x ** 0.5` | `return (x).powf(0.5);` | ✅ Done |
| `a[i] - b[i]` | `a[i] - b[i]` | ✅ Done |
| `total = 0.0` | `let mut total: f64 = 0.0;` | ✅ Done |
| `result[i] = val` | `result[i] = val;` | ✅ Done |
| `result = [0.0] * n` | `let mut result = vec![0.0f64; n];` | ✅ Done |
| `range(a, b)` | `a..b` | ✅ Done |
| `for i in range(n): for j...` | nested for loops | ✅ Done |
| `[f(x) for x in xs]` | `xs.iter().map(f).collect()` | ✅ Done |
| `np.array` params | `PyReadonlyArray1<f64>` (via `--ml-mode`) | ✅ Done |

**Untranslatable** (warns + skips): `eval()`, `exec()`, `getattr()`, `async def`, class self mutation

---

## Generated Code Example

For `examples/euclidean.py`:

```python
def euclidean(p1, p2):
    total = 0.0
    for i in range(len(p1)):
        diff = p1[i] - p2[i]
        total += diff * diff
    return total ** 0.5
```

rustify-ml generates:

```rust
use pyo3::prelude::*;

#[pyfunction]
/// Auto-generated from Python hotspot `euclidean` at line 1 (100.00%): 100% hotspot
pub fn euclidean(py: Python, p1: Vec<f64>, p2: Vec<f64>) -> PyResult<f64> {
    let _ = py;
    if p1.len() != p2.len() {
        return Err(pyo3::exceptions::PyValueError::new_err("length mismatch"));
    }
    let mut total = 0.0f64;
    for i in 0..p1.len() {
        // ...
        total += diff * diff;
    }
    Ok((total).powf(0.5))
}
```

---

## Timing Demo

Run the built-in benchmark after building the extension:

```bash
# Build the extension, then benchmark euclidean distance
rustify-ml accelerate --file examples/euclidean.py --output dist --threshold 0 --benchmark

# Or manually after maturin develop:
cd dist/rustify_ml_ext && maturin develop --release && cd ../..
rustify-ml accelerate --file examples/euclidean.py --output dist --threshold 0 --benchmark
```

Expected output (from `benches/compare.py --with-rust`):

```
================================================================================
  rustify-ml benchmark results
================================================================================
  Function                            |  Python us |    Rust us |  Speedup
  ------------------------------------+------------+------------+---------
  running_mean (n=500, w=10)          |      376.9 |       18.7 |    20.2x
  convolve1d (n=1000, k=5)            |      329.8 |       21.0 |    15.7x
  moving_average (n=1000, w=10)       |      471.5 |       30.8 |    15.3x
  bpe_encode (len=100)                |       11.1 |        0.9 |    12.3x
  euclidean (n=1000)                  |       55.8 |       20.8 |     2.7x
================================================================================
```

> Numbers measured on WSL, CPython 3.12. Actual speedup depends on Python version, CPU, and input size.
> Loop-heavy functions (sliding window, convolution, tokenizers) see the largest gains.

---

## Example Files

| File | Description | Key Patterns |
|------|-------------|-------------|
| `examples/euclidean.py` | Euclidean distance | `range(len(x))`, `**`, accumulator |
| `examples/matrix_ops.py` | Matrix multiply + dot product | nested loops, subscript assign |
| `examples/image_preprocess.py` | Pixel normalize + gamma | `[0.0] * n`, subscript assign |
| `examples/bpe_tokenizer.py` | BPE encode (tiktoken-style) | while loop, HashMap merge rank |
| `examples/slow_tokenizer.py` | BPE-style tokenizer fixture | while loop, dict lookup |
| `examples/data_pipeline.py` | CSV parse + running mean | string ops, sliding window |
| `examples/signal_processing.py` | convolve1d, moving_average, diff, cumsum | nested loops, 1D signal ops |
| `examples/sklearn_scaler.py` | standard_scale, min_max_scale, l2_normalize | element-wise Vec ops |

---

## Architecture

```
CLI args (Clap)
    → input::load_input()     # File | stdin snippet | git2 clone
    → profiler::profile_input()  # cProfile subprocess; python3→python fallback
    → analyzer::select_targets() # Threshold filter; ml_mode tagging
    → generator::generate()   # AST walk; Rust codegen; len-check guards
    → builder::build_extension() # cargo check (fast-fail) → maturin develop
    → print_summary()         # ASCII table to stdout
```

**Modules:**

| Module | Responsibility |
|--------|---------------|
| `input.rs` | Load Python from file, stdin, or git repo |
| `profiler.rs` | Run cProfile via Python subprocess; parse hotspots |
| `analyzer.rs` | Filter hotspots by threshold; apply ML heuristics |
| `generator.rs` | Walk Python AST; emit Rust + PyO3 stubs |
| `builder.rs` | `cargo check` generated crate; spawn `maturin develop` |
| `utils.rs` | Shared types; ASCII summary table |

---

## Development

### Prerequisites

- Rust 1.75+ stable (`rustup update stable`)
- Python 3.10+ on PATH (`python3` or `python`)
- `pip install maturin`

### Build & Test

```bash
# From rustify-ml/ directory (or use WSL on Windows)
cargo fmt && cargo check
cargo test
cargo clippy -- -D warnings
```

### Run CLI in dev mode

```bash
# Dry-run: generate code, inspect, no build
cargo run -- accelerate --file examples/euclidean.py --output dist --threshold 0 --dry-run

# Full run (requires maturin)
cargo run -- accelerate --file examples/euclidean.py --output dist --threshold 0

# Verbose output
cargo run -- accelerate --file examples/euclidean.py --output dist -vv --dry-run
```

### Windows Note

The project builds and tests in **WSL** (Windows Subsystem for Linux). Running `cargo test` directly in Windows CMD requires Visual Studio Build Tools (`link.exe`). Use WSL for development:

```bash
cd /mnt/d/WindsurfProjects/rustify/rustify-ml
cargo fmt && cargo check
cargo test
```

---

## Roadmap

See [plan.md](plan.md) for the full prioritized task list. High-level:

1. **Core pipeline** — profile → analyze → generate → build
2.**Translation coverage** — assign init, subscript assign, list init, range forms, nested for loops
3.**While loop translation**`while changed:`, `while i < len(x):` → Rust while
4.**Safety** — length-check guards, cargo check on generated crate
5.**Profiler robustness** — python3/python fallback, version pre-flight, stdlib filter
6.**CLI polish**`--list-targets`, `--function`, `--iterations`, `--benchmark`
7.**ndarray feature**`--ml-mode` + numpy import → `PyReadonlyArray1<f64>` params
8.**BPE tokenizer fixture**`examples/bpe_tokenizer.py` + integration tests
9.**Benchmark script**`benches/compare.py` (Python baseline + `--with-rust` mode)
10.**List comprehension**`[f(x) for x in xs]``xs.iter().map(f).collect()`
11.**Criterion benchmarks**`benches/speedup.rs` with Criterion (html reports; euclidean/dot_product/moving_average)
12. 📋 **v0.1.0 release** — crates.io publish, CHANGELOG, GitHub release (see CHANGELOG.md)

---

## License

MIT — see [LICENSE](LICENSE)

> ⚠️ **Generated code requires review.** rustify-ml emits Rust stubs as a starting point. Always review generated `lib.rs` before deploying, especially for fallback-translated functions (marked with `// fallback: echo input`).