cuda-rust-wasm 0.1.7

CUDA to Rust transpiler with WebGPU/WASM support
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
# Getting Started with CUDA-Rust-WASM


## Welcome to GPU Computing in the Browser! 🚀


This tutorial will take you from zero to running your first CUDA kernel in a web browser using CUDA-Rust-WASM. By the end, you'll understand how to transpile CUDA code to WebGPU and achieve near-native performance in any browser.

## Table of Contents


1. [Prerequisites]#prerequisites
2. [Installation]#installation
3. [Your First Kernel]#your-first-kernel
4. [Understanding the Transpilation]#understanding-the-transpilation
5. [Browser Integration]#browser-integration
6. [Performance Analysis]#performance-analysis
7. [Next Steps]#next-steps

## Prerequisites


### Knowledge Requirements

- Basic understanding of parallel computing concepts
- Familiarity with JavaScript/TypeScript
- Optional: CUDA programming experience (helpful but not required)

### System Requirements

- **Node.js**: 18+ (LTS recommended)
- **Browser**: Chrome 113+, Firefox 115+, Safari 16.4+, or Edge 113+
- **GPU**: Any GPU with WebGPU support (most modern GPUs)
- **RAM**: 4GB minimum, 8GB recommended
- **Storage**: 1GB free space

### Check WebGPU Support


First, let's verify your browser supports WebGPU:

```javascript
// Run this in your browser console
if (navigator.gpu) {
  console.log('✅ WebGPU is supported!');
  navigator.gpu.requestAdapter().then(adapter => {
    if (adapter) {
      console.log('🎯 GPU adapter found:', adapter.info);
    } else {
      console.log('⚠️ No GPU adapter available');
    }
  });
} else {
  console.log('❌ WebGPU not supported in this browser');
}
```

## Installation


### Method 1: NPX (Recommended for Beginners)


No installation required! Use NPX to run commands directly:

```bash
# Check if everything works

npx cuda-rust-wasm --version

# Create a new project

npx cuda-rust-wasm init my-first-gpu-project
cd my-first-gpu-project
```

### Method 2: Global Installation


For frequent use, install globally:

```bash
# Install globally

npm install -g cuda-rust-wasm

# Verify installation

cuda-rust-wasm --version

# Create new project

cuda-rust-wasm init my-first-gpu-project
cd my-first-gpu-project
```

### Method 3: Project Dependency


For existing projects:

```bash
# Add to existing project

npm install cuda-rust-wasm

# Add to package.json scripts

echo '{
  "scripts": {
    "transpile": "cuda-rust-wasm transpile",
    "analyze": "cuda-rust-wasm analyze",
    "benchmark": "cuda-rust-wasm benchmark"
  }
}' >> package.json
```

## Your First Kernel


Let's start with the classic "Hello World" of parallel computing: vector addition.

### Step 1: Create the CUDA Kernel


Create a file called `vector_add.cu`:

```cuda
// vector_add.cu - Your first CUDA kernel
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    // Calculate this thread's index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Boundary check to prevent out-of-bounds access
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}
```

### Step 2: Analyze the Kernel


Before transpiling, let's analyze what this kernel does:

```bash
npx cuda-rust-wasm analyze vector_add.cu --verbose
```

You'll see output like:

```json
{
  "analysis": {
    "kernelName": "vectorAdd",
    "complexity": "simple",
    "memoryPattern": "coalesced",
    "parallelizability": "perfect",
    "recommendations": [
      "Excellent memory access pattern",
      "No optimization needed for this simple case",
      "Consider using larger block sizes for better occupancy"
    ]
  },
  "performance": {
    "estimatedSpeedup": "high",
    "bottlenecks": [],
    "occupancy": "good"
  }
}
```

### Step 3: Transpile to WebGPU


Now let's transpile the CUDA code to WebGPU:

```bash
npx cuda-rust-wasm transpile vector_add.cu \
  --target webgpu \
  --optimize \
  --output-dir ./dist \
  --generate-bindings
```

This creates several files:

```
dist/
├── vector_add.wgsl        # WebGPU shader
├── vector_add.js          # JavaScript bindings
├── vector_add.d.ts        # TypeScript definitions
├── vector_add_test.js     # Auto-generated tests
└── performance_report.html # Optimization report
```

### Step 4: Examine the Generated Code


Let's look at the generated WebGPU shader (`dist/vector_add.wgsl`):

```wgsl
@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> c: array<f32>;

struct Params {
    n: u32,
}

@group(0) @binding(3) var<uniform> params: Params;

@compute @workgroup_size(256)
fn vectorAdd(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let i = global_id.x;
    
    if (i < params.n) {
        c[i] = a[i] + b[i];
    }
}
```

Notice how the transpiler:
- Converted CUDA's global memory pointers to WebGPU storage buffers
- Mapped `blockIdx.x * blockDim.x + threadIdx.x` to `global_invocation_id.x`
- Added proper buffer bindings and group layouts
- Optimized the workgroup size to 256 (a good default)

## Understanding the Transpilation


### Memory Model Translation


| CUDA Concept | WebGPU Equivalent | Purpose |
|--------------|-------------------|---------|
| Global memory (`float*`) | Storage buffer (`array<f32>`) | Large data arrays |
| Shared memory (`__shared__`) | Workgroup memory (`var<workgroup>`) | Fast inter-thread communication |
| Constant memory (`__constant__`) | Uniform buffer (`var<uniform>`) | Read-only parameters |
| Thread index (`threadIdx.x`) | Local invocation ID (`local_invocation_id.x`) | Thread within workgroup |
| Block index (`blockIdx.x`) | Workgroup ID (`workgroup_id.x`) | Workgroup identifier |

### Execution Model Translation


```cuda
// CUDA launch configuration
vectorAdd<<<gridSize, blockSize>>>(a, b, c, n);
```

becomes:

```javascript
// WebGPU dispatch
await kernel.dispatch(gridSize, 1, 1);
```

where the blockSize is embedded in the shader's `@workgroup_size()`.

## Browser Integration


Now let's create a complete web application using our transpiled kernel.

### Step 5: Create the HTML Structure


Create `index.html`:

```html
<!DOCTYPE html>

<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My First GPU Kernel</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            max-width: 800px;
            margin: 0 auto;
            padding: 20px;
        }
        .container {
            background: #f5f5f5;
            padding: 20px;
            border-radius: 8px;
            margin: 20px 0;
        }
        .status {
            padding: 10px;
            border-radius: 4px;
            margin: 10px 0;
        }
        .success { background: #d4edda; color: #155724; }
        .error { background: #f8d7da; color: #721c24; }
        .info { background: #d1ecf1; color: #0c5460; }
        button {
            background: #007bff;
            color: white;
            border: none;
            padding: 10px 20px;
            border-radius: 4px;
            cursor: pointer;
            font-size: 16px;
        }
        button:hover { background: #0056b3; }
        button:disabled { background: #6c757d; cursor: not-allowed; }
        #results {
            font-family: monospace;
            background: #f8f9fa;
            padding: 15px;
            border-radius: 4px;
            white-space: pre-wrap;
        }
    </style>
</head>
<body>
    <h1>🚀 My First GPU Kernel</h1>
    
    <div class="container">
        <h2>Vector Addition on GPU</h2>
        <p>This demo adds two vectors using your GPU via WebGPU:</p>
        <p><strong>A + B = C</strong> where each vector has 1 million elements</p>
        
        <button id="runKernel">Run GPU Kernel</button>
        <button id="runCPU">Compare with CPU</button>
        
        <div id="status"></div>
        <div id="results"></div>
    </div>

    <script type="module" src="app.js"></script>
</body>
</html>
```

### Step 6: Create the JavaScript Application


Create `app.js`:

```javascript
// app.js - Complete WebGPU application
import { createWebGPUKernel, detectHardware } from 'cuda-rust-wasm';

class VectorAdditionDemo {
    constructor() {
        this.device = null;
        this.kernel = null;
        this.n = 1024 * 1024; // 1 million elements
        this.a = new Float32Array(this.n);
        this.b = new Float32Array(this.n);
        this.c = new Float32Array(this.n);
        
        this.initializeUI();
    }
    
    initializeUI() {
        document.getElementById('runKernel').addEventListener('click', () => this.runGPUKernel());
        document.getElementById('runCPU').addEventListener('click', () => this.runCPUComparison());
        
        this.showStatus('Ready to run! Click "Run GPU Kernel" to start.', 'info');
    }
    
    showStatus(message, type = 'info') {
        const statusDiv = document.getElementById('status');
        statusDiv.className = `status ${type}`;
        statusDiv.textContent = message;
    }
    
    updateResults(results) {
        document.getElementById('results').textContent = results;
    }
    
    async initialize() {
        this.showStatus('Initializing WebGPU...', 'info');
        
        try {
            // Check WebGPU support
            if (!navigator.gpu) {
                throw new Error('WebGPU not supported in this browser');
            }
            
            // Get GPU adapter and device
            const adapter = await navigator.gpu.requestAdapter();
            if (!adapter) {
                throw new Error('No WebGPU adapter found');
            }
            
            this.device = await adapter.requestDevice();
            
            // Load and compile our kernel
            const kernelCode = await fetch('./dist/vector_add.wgsl').then(r => r.text());
            
            // Get hardware profile for optimization
            const hardware = await detectHardware();
            console.log('Hardware profile:', hardware);
            
            // Create optimized kernel
            this.kernel = await createWebGPUKernel(this.device, kernelCode, {
                optimizationLevel: 3,
                enableProfiling: true,
                hardwareProfile: hardware
            });
            
            // Generate test data
            this.generateTestData();
            
            this.showStatus('✅ WebGPU initialized successfully!', 'success');
            this.updateResults(`Hardware: ${hardware.deviceName}\nCompute capability: ${hardware.computeCapability}\nMemory bandwidth: ${hardware.memoryBandwidth} GB/s`);
            
        } catch (error) {
            this.showStatus(`❌ Initialization failed: ${error.message}`, 'error');
            console.error('Initialization error:', error);
        }
    }
    
    generateTestData() {
        // Generate random test vectors
        for (let i = 0; i < this.n; i++) {
            this.a[i] = Math.random() * 100;
            this.b[i] = Math.random() * 100;
        }
        console.log('Generated test data:', { n: this.n, sampleA: this.a.slice(0, 5), sampleB: this.b.slice(0, 5) });
    }
    
    async runGPUKernel() {
        if (!this.kernel) {
            this.showStatus('❌ Kernel not initialized', 'error');
            return;
        }
        
        try {
            this.showStatus('🔄 Running GPU kernel...', 'info');
            document.getElementById('runKernel').disabled = true;
            
            const startTime = performance.now();
            
            // Create GPU buffers
            const bufferA = this.kernel.createBuffer(this.n * 4, GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST);
            const bufferB = this.kernel.createBuffer(this.n * 4, GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST);
            const bufferC = this.kernel.createBuffer(this.n * 4, GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC);
            
            // Copy data to GPU
            await this.kernel.writeBuffer(0, this.a.buffer);
            await this.kernel.writeBuffer(1, this.b.buffer);
            
            // Set kernel parameters
            this.kernel.setArgs({ n: this.n });
            
            // Launch kernel with profiling
            const profile = await this.kernel.dispatchWithProfiling(
                Math.ceil(this.n / 256), 1, 1
            );
            
            // Read results back
            const resultBuffer = await this.kernel.readBuffer(2);
            const results = new Float32Array(resultBuffer);
            
            const totalTime = performance.now() - startTime;
            
            // Verify results (check first few elements)
            const verification = this.verifyResults(results);
            
            // Display results
            const throughput = (this.n * 3 * 4) / (profile.executionTime / 1000) / (1024 * 1024 * 1024); // GB/s
            const gflops = this.n / (profile.executionTime / 1000) / (1000 * 1000 * 1000); // GFLOPS
            
            this.showStatus('✅ GPU kernel completed successfully!', 'success');
            this.updateResults(`GPU Results:
Total time: ${totalTime.toFixed(2)}ms
Kernel execution: ${profile.executionTime.toFixed(3)}ms
Memory transfer: ${(totalTime - profile.executionTime).toFixed(3)}ms
Throughput: ${throughput.toFixed(2)} GB/s
Compute: ${gflops.toFixed(2)} GFLOPS
Verification: ${verification ? '✅ PASSED' : '❌ FAILED'}

Sample results:
A[0] + B[0] = ${this.a[0].toFixed(3)} + ${this.b[0].toFixed(3)} = ${results[0].toFixed(3)}
A[1] + B[1] = ${this.a[1].toFixed(3)} + ${this.b[1].toFixed(3)} = ${results[1].toFixed(3)}
A[2] + B[2] = ${this.a[2].toFixed(3)} + ${this.b[2].toFixed(3)} = ${results[2].toFixed(3)}`);
            
            // Store GPU results for comparison
            this.gpuResults = {
                time: profile.executionTime,
                throughput: throughput,
                gflops: gflops
            };
            
        } catch (error) {
            this.showStatus(`❌ GPU kernel failed: ${error.message}`, 'error');
            console.error('GPU kernel error:', error);
        } finally {
            document.getElementById('runKernel').disabled = false;
        }
    }
    
    async runCPUComparison() {
        this.showStatus('🔄 Running CPU comparison...', 'info');
        document.getElementById('runCPU').disabled = true;
        
        try {
            const startTime = performance.now();
            
            // CPU vector addition
            const cpuResults = new Float32Array(this.n);
            for (let i = 0; i < this.n; i++) {
                cpuResults[i] = this.a[i] + this.b[i];
            }
            
            const cpuTime = performance.now() - startTime;
            const cpuThroughput = (this.n * 3 * 4) / (cpuTime / 1000) / (1024 * 1024 * 1024); // GB/s
            const cpuGflops = this.n / (cpuTime / 1000) / (1000 * 1000 * 1000); // GFLOPS
            
            let comparison = `CPU Results:
Execution time: ${cpuTime.toFixed(2)}ms
Throughput: ${cpuThroughput.toFixed(2)} GB/s
Compute: ${cpuGflops.toFixed(2)} GFLOPS`;
            
            if (this.gpuResults) {
                const speedup = cpuTime / this.gpuResults.time;
                const throughputRatio = this.gpuResults.throughput / cpuThroughput;
                
                comparison += `

GPU vs CPU Comparison:
Speedup: ${speedup.toFixed(2)}x faster
Throughput improvement: ${throughputRatio.toFixed(2)}x
Efficiency: ${(this.gpuResults.gflops / cpuGflops * 100).toFixed(1)}%`;
            }
            
            this.showStatus('✅ CPU comparison completed!', 'success');
            this.updateResults(comparison);
            
        } catch (error) {
            this.showStatus(`❌ CPU comparison failed: ${error.message}`, 'error');
            console.error('CPU comparison error:', error);
        } finally {
            document.getElementById('runCPU').disabled = false;
        }
    }
    
    verifyResults(gpuResults) {
        // Verify first 100 elements (sufficient for validation)
        const tolerance = 1e-5;
        for (let i = 0; i < Math.min(100, this.n); i++) {
            const expected = this.a[i] + this.b[i];
            const actual = gpuResults[i];
            if (Math.abs(expected - actual) > tolerance) {
                console.error(`Verification failed at index ${i}: expected ${expected}, got ${actual}`);
                return false;
            }
        }
        return true;
    }
}

// Initialize the demo when page loads
window.addEventListener('DOMContentLoaded', async () => {
    const demo = new VectorAdditionDemo();
    await demo.initialize();
});
```

### Step 7: Run the Application


1. Start a local server (required for module imports):

```bash
# Using Python (if available)

python -m http.server 8000

# Using Node.js http-server

npx http-server

# Using any other static server

```

2. Open your browser to `http://localhost:8000`

3. Click "Run GPU Kernel" to see your CUDA kernel running in the browser!

## Performance Analysis


### Understanding the Results


When you run the demo, you'll see metrics like:

```
GPU Results:
Total time: 15.42ms
Kernel execution: 2.341ms
Memory transfer: 13.079ms
Throughput: 2.84 GB/s
Compute: 427.42 GFLOPS
```

Let's break this down:

- **Total time**: Complete operation including memory transfers
- **Kernel execution**: Pure GPU computation time
- **Memory transfer**: Time to copy data to/from GPU
- **Throughput**: Data bandwidth achieved
- **Compute**: Computational throughput

### Optimization Opportunities


The analysis might show that memory transfer dominates execution time. This is normal for simple kernels and suggests:

1. **Batch operations**: Process more data per kernel launch
2. **Keep data on GPU**: Minimize host-device transfers
3. **Use persistent kernels**: For iterative algorithms
4. **Pipeline operations**: Overlap compute with memory transfers

### Compare with CPU


The CPU comparison shows the effectiveness of GPU acceleration:

```
GPU vs CPU Comparison:
Speedup: 8.45x faster
Throughput improvement: 3.2x
Efficiency: 89.3%
```

## Next Steps


Congratulations! 🎉 You've successfully run your first CUDA kernel in a web browser. Here's what to explore next:

### 1. Try More Complex Kernels


```bash
# Matrix multiplication

npx cuda-rust-wasm create-example matrix-multiply

# Image processing

npx cuda-rust-wasm create-example convolution

# Neural network layer

npx cuda-rust-wasm create-example neural-network
```

### 2. Enable Neural Optimization


```javascript
// Add AI-powered optimization
const optimizer = new NeuralOptimizer();
const optimizedKernel = await optimizer.optimizeKernel(kernelCode, {
    target: 'webgpu',
    performanceTarget: 'throughput'
});
```

### 3. Integrate with ruv-FANN


```javascript
// Add neural network acceleration
import { RuvFANN } from 'ruv-fann';

const neuralNet = new RuvFANN([784, 128, 10]);
const accelerated = await neuralNet.accelerateWithGPU(kernelCode);
```

### 4. Advanced Features


- **Real-time video processing**: Process webcam streams
- **Scientific computing**: Implement complex algorithms
- **Machine learning**: Train neural networks in the browser
- **Data visualization**: GPU-accelerated graphics

### 5. Learn More


- **[Advanced Tutorials]./advanced/**: Complex patterns and optimizations
- **[Performance Guide]../PERFORMANCE.md**: Detailed optimization techniques
- **[API Reference]../API.md**: Complete API documentation
- **[Migration Guide]../MIGRATION_GUIDE.md**: Port existing CUDA code
- **[Examples]../../examples/**: Real-world applications

### 6. Join the Community


- **Discord**: [Get real-time help]https://discord.gg/vibecast
- **GitHub**: [Contribute and report issues]https://github.com/vibecast/cuda-rust-wasm
- **Blog**: [Latest updates and tutorials]https://vibecast.io/blog
- **Newsletter**: [Stay updated]https://vibecast.io/newsletter

## Troubleshooting


### Common Issues


**WebGPU not supported:**
```javascript
// Add fallback detection
if (!navigator.gpu) {
    console.log('WebGPU not available, falling back to WebAssembly');
    // Implement WASM fallback
}
```

**Kernel compilation errors:**
```bash
# Add verbose output for debugging

npx cuda-rust-wasm transpile kernel.cu --verbose --debug
```

**Performance issues:**
```bash
# Analyze for bottlenecks

npx cuda-rust-wasm analyze kernel.cu --performance-focus
```

**Memory allocation failures:**
```javascript
// Handle gracefully
try {
    const buffer = kernel.createBuffer(size);
} catch (error) {
    if (error.name === 'OutOfMemoryError') {
        console.log('Reducing buffer size...');
        // Implement chunked processing
    }
}
```

### Getting Help


1. **Check the browser console** for detailed error messages
2. **Enable verbose logging** during development
3. **Use the analysis tools** to understand performance
4. **Join our Discord** for community support
5. **Check GitHub issues** for known problems

---

You've now mastered the basics of CUDA-Rust-WASM! You can transpile CUDA kernels to run efficiently in web browsers, achieving excellent performance with full cross-platform compatibility. The GPU is now accessible from JavaScript, opening up infinite possibilities for web applications.

Happy coding! 🚀