# SIMD Support in cuda-wasm
This directory contains examples demonstrating SIMD (Single Instruction, Multiple Data) acceleration in cuda-wasm across x86_64 and ARM platforms.
## Overview
cuda-wasm uses platform-specific SIMD instructions to accelerate compute operations on CPUs. This provides significant speedups for workloads that cannot use GPU acceleration (e.g., CPU fallback paths, preprocessing, or environments without GPU access).
## Supported SIMD Instruction Sets
### x86_64
| **SSE2** | 128-bit | 4 | `is_x86_feature_detected!("sse2")` |
| **AVX2** | 256-bit | 8 | `is_x86_feature_detected!("avx2")` |
| **AVX-512** | 512-bit | 16 | `is_x86_feature_detected!("avx512f")` |
SSE2 is available on virtually all x86_64 processors. AVX2 is available on Intel Haswell (2013) and AMD Excavator (2015) and later. AVX-512 is available on select Intel Xeon, Ice Lake, and AMD Zen 4+ processors.
### ARM / AArch64
| **NEON** | 128-bit | 4 | Always available on AArch64 |
| **SVE** | 128-2048 bit | Variable | `is_aarch64_feature_detected!("sve")` |
| **SVE2** | 128-2048 bit | Variable | `is_aarch64_feature_detected!("sve2")` |
NEON is mandatory on all AArch64 (ARM 64-bit) processors. SVE (Scalable Vector Extension) is available on ARM Neoverse V1+ (AWS Graviton3, Fujitsu A64FX).
## Runtime Detection
cuda-wasm detects SIMD capabilities at runtime and selects the best available implementation:
```rust
#[cfg(target_arch = "x86_64")]
{
if is_x86_feature_detected!("avx512f") {
// Use AVX-512 path (16 x f32)
} else if is_x86_feature_detected!("avx2") {
// Use AVX2 path (8 x f32)
} else if is_x86_feature_detected!("sse2") {
// Use SSE2 path (4 x f32)
}
}
#[cfg(target_arch = "aarch64")]
{
// NEON is always available (4 x f32)
}
```
This ensures compiled binaries run correctly on any hardware while exploiting the best available SIMD capabilities.
## Performance Characteristics
### Vector Addition (1M f32 elements, --release)
#### x86_64 (Intel Xeon, typical results)
| Scalar | 1.2 ms | 1.0x | 10 GB/s |
| SSE2 | 0.35 ms | 3.4x | 34 GB/s |
| AVX2 | 0.18 ms | 6.7x | 67 GB/s |
| AVX-512 | 0.10 ms | 12x | 120 GB/s |
#### AArch64 (Apple M2, typical results)
| Scalar | 0.8 ms | 1.0x | 15 GB/s |
| NEON | 0.22 ms | 3.6x | 55 GB/s |
*Note: At larger data sizes, performance becomes memory-bandwidth limited rather than compute-limited.*
### When SIMD Helps Most
- **Compute-bound operations**: Math-heavy kernels with high arithmetic intensity
- **Small to medium data**: Data fits in L1/L2 cache, avoiding memory bottlenecks
- **Embarrassingly parallel**: Independent element-wise operations (add, multiply, etc.)
### When SIMD Helps Less
- **Memory-bound operations**: Large data exceeding cache, limited by DRAM bandwidth
- **Branch-heavy code**: Conditional logic that prevents vectorization
- **Sequential dependencies**: Loop-carried dependencies that cannot be parallelized
## Examples
### benchmark_simd.rs
Run the SIMD benchmark:
```bash
# Basic run
cargo run --example benchmark_simd --release
# With native CPU optimizations (recommended)
RUSTFLAGS="-C target-cpu=native" cargo run --example benchmark_simd --release
```
The benchmark:
1. Detects available SIMD features
2. Runs vector_add at 4 sizes: 1K, 10K, 100K, 1M elements
3. Tests all available SIMD implementations
4. Prints a comparison table with speedup vs. scalar
5. Verifies correctness of SIMD results
## Building with SIMD
### Compile-Time Feature Selection
```bash
# Let the compiler auto-vectorize for the build machine
RUSTFLAGS="-C target-cpu=native" cargo build --release
# Target specific x86_64 features
RUSTFLAGS="-C target-feature=+avx2,+fma" cargo build --release
# Target specific ARM features
RUSTFLAGS="-C target-feature=+neon,+fp-armv8" cargo build --release
```
### Runtime Feature Detection
For distributed binaries that need to run on various hardware:
```rust
// Use #[target_feature] attribute for unsafe SIMD functions
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn compute_avx2(data: &mut [f32]) {
// AVX2 intrinsics here
}
// Runtime dispatch
fn compute(data: &mut [f32]) {
#[cfg(target_arch = "x86_64")]
{
if is_x86_feature_detected!("avx2") {
return unsafe { compute_avx2(data) };
}
}
compute_scalar(data);
}
```
## Integration with cuda-wasm
SIMD acceleration is used as a CPU fallback in cuda-wasm when:
1. **No GPU available**: Server or container without GPU access
2. **Small workloads**: Data too small to justify GPU transfer overhead
3. **Preprocessing**: Data preparation before GPU kernel launch
4. **WebAssembly**: WASM SIMD (128-bit) for browser-based compute