# ToRSh WebGPU Backend
This document provides comprehensive documentation for ToRSh's WebGPU backend implementation, enabling cross-platform GPU acceleration for deep learning workloads on web browsers, desktop systems, and mobile devices.
## Overview
The WebGPU backend leverages the modern WebGPU specification to provide high-performance GPU computing capabilities across multiple platforms with a unified API. Unlike platform-specific backends (CUDA, Metal), WebGPU offers true cross-platform compatibility while maintaining excellent performance.
## Key Features
### 🌐 **Cross-Platform Compatibility**
- **Web Browsers**: Chrome, Firefox, Safari, Edge with WebGPU support
- **Desktop**: Windows (DirectX 12), macOS (Metal), Linux (Vulkan)
- **Mobile**: iOS (Metal), Android (Vulkan)
- **Cloud**: GPU-enabled cloud instances and containers
### 🚀 **High Performance**
- **Modern GPU APIs**: Built on Vulkan, DirectX 12, and Metal
- **Compute Shaders**: WGSL-based compute pipeline system
- **Async Operations**: Native async/await support for all operations
- **Pipeline Caching**: Intelligent caching of compiled compute pipelines
- **Memory Management**: Efficient buffer pooling and memory tracking
### 🔧 **Advanced Features**
- **Automatic Device Selection**: Intelligent adapter enumeration and selection
- **Memory Pool Management**: Configurable memory allocation strategies
- **Kernel Executor**: High-level tensor operation execution
- **Error Handling**: Comprehensive error types and validation
- **Performance Monitoring**: Built-in metrics and profiling support
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ ToRSh WebGPU Backend │
├─────────────────────────────────────────────────────────────────┤
│ WebGpuBackend │
│ ├── Configuration and initialization │
│ ├── Device management │
│ ├── Backend trait implementation │
│ └── Integration with ToRSh ecosystem │
├─────────────────────────────────────────────────────────────────┤
│ WebGpuDevice │
│ ├── Adapter selection and device creation │
│ ├── Command encoding and submission │
│ ├── Memory usage tracking │
│ └── Limits and features querying │
├─────────────────────────────────────────────────────────────────┤
│ WebGpuMemoryManager │
│ ├── Buffer allocation and deallocation │
│ ├── Memory pool management │
│ ├── Host-device transfers │
│ └── Memory statistics tracking │
├─────────────────────────────────────────────────────────────────┤
│ WebGpuKernelExecutor │
│ ├── Tensor operation execution │
│ ├── Pipeline factory integration │
│ ├── Custom kernel support │
│ └── Synchronization management │
├─────────────────────────────────────────────────────────────────┤
│ Pipeline System │
│ ├── ComputePipeline: Compiled compute pipeline wrapper │
│ ├── PipelineCache: Caching for compiled pipelines │
│ ├── PipelineFactory: Common operation pipeline creation │
│ └── PipelineDescriptor: Pipeline configuration │
├─────────────────────────────────────────────────────────────────┤
│ Shader System │
│ ├── ShaderModule: WGSL shader compilation │
│ ├── ComputeShader: High-level compute shader wrapper │
│ ├── ShaderCache: Compiled shader caching │
│ └── WGSL Kernels: Pre-built tensor operation shaders │
├─────────────────────────────────────────────────────────────────┤
│ Buffer Management │
│ ├── WebGpuBuffer: GPU buffer wrapper │
│ ├── WebGpuBufferPool: Buffer reuse and pooling │
│ ├── Mapping operations: Host-device data transfer │
│ └── Buffer usage validation │
└─────────────────────────────────────────────────────────────────┘
```
## Core Components
### WebGpuBackend
The main backend implementation that integrates with ToRSh's backend system:
```rust
use torsh_backend::webgpu::{WebGpuBackend, WebGpuBackendConfig};
// Create backend with custom configuration
let config = WebGpuBackendConfig {
adapter_index: None, // Auto-select best adapter
power_preference: wgpu::PowerPreference::HighPerformance,
debug_mode: false,
max_buffer_size: 1024 * 1024 * 1024, // 1GB
enable_pipeline_cache: true,
preferred_workgroup_size: (64, 1, 1),
};
let mut backend = WebGpuBackend::new(config);
backend.initialize().await?;
```
### WebGpuDevice
Device management and GPU interaction:
```rust
use torsh_backend::webgpu::WebGpuDevice;
// Create device from best available adapter
let device = WebGpuDevice::from_best_adapter(0).await?;
// Or from specific adapter
let device = WebGpuDevice::from_adapter_index(0, 0).await?;
// Get device information
println!("Device: {}", device.name());
println!("Type: {:?}", device.device_type());
println!("Memory: {} MB", device.info().memory_total / (1024 * 1024));
```
### Buffer Operations
GPU memory management and data transfer:
```rust
use torsh_backend::{BufferDescriptor, BufferUsage, MemoryLocation};
// Create storage buffer
let descriptor = BufferDescriptor {
name: "tensor_data".to_string(),
size: 4096,
usage: BufferUsage::STORAGE | BufferUsage::COPY_SRC | BufferUsage::COPY_DST,
memory_location: MemoryLocation::Device,
};
let buffer = WebGpuBuffer::new(device, descriptor, handle)?;
// Write data to buffer
let data: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
buffer.write_data(0, &data).await?;
// Read data from buffer
let result: Vec<f32> = buffer.read_data(0, data.len()).await?;
```
### Compute Operations
High-level tensor operations:
```rust
use torsh_backend::webgpu::WebGpuKernelExecutor;
let executor = WebGpuKernelExecutor::new(device);
// Elementwise operations
executor.elementwise_add(&input_a, &input_b, &output).await?;
executor.elementwise_mul(&input_a, &input_b, &output).await?;
// Activation functions
executor.relu(&input, &output).await?;
executor.softmax(&input, &output).await?;
// Matrix operations
executor.matmul(&a, &b, &output, m, n, k).await?;
// Convolution
executor.conv2d(&input, &kernel, &output, params).await?;
```
### Custom Kernels
Execute custom WGSL compute shaders:
```rust
let shader_source = r#"
@group(0) @binding(0) var<storage, read> input: array<f32>;
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let index = global_id.x;
if (index >= arrayLength(&output)) {
return;
}
output[index] = input[index] * 2.0; // Double each element
}
"#;
executor.execute_custom_kernel(
shader_source,
"main",
&[&input_buffer, &output_buffer],
(64, 1, 1), // workgroup_size
(16, 1, 1), // workgroup_count
None, // uniform_data
).await?;
```
## Supported Operations
### Elementwise Operations
- **Addition**: Element-wise addition of tensors
- **Multiplication**: Element-wise multiplication of tensors
- **Subtraction**: Element-wise subtraction of tensors
- **Division**: Element-wise division of tensors
### Matrix Operations
- **Matrix Multiplication**: General matrix multiplication (GEMM)
- **Transpose**: Matrix transposition
- **Batch Matrix Multiplication**: Batched GEMM operations
### Neural Network Operations
- **Convolution 2D**: 2D convolution with configurable stride and padding
- **Activation Functions**: ReLU, Softmax, Sigmoid, Tanh
- **Normalization**: Batch normalization, Layer normalization
- **Pooling**: Max pooling, Average pooling
### Reduction Operations
- **Sum**: Reduce sum across dimensions
- **Mean**: Reduce mean across dimensions
- **Max/Min**: Reduce max/min across dimensions
- **Variance**: Compute variance across dimensions
### Utility Operations
- **Copy**: Buffer-to-buffer copying
- **Fill**: Fill buffer with constant value
- **Cast**: Data type conversion
- **Reshape**: Tensor shape manipulation
## Performance Optimization
### Workgroup Size Optimization
WebGPU performance heavily depends on optimal workgroup sizes:
```rust
// Get optimal workgroup size for problem
let optimal_size = device.optimal_workgroup_size(element_count);
// Manual workgroup size tuning
let workgroup_sizes = [(64, 1, 1), (128, 1, 1), (256, 1, 1)];
let best_size = benchmark_workgroup_sizes(&workgroup_sizes);
```
### Memory Access Patterns
Optimize memory access for better performance:
```rust
// Prefer coalesced memory access
// Good: Sequential access pattern
output[index] = input[index] + bias[index];
// Bad: Random access pattern
output[indices[index]] = input[index] + bias[random_idx];
```
### Pipeline Caching
Leverage pipeline caching for repeated operations:
```rust
// Enable pipeline caching (default)
let config = WebGpuBackendConfig {
enable_pipeline_cache: true,
..Default::default()
};
// Check cache statistics
let stats = executor.pipeline_stats();
println!("Cached pipelines: {}", stats.pipeline_count);
```
### Async Operations
Use async operations for better CPU utilization:
```rust
// Execute multiple operations concurrently
let tasks = vec![
executor.elementwise_add(&a1, &b1, &c1),
executor.elementwise_add(&a2, &b2, &c2),
executor.elementwise_add(&a3, &b3, &c3),
];
futures::future::try_join_all(tasks).await?;
```
## Platform-Specific Considerations
### Web Browsers
**Requirements:**
- WebGPU-enabled browser (Chrome 113+, Firefox 113+, Safari 16.4+)
- HTTPS connection (WebGPU requires secure context)
- Sufficient GPU memory
**Limitations:**
- Buffer size limits (typically 1GB)
- No shared memory between workers
- Browser-specific performance variations
**Optimization:**
```rust
// Web-optimized configuration
let config = WebGpuBackendConfig {
max_buffer_size: 512 * 1024 * 1024, // 512MB for web
preferred_workgroup_size: (64, 1, 1),
..Default::default()
};
```
### Desktop (Native)
**Windows (DirectX 12):**
- Requires Windows 10 version 1903+
- DirectX 12 compatible GPU
- Latest GPU drivers
**macOS (Metal):**
- macOS 10.15+ recommended
- Metal-compatible GPU
- Integrated and discrete GPUs supported
**Linux (Vulkan):**
- Vulkan 1.1+ support
- Mesa drivers 21.0+ or proprietary drivers
- X11 or Wayland display server
### Mobile Platforms
**iOS (Metal):**
- iOS 13+ required
- A12 Bionic or newer recommended
- Memory constraints on older devices
**Android (Vulkan):**
- Android 7.0+ (API level 24)
- Vulkan-capable GPU
- Device-specific driver variations
## Error Handling
Comprehensive error handling for robust applications:
```rust
use torsh_backend::webgpu::WebGpuError;
match executor.elementwise_add(&a, &b, &output).await {
Ok(()) => println!("Operation succeeded"),
Err(WebGpuError::DeviceLost(msg)) => {
eprintln!("Device lost: {}", msg);
// Reinitialize device
},
Err(WebGpuError::OutOfMemory(requested, available)) => {
eprintln!("Out of memory: requested {} bytes, {} available",
requested, available);
// Reduce batch size or free memory
},
Err(WebGpuError::InvalidWorkgroupSize(size)) => {
eprintln!("Invalid workgroup size: {:?}", size);
// Adjust workgroup configuration
},
Err(e) => eprintln!("Other error: {}", e),
}
```
## Best Practices
### 1. Adapter Selection
```rust
// Enumerate adapters to choose the best one
let adapters = torsh_backend::webgpu::enumerate_adapters().await?;
for (i, adapter) in adapters.iter().enumerate() {
let info = torsh_backend::webgpu::get_adapter_info(adapter);
println!("Adapter {}: {} ({:?})", i, info.name, info.device_type);
}
// Select high-performance discrete GPU if available
let adapter = adapters.into_iter()
.find(|a| a.get_info().device_type == wgpu::DeviceType::DiscreteGpu)
.unwrap_or_else(|| torsh_backend::webgpu::get_best_adapter().await.unwrap());
```
### 2. Memory Management
```rust
// Use buffer pools for frequent allocations
let pool = WebGpuBufferPool::new(device);
// Reuse buffers when possible
let buffer1 = pool.get_buffer(descriptor.clone())?;
// ... use buffer1 ...
pool.return_buffer(buffer1);
// Monitor memory usage
let usage = device.memory_usage();
if usage.allocated_bytes > threshold {
// Implement memory pressure handling
}
```
### 3. Batch Operations
```rust
// Batch multiple small operations
let operations = vec![
(input_a1, input_b1, output1),
(input_a2, input_b2, output2),
(input_a3, input_b3, output3),
];
// Execute in batches to reduce overhead
for batch in operations.chunks(batch_size) {
for (a, b, out) in batch {
executor.elementwise_add(a, b, out).await?;
}
executor.synchronize().await?;
}
```
### 4. Error Recovery
```rust
// Implement retry logic for transient errors
async fn robust_operation(executor: &WebGpuKernelExecutor,
a: &WebGpuBuffer, b: &WebGpuBuffer,
output: &WebGpuBuffer) -> Result<(), WebGpuError> {
let mut attempts = 0;
const MAX_ATTEMPTS: u32 = 3;
loop {
match executor.elementwise_add(a, b, output).await {
Ok(()) => return Ok(()),
Err(WebGpuError::DeviceLost(_)) if attempts < MAX_ATTEMPTS => {
attempts += 1;
// Reinitialize device and retry
tokio::time::sleep(Duration::from_millis(100)).await;
},
Err(e) => return Err(e),
}
}
}
```
## Debugging and Profiling
### Enable Debug Mode
```rust
let config = WebGpuBackendConfig {
debug_mode: true,
..Default::default()
};
```
### Performance Monitoring
```rust
// Monitor pipeline cache performance
let stats = executor.pipeline_stats();
println!("Cache hit ratio: {:.2}%",
stats.pipeline_count as f64 / total_operations as f64 * 100.0);
// Track memory usage over time
let usage = device.memory_usage();
println!("Memory efficiency: {:.2}%",
usage.allocated_bytes as f64 / usage.peak_allocated_bytes as f64 * 100.0);
```
### Validation and Testing
```rust
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_elementwise_operations() {
if !torsh_backend::webgpu::is_available() {
return; // Skip test if WebGPU not available
}
let device = WebGpuDevice::from_best_adapter(0).await.unwrap();
// ... test implementation ...
}
}
```
## Integration with ToRSh
The WebGPU backend integrates seamlessly with ToRSh's tensor operations:
```rust
use torsh_tensor::Tensor;
use torsh_backend::BackendBuilder;
// Create tensor with WebGPU backend
let backend = BackendBuilder::new()
.backend_type(BackendType::WebGpu)
.build()?;
let device = backend.default_device()?;
let tensor = Tensor::zeros([1024, 1024], DType::F32, &device)?;
// Operations automatically use WebGPU backend
let result = tensor + tensor; // Uses WebGPU elementwise addition
let output = result.relu(); // Uses WebGPU ReLU activation
```
## Future Enhancements
### Planned Features
1. **WebGPU 2.0 Support**: Enhanced features and performance
2. **Multi-GPU Support**: Distributed computing across multiple adapters
3. **Advanced Memory Management**: Unified memory and memory compression
4. **Optimized Kernels**: Hand-tuned WGSL kernels for common operations
5. **Automatic Tuning**: ML-based performance optimization
6. **Interoperability**: Integration with WebAssembly and Web Workers
### Research Areas
1. **Adaptive Workgroup Sizing**: Dynamic workgroup size optimization
2. **Memory Access Optimization**: Automatic memory layout optimization
3. **Cross-Platform Profiling**: Unified performance analysis tools
4. **Energy Efficiency**: Power-aware operation scheduling
## Conclusion
The ToRSh WebGPU backend provides a modern, cross-platform solution for GPU-accelerated deep learning. Its comprehensive feature set, excellent performance, and broad compatibility make it an ideal choice for applications targeting multiple platforms, from web browsers to high-performance computing environments.
By leveraging WebGPU's modern design and ToRSh's robust architecture, developers can build high-performance deep learning applications that run seamlessly across the entire spectrum of computing devices, from mobile phones to data center GPUs.