torsh-backend 0.1.2

# ToRSh WebGPU Backend

This document provides comprehensive documentation for ToRSh's WebGPU backend implementation, enabling cross-platform GPU acceleration for deep learning workloads on web browsers, desktop systems, and mobile devices.

## Overview

The WebGPU backend leverages the modern WebGPU specification to provide high-performance GPU computing capabilities across multiple platforms with a unified API. Unlike platform-specific backends (CUDA, Metal), WebGPU offers true cross-platform compatibility while maintaining excellent performance.

## Key Features

### 🌐 **Cross-Platform Compatibility**
- **Web Browsers**: Chrome, Firefox, Safari, Edge with WebGPU support
- **Desktop**: Windows (DirectX 12), macOS (Metal), Linux (Vulkan)
- **Mobile**: iOS (Metal), Android (Vulkan)
- **Cloud**: GPU-enabled cloud instances and containers

### 🚀 **High Performance**
- **Modern GPU APIs**: Built on Vulkan, DirectX 12, and Metal
- **Compute Shaders**: WGSL-based compute pipeline system
- **Async Operations**: Native async/await support for all operations
- **Pipeline Caching**: Intelligent caching of compiled compute pipelines
- **Memory Management**: Efficient buffer pooling and memory tracking

### 🔧 **Advanced Features**
- **Automatic Device Selection**: Intelligent adapter enumeration and selection
- **Memory Pool Management**: Configurable memory allocation strategies
- **Kernel Executor**: High-level tensor operation execution
- **Error Handling**: Comprehensive error types and validation
- **Performance Monitoring**: Built-in metrics and profiling support

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    ToRSh WebGPU Backend                        │
├─────────────────────────────────────────────────────────────────┤
│  WebGpuBackend                                                 │
│  ├── Configuration and initialization                           │
│  ├── Device management                                          │
│  ├── Backend trait implementation                               │
│  └── Integration with ToRSh ecosystem                           │
├─────────────────────────────────────────────────────────────────┤
│  WebGpuDevice                                                  │
│  ├── Adapter selection and device creation                      │
│  ├── Command encoding and submission                            │
│  ├── Memory usage tracking                                      │
│  └── Limits and features querying                               │
├─────────────────────────────────────────────────────────────────┤
│  WebGpuMemoryManager                                           │
│  ├── Buffer allocation and deallocation                         │
│  ├── Memory pool management                                     │
│  ├── Host-device transfers                                      │
│  └── Memory statistics tracking                                 │
├─────────────────────────────────────────────────────────────────┤
│  WebGpuKernelExecutor                                          │
│  ├── Tensor operation execution                                 │
│  ├── Pipeline factory integration                               │
│  ├── Custom kernel support                                      │
│  └── Synchronization management                                 │
├─────────────────────────────────────────────────────────────────┤
│  Pipeline System                                               │
│  ├── ComputePipeline: Compiled compute pipeline wrapper         │
│  ├── PipelineCache: Caching for compiled pipelines             │
│  ├── PipelineFactory: Common operation pipeline creation        │
│  └── PipelineDescriptor: Pipeline configuration                 │
├─────────────────────────────────────────────────────────────────┤
│  Shader System                                                 │
│  ├── ShaderModule: WGSL shader compilation                      │
│  ├── ComputeShader: High-level compute shader wrapper          │
│  ├── ShaderCache: Compiled shader caching                       │
│  └── WGSL Kernels: Pre-built tensor operation shaders          │
├─────────────────────────────────────────────────────────────────┤
│  Buffer Management                                             │
│  ├── WebGpuBuffer: GPU buffer wrapper                          │
│  ├── WebGpuBufferPool: Buffer reuse and pooling               │
│  ├── Mapping operations: Host-device data transfer             │
│  └── Buffer usage validation                                    │
└─────────────────────────────────────────────────────────────────┘
```

## Core Components

### WebGpuBackend

The main backend implementation that integrates with ToRSh's backend system:

```rust
use torsh_backend::webgpu::{WebGpuBackend, WebGpuBackendConfig};

// Create backend with custom configuration
let config = WebGpuBackendConfig {
    adapter_index: None, // Auto-select best adapter
    power_preference: wgpu::PowerPreference::HighPerformance,
    debug_mode: false,
    max_buffer_size: 1024 * 1024 * 1024, // 1GB
    enable_pipeline_cache: true,
    preferred_workgroup_size: (64, 1, 1),
};

let mut backend = WebGpuBackend::new(config);
backend.initialize().await?;
```

### WebGpuDevice

Device management and GPU interaction:

```rust
use torsh_backend::webgpu::WebGpuDevice;

// Create device from best available adapter
let device = WebGpuDevice::from_best_adapter(0).await?;

// Or from specific adapter
let device = WebGpuDevice::from_adapter_index(0, 0).await?;

// Get device information
println!("Device: {}", device.name());
println!("Type: {:?}", device.device_type());
println!("Memory: {} MB", device.info().memory_total / (1024 * 1024));
```

### Buffer Operations

GPU memory management and data transfer:

```rust
use torsh_backend::{BufferDescriptor, BufferUsage, MemoryLocation};

// Create storage buffer
let descriptor = BufferDescriptor {
    name: "tensor_data".to_string(),
    size: 4096,
    usage: BufferUsage::STORAGE | BufferUsage::COPY_SRC | BufferUsage::COPY_DST,
    memory_location: MemoryLocation::Device,
};

let buffer = WebGpuBuffer::new(device, descriptor, handle)?;

// Write data to buffer
let data: Vec<f32> = vec![1.0, 2.0, 3.0, 4.0];
buffer.write_data(0, &data).await?;

// Read data from buffer
let result: Vec<f32> = buffer.read_data(0, data.len()).await?;
```

### Compute Operations

High-level tensor operations:

```rust
use torsh_backend::webgpu::WebGpuKernelExecutor;

let executor = WebGpuKernelExecutor::new(device);

// Elementwise operations
executor.elementwise_add(&input_a, &input_b, &output).await?;
executor.elementwise_mul(&input_a, &input_b, &output).await?;

// Activation functions
executor.relu(&input, &output).await?;
executor.softmax(&input, &output).await?;

// Matrix operations
executor.matmul(&a, &b, &output, m, n, k).await?;

// Convolution
executor.conv2d(&input, &kernel, &output, params).await?;
```

### Custom Kernels

Execute custom WGSL compute shaders:

```rust
let shader_source = r#"
@group(0) @binding(0) var<storage, read> input: array<f32>;
@group(0) @binding(1) var<storage, read_write> output: array<f32>;

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let index = global_id.x;
    if (index >= arrayLength(&output)) {
        return;
    }
    output[index] = input[index] * 2.0; // Double each element
}
"#;

executor.execute_custom_kernel(
    shader_source,
    "main",
    &[&input_buffer, &output_buffer],
    (64, 1, 1),   // workgroup_size
    (16, 1, 1),   // workgroup_count
    None,         // uniform_data
).await?;
```

## Supported Operations

### Elementwise Operations
- **Addition**: Element-wise addition of tensors
- **Multiplication**: Element-wise multiplication of tensors
- **Subtraction**: Element-wise subtraction of tensors
- **Division**: Element-wise division of tensors

### Matrix Operations
- **Matrix Multiplication**: General matrix multiplication (GEMM)
- **Transpose**: Matrix transposition
- **Batch Matrix Multiplication**: Batched GEMM operations

### Neural Network Operations
- **Convolution 2D**: 2D convolution with configurable stride and padding
- **Activation Functions**: ReLU, Softmax, Sigmoid, Tanh
- **Normalization**: Batch normalization, Layer normalization
- **Pooling**: Max pooling, Average pooling

### Reduction Operations
- **Sum**: Reduce sum across dimensions
- **Mean**: Reduce mean across dimensions
- **Max/Min**: Reduce max/min across dimensions
- **Variance**: Compute variance across dimensions

### Utility Operations
- **Copy**: Buffer-to-buffer copying
- **Fill**: Fill buffer with constant value
- **Cast**: Data type conversion
- **Reshape**: Tensor shape manipulation

## Performance Optimization

### Workgroup Size Optimization

WebGPU performance heavily depends on optimal workgroup sizes:

```rust
// Get optimal workgroup size for problem
let optimal_size = device.optimal_workgroup_size(element_count);

// Manual workgroup size tuning
let workgroup_sizes = [(64, 1, 1), (128, 1, 1), (256, 1, 1)];
let best_size = benchmark_workgroup_sizes(&workgroup_sizes);
```

### Memory Access Patterns

Optimize memory access for better performance:

```rust
// Prefer coalesced memory access
// Good: Sequential access pattern
output[index] = input[index] + bias[index];

// Bad: Random access pattern
output[indices[index]] = input[index] + bias[random_idx];
```

### Pipeline Caching

Leverage pipeline caching for repeated operations:

```rust
// Enable pipeline caching (default)
let config = WebGpuBackendConfig {
    enable_pipeline_cache: true,
    ..Default::default()
};

// Check cache statistics
let stats = executor.pipeline_stats();
println!("Cached pipelines: {}", stats.pipeline_count);
```

### Async Operations

Use async operations for better CPU utilization:

```rust
// Execute multiple operations concurrently
let tasks = vec![
    executor.elementwise_add(&a1, &b1, &c1),
    executor.elementwise_add(&a2, &b2, &c2),
    executor.elementwise_add(&a3, &b3, &c3),
];

futures::future::try_join_all(tasks).await?;
```

## Platform-Specific Considerations

### Web Browsers

**Requirements:**
- WebGPU-enabled browser (Chrome 113+, Firefox 113+, Safari 16.4+)
- HTTPS connection (WebGPU requires secure context)
- Sufficient GPU memory

**Limitations:**
- Buffer size limits (typically 1GB)
- No shared memory between workers
- Browser-specific performance variations

**Optimization:**
```rust
// Web-optimized configuration
let config = WebGpuBackendConfig {
    max_buffer_size: 512 * 1024 * 1024, // 512MB for web
    preferred_workgroup_size: (64, 1, 1),
    ..Default::default()
};
```

### Desktop (Native)

**Windows (DirectX 12):**
- Requires Windows 10 version 1903+
- DirectX 12 compatible GPU
- Latest GPU drivers

**macOS (Metal):**
- macOS 10.15+ recommended
- Metal-compatible GPU
- Integrated and discrete GPUs supported

**Linux (Vulkan):**
- Vulkan 1.1+ support
- Mesa drivers 21.0+ or proprietary drivers
- X11 or Wayland display server

### Mobile Platforms

**iOS (Metal):**
- iOS 13+ required
- A12 Bionic or newer recommended
- Memory constraints on older devices

**Android (Vulkan):**
- Android 7.0+ (API level 24)
- Vulkan-capable GPU
- Device-specific driver variations

## Error Handling

Comprehensive error handling for robust applications:

```rust
use torsh_backend::webgpu::WebGpuError;

match executor.elementwise_add(&a, &b, &output).await {
    Ok(()) => println!("Operation succeeded"),
    Err(WebGpuError::DeviceLost(msg)) => {
        eprintln!("Device lost: {}", msg);
        // Reinitialize device
    },
    Err(WebGpuError::OutOfMemory(requested, available)) => {
        eprintln!("Out of memory: requested {} bytes, {} available", 
                 requested, available);
        // Reduce batch size or free memory
    },
    Err(WebGpuError::InvalidWorkgroupSize(size)) => {
        eprintln!("Invalid workgroup size: {:?}", size);
        // Adjust workgroup configuration
    },
    Err(e) => eprintln!("Other error: {}", e),
}
```

## Best Practices

### 1. Adapter Selection

```rust
// Enumerate adapters to choose the best one
let adapters = torsh_backend::webgpu::enumerate_adapters().await?;
for (i, adapter) in adapters.iter().enumerate() {
    let info = torsh_backend::webgpu::get_adapter_info(adapter);
    println!("Adapter {}: {} ({:?})", i, info.name, info.device_type);
}

// Select high-performance discrete GPU if available
let adapter = adapters.into_iter()
    .find(|a| a.get_info().device_type == wgpu::DeviceType::DiscreteGpu)
    .unwrap_or_else(|| torsh_backend::webgpu::get_best_adapter().await.unwrap());
```

### 2. Memory Management

```rust
// Use buffer pools for frequent allocations
let pool = WebGpuBufferPool::new(device);

// Reuse buffers when possible
let buffer1 = pool.get_buffer(descriptor.clone())?;
// ... use buffer1 ...
pool.return_buffer(buffer1);

// Monitor memory usage
let usage = device.memory_usage();
if usage.allocated_bytes > threshold {
    // Implement memory pressure handling
}
```

### 3. Batch Operations

```rust
// Batch multiple small operations
let operations = vec![
    (input_a1, input_b1, output1),
    (input_a2, input_b2, output2),
    (input_a3, input_b3, output3),
];

// Execute in batches to reduce overhead
for batch in operations.chunks(batch_size) {
    for (a, b, out) in batch {
        executor.elementwise_add(a, b, out).await?;
    }
    executor.synchronize().await?;
}
```

### 4. Error Recovery

```rust
// Implement retry logic for transient errors
async fn robust_operation(executor: &WebGpuKernelExecutor, 
                         a: &WebGpuBuffer, b: &WebGpuBuffer, 
                         output: &WebGpuBuffer) -> Result<(), WebGpuError> {
    let mut attempts = 0;
    const MAX_ATTEMPTS: u32 = 3;
    
    loop {
        match executor.elementwise_add(a, b, output).await {
            Ok(()) => return Ok(()),
            Err(WebGpuError::DeviceLost(_)) if attempts < MAX_ATTEMPTS => {
                attempts += 1;
                // Reinitialize device and retry
                tokio::time::sleep(Duration::from_millis(100)).await;
            },
            Err(e) => return Err(e),
        }
    }
}
```

## Debugging and Profiling

### Enable Debug Mode

```rust
let config = WebGpuBackendConfig {
    debug_mode: true,
    ..Default::default()
};
```

### Performance Monitoring

```rust
// Monitor pipeline cache performance
let stats = executor.pipeline_stats();
println!("Cache hit ratio: {:.2}%", 
         stats.pipeline_count as f64 / total_operations as f64 * 100.0);

// Track memory usage over time
let usage = device.memory_usage();
println!("Memory efficiency: {:.2}%", 
         usage.allocated_bytes as f64 / usage.peak_allocated_bytes as f64 * 100.0);
```

### Validation and Testing

```rust
#[cfg(test)]
mod tests {
    use super::*;
    
    #[tokio::test]
    async fn test_elementwise_operations() {
        if !torsh_backend::webgpu::is_available() {
            return; // Skip test if WebGPU not available
        }
        
        let device = WebGpuDevice::from_best_adapter(0).await.unwrap();
        // ... test implementation ...
    }
}
```

## Integration with ToRSh

The WebGPU backend integrates seamlessly with ToRSh's tensor operations:

```rust
use torsh_tensor::Tensor;
use torsh_backend::BackendBuilder;

// Create tensor with WebGPU backend
let backend = BackendBuilder::new()
    .backend_type(BackendType::WebGpu)
    .build()?;

let device = backend.default_device()?;
let tensor = Tensor::zeros([1024, 1024], DType::F32, &device)?;

// Operations automatically use WebGPU backend
let result = tensor + tensor; // Uses WebGPU elementwise addition
let output = result.relu();   // Uses WebGPU ReLU activation
```

## Future Enhancements

### Planned Features

1. **WebGPU 2.0 Support**: Enhanced features and performance
2. **Multi-GPU Support**: Distributed computing across multiple adapters
3. **Advanced Memory Management**: Unified memory and memory compression
4. **Optimized Kernels**: Hand-tuned WGSL kernels for common operations
5. **Automatic Tuning**: ML-based performance optimization
6. **Interoperability**: Integration with WebAssembly and Web Workers

### Research Areas

1. **Adaptive Workgroup Sizing**: Dynamic workgroup size optimization
2. **Memory Access Optimization**: Automatic memory layout optimization
3. **Cross-Platform Profiling**: Unified performance analysis tools
4. **Energy Efficiency**: Power-aware operation scheduling

## Conclusion

The ToRSh WebGPU backend provides a modern, cross-platform solution for GPU-accelerated deep learning. Its comprehensive feature set, excellent performance, and broad compatibility make it an ideal choice for applications targeting multiple platforms, from web browsers to high-performance computing environments.

By leveraging WebGPU's modern design and ToRSh's robust architecture, developers can build high-performance deep learning applications that run seamlessly across the entire spectrum of computing devices, from mobile phones to data center GPUs.