Expand description
GPU backend using wgpu (Vulkan/Metal/DX12/WebGPU)
This backend provides GPU-accelerated compute for large-scale operations. It uses wgpu for cross-platform GPU access and WGSL compute shaders.
§Performance
GPU backend is optimal for very large workloads (>100K elements for reductions,
1000×1000 for matrix operations) where transfer overhead is amortized.
Expected speedups vs SIMD:
- Matrix multiplication (large): 10-50x
- Reductions (large): 5-20x
§Architecture
- Device initialization is lazy (first GPU operation)
- Compute shaders written in WGSL
- Asynchronous execution with pollster for blocking
- Automatic fallback to CPU if GPU unavailable
§Memory Hierarchy Abstractions
TensorView- Structured view into GPU memory with shape/stride metadataPartitionView- Tiling strategy for efficient GPU work distribution
Based on cuda-tile-behavior.md Section 3.2.
Re-exports§
pub use wgpu;
Modules§
- runtime
- Cross-platform async runtime helpers for GPU operations.
- shaders
- WGSL compute shaders for GPU operations
Structs§
- Buffer
Id - Unique identifier for a buffer in a batch
- GpuBackend
- GPU backend for compute operations (native only, uses sync wrappers)
- GpuCommand
Batch - Command batch for async GPU execution
- GpuDevice
- GPU device manager
- GpuDevice
Pool - Pool of GPU devices for multi-GPU workloads
- GpuMatmul
Cache - PMAT-322: Cached matmul with persistent weight buffers for LLM inference. Cached matmul state: pipeline + pre-uploaded weight buffers + persistent I/O.
- MaxOp
- Max reduction operation
- MinOp
- Min reduction operation
- Partition
View - A tiling strategy over a TensorView.
- QkvLoRA
- PMAT-324: WGSL transformer forward pass shaders. Optional LoRA buffers for Q/K/V projections in a layer’s forward pass.
- SumOp
- Sum reduction operation
- Tensor
View - A view into a contiguous memory region with shape and stride information.
- Tile
Info - Information about a single tile within a partition.
- Wgsl
Forward Pass - PMAT-324: WGSL transformer forward pass shaders. GPU-resident transformer layer state. All buffers persist across tokens — only input/output change per step.
Enums§
- Memory
Layout - Memory layout for tensor storage
Constants§
- TILE_
SIZE - Default tile size for 2D reductions (matches GPU workgroup size)
Traits§
- Reduce
Op - Reduction operation trait for generic tile reduction
Functions§
- tiled_
max_ 2d - Convenience function for tiled max reduction
- tiled_
min_ 2d - Convenience function for tiled min reduction
- tiled_
reduce_ 2d - Perform tiled reduction on 2D data (CPU fallback)
- tiled_
reduce_ partial - Compute partial tile results for verification
- tiled_
sum_ 2d - Convenience function for tiled sum reduction
Type Aliases§
- Pipeline
Cache - Pipeline cache keyed by shader source pointer address.