Module gpu

Expand description

GPU backend using wgpu (Vulkan/Metal/DX12/WebGPU)

This backend provides GPU-accelerated compute for large-scale operations. It uses wgpu for cross-platform GPU access and WGSL compute shaders.

§Performance

GPU backend is optimal for very large workloads (>100K elements for reductions,

1000×1000 for matrix operations) where transfer overhead is amortized.

Expected speedups vs SIMD:

Based on cuda-tile-behavior.md Section 3.2.

BufferId: Unique identifier for a buffer in a batch
GpuBackend: GPU backend for compute operations (native only, uses sync wrappers)
GpuCommandBatch: Command batch for async GPU execution
GpuDevice: GPU device manager
GpuDevicePool: Pool of GPU devices for multi-GPU workloads
GpuMatmulCache: PMAT-322: Cached matmul with persistent weight buffers for LLM inference. Cached matmul state: pipeline + pre-uploaded weight buffers + persistent I/O.
MaxOp: Max reduction operation
MinOp: Min reduction operation
PartitionView: A tiling strategy over a TensorView.
QkvLoRA: PMAT-324: WGSL transformer forward pass shaders. Optional LoRA buffers for Q/K/V projections in a layer’s forward pass.
SumOp: Sum reduction operation
TensorView: A view into a contiguous memory region with shape and stride information.
TileInfo: Information about a single tile within a partition.
WgslForwardPass: PMAT-324: WGSL transformer forward pass shaders. GPU-resident transformer layer state. All buffers persist across tokens — only input/output change per step.