runmat-accelerate 0.2.5

Pluggable GPU acceleration layer for RunMat (CUDA, ROCm, Metal, Vulkan/Spir-V)
Documentation

RunMat Accelerate

Purpose

runmat-accelerate provides the high-level acceleration layer that integrates GPU backends with the language runtime. It implements provider(s) for runmat-accelerate-api so that gpuArray, gather, and accelerated math and linear algebra can execute on devices transparently where appropriate.

Architecture

  • Depends on runmat-accelerate-api to register an AccelProvider implementation at startup.
  • Backends (e.g., wgpu, cuda, rocm, metal, vulkan, opencl) are feature-gated. Only one provider is registered globally, but a future multi-device planner can fan out.
  • Planner decides when to run ops on CPU vs GPU (size thresholds, op types, fusion opportunities). Accelerator exposes ergonomic entry points used by the runtime or higher layers.

Fusion and optimization

  • The planner and fusion engine automatically fuse common elementwise chains and reductions to reduce temporaries and host↔device transfers. Fusion detects compatible operations and generates optimized GPU kernels.
  • For providers that expose fused kernels, the planner routes operations to those paths, improving performance by minimizing kernel launches and memory traffic.
  • Autograd support (planned): Tensor/Matrix operations will participate in reverse-mode autograd by default. The runtime will record a compact tape of primitive ops; gradients will be computed by chaining primitive derivatives.

What it provides today

  • A fully functional Accelerator with automatic CPU/GPU routing: the planner chooses CPU path (delegating to runmat-runtime) or GPU path (via provider methods) based on tensor sizes, operation types, and fusion opportunities.
  • Integration points for gpuArray/gather: when a provider is registered, runtime builtins route through the provider API defined in runmat-accelerate-api.
  • Fusion engine that detects and executes fused elementwise chains and reductions on GPU.
  • Native auto-offload that transparently promotes tensors to GPU when beneficial.

How it fits with the runtime

  • The MATLAB-facing builtins (gpuArray, gather) live in runmat-runtime for consistency with all other builtins. They call into runmat-accelerate-api::provider(), which is implemented and registered by this crate.
  • This separation avoids dependency cycles and keeps the language surface centralized while enabling pluggable backends.

Backends

  • wgpu (feature: wgpu) is the primary cross-vendor backend, providing support for Metal (macOS), DirectX 12 (Windows), and Vulkan (Linux) through a single portable implementation.
  • Additional backends (CUDA/ROCm/OpenCL) are planned (features already stubbed).
  • Backend responsibilities:
    • Allocate/free buffers, handle host↔device transfers
    • Provide kernels for core ops (elementwise, transpose, matmul/GEMM, reductions, signal processing)
    • Report device information (for planner decisions)
    • Compile and cache compute pipelines for optimal performance

Current state

  • Fully functional wgpu backend powering the pre-release runtime with cross-platform GPU support (Metal/DX12/Vulkan).
  • CPU fallback path is exercised in production tests today; the GPU path already delivers the benchmarked speedups but is still evolving, so expect rough edges as we iterate.
  • Fusion engine operational, automatically detecting and fusing elementwise chains and reductions.
  • Native auto-offload heuristics working, transparently promoting operations to GPU when beneficial.

Roadmap

  • ✅ In-process provider with buffer registry (implemented).
  • ✅ wgpu backend (implemented): upload/download, elementwise ops, transpose, matmul, reductions, signal processing.
  • ✅ Operator fusion (implemented): elementwise chains and reductions are automatically fused.
  • ✅ Planner cost model and auto-offload (implemented): automatic CPU/GPU routing based on tensor sizes and operation types.
  • Planned: Streams/queues, memory pools, pinned/unified buffers, and multi-device support.
  • Planned: Additional backends (CUDA/ROCm/OpenCL) for specialized use cases.

Example usage

The provider is registered at process startup (REPL/CLI/app). Once registered, MATLAB-like code can use:

G = gpuArray(A);      % move tensor to device
H = G + 2;            % elementwise add (planner may choose GPU path)
R = gather(H);        % bring results back to host

Native acceleration

RunMat Accelerate powers native acceleration for RunMat. It lets users avoid needing to use the gpuArray/gather builtins and instead use the native acceleration API:

% Example: Large matrix multiplication and elementwise operations
A = randn(10000, 10000);   % Large matrix
B = randn(10000, 10000);

% Normally, in MATLAB you'd need to explicitly use gpuArray:
%   G = gpuArray(A);
%   H = G .* B;           % Elementwise multiply on GPU
%   S = sum(H, 2);
%   R = gather(S);

% With RunMat Accelerate, the planner can transparently move data to the GPU
% and back as needed, so you can just write:
H = A .* B;           % Planner may choose GPU for large ops
S = sum(H, 2);        % Fused and executed on device if beneficial

% Results are automatically brought back to host as needed
disp(S(1:10));        % Print first 10 results

% The planner/JIT will optimize transfers and fuse operations for best performance.

Device info (gpuDevice)

  • gpuDevice() returns a structured value with details about the active provider/device when available. Fields include device_id, name, vendor, optional memory_bytes, and optional backend.
info = gpuDevice();
% Example output (in-process provider):
%   struct with fields:
%       device_id: 0
%            name: 'InProcess'
%          vendor: 'RunMat'
%      backend: 'inprocess'

Reduction tuning and defaults

  • Defaults used by the WGPU provider (subject to device variation):
    • two-pass threshold: 1024 elements per slice
    • reduction workgroup size: 256
  • Environment overrides:
    • RUNMAT_TWO_PASS_THRESHOLD=<usize>
    • RUNMAT_REDUCTION_WG=<u32>
  • Selection logic:
    • Single-pass when reduce_len <= threshold, otherwise two-pass (partials + second-stage reduce).
    • Fusion and builtin paths honor provider defaults automatically; call sites may pass wg=0 to use provider defaults.
  • Sweep to tune for your device with the wgpu_profile binary:
    • Quick sweep: cargo run -p runmat-accelerate --bin wgpu_profile --features wgpu -- --only-reduce-sweep --reduce-sweep --quick
    • Force single-pass: RUNMAT_TWO_PASS_THRESHOLD=1000000 cargo run -p runmat-accelerate --bin wgpu_profile --features wgpu -- --only-reduce-sweep --reduce-sweep
    • Force two-pass: RUNMAT_TWO_PASS_THRESHOLD=1 cargo run -p runmat-accelerate --bin wgpu_profile --features wgpu -- --only-reduce-sweep --reduce-sweep

Pipeline cache and warmup

  • The provider caches compute pipelines by shader hash and layout to reduce first-dispatch latency.
  • A brief warmup compiles common kernels during provider initialization.
  • Cache metrics (hits/misses) are accessible via runmat_accelerate::provider_cache_stats() (when built with wgpu) and are printed by wgpu_profile after a sweep.
  • On-disk persistence: cache metadata (.json) and shader WGSL (.wgsl) are persisted under the OS cache directory or RUNMAT_PIPELINE_CACHE_DIR (fallback target/tmp).
  • Warmup precompiles pipelines from disk when available; runmat accel-info prints the last warmup duration in milliseconds.