amari-gpu
GPU acceleration for Amari mathematical computations using WebGPU.
Overview
amari-gpu is an integration crate that provides GPU-accelerated implementations of mathematical operations from Amari domain crates. It follows the progressive enhancement pattern: operations automatically fall back to CPU computation when GPU is unavailable or for small workloads, scaling to GPU acceleration for large batch operations in production.
Architecture
As an integration crate, amari-gpu consumes APIs from domain crates and exposes them to GPU platforms:
Domain Crates (provide APIs):
amari-core → amari-measure → amari-calculus
amari-info-geom, amari-relativistic, amari-network
Integration Crates (consume APIs):
amari-gpu → depends on domain crates
amari-wasm → depends on domain crates
Dependency Rule: Integration crates depend on domain crates, never the reverse.
Current Integrations (v0.13.0)
Implemented GPU Acceleration
| Domain Crate | Module | Operations | Status |
|---|---|---|---|
| amari-core | core |
Geometric algebra operations (G2, G3, G4), multivector products | ✅ Implemented |
| amari-info-geom | info_geom |
Fisher metric, divergence computations, statistical manifolds | ✅ Implemented |
| amari-relativistic | relativistic |
Minkowski space operations, Lorentz transformations | ✅ Implemented |
| amari-network | network |
Graph operations, spectral methods | ✅ Implemented |
| amari-measure | measure |
Measure theory computations, sigma-algebras | ✅ Implemented (feature: measure) |
| amari-calculus | calculus |
Field evaluation, gradients, divergence, curl | ✅ Implemented (feature: calculus) |
| amari-dual | dual |
Automatic differentiation GPU operations | ✅ Implemented (feature: dual) |
| amari-enumerative | enumerative |
Intersection theory GPU operations | ✅ Implemented (feature: enumerative) |
| amari-automata | automata |
Cellular automata GPU evolution | ✅ Implemented (feature: automata) |
| amari-fusion | fusion |
Tropical-dual-Clifford fusion operations | ✅ Implemented (feature: fusion) |
| amari-holographic | holographic |
Holographic memory, batch binding, similarity matrices | ✅ Implemented (feature: holographic) |
| amari-probabilistic | probabilistic |
Gaussian sampling, batch statistics, Monte Carlo | ✅ Implemented (feature: probabilistic) |
Temporarily Disabled Modules
| Domain Crate | Module | Status | Reason |
|---|---|---|---|
| amari-tropical | tropical |
❌ Disabled | Orphan impl rules - requires extension traits |
Note: If you were using amari_gpu::tropical in previous versions, this module is not available in v0.12.2. Use CPU implementations from amari_tropical directly until this module is restored in a future release.
Features
[]
= []
= ["amari-core/std", "amari-relativistic/std", "amari-info-geom/std"]
= ["wgpu/webgpu"]
= ["amari-core/high-precision", "amari-relativistic/high-precision"]
= ["dep:amari-measure"]
= ["dep:amari-calculus"]
= ["dep:amari-dual"]
= ["dep:amari-enumerative"]
= ["dep:amari-automata"]
= ["dep:amari-fusion"]
= ["dep:amari-holographic"] # Holographic memory GPU acceleration
= ["dep:rand", "dep:rand_distr"] # Probabilistic GPU acceleration
# tropical = ["dep:amari-tropical"] # Disabled - orphan impl rules
Usage
Basic Setup
use GpuContext;
async
Calculus GPU Acceleration
use GpuCalculus;
use ScalarField;
use Multivector;
async
Holographic Memory GPU Acceleration
use ;
async
Holographic GPU Operations
| Operation | Description | GPU Threshold |
|---|---|---|
batch_bind() |
Parallel geometric product binding | ≥ 100 pairs |
batch_similarity() |
Pairwise or matrix similarity computation | ≥ 100 vectors |
resonator_cleanup() |
Parallel codebook search for best match | ≥ 100 codebook entries |
WGSL Shaders
The holographic module includes optimized WGSL compute shaders:
holographic_batch_bind: Cayley table-based geometric product for bindingholographic_batch_similarity: Inner product with reverse<A B̃>₀for similarityholographic_bundle_all: Parallel reduction for vector superpositionholographic_resonator_step: Parallel max-finding for cleanup
Probabilistic GPU Acceleration
use GpuProbabilistic;
async
Probabilistic GPU Operations
| Operation | Description | GPU Threshold |
|---|---|---|
batch_sample_gaussian() |
Parallel Box-Muller Gaussian sampling | ≥ 1000 samples |
batch_mean() |
Parallel reduction for mean | ≥ 1000 elements |
batch_variance() |
Two-pass parallel variance | ≥ 1000 elements |
Adaptive CPU/GPU Dispatch
The library automatically selects the optimal execution path:
// Small batch: Automatically uses CPU (< 1000 points for scalar fields)
let small_points = vec!;
let values = gpu_calculus.batch_eval_scalar_field.await?;
// ↑ Executed on CPU (overhead of GPU transfer exceeds benefit)
// Large batch: Automatically uses GPU (≥ 1000 points)
let large_points = generate_point_grid; // 10,000 points
let values = gpu_calculus.batch_eval_scalar_field.await?;
// ↑ Executed on GPU (parallel processing advantage)
Batch Size Thresholds
| Operation | CPU Threshold | GPU Threshold |
|---|---|---|
| Scalar field evaluation | < 1000 points | ≥ 1000 points |
| Vector field evaluation | < 500 points | ≥ 500 points |
| Gradient computation | < 500 points | ≥ 500 points |
| Divergence/Curl | < 500 points | ≥ 500 points |
| Holographic binding | < 100 pairs | ≥ 100 pairs |
| Holographic similarity | < 100 vectors | ≥ 100 vectors |
| Resonator cleanup | < 100 codebook | ≥ 100 codebook |
| Gaussian sampling | < 1000 samples | ≥ 1000 samples |
| Batch mean/variance | < 1000 elements | ≥ 1000 elements |
Implementation Status
Holographic Module (v0.13.0)
GPU Implementations (✅ Complete):
- Batch binding with Cayley table geometric product
- Batch similarity using proper inner product
<A B̃>₀ - Parallel reduction for vector bundling
- Resonator cleanup with parallel codebook search
Probabilistic Module (v0.13.0)
GPU Implementations (✅ Complete):
- Batch Gaussian sampling on multivector spaces
- Parallel mean and variance computation
- Monte Carlo integration acceleration
- GPU-based random number generation with Box-Muller transform
Types:
GpuHolographicTDC: GPU-compatible TropicalDualClifford representationGpuResonatorOutput: Cleanup result with best match infoHolographicGpuOps: Main GPU operations struct
Shaders:
HOLOGRAPHIC_BATCH_BIND: 64-thread workgroups for bindingHOLOGRAPHIC_BATCH_SIMILARITY: 256-thread workgroups for similarityHOLOGRAPHIC_BUNDLE_ALL: Workgroup-shared memory reductionHOLOGRAPHIC_RESONATOR_STEP: 256-thread parallel max-finding
Calculus Module (v0.13.0)
CPU Implementations (✅ Complete):
- Central finite differences for numerical derivatives
- Field evaluation at multiple points
- Gradient, divergence, and curl computation
- Step size: h = 1e-6 for numerical stability
GPU Implementations (⏸️ Future Work):
- WGSL compute shaders for parallel field evaluation
- Parallel finite difference computation
- Optimized memory layout for GPU transfer
Current Behavior:
- Infrastructure and pipelines are in place
- All operations currently use CPU implementations
- Shaders can be added incrementally without API changes
Examples
See the examples/ directory for complete examples:
# Run geometric algebra example
# Run information geometry example
# Run calculus example (requires 'calculus' feature)
Development
Running Tests
# Run all tests
# Run with specific features
# Run GPU tests (requires GPU access)
Building Documentation
Future Work
Short-term (v0.13.x)
- Implement WGSL shaders for calculus operations
- Add GPU benchmarks comparing CPU vs GPU performance
- Optimize memory transfer patterns
- Add more comprehensive examples
- Restore tropical GPU module using extension traits (orphan impl fix)
Medium-term (v0.14.x - v0.15.x)
- Implement tropical algebra GPU operations
- Multi-GPU support for large holographic memories
- Performance optimization across all GPU modules
- Unified GPU context sharing across all modules
Long-term (v1.0.0+)
- WebGPU backend for browser deployment
- Multi-GPU support for distributed computation
- Kernel fusion optimization
- Custom WGSL shader compilation pipeline
Performance Considerations
- GPU Initialization: ~100-200ms startup cost for context creation
- Data Transfer: Significant overhead for small batches (< 500 elements)
- Optimal Use Cases: Large batch operations (> 1000 elements)
- Memory: GPU buffers are sized for batch operations (dynamically allocated)
Platform Support
| Platform | Backend | Status |
|---|---|---|
| Linux | Vulkan | ✅ Tested |
| macOS | Metal | ✅ Supported (not regularly tested) |
| Windows | DirectX 12 / Vulkan | ✅ Supported (not regularly tested) |
| WebAssembly | WebGPU | ⏸️ Requires webgpu feature |
Dependencies
wgpu(v0.19): WebGPU implementationbytemuck: Zero-cost GPU buffer conversionsnalgebra: Linear algebra operationstokio: Async runtime for GPU operationsfutures,pollster: Async utilities
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Contributing
Contributions are welcome! Areas of particular interest:
- WGSL shader implementations for calculus operations
- Performance benchmarks and optimization
- Platform-specific testing and bug reports
- Documentation improvements and examples
References
- WebGPU Specification
- wgpu Documentation
- Geometric Algebra GPU Acceleration (example reference)