# gpu-mumu
_A MuMu/Lava plugin that adds matrix & tensor operations with an optional Vulkan backend — and a zero-drama CPU fallback when no GPU is available._
**Crate:** `gpu-mumu`
**Library name (cdylib):** `mumugpu` → built as `libmumugpu.{so|dylib}` (Windows: `mumugpu.dll`)
**Version:** `0.2.0-rc.1`
**Engine compatibility:** `core-mumu = 0.9.0-rc.3`
**License:** MIT OR Apache-2.0
**Repository:** <https://gitlab.com/tofo/gpu-mumu>
**Homepage:** <https://lava.nu11.uk>
---
## What this plugin provides
- **A consistent tensor API** that works everywhere:
- If a Vulkan device is present, a Vulkan context is created at load time.
- If not, the plugin **falls back to optimized CPU paths** with identical results.
- **Batteries-included operations** for 2-D float matrices:
- matrix multiply, elementwise add/subtract/**Hadamard**, transpose, 2×2 inverse,
sum reduction, scalar scaling, and array↔“tensor” conversion helpers.
- **Strict shape/type checks** and clear error messages (ragged rows, shape mismatches, etc.).
- **Debug visibility** (debug builds): query whether the **last call** used the GPU.
Under the hood the crate ships **GLSL compute shaders** (built to SPIR-V if `glslc` is available at build time) alongside robust CPU implementations to guarantee portability.
---
## Quick start (MuMu)
Load the plugin and multiply two matrices:
```mu
extend("gpu")
A = [
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]
]
B = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13,14, 15, 16]
]
AT = gpu:to_tensor(A) // validate & convert to Float2DArray
BT = gpu:to_tensor(B)
CT = gpu:multiply(AT, BT) // (4×4) · (4×4) -> (4×4)
slog(gpu:to_array(CT)) // -> [[1,2,3,4], [5,6,7,8], ...]
```
> The loader resolves `extend("gpu")` to a shared library named
> `libmumu**gpu**.{so|dylib}` (Windows: `mumu**gpu**.dll`) using the search
> paths documented by the core engine.
---
## API overview
All functions are registered as dynamic MuMu functions when the plugin is loaded.
Types below are MuMu runtime types from `core-mumu`:
| `gpu:to_tensor` | `(Int2DArray | Float2DArray)` | `Float2DArray` | Validates rectangular shape; converts ints → floats. Errors on empty/ragged input. |
| `gpu:to_array` | `(Float2DArray)` | `Float2DArray` | Identity helper (useful to signal intent when composing). |
| `gpu:multiply` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Matrix product `(m×k) · (k×n) -> (m×n)`. Errors on ragged rows or incompatible dimensions. |
| `gpu:add` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Elementwise sum. Shapes must match exactly. |
| `gpu:subtract` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Elementwise difference. Shapes must match. |
| `gpu:hadamard` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Elementwise product (Hadamard). Shapes must match. |
| `gpu:transpose` | `(Float2DArray T)` | `Float2DArray` | Transpose `m×n -> n×m`. Validates rectangular rows. |
| `gpu:inverse` | `(Float2DArray T)` | `Float2DArray (2×2)` | **Only** 2×2 currently. Errors if singular or wrong size. |
| `gpu:reduce_sum` | `(Float2DArray T)` | `Float` | Sum of all elements. |
| `gpu:scale` | `(Int|Float scalar, Float2DArray T)` | `Float2DArray` | Multiply every element by scalar. |
### Debug helper (debug builds only)
| `gpu:last_call` | `()` | `KeyedArray { op: string, used_gpu: bool }` | Inspects the last GPU function call. `used_gpu` indicates whether a Vulkan context was active for that call (some ops currently run on CPU even if a context exists). |
---
## Behavior & design details
### CPU fallback and Vulkan context
- On `extend("gpu")`, the plugin tries to create a Vulkan device using **ash**.
- If no device is found (or Vulkan initialization fails), execution **continues**;
all operations run on the CPU reference path with identical semantics.
- Where Vulkan is available, some operations may still call into the CPU path
(the SPIR-V kernels are shipped and compiled, but not all are wired up yet).
The debug helper `gpu:last_call()` makes this explicit.
### Types & shape safety
- The plugin treats the “tensor” as a plain `Float2DArray` in the core runtime.
- `gpu:to_tensor` acts as an ingest gate: it validates rectangular shapes and
normalizes ints to floats, so the rest of the API can assume dense float
matrices. Most ops will error on ragged rows or mismatched shapes.
### Threading & global state
- A single `AshVulkanContext` is stored in a global `Arc<Mutex<Option<_>>>`.
- Nothing is exported that mutates global state outside that lock.
- The library is designed to be loaded dynamically and dropped with the process.
---
## Building & installing (host-only plugin)
This crate builds a **cdylib** for dynamic loading. Typical flows:
```bash
# Build with Cargo (release)
cargo build --release
# Or use the provided Makefile (build + copy .so to /usr/local/lib)
make
sudo make install
```
> **Vulkan & shader notes**
>
> - A working Vulkan loader/runtime enables the GPU context.
> - If `glslc` is in `PATH`, `build.rs` compiles shaders in `shader/*.glsl` to
> SPIR-V and embeds them; otherwise the build continues with a warning.
> - The plugin remains fully functional on CPU without glslc or GPU drivers.
---
## Dependencies (high level)
- **Engine:** `core-mumu = 0.9.0-rc.3` (dynamic function registry, MuMu `Value` types).
- **Vulkan:** `ash = 0.38` (optional at runtime; CPU works without GPU).
- **Runtime:** `anyhow`, `log`, `env_logger`, `lazy_static`, `indexmap`, `libloading`.
> Web/WASM is **not** a target for this crate (host-only by design).
---
## Troubleshooting
- `extend("gpu")` prints *“plugin could not be located”*
→ Ensure `libmumugpu.{so|dylib|dll}` is on a loader search path
(core engine looks in common system locations and `$MUMU_PLUGIN_PATH`).
- *“No Vulkan physical devices found”* on load
→ That’s OK. The plugin will use the CPU reference path.
- Want to see what happened?
- Set `RUST_LOG=info` to see setup logs from the Vulkan context.
- Set `LAVA_TIMING_VERBOSE=1` to make the core REPL/driver print timing ticks.
- In **debug builds**, call `gpu:last_call()` to inspect `op` and `used_gpu`.
---
## Minimal examples
Elementwise operations and reductions:
```mu
extend("gpu")
T1 = gpu:to_tensor([[1,2,3],[4,5,6]])
T2 = gpu:to_tensor([[6,5,4],[3,2,1]])
slog(gpu:add(T1, T2)) // -> [[7,7,7],[7,7,7]]
slog(gpu:hadamard(T1, T2)) // -> [[6,10,12],[12,10,6]]
slog(gpu:reduce_sum(T1)) // -> 21
slog(gpu:scale(0.5, T1)) // -> [[0.5,1,1.5],[2,2.5,3]]
```
Matrix multiply and transpose:
```mu
extend("gpu")
A = gpu:to_tensor([[1,2],[3,4]]) // 2×2
B = gpu:to_tensor([[4,3],[2,1]]) // 2×2
C = gpu:multiply(A, B) // -> 2×2
slog(gpu:to_array(gpu:transpose(C)))
```
> Examples intentionally stay small; consult the function table for signatures.
---
## Project layout (key files)
- `src/lib.rs` — dynamic entrypoint (`Cargo_lock`) and registration.
- `src/registration.rs` — registers all `gpu:*` functions into the engine.
- `src/operators/*` — operation bridges & helpers (`ensure_float2d`, elementwise, conversions).
- `src/cpu_ops.rs` — CPU reference implementations (multiply, transpose, reduce, scale, 2×2 inverse).
- `src/vulkan.rs` — ash-based Vulkan context initialisation.
- `shader/*.glsl` — compute kernels (compiled by `build.rs` if `glslc` is present).
- `examples/4x4.mu` — tiny end-to-end sample script.
---
## Versioning & license
This crate follows **pre-release semver** while the MuMu/Lava engine evolves.
The API is expected to stabilise with the `0.2.x` series.
Licensed under either of:
- MIT license
- Apache License, Version 2.0
at your option.
---
## Acknowledgements
Built for the MuMu/Lava ecosystem. Thanks to the `ash` project and the Vulkan community.
If you have ideas, issues, or want to wire more ops to the GPU kernels, please open an issue or MR at **GitLab**: <https://gitlab.com/tofo/gpu-mumu>.