gpu-mumu 0.2.0-rc.1

# gpu-mumu

_A MuMu/Lava plugin that adds matrix & tensor operations with an optional Vulkan backend — and a zero-drama CPU fallback when no GPU is available._

**Crate:** `gpu-mumu`  
**Library name (cdylib):** `mumugpu` → built as `libmumugpu.{so|dylib}` (Windows: `mumugpu.dll`)  
**Version:** `0.2.0-rc.1`  
**Engine compatibility:** `core-mumu = 0.9.0-rc.3`  
**License:** MIT OR Apache-2.0  
**Repository:** <https://gitlab.com/tofo/gpu-mumu>  
**Homepage:** <https://lava.nu11.uk>

---

## What this plugin provides

- **A consistent tensor API** that works everywhere:
  - If a Vulkan device is present, a Vulkan context is created at load time.
  - If not, the plugin **falls back to optimized CPU paths** with identical results.
- **Batteries-included operations** for 2-D float matrices:
  - matrix multiply, elementwise add/subtract/**Hadamard**, transpose, 2×2 inverse,
    sum reduction, scalar scaling, and array↔“tensor” conversion helpers.
- **Strict shape/type checks** and clear error messages (ragged rows, shape mismatches, etc.).
- **Debug visibility** (debug builds): query whether the **last call** used the GPU.

Under the hood the crate ships **GLSL compute shaders** (built to SPIR-V if `glslc` is available at build time) alongside robust CPU implementations to guarantee portability.

---

## Quick start (MuMu)

Load the plugin and multiply two matrices:

```mu
extend("gpu")

A = [
  [1, 0, 0, 0],
  [0, 1, 0, 0],
  [0, 0, 1, 0],
  [0, 0, 0, 1]
]
B = [
  [1,  2,  3,  4],
  [5,  6,  7,  8],
  [9, 10, 11, 12],
  [13,14, 15, 16]
]

AT = gpu:to_tensor(A)          // validate & convert to Float2DArray
BT = gpu:to_tensor(B)

CT = gpu:multiply(AT, BT)      // (4×4) · (4×4) -> (4×4)
slog(gpu:to_array(CT))         // -> [[1,2,3,4], [5,6,7,8], ...]
```

> The loader resolves `extend("gpu")` to a shared library named
> `libmumu**gpu**.{so|dylib}` (Windows: `mumu**gpu**.dll`) using the search
> paths documented by the core engine.

---

## API overview

All functions are registered as dynamic MuMu functions when the plugin is loaded.
Types below are MuMu runtime types from `core-mumu`:

| Function | Signature | Returns | Notes |
|---|---|---|---|
| `gpu:to_tensor` | `(Int2DArray | Float2DArray)` | `Float2DArray` | Validates rectangular shape; converts ints → floats. Errors on empty/ragged input. |
| `gpu:to_array` | `(Float2DArray)` | `Float2DArray` | Identity helper (useful to signal intent when composing). |
| `gpu:multiply` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Matrix product `(m×k) · (k×n) -> (m×n)`. Errors on ragged rows or incompatible dimensions. |
| `gpu:add` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Elementwise sum. Shapes must match exactly. |
| `gpu:subtract` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Elementwise difference. Shapes must match. |
| `gpu:hadamard` | `(Float2DArray A, Float2DArray B)` | `Float2DArray` | Elementwise product (Hadamard). Shapes must match. |
| `gpu:transpose` | `(Float2DArray T)` | `Float2DArray` | Transpose `m×n -> n×m`. Validates rectangular rows. |
| `gpu:inverse` | `(Float2DArray T)` | `Float2DArray (2×2)` | **Only** 2×2 currently. Errors if singular or wrong size. |
| `gpu:reduce_sum` | `(Float2DArray T)` | `Float` | Sum of all elements. |
| `gpu:scale` | `(Int|Float scalar, Float2DArray T)` | `Float2DArray` | Multiply every element by scalar. |

### Debug helper (debug builds only)

| Function | Signature | Returns | Notes |
|---|---|---|---|
| `gpu:last_call` | `()` | `KeyedArray { op: string, used_gpu: bool }` | Inspects the last GPU function call. `used_gpu` indicates whether a Vulkan context was active for that call (some ops currently run on CPU even if a context exists). |

---

## Behavior & design details

### CPU fallback and Vulkan context

- On `extend("gpu")`, the plugin tries to create a Vulkan device using **ash**.
- If no device is found (or Vulkan initialization fails), execution **continues**;
  all operations run on the CPU reference path with identical semantics.
- Where Vulkan is available, some operations may still call into the CPU path
  (the SPIR-V kernels are shipped and compiled, but not all are wired up yet).
  The debug helper `gpu:last_call()` makes this explicit.

### Types & shape safety

- The plugin treats the “tensor” as a plain `Float2DArray` in the core runtime.
- `gpu:to_tensor` acts as an ingest gate: it validates rectangular shapes and
  normalizes ints to floats, so the rest of the API can assume dense float
  matrices. Most ops will error on ragged rows or mismatched shapes.

### Threading & global state

- A single `AshVulkanContext` is stored in a global `Arc<Mutex<Option<_>>>`.
- Nothing is exported that mutates global state outside that lock.
- The library is designed to be loaded dynamically and dropped with the process.

---

## Building & installing (host-only plugin)

This crate builds a **cdylib** for dynamic loading. Typical flows:

```bash
# Build with Cargo (release)
cargo build --release

# Or use the provided Makefile (build + copy .so to /usr/local/lib)
make
sudo make install
```

> **Vulkan & shader notes**
>
> - A working Vulkan loader/runtime enables the GPU context.
> - If `glslc` is in `PATH`, `build.rs` compiles shaders in `shader/*.glsl` to
>   SPIR-V and embeds them; otherwise the build continues with a warning.
> - The plugin remains fully functional on CPU without glslc or GPU drivers.

---

## Dependencies (high level)

- **Engine:** `core-mumu = 0.9.0-rc.3` (dynamic function registry, MuMu `Value` types).
- **Vulkan:** `ash = 0.38` (optional at runtime; CPU works without GPU).
- **Runtime:** `anyhow`, `log`, `env_logger`, `lazy_static`, `indexmap`, `libloading`.

> Web/WASM is **not** a target for this crate (host-only by design).

---

## Troubleshooting

- `extend("gpu")` prints *“plugin could not be located”*  
  → Ensure `libmumugpu.{so|dylib|dll}` is on a loader search path
  (core engine looks in common system locations and `$MUMU_PLUGIN_PATH`).

- *“No Vulkan physical devices found”* on load  
  → That’s OK. The plugin will use the CPU reference path.

- Want to see what happened?  
  - Set `RUST_LOG=info` to see setup logs from the Vulkan context.
  - Set `LAVA_TIMING_VERBOSE=1` to make the core REPL/driver print timing ticks.
  - In **debug builds**, call `gpu:last_call()` to inspect `op` and `used_gpu`.

---

## Minimal examples

Elementwise operations and reductions:

```mu
extend("gpu")

T1 = gpu:to_tensor([[1,2,3],[4,5,6]])
T2 = gpu:to_tensor([[6,5,4],[3,2,1]])

slog(gpu:add(T1, T2))        // -> [[7,7,7],[7,7,7]]
slog(gpu:hadamard(T1, T2))   // -> [[6,10,12],[12,10,6]]
slog(gpu:reduce_sum(T1))     // -> 21
slog(gpu:scale(0.5, T1))     // -> [[0.5,1,1.5],[2,2.5,3]]
```

Matrix multiply and transpose:

```mu
extend("gpu")

A = gpu:to_tensor([[1,2],[3,4]])     // 2×2
B = gpu:to_tensor([[4,3],[2,1]])     // 2×2

C = gpu:multiply(A, B)               // -> 2×2
slog(gpu:to_array(gpu:transpose(C)))
```

> Examples intentionally stay small; consult the function table for signatures.

---

## Project layout (key files)

- `src/lib.rs` — dynamic entrypoint (`Cargo_lock`) and registration.
- `src/registration.rs` — registers all `gpu:*` functions into the engine.
- `src/operators/*` — operation bridges & helpers (`ensure_float2d`, elementwise, conversions).
- `src/cpu_ops.rs` — CPU reference implementations (multiply, transpose, reduce, scale, 2×2 inverse).
- `src/vulkan.rs` — ash-based Vulkan context initialisation.
- `shader/*.glsl` — compute kernels (compiled by `build.rs` if `glslc` is present).
- `examples/4x4.mu` — tiny end-to-end sample script.

---

## Versioning & license

This crate follows **pre-release semver** while the MuMu/Lava engine evolves.
The API is expected to stabilise with the `0.2.x` series.

Licensed under either of:

- MIT license
- Apache License, Version 2.0

at your option.

---

## Acknowledgements

Built for the MuMu/Lava ecosystem. Thanks to the `ash` project and the Vulkan community.

If you have ideas, issues, or want to wire more ops to the GPU kernels, please open an issue or MR at **GitLab**: <https://gitlab.com/tofo/gpu-mumu>.