meganeura
Meganeura - a cross-platform Neural Network training and inference library in Rust.

Why Meganeura?
- Portable. It's powered by blade-graphics for accessing GPUs across the board: Linux, Windows, MacOS, even edge devices on iOS or Android. Not toasters though.
- Fast. Within 2x of ROCm, and 5x of optimized CUDA or MLX for training workloads.
- Lean. It packs a bunch of kernels, but the real power comes from their auto-discovery. During the optimization pre-process, it explores the search space using e-graph, similar to Luminal.
Benchmarks
See Inferena for a comprehensive comparison between different frameworks.
SmolVLA action expert training (chunk_size=50, vlm_seq_len=16, float32, random weights).
| GPU | Framework | Compile | Forward | Backward |
|---|---|---|---|---|
| Radeon 890M (RADV) | Meganeura 3d34aad29c5c9151dfb59b2a3be073ac203c30af | 0 s | 14.2 ms | 36.4 ms |
| Radeon 890M (RADV) | PyTorch 2.10.0 ROCm | 7.30 s | 20.9 ms | 48.0 ms |
| GeForce RTX 5080 (590/Linux) | Meganeura 550bb6caf09c819f199084d2263794e14f683463 | 0 s | 6.1 ms | 35.1 ms |
| GeForce RTX 5080 (590/Linux) | PyTorch 2.11.0+cu128 | 3.41 s | 1.57 ms | 4.68 ms |
| Apple M3 | Meganeura 5ddf5e5c9b7b99ebb8d9d21e5c47110297ffeaa5 | 0s | 48.8 ms | 89.7 ms |
| Apple M3 | PyTorch 2.11.0 | 6.0s | 10.2 ms | 31.6 ms |
Full automatic differentiation through all ops including fused CausalAttention (with LSE-based backward), RoPE, Softmax, RmsNorm, and SwiGLU. Gradient norms verified against PyTorch via Inferena correctness checking.
Run bash bench/compare.sh to reproduce.
Profiling
Examples accept MEGANEURA_TRACE=<filename> environment for saming binary Perfetto traces.
You can open them with Perfetto Trace Viewer:

System Requirements
It works on on anything with Vulkan, including LavaPipe, or MacOS devices. Runs best when cooperative matrix operations are hardware-accelerated:
- Vulkan: GPU and driver supporting VK_KHR_cooperative_matrix (NVIDIA Volta+, AMD RDNA3+, Intel Arc)
- Metal: Apple GPU with simdgroup matrix support (Apple M1+)
Happy to rely on Mesa's Lavapipe for CI or local compatibility tests.
Standard Model Loading
Meganeura can load models from standard interchange formats and run them through the normal pipeline (e-graph optimization → compile → GPU execution) without any Rust codegen:
- ONNX — via
load_onnx("model.onnx"). Uses oxionnx-proto for lightweight protobuf parsing. Supports Gemm, MatMul, activations, normalization, attention, convolution, and shape ops. Decomposed subgraphs (fromtorch.onnx.export) should be re-exported with compound ops preserved viaoptimum-cli. - NNEF — via
load_nnef("model_dir/"). Hand-rolled parser for the Khronos NNEF text format and binary tensor files. Supports matmul, linear, convolution, activations, normalization, and reshape ops.
Both loaders translate the foreign graph into Meganeura's IR, where the e-graph optimizer discovers kernel fusions (e.g. x * sigmoid(x) → Silu, Silu(gate) * up → SwiGLU) before compilation.