VKML — Vulkan Machine Learning
A greenfield Vulkan 1.4 ONNX inference runtime implemented in Rust. Vendor-agnostic, lock-free, and designed for heterogeneous multi-GPU execution.
Pending arXiv publication paper (written against the frozen
paperbranch).
Originally inspired by VkFFT's demonstration that Vulkan compute for Fast Fourier Transform can match or exceed CUDA performance.
Highlights
- 1.62 µs/op dispatch latency on a linear chain-add benchmark, vs ONNX Runtime's CUDA at 2.06 µs/op
- ~10× lower runtime host memory usage compared to ONNX Runtime
- Multi-vendor GPU execution — automatic graph partitioning across NVIDIA, Intel, and AMD within a single process
- Lock-free graph scheduler — event-driven, self-propagating execution with no central coordinator
Usage
Loading and executing an ONNX model. This example creates the input tensor on the CPU, runs the model on the GPU, then reads the output back to the CPU:
use ;
Architecture
ONNX Model File Hardware
┌──────────────┐ ┌───────────────────────────────────────────┐
│ │ │ Execution Engine │
│ onnx- │ │ │
│ extractor │───▶│ Graph ──▶ Greedy Device ──▶ Execution │
│ (zero-copy) │ │ Build Placement Chunks │
│ │ │ │ │
└──────────────┘ │ ┌────────────────┘ │
│ ▼ │
│ Event-Driven Scheduler (lock-free) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Chunk 0 │─▶│ Chunk 1 │─▶│ Chunk N │ │
│ │ (GPU 0) │ │ (GPU 0) │ │ (GPU 1) │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ powered by zero-pool │
└───────────────────────────────────────────┘
Key design decisions:
- Graph compilation groups contiguous operations on the same device into single Vulkan command buffers, minimising driver dispatch overhead
- Self-propagating execution — chunk completion signals atomic dependency counters; zero-count triggers immediate successor submission with no central coordinator
- Automatic heterogeneity — the greedy allocator transparently injects
TransferToDeviceinstructions when operations span different physical devices - Zero-copy throughout — model loading, CPU allocations, and GPU reference-based transfers avoid unnecessary copies
Benchmarks
Performance comparison against ONNX Runtime (CUDA Execution Provider) and ncnn (Vulkan) on an RTX 3080, all FP32. Second-pass inference latency (steady state, batch size 1).
| Model | Size | VKML | ONNX-RT (CUDA) | ncnn (Vulkan) |
|---|---|---|---|---|
| mnist-12 (12 ops) | 0.15 MiB | 97 µs | 161 µs | 435 µs |
| Conv-heavy CNN (15 ops) | 266 MiB | 2.99 ms | 4.37 ms | 5.47 ms |
| MatMul-heavy (19 ops) | 973 MiB | 72.39 ms | 10.78 ms¹ | 72.33 ms |
| Chained Add (scaling) | <1 MiB | 1.62 µs/op | 2.06 µs/op | 9.11 µs/op |
¹ ONNX-RT benefits from opaque Tensor Core / TF32 acceleration on large MatMul workloads. VKML and ncnn use strictly FP32 compute paths.
Full methodology, percentile breakdowns, and memory comparisons are detailed in the paper.
Ecosystem
VKML is built on two standalone crates, each usable independently:
| Crate | Description |
|---|---|
| zero-pool | Lock-free, zero-dependency thread pool with cooperative memory reclamation. Miri verified. |
| onnx-extractor | Lightweight ONNX parser with mmap and zero-copy tensor access. |
Building
Requires Slang in PATH to compile shaders at runtime.
Roadmap
- Kernel optimisation & mixed precision — expand
VK_NV_cooperative_matrix2usage, integrate FP8 support viaVK_EXT_shader_float8 - Advanced memory management — streaming model loader via mmap, PCIe Peer-to-Peer transfers via
VK_KHR_external_memory - Platform expansion — validate on Apple Silicon via MoltenVK, add
VK_KHR_device_groupfor NVLink - Hybrid execution — multi-threaded CPU operations, dynamic graph support (conditionals, loops), backwards pass for training
Project Priorities
- Universal compute utilisation (leverages any available hardware combination)
- High heterogeneous compute efficiency
- Predictable and consistent performance
- Ease of use
References
Vulkan Resources
- Cooperative Matrix Performance
- VkFFT — IEEE Paper
- Vulkan Tutorial
- Ash-rs
- Vulkanalia
- Nvidia Recommended Do's and Don'ts
Related Projects
License
MIT
Internal implementation notes, assumptions, and planned work.
Overall Todo's
- More of the ONNX operators spec
- VK_NV_cooperative_matrix2
- Interface for manual instruction and tensor modification into model and/or tensor graphs
- Backwards pass to allow training a model
- Dynamic graph support (Conditionals, Loops, etc)
- Multi-threading for CPU operators; matmul, gemm, etc. Currently they serve only as single threaded references
Thread Pool Implementation
- One global thread pool is used throughout the entire process, zero-pool
- This means that in most cases, bar some older dgpu specifications that require staged allocation logic, the entire process is multi-threaded where possible and lock free.
GPU Management
- Memory tracking implemented using VK_EXT_memory_budget when available
- Tracks both self usage and initial usage from other processes
- Configurable threshold (default 95% of available memory)
- Multi-threaded allocation
- Automatic model placement across available devices (GPUs and CPU)
- Automatically creates transfer operations when model is split across devices
- Handles host-visible vs device-local memory requirements
- GPU features are taken into account, and performance features are toggled as supported on a per device level
- GPU-to-GPU movement currently routes through CPU
- Need to investigate Vulkan device pools
- Research needed on VK shared memory pool extensions
Architecture Decisions
- Model, Layer, Tensor etc. act as descriptors/blueprints only
- Allows the compute manager to handle all data and memory
- Large separation between blueprint layers and final tensor DAG
- Zero-copy optimisations:
- Model loading is full zero-copy
- CPU allocations use zero-copy transfer when possible
- GPU allocations use reference-based zero-copy transfer when possible
- Model storage is sequential in memory
- Avoids unnecessary CPU transfers
- Current compute implementation:
- All work that can logically be parallelisable is done so
Vulkan Usage
- Vendor specific extensions become standard extensions depending on adoption. As of 2025, ARM appears to be focusing on adding ML specific extension to Vulkan
- As of 1.3.300 VK_NV_cooperative_matrix
- As of 1.4.317 VK_EXT_shader_float8
- As of 1.4.319 VK_ARM_data_graph