vkml-0.0.2 has been yanked.

Visit the last successful build: vkml-0.0.1

VKML — Vulkan Machine Learning

A greenfield Vulkan 1.4 ONNX inference runtime implemented in Rust. Vendor-agnostic, lock-free, and designed for heterogeneous multi-GPU execution.

Pending arXiv publication paper (written against the frozen paper branch).

Originally inspired by VkFFT's demonstration that Vulkan compute for Fast Fourier Transform can match or exceed CUDA performance.

Highlights

1.62 µs/op dispatch latency on a linear chain-add benchmark, vs ONNX Runtime's CUDA at 2.06 µs/op
~10× lower runtime host memory usage compared to ONNX Runtime
Multi-vendor GPU execution — automatic graph partitioning across NVIDIA, Intel, and AMD within a single process
Lock-free graph scheduler — event-driven, self-propagating execution with no central coordinator

Usage

Loading and executing an ONNX model. This example creates the input tensor on the CPU, runs the model on the GPU, then reads the output back to the CPU:

use vkml::{ComputeManager, DataType, Tensor, TensorDesc, VKMLError};

fn main() -> Result<(), VKMLError> {
    let mut manager = ComputeManager::new_from_onnx_path("mnist-12.onnx")?;

    let desc = TensorDesc::new(vec![1, 1, 28, 28], DataType::Float);
    let input = Tensor::new_cpu(desc.clone(), vec![0u8; desc.size_in_bytes()].into());

    let out_ids = manager.forward(vec![input])?;
    let outputs = manager.tensor_read_vec(&out_ids);
    Ok(())
}

Architecture

ONNX Model File                        Hardware
 ┌──────────────┐     ┌───────────────────────────────────────────┐
 │              │     │  Execution Engine                         │
 │  onnx-       │     │                                           │
 │  extractor   │───▶│  Graph ──▶ Greedy Device ──▶ Execution  │
 │ (zero-copy)  │     │  Build      Placement          Chunks     │
 │              │     │                               │           │
 └──────────────┘     │              ┌────────────────┘           │
                      │              ▼                            │
                      │  Event-Driven Scheduler (lock-free)       │
                      │  ┌─────────┐   ┌─────────┐   ┌─────────┐  │
                      │  │ Chunk 0 │─▶│ Chunk 1 │─▶│ Chunk N │  │
                      │  │ (GPU 0) │   │ (GPU 0) │   │ (GPU 1) │  │
                      │  └─────────┘   └─────────┘   └─────────┘  │
                      │           powered by zero-pool            │
                      └───────────────────────────────────────────┘

Key design decisions:

Graph compilation groups contiguous operations on the same device into single Vulkan command buffers, minimising driver dispatch overhead
Self-propagating execution — chunk completion signals atomic dependency counters; zero-count triggers immediate successor submission with no central coordinator
Automatic heterogeneity — the greedy allocator transparently injects TransferToDevice instructions when operations span different physical devices
Zero-copy throughout — model loading, CPU allocations, and GPU reference-based transfers avoid unnecessary copies

Benchmarks

Performance comparison against ONNX Runtime (CUDA Execution Provider) and ncnn (Vulkan) on an RTX 3080, all FP32. Second-pass inference latency (steady state, batch size 1).

Model	Size	VKML	ONNX-RT (CUDA)	ncnn (Vulkan)
mnist-12 (12 ops)	0.15 MiB	97 µs	161 µs	435 µs
Conv-heavy CNN (15 ops)	266 MiB	2.99 ms	4.37 ms	5.47 ms
MatMul-heavy (19 ops)	973 MiB	72.39 ms	10.78 ms¹	72.33 ms
Chained Add (scaling)	<1 MiB	1.62 µs/op	2.06 µs/op	9.11 µs/op

¹ ONNX-RT benefits from opaque Tensor Core / TF32 acceleration on large MatMul workloads. VKML and ncnn use strictly FP32 compute paths.

Full methodology, percentile breakdowns, and memory comparisons are detailed in the paper.

Ecosystem

VKML is built on two standalone crates, each usable independently:

Crate	Description
zero-pool	Lock-free, zero-dependency thread pool with cooperative memory reclamation. Miri verified.
onnx-extractor	Lightweight ONNX parser with mmap and zero-copy tensor access.

Building

Requires Slang in PATH to compile shaders at runtime.

cargo build --release

Roadmap

Kernel optimisation & mixed precision — expand VK_NV_cooperative_matrix2 usage, integrate FP8 support via VK_EXT_shader_float8
Advanced memory management — streaming model loader via mmap, PCIe Peer-to-Peer transfers via VK_KHR_external_memory
Platform expansion — validate on Apple Silicon via MoltenVK, add VK_KHR_device_group for NVLink
Hybrid execution — multi-threaded CPU operations, dynamic graph support (conditionals, loops), backwards pass for training

Project Priorities

Universal compute utilisation (leverages any available hardware combination)
High heterogeneous compute efficiency
Predictable and consistent performance
Ease of use

References

Vulkan Resources

Related Projects

License

MIT

Internal implementation notes, assumptions, and planned work.

Overall Todo's

More of the ONNX operators spec
VK_NV_cooperative_matrix2
Interface for manual instruction and tensor modification into model and/or tensor graphs
Backwards pass to allow training a model
Dynamic graph support (Conditionals, Loops, etc)
Multi-threading for CPU operators; matmul, gemm, etc. Currently they serve only as single threaded references

Thread Pool Implementation

One global thread pool is used throughout the entire process, zero-pool
This means that in most cases, bar some older dgpu specifications that require staged allocation logic, the entire process is multi-threaded where possible and lock free.

GPU Management

Memory tracking implemented using VK_EXT_memory_budget when available
- Tracks both self usage and initial usage from other processes
- Configurable threshold (default 95% of available memory)
- Multi-threaded allocation
Automatic model placement across available devices (GPUs and CPU)
- Automatically creates transfer operations when model is split across devices
- Handles host-visible vs device-local memory requirements
GPU features are taken into account, and performance features are toggled as supported on a per device level
GPU-to-GPU movement currently routes through CPU
- Need to investigate Vulkan device pools
- Research needed on VK shared memory pool extensions

Architecture Decisions

Model, Layer, Tensor etc. act as descriptors/blueprints only
- Allows the compute manager to handle all data and memory
- Large separation between blueprint layers and final tensor DAG
Zero-copy optimisations:
- Model loading is full zero-copy
- CPU allocations use zero-copy transfer when possible
- GPU allocations use reference-based zero-copy transfer when possible
Model storage is sequential in memory
- Avoids unnecessary CPU transfers
Current compute implementation:
- All work that can logically be parallelisable is done so

Vulkan Usage

Vendor specific extensions become standard extensions depending on adoption. As of 2025, ARM appears to be focusing on adding ML specific extension to Vulkan
- As of 1.3.300 VK_NV_cooperative_matrix
- As of 1.4.317 VK_EXT_shader_float8
- As of 1.4.319 VK_ARM_data_graph

vkml 0.0.2