Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
hanzo-kernel
Write a GPU kernel once in Rust. Run it on CUDA, ROCm, Vulkan, Metal, WebGPU, and CPU — at or above hand-tuned speed.
use *;
That is the whole kernel. It lowers to native SPIR-V, PTX, HIP, and MSL — no per-backend rewrite, no shading-language dialect, no unsafe FFI.
Why
A production inference stack hand-writes the same ~20 operations four times — once per backend, in four languages (CUDA C++, Metal Shading Language, HIP C++, GLSL). That Cartesian product isn't just a maintenance tax; it is where the bugs breed, because one operation with four implementations has four numeric behaviors. A model that generates coherently on Metal can collapse into repetition on CUDA — same weights, same math, different rounding in a hand-tuned kernel.
hanzo-kernel collapses the product into a sum: backends × types × ops → backends + types + ops. One Rust source per operation, lowered everywhere. Uniform numeric behavior stops being something you chase with four test suites and becomes a structural property of a single source.
The claim, measured
Portability usually means a performance downgrade. It doesn't here. Every number below is measured, bit-exact-gated against a CPU oracle, and reproducible with the bundled matvec-check binary.
| Kernel | Backend | Result |
|---|---|---|
| int8 dp4a matvec | Vulkan (AMD gfx1151) | 196 GB/s — 118% of the hand-tuned kernel it replaces |
| int8 dp4a matvec | Metal (Apple M4 Max) | 206 GB/s |
rms_norm |
Vulkan, in the live engine | 10.6× the hand-written kernel it replaced (now deleted) |
softmax |
Vulkan, in the live engine | 351 GB/s (replaced the hand-written kernel) |
| all ops | CPU / Vulkan / Metal | bit-exact from one source (max_rel ≤ 1e-3) |
The dot on a Line<i8> lowers to a single OpSDot on Vulkan (verified in the emitted SPIR-V binary) — the same hardware integer-dot instruction the hand-written kernels use, 2.35× a scalar loop. The DSL doesn't emulate the fast path; it reaches it.
Install
[]
= "0.2"
Pick your backend with a feature flag (drop cpu for GPU-only builds — it pulls an LLVM/MLIR toolchain):
# GPU-only, no CPU toolchain
= { = "0.2", = false, = ["vulkan"] }
| Feature | Backend |
|---|---|
cpu (default) |
CPU reference runtime — the bit-exact oracle |
vulkan |
Vulkan (SPIR-V) — AMD, Intel, NVIDIA |
metal |
Metal (MSL) — Apple Silicon |
cuda |
CUDA (PTX/NVRTC) — NVIDIA |
rocm |
ROCm (HIP) — AMD |
The surface
Three attributes and two nouns — named for what they are, not for a framework.
#[kernel(targets(...))]— a launch entry point.targets(...)is validated at compile time (an unknown backend is a compile error) and self-documents where the kernel runs.#[device]— an on-GPU helper function, inlined into a kernel. A kernel is launched; a device function is a piece of one.Grid/Block— the launch shape: the grid of thread blocks, and the block of threads.Array,Float,Line— the data. Already the simple, accurate words.
Every performance primitive is a first-class part of the surface and lowers to the native instruction:
| Primitive | Lowers to |
|---|---|
Line<i8>::dot |
dp4a / OpSDotAccSat (int8 4-way dot) |
cmma |
tensor cores (WMMA / cooperative-matrix / simdgroup) |
plane_sum, plane_broadcast |
subgroup / warp reduce & shuffle |
SharedMemory, Atomic, barriers |
native shared memory, atomics, sync |
Built-in op library
Reusable, bit-exact kernels — not toys. Each ships with a CPU oracle and a bit-exact gate.
quant—matvec_q8,matvec_q4k(in-kernel K-quant decode), dp4a matvec (_blkblock-per-row +_sgsubgroup variants), packed-Q8_0 decode with in-kernel fp16 unpack.norm—rms_norm,layer_norm,add_rmsnorm(fused, multi-output).rope—rope_half(GPT-NeoX) andrope_interleaved(GPT-J) conventions.attn—sdpaandsdpa_runtime: GQA + online (flash-style) softmax with a runtime-length KV cache. One stable attention implementation across backends — the structural cure for repetition-collapse.
Run the correctness + throughput gate yourself:
Design
Four ideas, borrowed from Rich Hickey and applied to kernels:
- Decomplect. Ops, quantization types, and backends are three orthogonal axes braided together in hand-written code. Pulled apart, they compose — you write each axis once.
backends × types × opsbecomesbackends + types + ops. - Values, not places. CubeCL is the lowering engine — the implementation value.
hanzo_kernelis the stable namespace you build against; the stringcubeclappears in exactly one line of oneCargo.toml. The engine can be upgraded or vendored without touching a kernel. - One numeric behavior. A single source means a single rounding path. The "coherent on Metal, collapsed on CUDA" class of bug cannot occur — not because four copies were tested into agreement, but because there is one copy.
- Perf-gated migration, never a downgrade. A hand-tuned kernel is retired for its DSL twin only when the twin is bit-exact and at least as fast. Where a hand-tuned kernel still wins, it stays as the peak path and the DSL is the portable fallback for backends that lack it. Coverage always grows; speed never regresses.
Status
Published and in production. rms_norm and softmax are live DSL kernels in the Hanzo ML inference engine — their hand-written predecessors are deleted. The full op library is bit-exact on CPU, Vulkan, and Metal, and shelf-ready for the backends still on hand-tuned kernels. A generated DSL kernel dispatches through the engine's real Vulkan and Metal pipelines with maxerr = 0 — the DSL plugs in as a code generator, not a second runtime.
License
BSD-3-Clause. See LICENSE.