KAIO
Rust-native GPU kernel authoring framework.
Write GPU compute kernels in Rust, compile to PTX, run on NVIDIA GPUs. No CUDA C++, no Python, no CUDA toolkit required. Windows + Linux.
Quick Start
use *;
Requires an NVIDIA GPU with driver installed. No CUDA toolkit needed.
When to Use KAIO
KAIO is not a replacement for ML frameworks like Candle or Burn. It is the layer you use when you need more control than they provide.
- Write custom GPU kernels your framework doesn't support
- Control GPU memory explicitly — deterministic VRAM, buffer reuse
- Ship GPU binaries without Python or Triton in your dependency chain
- Run on Windows (Triton is Linux-only)
- Prototype GPU code in Rust without learning CUDA C++
Features
| Feature | Syntax | Status |
|---|---|---|
| Arithmetic | +, -, *, /, %, +=, -=, *=, /= |
Supported |
| Comparisons | <, <=, >, >=, ==, != |
Supported |
| Control flow | if/else, for, while |
Supported |
| Array access | a[idx] (global memory) |
Supported |
| Shared memory | shared_mem![f32; 256] |
Supported |
| Synchronization | bar_sync() |
Supported |
| Warp shuffle | shfl_sync_down/up/bfly() |
Supported |
| Reductions | block_reduce_sum(), block_reduce_max() |
Supported |
| Type casts | x as f32 |
Supported |
| Math builtins | sqrt, exp, log, tanh, abs, min, max |
Supported |
| FMA | fma(a, b, c) |
Supported |
| 2D blocks | block_size = (16, 16), thread_idx_y() |
Supported |
| Tiled matmul | kaio_ops::matmul() (31% of cuBLAS) |
Supported |
| Attention | kaio_ops::attention(), attention_causal() |
Supported |
| FlashAttention | kaio_ops::attention_flash() — O(d_k) memory |
Supported |
| Auto-tuner | kaio_ops::tune_matmul(), matmul_auto() |
Supported |
Architecture
| Crate | Description |
|---|---|
kaio |
Umbrella crate — re-exports everything via prelude |
kaio-macros |
#[gpu_kernel] proc macro |
kaio-core |
PTX IR, instruction emitters, zero external dependencies |
kaio-runtime |
CUDA driver wrapper via cudarc |
kaio-ops |
Pre-built GPU operations (matmul, attention, auto-tuner) |
Limitations
- NVIDIA only (SM 7.0+) — no AMD, no Intel
- Not cuBLAS-level performance (matmul reaches 31%)
- DSL subset of Rust — no closures, traits, generics, or
&&/|| - FlashAttention requires d_k <= 256
- No autograd, no multi-GPU
- API may change
Status
Phase 5 complete — attention (standard + FlashAttention), causal masking, auto-tuner, Windows CI. v0.1.0.
See the repository for full documentation, runnable examples, copy-paste patterns, and development logs.
License
Licensed under either of Apache-2.0 or MIT at your option.