KAIO
Rust-native GPU kernel authoring framework.
KAIO (πῦρ — fire) lets developers write GPU compute kernels in Rust and compile them to PTX for execution on NVIDIA GPUs. It is a Rust alternative to OpenAI's Triton, targeting Windows and Linux from day one, with compile-time PTX emission and Rust's type-safety guarantees.
Why KAIO?
- Cross-platform from day one. Windows and Linux.
cargo buildjust works. - Compile-time PTX emission. Kernels compile during
cargo buildvia proc macros. Zero cold-start. - Rust type safety. Catch out-of-bounds indexing, dtype mismatches, and synchronization errors at compile time.
- Embeddable anywhere. Use from Rust natively, from C/C++ via FFI, from Python via PyO3.
Architecture
KAIO is structured in four layers:
Layer 4: Block-Level Operations (tiled matmul, fused attention)
Layer 3: Proc Macro DSL (#[gpu_kernel], user-facing API)
Layer 2: Runtime (kernel launch, memory mgmt via cudarc)
Layer 1: PTX Codegen (instruction emission, IR)
Crate Structure
| Crate | Description |
|---|---|
kaio |
Umbrella crate — re-exports kaio-core and kaio-runtime |
kaio-core |
PTX IR types, instruction emitters, PtxWriter |
kaio-runtime |
CUDA driver API wrapper, kernel launch, device memory |
Current Status
Phase 1 — PTX Foundation — complete. The IR and runtime layers can
construct, emit, load, and execute GPU kernels. The vector_add kernel
runs on real hardware (RTX 4090, verified on both single-block and
multi-block launches).
Phase 2 — Proc Macro DSL is next: #[gpu_kernel] attribute macro
that transforms Rust function syntax into PTX. See
docs/phases.md for the full roadmap.
Phase 1 Example (IR API)
use ;
use ;
use ;
use MemoryOp;
use *;
use PtxType;
// Build a vector_add kernel via the IR API
let mut alloc = new;
let mut kernel = new;
kernel.add_param;
kernel.add_param;
kernel.add_param;
kernel.add_param;
// ... (build instructions using alloc + kernel.push()) ...
// Emit to PTX text
let mut module = new;
module.add_kernel;
let mut w = new;
module.emit.unwrap;
let ptx_text = w.finish;
// Load and run on GPU
use ;
let device = new?;
let module = device.load_ptx?;
let func = module.function?;
// ... allocate buffers, launch kernel, read results ...
See kaio-runtime/tests/vector_add_e2e.rs for the complete working example.
Target Hardware
- Primary: NVIDIA GPUs, SM 7.0+ (Volta and newer)
- Development GPU: RTX 4090 (SM 8.9, Ada Lovelace)
- Platforms: Windows 10/11, Linux (Ubuntu 22.04+)
Building
# Requires Rust 1.94+ (pinned via rust-toolchain.toml)
Development
Sprint-by-sprint progress with full architectural decision records:
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.