kaio 0.2.1

Rust-native GPU kernel authoring framework. Write GPU compute kernels in Rust, automatically lower to PTX. Cross-platform (Windows + Linux), type-safe, no CUDA C++ required.
Documentation

KAIO

Rust-native GPU kernel authoring framework.

Write GPU compute kernels in Rust, compile to PTX, run on NVIDIA GPUs. No CUDA C++, no Python, no CUDA toolkit required. Windows + Linux.

Quick Start

use kaio::prelude::*;

#[gpu_kernel(block_size = 256)]
fn saxpy(x: &[f32], y: &mut [f32], alpha: f32, n: u32) {
    let idx = thread_idx_x() + block_idx_x() * block_dim_x();
    if idx < n {
        y[idx] = alpha * x[idx] + y[idx];
    }
}

fn main() -> Result<()> {
    let device = KaioDevice::new(0)?;
    let n = 1024u32;

    let x = device.alloc_from(&vec![1.0f32; n as usize])?;
    let mut y = device.alloc_from(&vec![2.0f32; n as usize])?;

    saxpy::launch(&device, &x, &mut y, 2.5f32, n)?;

    let result = y.to_host(&device)?;
    println!("result: {:?}", &result[..8]);
    // prints: result: [4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5]
    Ok(())
}

Requires an NVIDIA GPU with driver installed. No CUDA toolkit needed.

When to Use KAIO

KAIO is not a replacement for ML frameworks like Candle or Burn. It is the layer you use when you need more control than they provide.

  • Write custom GPU kernels your framework doesn't support
  • Control GPU memory explicitly — deterministic VRAM, buffer reuse
  • Ship GPU binaries without Python or Triton in your dependency chain
  • Run on Windows (Triton is Linux-only)
  • Prototype GPU code in Rust without learning CUDA C++

Features

Feature Syntax Status
Arithmetic +, -, *, /, %, +=, -=, *=, /= Supported
Comparisons <, <=, >, >=, ==, != Supported
Control flow if/else, for, while Supported
Array access a[idx] (global memory) Supported
Shared memory shared_mem![f32; 256] Supported
Synchronization bar_sync() Supported
Warp shuffle shfl_sync_down/up/bfly() Supported
Reductions block_reduce_sum(), block_reduce_max() Supported
Type casts x as f32 Supported
Math builtins sqrt, exp, log, tanh, abs, min, max Supported
FMA fma(a, b, c) Supported
2D blocks block_size = (16, 16), thread_idx_y() Supported
Tiled matmul kaio_ops::matmul() (31% of cuBLAS) Supported
Attention kaio_ops::attention(), attention_causal() Supported
FlashAttention kaio_ops::attention_flash() — O(d_k) memory Supported
Auto-tuner kaio_ops::tune_matmul(), matmul_auto() Supported

Architecture

Crate Description
kaio Umbrella crate — re-exports everything via prelude
kaio-macros #[gpu_kernel] proc macro
kaio-core PTX IR, instruction emitters, zero external dependencies
kaio-runtime CUDA driver wrapper via cudarc
kaio-ops Pre-built GPU operations (matmul, attention, auto-tuner)

Limitations

  • NVIDIA only (SM 7.0+) — no AMD, no Intel
  • Not cuBLAS-level performance (matmul reaches 31%)
  • DSL subset of Rust — no closures, traits, generics, or &&/||
  • FlashAttention requires d_k <= 256
  • No autograd, no multi-GPU
  • API may change

Status

Phase 5 complete — attention (standard + FlashAttention), causal masking, auto-tuner, Windows CI. v0.1.0.

See the repository for full documentation, runnable examples, copy-paste patterns, and development logs.

License

Licensed under either of Apache-2.0 or MIT at your option.