oxicuda 0.1.3

OxiCUDA - Pure Rust CUDA replacement for the COOLJAPAN ecosystem (95% performance target)
Documentation

oxicuda

Pure Rust CUDA replacement for the COOLJAPAN ecosystem.

Part of the OxiCUDA project.

Overview

Version: 0.1.3 — 2026-04-17 — 496 tests passing

oxicuda is the umbrella crate that re-exports all OxiCUDA sub-crates behind feature flags. It provides a single dependency entry point for applications that need GPU compute capabilities without installing the CUDA Toolkit -- libcuda.so (or nvcuda.dll) is loaded dynamically at runtime.

The core crates (driver, memory, launch) are enabled by default. Higher-level libraries -- BLAS, DNN, FFT, sparse, solver, and random number generation -- are opt-in via feature flags. Enable full to get everything.

A prelude module and init() function provide convenient imports and one-call CUDA driver initialization. Additional built-in modules cover profiling, multi-GPU device pools, collective communication (NCCL equivalent), pipeline parallelism, and multi-node distributed training.

Architecture

                    oxicuda (umbrella)
     +---------+---------+---------+---------+
     |         |         |         |         |
  driver   memory    launch      ptx    autotune
     |         |         |         |         |
     +----+----+---------+---------+---------+
          |
   +------+------+------+------+------+
   |      |      |      |      |      |
  blas   dnn    fft   sparse solver  rand

Quick Start

use oxicuda::prelude::*;

fn main() -> CudaResult<()> {
    oxicuda::init()?;

    let device = Device::get(0)?;
    let ctx = std::sync::Arc::new(Context::new(&device)?);
    let stream = Stream::new(&ctx)?;

    let mut buf = DeviceBuffer::<f32>::alloc(1024)?;
    let host = vec![1.0f32; 1024];
    buf.copy_from_host(&host)?;

    Ok(())
}

Feature Flags

Feature Description Default
driver CUDA driver API wrapper Yes
memory GPU memory management Yes
launch Kernel launch infrastructure Yes
ptx PTX code generation DSL No
autotune Autotuner engine (implies ptx) No
blas cuBLAS equivalent No
dnn cuDNN equivalent (implies blas) No
fft cuFFT equivalent No
sparse cuSPARSE equivalent No
solver cuSOLVER equivalent No
rand cuRAND equivalent No
pool Stream-ordered memory pool No
backend Abstract compute backend trait No
primitives CUB-equivalent parallel GPU primitives No
vulkan Vulkan compute backend (cross-vendor) No
metal Apple Metal compute backend (macOS/iOS) No
webgpu WebGPU compute backend (via wgpu) No
rocm AMD ROCm/HIP backend (Linux + AMD GPU) No
level-zero Intel Level Zero backend No
onnx-backend ONNX operator runtime and graph executor No
tensor-backend ToRSh GPU tensor backend with autograd No
transformer-backend TrustformeRS transformer inference backend No
wasm-backend WASM + WebGPU backend for browser environments No
full Enable all optional features No

Sub-crates

Crate Volume Description
oxicuda-driver Vol.1 CUDA driver API bindings
oxicuda-memory Vol.1 Device, pinned, unified memory
oxicuda-launch Vol.1 Kernel launch and grid configuration
oxicuda-ptx Vol.2 PTX code generation DSL
oxicuda-autotune Vol.2 Autotuner for kernel parameters
oxicuda-blas Vol.3 Dense linear algebra (GEMM, etc.)
oxicuda-dnn Vol.4 Deep learning primitives
oxicuda-fft Vol.5 Fast Fourier Transform
oxicuda-sparse Vol.5 Sparse matrix operations
oxicuda-solver Vol.5 Matrix decompositions and solvers
oxicuda-rand Vol.5 Random number generation
oxicuda-primitives Vol.5 CUB-equivalent warp/block/device primitives
oxicuda-backend Abstract compute backend trait
oxicuda-vulkan Vulkan compute backend
oxicuda-metal Apple Metal compute backend
oxicuda-webgpu WebGPU compute backend
oxicuda-rocm AMD ROCm/HIP backend
oxicuda-levelzero Intel Level Zero backend

License

Apache-2.0 -- (C) 2026 COOLJAPAN OU (Team KitaSan)