oxicuda-lm
Part of the OxiCUDA ecosystem — Pure Rust CUDA replacement for the COOLJAPAN ecosystem.
Overview
oxicuda-lm provides the model-layer abstractions for LLM inference: a BPE tokenizer, transformer layer building blocks with incremental KV-cache support, and complete GPT-2 and LLaMA-2/3 model implementations. All forward passes are pure-Rust CPU reference implementations suitable for testing; GPU acceleration is provided by the included PTX kernel strings once a CUDA driver is available at runtime.
Status
| Version | Tests | Date |
|---|---|---|
| 0.1.5 | 182 passing | 2026-05-01 |
| 0.1.4 | 182 passing | 2026-04-18 |
Features
- BPE tokenizer:
BpeBuilder/BpeTokenizerwith full encode/decode round-trip and special-token support - Transformer layers:
TokenEmbedding,LearnedPositionalEmbedding,RotaryEmbedding(RoPE),MultiHeadAttention(with GQA),MlpFfn,SwiGluFfn,RmsNorm,LayerNorm - KV cache:
LayerKvCacheandPastKvCachefor incremental (token-by-token) decoding with correct cache accumulation - GPT-2 and LLaMA architectures:
Gpt2ModelandLlamaModelwithforward()andnext_token()helpers - PTX kernels: five GPU kernel source strings (embedding forward, RoPE apply, SiLU gate, RMSNorm, causal attention softmax) for SM75–SM120
- Pure Rust — no CUDA SDK, no C/Fortran at compile time
Usage
Add to your Cargo.toml:
[]
= "0.1.5"
use ;
let model = new?;
let prompt = vec!;
// Prefill
let = model.forward?;
// Incremental decode
let = model.next_token?;
println!;
License
Apache-2.0 — © 2026 COOLJAPAN OU (Team KitaSan)