oxibonsai-kernels

Q1_0_g128 (1-bit) and TQ2_0_g128 (ternary) compute kernels for OxiBonsai — dequantization, GEMV, GEMM, fused full-forward.

Implements the full compute stack for 1-bit and ternary inference: scalar reference kernels, SIMD-accelerated tiers (AVX2+FMA, AVX-512, NEON), tiled cache-blocked GEMM, parallel Rayon dispatch, and production GPU backends (Metal fused full-forward, native CUDA via NVRTC, plus scirs2-core backend).

Part of the OxiBonsai project.

Status: Stable (mature, complete) — 290 tests passing.

Features

dequant_1bit_g128 / dequant_tq2_0_g128 — dequantize Q1_0_g128 / TQ2_0_g128 blocks to f32
gemv_1bit_g128 / gemv_tq2_0_g128 — fused 1-bit / ternary GEMV (matrix-vector multiply)
gemm_1bit_g128 / gemm_tq2_0_g128 — fused 1-bit / ternary GEMM (batched matrix multiply)
KernelDispatcher::auto_detect() — selects the best SIMD tier at runtime
Tiled GEMM with cache-line alignment and software prefetch hints
Parallel dispatch via Rayon (gemv_*_par, gemm_*_par, tiled parallel paths)
Platform tuning: PlatformProfile, TunedThresholds
OneBitKernel and TernaryKernel traits unified through KernelDispatcher
GPU backend trait (GpuBackendTrait) with three concrete paths:
- Metal: fused full-forward TQ2 path (single command buffer) — ~50 tok/s on 1.7B ternary (~13× speedup)
- Native CUDA: NVRTC-compiled kernels with CUDA Graph execution (multi-encoding pass); prefill path with dedicated attention kernels for KV-cache population
- scirs2-core backend: portable CUDA/Metal via scirs2-core::gpu

SIMD Tiers

Tier	Feature Flag	Width	Platform
Reference (scalar)	(default)	N/A	All
AVX2+FMA	`simd-avx2`	256-bit	x86-64
AVX-512	`simd-avx512`	512-bit	x86-64
NEON	`simd-neon`	128-bit	AArch64

Cargo Features

Feature	Purpose
`simd-avx2`	Enable AVX2+FMA SIMD kernels (x86-64)
`avx2`	Alias for `simd-avx2` (Cargo shorthand)
`simd-avx512`	Enable AVX-512 SIMD kernels (x86-64)
`simd-neon`	Enable NEON SIMD kernels (AArch64)
`neon`	Alias for `simd-neon` (Cargo shorthand)
`metal`	Metal GPU backend + fused full-forward (macOS only)
`native-cuda`	Native CUDA NVRTC backend via `cudarc` (Linux/Windows)
`cuda`	scirs2-core CUDA backend (implies `gpu`)
`gpu`	Enable `scirs2-core/gpu` baseline GPU trait support
`wasm`	WebAssembly target adjustments

Usage

[dependencies]
# Auto-detect at runtime:
oxibonsai-kernels = { version = "0.1.2", features = ["simd-avx2"] }

use oxibonsai_kernels::KernelDispatcher;

let dispatcher = KernelDispatcher::auto_detect();
// dispatcher selects AVX2, AVX-512, NEON, or scalar automatically

License