Crate oxibonsai_kernels

Expand description

§oxibonsai-kernels

1-bit Q1_0_g128 compute kernels for OxiBonsai.

Provides dequantization and fused matrix-multiply operations optimized for the PrismML 1-bit weight format. The kernels are organised in a tiered dispatch architecture that auto-selects the fastest implementation available on the current CPU:

Tier	Feature gate	Instruction set
Reference	always	Pure scalar Rust (correctness baseline)
AVX2+FMA	`simd-avx2`	256-bit SIMD (x86-64)
AVX-512	`simd-avx512`	512-bit SIMD (x86-64)
NEON	`simd-neon`	128-bit SIMD (AArch64)

Runtime dispatch is handled by KernelDispatcher which queries SciRS2-Core’s SIMD capability cache on construction.

§Key Kernels

Kernel	Description
`dequant::dequant_1bit_g128`	Unpack 128 sign bits + FP16 scale → FP32
`gemv::gemv_1bit_g128`	1-bit weight matrix × FP32 vector (single-token decode)
`gemm::gemm_1bit_g128`	1-bit weight matrix × FP32 matrix (multi-token prefill)

§Trait

All tiers implement OneBitKernel so callers are agnostic to the underlying SIMD level.

Re-exports§

pub use gpu_backend::gpu_gemv_1bit;
pub use gpu_backend::gpu_matmul;
pub use gpu_backend::select_backend;
pub use gpu_backend::CpuBackend;
pub use gpu_backend::DeviceBuffer;
pub use gpu_backend::GpuBackend;
pub use gpu_backend::GpuBackendTrait;
pub use gpu_backend::GpuError;
pub use gpu_backend::LaunchConfig;
pub use aligned::AlignedBlocks;
pub use aligned::AlignedBuffer;
pub use dispatch::KernelDispatcher;
pub use dispatch::KernelTier;
pub use error::KernelError;
pub use error::KernelResult;
pub use gemv_q2k::gemv_q2k;
pub use gemv_q3k::gemv_q3k;
pub use gemv_q4_0::gemv_q4_0;
pub use gemv_q4k::gemv_q4k;
pub use gemv_q5k::gemv_q5k;
pub use gemv_q6k::gemv_q6k;
pub use gemv_q8_0::gemv_q8_0;
pub use gemv_q8k::gemv_q8k;
pub use parallel::gemm_fp8_e4m3_par;
pub use parallel::gemm_fp8_e5m2_par;
pub use parallel::gemm_ternary_g128_par;
pub use parallel::gemv_fp8_e4m3_par;
pub use parallel::gemv_fp8_e5m2_par;
pub use parallel::gemv_ternary_g128_par;
pub use parallel_tiled::gemm_adaptive_ternary;
pub use parallel_tiled::gemv_adaptive;
pub use parallel_tiled::gemv_adaptive_ternary;
pub use prefetch::PrefetchConfig;
pub use prefetch::PrefetchLocality;
pub use prefetch::PrefetchStrategy;
pub use simd_float_ops::rms_norm_simd;
pub use simd_float_ops::rope_apply_simd;
pub use simd_float_ops::silu_simd;
pub use simd_float_ops::softmax_simd;
pub use simd_float_ops::swiglu_simd;
pub use traits::Fp8Kernel;
pub use traits::OneBitKernel;
pub use traits::TernaryKernel;
pub use tuning::PlatformProfile;
pub use tuning::TunedThresholds;
pub use tuning::TuningSummary;
pub use weight_cache::GpuWeightHandle;

Modules§

aligned: Cache-line aligned memory allocations for SIMD kernel operations.
dequant: Reference (naive) dequantization kernel for Q1_0_g128.
dequant_fp8: FP8 dequantization reference kernels (E4M3FN and E5M2).
dequant_ternary: Reference (naive) dequantization kernels for ternary TQ2_0_g128 and TQ2_0 formats.
dispatch: Runtime kernel dispatch with CPU feature detection.
error: Error types for kernel operations.
fp8_lut: Pre-computed FP8 decode lookup tables.
gemm: Reference (naive) GEMM kernel for Q1_0_g128.
gemm_fp8: FP8 GEMM reference kernels (E4M3FN and E5M2).
gemm_ternary: Reference (naive) GEMM kernels for ternary TQ2_0_g128 and TQ2_0 formats.
gemv: Reference (naive) GEMV kernel for Q1_0_g128.
gemv_fp8: FP8 GEMV reference kernels (E4M3FN and E5M2).
gemv_q2k: Scalar GEMV kernel for Q2_K quantized weight matrices.
gemv_q3k: Scalar GEMV kernel for Q3_K quantized weight matrices.
gemv_q4_0: Scalar Q4_0 GEMV reference kernel.
gemv_q4k: Scalar GEMV kernel for Q4_K quantized weight matrices.
gemv_q5k: Scalar GEMV kernel for Q5_K quantized weight matrices.
gemv_q6k: Scalar GEMV kernel for Q6_K quantized weight matrices.
gemv_q8_0: Scalar Q8_0 GEMV reference kernel.
gemv_q8k: Scalar GEMV kernel for Q8_K quantized weight matrices.
gemv_ternary: Reference (naive) GEMV kernels for ternary TQ2_0_g128 and TQ2_0 formats.
gpu_backend: GPU backend abstraction layer for CUDA and Metal acceleration.
packing: Aligned memory allocation, block packing, and prefetch utilities.
parallel: Multi-threaded kernel wrappers using Rayon.
parallel_tiled: Parallel tiled kernel execution.
prefetch: Software prefetch hints for GEMV/GEMM kernel operations.
simd_avx2: AVX2-optimized 1-bit compute kernels for Q1_0_g128.
simd_avx512: AVX-512 optimized 1-bit compute kernels for Q1_0_g128.
simd_float_ops: SIMD-accelerated float operations for LLM inference.
simd_fp8_avx2: AVX2+FMA FP8 dequantization and GEMV/GEMM kernels.
simd_fp8_avx512: AVX-512 FP8 dequantization and GEMV/GEMM kernels.
tiled: Cache-aware tiled GEMV and GEMM computations.
traits: Trait definitions for 1-bit, ternary, and FP8 compute kernels.
tuning: Platform-aware performance tuning for kernel dispatch.
weight_cache: GPU weight cache — uploads model weights once, reuses across all GEMV/GEMM calls.