Skip to main content

Crate oxibonsai_kernels

Crate oxibonsai_kernels 

Source
Expand description

§oxibonsai-kernels

1-bit Q1_0_g128 compute kernels for OxiBonsai.

Provides dequantization and fused matrix-multiply operations optimized for the PrismML 1-bit weight format. The kernels are organised in a tiered dispatch architecture that auto-selects the fastest implementation available on the current CPU:

TierFeature gateInstruction set
ReferencealwaysPure scalar Rust (correctness baseline)
AVX2+FMAsimd-avx2256-bit SIMD (x86-64)
AVX-512simd-avx512512-bit SIMD (x86-64)
NEONsimd-neon128-bit SIMD (AArch64)

Runtime dispatch is handled by KernelDispatcher which queries SciRS2-Core’s SIMD capability cache on construction.

§Key Kernels

KernelDescription
dequant::dequant_1bit_g128Unpack 128 sign bits + FP16 scale → FP32
gemv::gemv_1bit_g1281-bit weight matrix × FP32 vector (single-token decode)
gemm::gemm_1bit_g1281-bit weight matrix × FP32 matrix (multi-token prefill)

§Trait

All tiers implement OneBitKernel so callers are agnostic to the underlying SIMD level.

Re-exports§

pub use gpu_backend::gpu_gemv_1bit;
pub use gpu_backend::gpu_matmul;
pub use gpu_backend::select_backend;
pub use gpu_backend::CpuBackend;
pub use gpu_backend::DeviceBuffer;
pub use gpu_backend::GpuBackend;
pub use gpu_backend::GpuBackendTrait;
pub use gpu_backend::GpuError;
pub use gpu_backend::LaunchConfig;
pub use aligned::AlignedBlocks;
pub use aligned::AlignedBuffer;
pub use dispatch::KernelDispatcher;
pub use dispatch::KernelTier;
pub use error::KernelError;
pub use error::KernelResult;
pub use gemv_q2k::gemv_q2k;
pub use gemv_q3k::gemv_q3k;
pub use gemv_q4_0::gemv_q4_0;
pub use gemv_q4k::gemv_q4k;
pub use gemv_q5k::gemv_q5k;
pub use gemv_q6k::gemv_q6k;
pub use gemv_q8_0::gemv_q8_0;
pub use gemv_q8k::gemv_q8k;
pub use parallel::gemm_fp8_e4m3_par;
pub use parallel::gemm_fp8_e5m2_par;
pub use parallel::gemm_ternary_g128_par;
pub use parallel::gemv_fp8_e4m3_par;
pub use parallel::gemv_fp8_e5m2_par;
pub use parallel::gemv_ternary_g128_par;
pub use parallel_tiled::gemm_adaptive_ternary;
pub use parallel_tiled::gemv_adaptive;
pub use parallel_tiled::gemv_adaptive_ternary;
pub use prefetch::PrefetchConfig;
pub use prefetch::PrefetchLocality;
pub use prefetch::PrefetchStrategy;
pub use simd_float_ops::rms_norm_simd;
pub use simd_float_ops::rope_apply_simd;
pub use simd_float_ops::silu_simd;
pub use simd_float_ops::softmax_simd;
pub use simd_float_ops::swiglu_simd;
pub use traits::Fp8Kernel;
pub use traits::OneBitKernel;
pub use traits::TernaryKernel;
pub use tuning::PlatformProfile;
pub use tuning::TunedThresholds;
pub use tuning::TuningSummary;
pub use weight_cache::GpuWeightHandle;

Modules§

aligned
Cache-line aligned memory allocations for SIMD kernel operations.
dequant
Reference (naive) dequantization kernel for Q1_0_g128.
dequant_fp8
FP8 dequantization reference kernels (E4M3FN and E5M2).
dequant_ternary
Reference (naive) dequantization kernels for ternary TQ2_0_g128 and TQ2_0 formats.
dispatch
Runtime kernel dispatch with CPU feature detection.
error
Error types for kernel operations.
fp8_lut
Pre-computed FP8 decode lookup tables.
gemm
Reference (naive) GEMM kernel for Q1_0_g128.
gemm_fp8
FP8 GEMM reference kernels (E4M3FN and E5M2).
gemm_ternary
Reference (naive) GEMM kernels for ternary TQ2_0_g128 and TQ2_0 formats.
gemv
Reference (naive) GEMV kernel for Q1_0_g128.
gemv_fp8
FP8 GEMV reference kernels (E4M3FN and E5M2).
gemv_q2k
Scalar GEMV kernel for Q2_K quantized weight matrices.
gemv_q3k
Scalar GEMV kernel for Q3_K quantized weight matrices.
gemv_q4_0
Scalar Q4_0 GEMV reference kernel.
gemv_q4k
Scalar GEMV kernel for Q4_K quantized weight matrices.
gemv_q5k
Scalar GEMV kernel for Q5_K quantized weight matrices.
gemv_q6k
Scalar GEMV kernel for Q6_K quantized weight matrices.
gemv_q8_0
Scalar Q8_0 GEMV reference kernel.
gemv_q8k
Scalar GEMV kernel for Q8_K quantized weight matrices.
gemv_ternary
Reference (naive) GEMV kernels for ternary TQ2_0_g128 and TQ2_0 formats.
gpu_backend
GPU backend abstraction layer for CUDA and Metal acceleration.
packing
Aligned memory allocation, block packing, and prefetch utilities.
parallel
Multi-threaded kernel wrappers using Rayon.
parallel_tiled
Parallel tiled kernel execution.
prefetch
Software prefetch hints for GEMV/GEMM kernel operations.
simd_avx2
AVX2-optimized 1-bit compute kernels for Q1_0_g128.
simd_avx512
AVX-512 optimized 1-bit compute kernels for Q1_0_g128.
simd_float_ops
SIMD-accelerated float operations for LLM inference.
simd_fp8_avx2
AVX2+FMA FP8 dequantization and GEMV/GEMM kernels.
simd_fp8_avx512
AVX-512 FP8 dequantization and GEMV/GEMM kernels.
tiled
Cache-aware tiled GEMV and GEMM computations.
traits
Trait definitions for 1-bit, ternary, and FP8 compute kernels.
tuning
Platform-aware performance tuning for kernel dispatch.
weight_cache
GPU weight cache — uploads model weights once, reuses across all GEMV/GEMM calls.