Expand description
§oxibonsai-kernels
1-bit Q1_0_g128 compute kernels for OxiBonsai.
Provides dequantization and fused matrix-multiply operations optimized for the PrismML 1-bit weight format. The kernels are organised in a tiered dispatch architecture that auto-selects the fastest implementation available on the current CPU:
| Tier | Feature gate | Instruction set |
|---|---|---|
| Reference | always | Pure scalar Rust (correctness baseline) |
| AVX2+FMA | simd-avx2 | 256-bit SIMD (x86-64) |
| AVX-512 | simd-avx512 | 512-bit SIMD (x86-64) |
| NEON | simd-neon | 128-bit SIMD (AArch64) |
Runtime dispatch is handled by KernelDispatcher which queries
SciRS2-Core’s SIMD capability cache on construction.
§Key Kernels
| Kernel | Description |
|---|---|
dequant::dequant_1bit_g128 | Unpack 128 sign bits + FP16 scale → FP32 |
gemv::gemv_1bit_g128 | 1-bit weight matrix × FP32 vector (single-token decode) |
gemm::gemm_1bit_g128 | 1-bit weight matrix × FP32 matrix (multi-token prefill) |
§Trait
All tiers implement OneBitKernel so callers are agnostic to the
underlying SIMD level.
Re-exports§
pub use gpu_backend::gpu_gemv_1bit;pub use gpu_backend::gpu_matmul;pub use gpu_backend::select_backend;pub use gpu_backend::CpuBackend;pub use gpu_backend::DeviceBuffer;pub use gpu_backend::GpuBackend;pub use gpu_backend::GpuBackendTrait;pub use gpu_backend::GpuError;pub use gpu_backend::LaunchConfig;pub use aligned::AlignedBlocks;pub use aligned::AlignedBuffer;pub use dispatch::KernelDispatcher;pub use dispatch::KernelTier;pub use error::KernelError;pub use error::KernelResult;pub use gemv_q2k::gemv_q2k;pub use gemv_q3k::gemv_q3k;pub use gemv_q4_0::gemv_q4_0;pub use gemv_q4k::gemv_q4k;pub use gemv_q5k::gemv_q5k;pub use gemv_q6k::gemv_q6k;pub use gemv_q8_0::gemv_q8_0;pub use gemv_q8k::gemv_q8k;pub use parallel::gemm_fp8_e4m3_par;pub use parallel::gemm_fp8_e5m2_par;pub use parallel::gemm_ternary_g128_par;pub use parallel::gemv_fp8_e4m3_par;pub use parallel::gemv_fp8_e5m2_par;pub use parallel::gemv_ternary_g128_par;pub use parallel_tiled::gemm_adaptive_ternary;pub use parallel_tiled::gemv_adaptive;pub use parallel_tiled::gemv_adaptive_ternary;pub use prefetch::PrefetchConfig;pub use prefetch::PrefetchLocality;pub use prefetch::PrefetchStrategy;pub use simd_float_ops::rms_norm_simd;pub use simd_float_ops::rope_apply_simd;pub use simd_float_ops::silu_simd;pub use simd_float_ops::softmax_simd;pub use simd_float_ops::swiglu_simd;pub use traits::Fp8Kernel;pub use traits::OneBitKernel;pub use traits::TernaryKernel;pub use tuning::PlatformProfile;pub use tuning::TunedThresholds;pub use tuning::TuningSummary;pub use weight_cache::GpuWeightHandle;
Modules§
- aligned
- Cache-line aligned memory allocations for SIMD kernel operations.
- dequant
- Reference (naive) dequantization kernel for Q1_0_g128.
- dequant_
fp8 - FP8 dequantization reference kernels (E4M3FN and E5M2).
- dequant_
ternary - Reference (naive) dequantization kernels for ternary TQ2_0_g128 and TQ2_0 formats.
- dispatch
- Runtime kernel dispatch with CPU feature detection.
- error
- Error types for kernel operations.
- fp8_lut
- Pre-computed FP8 decode lookup tables.
- gemm
- Reference (naive) GEMM kernel for Q1_0_g128.
- gemm_
fp8 - FP8 GEMM reference kernels (E4M3FN and E5M2).
- gemm_
ternary - Reference (naive) GEMM kernels for ternary TQ2_0_g128 and TQ2_0 formats.
- gemv
- Reference (naive) GEMV kernel for Q1_0_g128.
- gemv_
fp8 - FP8 GEMV reference kernels (E4M3FN and E5M2).
- gemv_
q2k - Scalar GEMV kernel for Q2_K quantized weight matrices.
- gemv_
q3k - Scalar GEMV kernel for Q3_K quantized weight matrices.
- gemv_
q4_ 0 - Scalar Q4_0 GEMV reference kernel.
- gemv_
q4k - Scalar GEMV kernel for Q4_K quantized weight matrices.
- gemv_
q5k - Scalar GEMV kernel for Q5_K quantized weight matrices.
- gemv_
q6k - Scalar GEMV kernel for Q6_K quantized weight matrices.
- gemv_
q8_ 0 - Scalar Q8_0 GEMV reference kernel.
- gemv_
q8k - Scalar GEMV kernel for Q8_K quantized weight matrices.
- gemv_
ternary - Reference (naive) GEMV kernels for ternary TQ2_0_g128 and TQ2_0 formats.
- gpu_
backend - GPU backend abstraction layer for CUDA and Metal acceleration.
- packing
- Aligned memory allocation, block packing, and prefetch utilities.
- parallel
- Multi-threaded kernel wrappers using Rayon.
- parallel_
tiled - Parallel tiled kernel execution.
- prefetch
- Software prefetch hints for GEMV/GEMM kernel operations.
- simd_
avx2 - AVX2-optimized 1-bit compute kernels for Q1_0_g128.
- simd_
avx512 - AVX-512 optimized 1-bit compute kernels for Q1_0_g128.
- simd_
float_ ops - SIMD-accelerated float operations for LLM inference.
- simd_
fp8_ avx2 - AVX2+FMA FP8 dequantization and GEMV/GEMM kernels.
- simd_
fp8_ avx512 - AVX-512 FP8 dequantization and GEMV/GEMM kernels.
- tiled
- Cache-aware tiled GEMV and GEMM computations.
- traits
- Trait definitions for 1-bit, ternary, and FP8 compute kernels.
- tuning
- Platform-aware performance tuning for kernel dispatch.
- weight_
cache - GPU weight cache — uploads model weights once, reuses across all GEMV/GEMM calls.