Skip to main content

Crate oxillama_quant

Crate oxillama_quant 

Source
Expand description

§oxillama-quant

Quantization kernel library for OxiLLaMa.

Provides dequantization and fused matmul operations for all GGUF quantization formats. Each format has three implementation tiers:

  1. Reference (naive) — Pure scalar Rust for correctness.
  2. Portable SIMD — Cross-platform vectorization.
  3. Platform SIMD — AVX2, AVX-512, NEON intrinsics.

§Supported Formats (planned)

CategoryTypes
LegacyQ4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1
K-QuantsQ2_K, Q3_K, Q4_K, Q5_K, Q6_K
I-QuantsIQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_XS, IQ4_NL
1-BitQ1_0_G128 (from OxiBonsai)
FloatF16, BF16, F32

Re-exports§

pub use dispatch::global_dispatcher;
pub use dispatch::CachedDispatcher;
pub use dispatch::KernelDispatcher;
pub use error::QuantError;
pub use error::QuantResult;
pub use lora::LoraAdapter;
pub use quantize::dequantize_to_f32;
pub use quantize::quantize_f16_to_q4_0;
pub use quantize::quantize_f16_to_q8_0;
pub use quantize::quantize_f32_to_q4_0;
pub use quantize::quantize_f32_to_q8_0;
pub use traits::QuantKernel;
pub use types::BlockInfo;
pub use types::QuantTensor;

Modules§

dispatch
Runtime kernel selection and dispatch.
error
Error types for quantization operations.
lora
LoRA (Low-Rank Adaptation) correction for quantized linear layers.
parallel
Parallel (multi-threaded) wrappers for quantized matrix operations.
quantize
Quantize-on-the-fly conversion utilities.
reference
Reference (naive) implementations of quantization kernels.
simd
Platform-specific SIMD quantization kernels.
traits
Core traits for quantization kernels.
types
Quantization data types and tensor wrapper.