Expand description
§oxillama-quant
Quantization kernel library for OxiLLaMa.
Provides dequantization and fused matmul operations for all GGUF quantization formats. Each format has three implementation tiers:
- Reference (naive) — Pure scalar Rust for correctness.
- Portable SIMD — Cross-platform vectorization.
- Platform SIMD — AVX2, AVX-512, NEON intrinsics.
§Supported Formats (planned)
| Category | Types |
|---|---|
| Legacy | Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1 |
| K-Quants | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K |
| I-Quants | IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_XS, IQ4_NL |
| 1-Bit | Q1_0_G128 (from OxiBonsai) |
| Float | F16, BF16, F32 |
Re-exports§
pub use dispatch::global_dispatcher;pub use dispatch::CachedDispatcher;pub use dispatch::KernelDispatcher;pub use error::QuantError;pub use error::QuantResult;pub use lora::LoraAdapter;pub use quantize::dequantize_to_f32;pub use quantize::quantize_f16_to_q4_0;pub use quantize::quantize_f16_to_q8_0;pub use quantize::quantize_f32_to_q4_0;pub use quantize::quantize_f32_to_q8_0;pub use traits::QuantKernel;pub use types::BlockInfo;pub use types::QuantTensor;
Modules§
- dispatch
- Runtime kernel selection and dispatch.
- error
- Error types for quantization operations.
- lora
- LoRA (Low-Rank Adaptation) correction for quantized linear layers.
- parallel
- Parallel (multi-threaded) wrappers for quantized matrix operations.
- quantize
- Quantize-on-the-fly conversion utilities.
- reference
- Reference (naive) implementations of quantization kernels.
- simd
- Platform-specific SIMD quantization kernels.
- traits
- Core traits for quantization kernels.
- types
- Quantization data types and tensor wrapper.