Skip to main content

Crate rage_quant

Crate rage_quant 

Source
Expand description

§rage-quant

High-performance quantized GEMV kernels for CPU-only LLM inference.

This crate provides direct dot product operations on GGML quantized tensor blocks (Q8_0, Q6_K, Q4_K) without requiring dequantization to dense f32 tensors first. This approach:

  • Reduces DRAM bandwidth by 3.76x (1.06 bytes/elem vs 4 bytes/elem)
  • Eliminates dense f32 cache entirely (78.8% RAM savings)
  • Achieves 3.0x decode speedup on CPU inference

§Key functions

§GEMM/GEMV utilities

  • gemm_kernel::dot_f32 — AVX2+FMA vectorized f32 dot product
  • [gemm_kernel::gemv_par] — Rayon-parallelized GEMV
  • [gemm_kernel::gemm_par] — Rayon-parallelized GEMM

§Example

use rage_quant::{dot_q8_0_f32, dequantize_q8_0_block};

// Direct quantized dot product (no dequantization needed)
let result = dot_q8_0_f32(&quantized_data, &input_vector, num_elements);

// Or dequantize a single block if needed
let f32_values = dequantize_q8_0_block(&block_bytes).unwrap();

Re-exports§

pub use ggml_quant::dot_q8_0_f32;
pub use ggml_quant::dot_q6_k_f32;
pub use ggml_quant::dot_q4_k_f32;
pub use ggml_quant::dequantize_q8_0_block;
pub use ggml_quant::dequantize_q4_k_block;
pub use ggml_quant::dequantize_q6_k_block;
pub use ggml_quant::decode_f16;
pub use gemm_kernel::dot_f32;
pub use gemm_kernel::gemv_rows_f32;
pub use gemm_kernel::gemm_f32_row_major;

Modules§

gemm_kernel
ggml_quant