1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
//! # rage-quant
//!
//! High-performance quantized GEMV kernels for CPU-only LLM inference.
//!
//! This crate provides direct dot product operations on GGML quantized
//! tensor blocks (Q8_0, Q6_K, Q4_K) without requiring dequantization
//! to dense f32 tensors first. This approach:
//!
//! - Reduces DRAM bandwidth by 3.76x (1.06 bytes/elem vs 4 bytes/elem)
//! - Eliminates dense f32 cache entirely (78.8% RAM savings)
//! - Achieves 3.0x decode speedup on CPU inference
//!
//! ## Key functions
//!
//! - [`dot_q8_0_f32`] — Direct dot product on Q8_0 blocks (auto-detects AVX2)
//! - [`dot_q6_k_f32`] — Direct dot product on Q6_K blocks
//! - [`dot_q4_k_f32`] — Direct dot product on Q4_K blocks
//! - [`dequantize_q8_0_block`] — Dequantize Q8_0 block to f32
//! - [`dequantize_q4_k_block`] — Dequantize Q4_K block to f32
//! - [`dequantize_q6_k_block`] — Dequantize Q6_K block to f32
//!
//! ## GEMM/GEMV utilities
//!
//! - [`gemm_kernel::dot_f32`] — AVX2+FMA vectorized f32 dot product
//! - [`gemm_kernel::gemv_par`] — Rayon-parallelized GEMV
//! - [`gemm_kernel::gemm_par`] — Rayon-parallelized GEMM
//!
//! ## Example
//!
//! ```ignore
//! use rage_quant::{dot_q8_0_f32, dequantize_q8_0_block};
//!
//! // Direct quantized dot product (no dequantization needed)
//! let result = dot_q8_0_f32(&quantized_data, &input_vector, num_elements);
//!
//! // Or dequantize a single block if needed
//! let f32_values = dequantize_q8_0_block(&block_bytes).unwrap();
//! ```
// Re-export primary quantized dot product functions
pub use ;
// Re-export GEMM/GEMV utilities
pub use ;