rage-quant 0.1.0

High-performance quantized GEMV kernels for CPU-only LLM inference. Direct dot product on Q8_0/Q6_K/Q4_K GGUF blocks with AVX2+FMA SIMD — 3.0x decode speedup.
Documentation

rage-quant

Run LLMs 3x faster on your CPU — no GPU required.

Pure Rust quantized GEMV kernels that perform matrix-vector multiplication directly on GGML quantized data (Q8_0, Q6_K, Q4_K) with AVX2+FMA SIMD. No dequantization. No wasted bandwidth. No GPU needed.

License: AGPL-3.0 Rust


Why use rage-quant?

If you are building a Rust LLM inference pipeline and want CPU performance that approaches GPU speed on small-to-medium models, this crate gives you the core computation kernels.

The problem with standard inference

Every existing Rust framework (candle, burn) does this:

GGUF quantized weights -> dequantize to f32 -> f32 GEMV -> result
               4x DRAM bandwidth wasted ^    ^ 3.2 GB RAM for dense cache

What rage-quant does differently

GGUF quantized weights -> quantized GEMV -> result
         reads 1.06 bytes/element instead of 4 bytes = 3.76x less DRAM traffic

No dequantization step. No f32 cache. 57% less RAM. 3x faster decode.

Real benchmarks (not theoretical)

Tested on Qwen3-0.6B-Q8_0.gguf | CPU-only | AMD Ryzen 9 9900X | 12 threads

What we measured Before After Improvement
Decode latency per token 42 ms 14 ms 3.0x faster
From naive Rust implementation 120,000 ms 466 ms 257x faster
From sgemm baseline (standard BLAS) 74,758 ms 466 ms 160x faster
Peak RAM usage 3.2 GB 1.38 GB 57% less
Throughput ~24 tok/s 67-71 tok/s ~3x more

These numbers are real, measured, reproducible. See docs/cpu-optimizations.md for methodology.

Why is it faster? One insight.

On modern CPUs, LLM decode (batch=1) is DRAM bandwidth-limited, not compute-limited. By reading 1 byte (quantized) instead of 4 bytes (f32), you move 3.76x less data through the memory bus. The speedup follows directly.

Additionally: LLVM cannot auto-vectorize the i8-to-f32 widening path. It tries i8->i16->i32->f32, wasting registers. Manual vpmovsxbd (i8->i32 direct) via _mm256_cvtepi8_epi32 is required. This is why hand-written AVX2 intrinsics beat the compiler here.

Quick start

Install

[dependencies]
rage-quant = "0.1"

Quantized dot product (the main feature)

use rage_quant::dot_q8_0_f32;

// quantized_weights: &[u8] -- raw Q8_0 blocks straight from a GGUF file
// input_vector: &[f32] -- your activation vector
// num_elements: total number of f32 elements represented

let result = dot_q8_0_f32(&quantized_weights, &input_vector, num_elements);
// Auto-detects AVX2+FMA at runtime; falls back to scalar on older CPUs

Other quantization formats

use rage_quant::{dot_q6_k_f32, dot_q4_k_f32};

let result_q6 = dot_q6_k_f32(&q6k_data, &input, num_elements);
let result_q4 = dot_q4_k_f32(&q4k_data, &input, num_elements);

Dequantization (when you need raw f32)

use rage_quant::{dequantize_q8_0_block, dequantize_q4_k_block, dequantize_q6_k_block};

let f32_values = dequantize_q8_0_block(&block_bytes).unwrap();  // -> 32 f32 values
let f32_q4k = dequantize_q4_k_block(&block_bytes).unwrap();     // -> 256 f32 values
let f32_q6k = dequantize_q6_k_block(&block_bytes).unwrap();     // -> 256 f32 values

Parallel GEMV and GEMM (f32, rayon-accelerated)

use rage_quant::{gemv_rows_f32, gemm_f32_row_major, dot_f32};

// Matrix-vector multiply (uses all cores via rayon)
let output = gemv_rows_f32(&matrix_data, num_cols, &vector);

// Matrix-matrix multiply (rayon + gemm crate backend)
let output = gemm_f32_row_major(m, n, k, &a, &b);

// Single dot product with AVX2+FMA
let dot = dot_f32(&a, &b);

Supported quantization formats

Format Block size Bits/weight Function SIMD status
Q8_0 32 8 dot_q8_0_f32() AVX2+FMA
Q6_K 256 6 dot_q6_k_f32() Scalar (AVX2 planned)
Q4_K 256 4 dot_q4_k_f32() Scalar (AVX2 planned)

How it compares

Feature rage-quant llama.cpp (ggml) candle (HuggingFace) burn
Language Pure Rust C/C++ Rust Rust
Quantized GEMV (skip deq.) Yes Yes (in C) No No
AVX2 SIMD on quantized data Yes Yes (in C) Partial No
Q8_0 + Q6_K + Q4_K Yes Yes Q8_0 only No
GGUF-native blocks Yes Yes Limited No
Standalone reusable crate Yes No (monolithic) No (framework) No (framework)
Zero C/C++ dependency Yes No (is C/C++) Has some C deps Yes
Measured 3x decode speedup Yes Baseline N/A N/A

Why not just use llama.cpp?

llama.cpp is excellent, but:

  1. It is C/C++ -- integrating into a Rust project requires unsafe FFI bindings
  2. It is monolithic -- you cannot extract just the quantized dot product without pulling the entire engine
  3. rage-quant is a standalone Rust crate -- cargo add rage-quant and you have the kernels

Why not candle or burn?

Neither implements quantized GEMV on GGUF blocks. They dequantize to f32 first, losing the bandwidth advantage that gives rage-quant its 3x speedup.

CPU optimization findings (T1-T9)

This crate embodies 9 validated CPU inference optimizations discovered during development. Full details with formulas, measurements, and methodology in docs/cpu-optimizations.md.

ID What was optimized Measured result
T1 GEMV on quantized data (skip f32) decode 42ms -> 18ms = 2.3x
T2 Eliminate dense f32 weight caches RSS 3.2GB -> 1.38GB = -57% RAM
T3 AVX2 widening i8->f32 intrinsics +18.8% on top of T1
T4 Memory-bound diagnosis Proved DRAM is the bottleneck
T5 In-place residual addition Marginal on small models
T6 Software prefetch hints ~10-20% estimated on 14B+ models
T7 GEMV vs sgemm for m=1 decode sgemm 180ms vs GEMV 18ms = 10x
T8 QKV fusion (decode-only path) 1.8x per-layer QKV compute
T9 Column-tiling for GEMM prefill 5091ms -> 3057ms = 1.67x

Hardware requirements

  • Minimum: Any x86_64 CPU (scalar fallback works everywhere)
  • Recommended: AVX2+FMA support (Intel Haswell 2013+ / AMD Zen 2017+)
  • Tested on: AMD Ryzen 9 9900X (Zen 5), DDR5, 12 threads

ARM NEON and AVX-512 support are planned.

License

Dual-licensed:

  • Open Source: AGPL-3.0 -- free for open-source, personal, and academic use
  • Commercial: Commercial License -- for proprietary/closed-source use

For commercial licensing: the@angriestboy.com

Author

Carlos Enrique Castro Lazaro

Contributing

Contributions welcome under AGPL-3.0. By submitting a PR, you agree to license your contribution under the same terms.

Priority areas:

  • AVX2 SIMD for Q6_K and Q4_K dot products
  • ARM NEON implementation
  • AVX-512 implementation
  • Benchmarks on different hardware (Intel, older AMD, server CPUs)
  • Additional quantization formats (Q5_K, Q2_K, IQ formats)