rage-quant

Run LLMs 3x faster on your CPU — no GPU required.

Pure Rust quantized GEMV kernels that perform matrix-vector multiplication directly on GGML quantized data (Q8_0, Q6_K, Q4_K) with AVX2+FMA SIMD. No dequantization. No wasted bandwidth. No GPU needed.

Why use rage-quant?

If you are building a Rust LLM inference pipeline and want CPU performance that approaches GPU speed on small-to-medium models, this crate gives you the core computation kernels.

The problem with standard inference

Every existing Rust framework (candle, burn) does this:

GGUF quantized weights -> dequantize to f32 -> f32 GEMV -> result
               4x DRAM bandwidth wasted ^    ^ 3.2 GB RAM for dense cache

What rage-quant does differently

GGUF quantized weights -> quantized GEMV -> result
         reads 1.06 bytes/element instead of 4 bytes = 3.76x less DRAM traffic

No dequantization step. No f32 cache. 57% less RAM. 3x faster decode.

Real benchmarks (not theoretical)

Tested on Qwen3-0.6B-Q8_0.gguf | CPU-only | AMD Ryzen 9 9900X | 12 threads

What we measured	Before	After	Improvement
Decode latency per token	42 ms	14 ms	3.0x faster
From naive Rust implementation	120,000 ms	466 ms	257x faster
From sgemm baseline (standard BLAS)	74,758 ms	466 ms	160x faster
Peak RAM usage	3.2 GB	1.38 GB	57% less
Throughput	~24 tok/s	67-71 tok/s	~3x more

These numbers are real, measured, reproducible. See docs/cpu-optimizations.md for methodology.

Why is it faster? One insight.

On modern CPUs, LLM decode (batch=1) is DRAM bandwidth-limited, not compute-limited. By reading 1 byte (quantized) instead of 4 bytes (f32), you move 3.76x less data through the memory bus. The speedup follows directly.

Additionally: LLVM cannot auto-vectorize the i8-to-f32 widening path. It tries i8->i16->i32->f32, wasting registers. Manual vpmovsxbd (i8->i32 direct) via _mm256_cvtepi8_epi32 is required. This is why hand-written AVX2 intrinsics beat the compiler here.

Quick start

Install

[dependencies]
rage-quant = "0.1"

Quantized dot product (the main feature)

use rage_quant::dot_q8_0_f32;

// quantized_weights: &[u8] -- raw Q8_0 blocks straight from a GGUF file
// input_vector: &[f32] -- your activation vector
// num_elements: total number of f32 elements represented

let result = dot_q8_0_f32(&quantized_weights, &input_vector, num_elements);
// Auto-detects AVX2+FMA at runtime; falls back to scalar on older CPUs

Other quantization formats

use rage_quant::{dot_q6_k_f32, dot_q4_k_f32};

let result_q6 = dot_q6_k_f32(&q6k_data, &input, num_elements);
let result_q4 = dot_q4_k_f32(&q4k_data, &input, num_elements);

Dequantization (when you need raw f32)

use rage_quant::{dequantize_q8_0_block, dequantize_q4_k_block, dequantize_q6_k_block};

let f32_values = dequantize_q8_0_block(&block_bytes).unwrap();  // -> 32 f32 values
let f32_q4k = dequantize_q4_k_block(&block_bytes).unwrap();     // -> 256 f32 values
let f32_q6k = dequantize_q6_k_block(&block_bytes).unwrap();     // -> 256 f32 values

Parallel GEMV and GEMM (f32, rayon-accelerated)

use rage_quant::{gemv_rows_f32, gemm_f32_row_major, dot_f32};

// Matrix-vector multiply (uses all cores via rayon)
let output = gemv_rows_f32(&matrix_data, num_cols, &vector);

// Matrix-matrix multiply (rayon + gemm crate backend)
let output = gemm_f32_row_major(m, n, k, &a, &b);

// Single dot product with AVX2+FMA
let dot = dot_f32(&a, &b);

Supported quantization formats

Format	Block size	Bits/weight	Function	SIMD status
Q8_0	32	8	`dot_q8_0_f32()`	AVX2+FMA
Q6_K	256	6	`dot_q6_k_f32()`	Scalar (AVX2 planned)
Q4_K	256	4	`dot_q4_k_f32()`	Scalar (AVX2 planned)

How it compares

Feature	rage-quant	llama.cpp (ggml)	candle (HuggingFace)	burn
Language	Pure Rust	C/C++	Rust	Rust
Quantized GEMV (skip deq.)	Yes	Yes (in C)	No	No
AVX2 SIMD on quantized data	Yes	Yes (in C)	Partial	No
Q8_0 + Q6_K + Q4_K	Yes	Yes	Q8_0 only	No
GGUF-native blocks	Yes	Yes	Limited	No
Standalone reusable crate	Yes	No (monolithic)	No (framework)	No (framework)
Zero C/C++ dependency	Yes	No (is C/C++)	Has some C deps	Yes
Measured 3x decode speedup	Yes	Baseline	N/A	N/A

Why not just use llama.cpp?

llama.cpp is excellent, but:

It is C/C++ -- integrating into a Rust project requires unsafe FFI bindings
It is monolithic -- you cannot extract just the quantized dot product without pulling the entire engine
rage-quant is a standalone Rust crate -- cargo add rage-quant and you have the kernels

Why not candle or burn?

Neither implements quantized GEMV on GGUF blocks. They dequantize to f32 first, losing the bandwidth advantage that gives rage-quant its 3x speedup.

CPU optimization findings (T1-T9)

This crate embodies 9 validated CPU inference optimizations discovered during development. Full details with formulas, measurements, and methodology in docs/cpu-optimizations.md.

ID	What was optimized	Measured result
T1	GEMV on quantized data (skip f32)	decode 42ms -> 18ms = 2.3x
T2	Eliminate dense f32 weight caches	RSS 3.2GB -> 1.38GB = -57% RAM
T3	AVX2 widening i8->f32 intrinsics	+18.8% on top of T1
T4	Memory-bound diagnosis	Proved DRAM is the bottleneck
T5	In-place residual addition	Marginal on small models
T6	Software prefetch hints	~10-20% estimated on 14B+ models
T7	GEMV vs sgemm for m=1 decode	sgemm 180ms vs GEMV 18ms = 10x
T8	QKV fusion (decode-only path)	1.8x per-layer QKV compute
T9	Column-tiling for GEMM prefill	5091ms -> 3057ms = 1.67x

Hardware requirements

Minimum: Any x86_64 CPU (scalar fallback works everywhere)
Recommended: AVX2+FMA support (Intel Haswell 2013+ / AMD Zen 2017+)
Tested on: AMD Ryzen 9 9900X (Zen 5), DDR5, 12 threads

ARM NEON and AVX-512 support are planned.

License

Dual-licensed:

Open Source: AGPL-3.0 -- free for open-source, personal, and academic use
Commercial: Commercial License -- for proprietary/closed-source use

For commercial licensing: the@angriestboy.com

Author

Carlos Enrique Castro Lazaro

GitHub: @OnCeUponTry
Website: angriestboy.com

Contributing

Contributions welcome under AGPL-3.0. By submitting a PR, you agree to license your contribution under the same terms.

Priority areas:

AVX2 SIMD for Q6_K and Q4_K dot products
ARM NEON implementation
AVX-512 implementation
Benchmarks on different hardware (Intel, older AMD, server CPUs)
Additional quantization formats (Q5_K, Q2_K, IQ formats)

rage-quant 0.1.0