# rage-quant
**Run LLMs 3x faster on your CPU — no GPU required.**
Pure Rust quantized GEMV kernels that perform matrix-vector multiplication directly on GGML quantized data (Q8_0, Q6_K, Q4_K) with AVX2+FMA SIMD. No dequantization. No wasted bandwidth. No GPU needed.
[](LICENSE)
[]()
---
## Why use rage-quant?
If you are building a Rust LLM inference pipeline and want CPU performance that approaches GPU speed on small-to-medium models, this crate gives you the core computation kernels.
### The problem with standard inference
Every existing Rust framework (candle, burn) does this:
```
GGUF quantized weights -> dequantize to f32 -> f32 GEMV -> result
4x DRAM bandwidth wasted ^ ^ 3.2 GB RAM for dense cache
```
### What rage-quant does differently
```
GGUF quantized weights -> quantized GEMV -> result
reads 1.06 bytes/element instead of 4 bytes = 3.76x less DRAM traffic
```
**No dequantization step. No f32 cache. 57% less RAM. 3x faster decode.**
## Real benchmarks (not theoretical)
Tested on **Qwen3-0.6B-Q8_0.gguf** | CPU-only | AMD Ryzen 9 9900X | 12 threads
| Decode latency per token | 42 ms | 14 ms | **3.0x faster**|
| From naive Rust implementation | 120,000 ms | 466 ms | **257x faster**|
| From sgemm baseline (standard BLAS) | 74,758 ms | 466 ms | **160x faster**|
| Peak RAM usage | 3.2 GB | 1.38 GB | **57% less** |
| Throughput | ~24 tok/s | 67-71 tok/s | **~3x more** |
These numbers are real, measured, reproducible. See [docs/cpu-optimizations.md](docs/cpu-optimizations.md) for methodology.
### Why is it faster? One insight.
On modern CPUs, LLM decode (batch=1) is **DRAM bandwidth-limited**, not compute-limited. By reading 1 byte (quantized) instead of 4 bytes (f32), you move 3.76x less data through the memory bus. The speedup follows directly.
Additionally: **LLVM cannot auto-vectorize the i8-to-f32 widening path.** It tries i8->i16->i32->f32, wasting registers. Manual `vpmovsxbd` (i8->i32 direct) via `_mm256_cvtepi8_epi32` is required. This is why hand-written AVX2 intrinsics beat the compiler here.
## Quick start
### Install
```toml
[dependencies]
rage-quant = "0.1"
```
### Quantized dot product (the main feature)
```rust
use rage_quant::dot_q8_0_f32;
// quantized_weights: &[u8] -- raw Q8_0 blocks straight from a GGUF file
// input_vector: &[f32] -- your activation vector
// num_elements: total number of f32 elements represented
let result = dot_q8_0_f32(&quantized_weights, &input_vector, num_elements);
// Auto-detects AVX2+FMA at runtime; falls back to scalar on older CPUs
```
### Other quantization formats
```rust
use rage_quant::{dot_q6_k_f32, dot_q4_k_f32};
let result_q6 = dot_q6_k_f32(&q6k_data, &input, num_elements);
let result_q4 = dot_q4_k_f32(&q4k_data, &input, num_elements);
```
### Dequantization (when you need raw f32)
```rust
use rage_quant::{dequantize_q8_0_block, dequantize_q4_k_block, dequantize_q6_k_block};
let f32_values = dequantize_q8_0_block(&block_bytes).unwrap(); // -> 32 f32 values
let f32_q4k = dequantize_q4_k_block(&block_bytes).unwrap(); // -> 256 f32 values
let f32_q6k = dequantize_q6_k_block(&block_bytes).unwrap(); // -> 256 f32 values
```
### Parallel GEMV and GEMM (f32, rayon-accelerated)
```rust
use rage_quant::{gemv_rows_f32, gemm_f32_row_major, dot_f32};
// Matrix-vector multiply (uses all cores via rayon)
let output = gemv_rows_f32(&matrix_data, num_cols, &vector);
// Matrix-matrix multiply (rayon + gemm crate backend)
let output = gemm_f32_row_major(m, n, k, &a, &b);
// Single dot product with AVX2+FMA
let dot = dot_f32(&a, &b);
```
## Supported quantization formats
| Q8_0 | 32 | 8 | `dot_q8_0_f32()` | **AVX2+FMA** |
| Q6_K | 256 | 6 | `dot_q6_k_f32()` | Scalar (AVX2 planned) |
| Q4_K | 256 | 4 | `dot_q4_k_f32()` | Scalar (AVX2 planned) |
## How it compares
| Language | **Pure Rust** | C/C++ | Rust | Rust |
| Quantized GEMV (skip deq.) | **Yes** | Yes (in C) | No | No |
| AVX2 SIMD on quantized data | **Yes** | Yes (in C) | Partial | No |
| Q8_0 + Q6_K + Q4_K | **Yes** | Yes | Q8_0 only | No |
| GGUF-native blocks | **Yes** | Yes | Limited | No |
| Standalone reusable crate | **Yes** | No (monolithic) | No (framework) | No (framework) |
| Zero C/C++ dependency | **Yes** | No (is C/C++) | Has some C deps | Yes |
| Measured 3x decode speedup | **Yes** | Baseline | N/A | N/A |
### Why not just use llama.cpp?
llama.cpp is excellent, but:
1. **It is C/C++** -- integrating into a Rust project requires unsafe FFI bindings
2. **It is monolithic** -- you cannot extract just the quantized dot product without pulling the entire engine
3. **rage-quant is a standalone Rust crate** -- `cargo add rage-quant` and you have the kernels
### Why not candle or burn?
Neither implements quantized GEMV on GGUF blocks. They dequantize to f32 first, losing the bandwidth advantage that gives rage-quant its 3x speedup.
## CPU optimization findings (T1-T9)
This crate embodies 9 validated CPU inference optimizations discovered during development. Full details with formulas, measurements, and methodology in [docs/cpu-optimizations.md](docs/cpu-optimizations.md).
| T1 | GEMV on quantized data (skip f32) | decode 42ms -> 18ms = **2.3x** |
| T2 | Eliminate dense f32 weight caches | RSS 3.2GB -> 1.38GB = **-57% RAM** |
| T3 | AVX2 widening i8->f32 intrinsics | **+18.8%** on top of T1 |
| T4 | Memory-bound diagnosis | Proved DRAM is the bottleneck |
| T5 | In-place residual addition | Marginal on small models |
| T6 | Software prefetch hints | ~10-20% estimated on 14B+ models |
| T7 | GEMV vs sgemm for m=1 decode | sgemm 180ms vs GEMV 18ms = **10x** |
| T8 | QKV fusion (decode-only path) | **1.8x** per-layer QKV compute |
| T9 | Column-tiling for GEMM prefill | 5091ms -> 3057ms = **1.67x** |
## Hardware requirements
- **Minimum**: Any x86_64 CPU (scalar fallback works everywhere)
- **Recommended**: AVX2+FMA support (Intel Haswell 2013+ / AMD Zen 2017+)
- **Tested on**: AMD Ryzen 9 9900X (Zen 5), DDR5, 12 threads
ARM NEON and AVX-512 support are planned.
## License
Dual-licensed:
- **Open Source**: [AGPL-3.0](LICENSE) -- free for open-source, personal, and academic use
- **Commercial**: [Commercial License](LICENSE-COMMERCIAL.md) -- for proprietary/closed-source use
For commercial licensing: the@angriestboy.com
## Author
**Carlos Enrique Castro Lazaro**
- GitHub: [@OnCeUponTry](https://github.com/OnCeUponTry)
- Website: [angriestboy.com](https://www.angriestboy.com)
## Contributing
Contributions welcome under AGPL-3.0. By submitting a PR, you agree to license your contribution under the same terms.
**Priority areas:**
- AVX2 SIMD for Q6_K and Q4_K dot products
- ARM NEON implementation
- AVX-512 implementation
- Benchmarks on different hardware (Intel, older AMD, server CPUs)
- Additional quantization formats (Q5_K, Q2_K, IQ formats)