# Phase 15: Fused Q4K Quantized GEMV
**Version:** 1.0.0
**Status:** SPECIFICATION
**Date:** 2025-01-15
**Target:** 2x Ollama CPU throughput for APR format
**Scope:** Fused dequantization + dot product for Q4_K quantized inference
---
## Executive Summary
This specification defines the implementation of **fused Q4K dequant+dot kernels** for trueno, enabling APR format to achieve performance parity with GGUF/llama.cpp. The current APR CPU path achieves ~15 tok/s vs llama.cpp's ~42 tok/s on TinyLlama-1.1B Q4_0.
**Root Cause:** APR's current path separates dequantization from GEMV, causing:
1. Extra memory traffic (dequant to temp buffer, then dot)
2. Cache pollution from intermediate f32 expansion
3. Missed SIMD fusion opportunities
**Solution:** Fused Q4K kernels that dequantize directly into SIMD registers during dot product computation.
---
## 1. Problem Statement
### 1.1 Current Architecture Gap
| llama.cpp | Fused Q4×Q8 SIMD (llamafile) | ~42 tok/s |
| candle | Inline AVX2 dequant+dot | ~10 tok/s |
| APR (current) | Separate dequant → f32 dot | ~15 tok/s |
| **APR (target)** | **Fused Q4×Q8 SIMD** | **≥42 tok/s** |
### 1.2 Memory Bandwidth Analysis
For a 4096×4096 Q4_K matmul (single token decode):
| Separate dequant | 8 MB (Q4) + 64 MB (f32 temp) | 64 MB | 17% |
| Fused Q4×Q8 | 8 MB (Q4) + 4 MB (Q8 input) | 0 | 100% |
**Key Insight:** Fused approach eliminates 128 MB of unnecessary memory traffic per matmul.
### 1.3 Scope
**IN SCOPE:**
- Fused Q4_K × Q8_K SIMD dot product (AVX2, AVX-512, NEON)
- Cache-blocked quantized GEMV
- Integration with realizar APR transformer
- Benchmarks against llama.cpp baseline
**OUT OF SCOPE:**
- GPU quantized kernels (see trueno-gpu Phase 16)
- Training quantization
- New quantization formats
---
## 2. Root Cause Analysis (5 Whys)
### Why #1: Why is APR 2.8x slower than llama.cpp on CPU?
**Answer:** Each token generation performs ~200 matmuls, and APR's matmul is 2.8x slower.
### Why #2: Why is APR matmul 2.8x slower?
**Answer:** APR dequantizes Q4_K weights to f32 before GEMV, while llama.cpp keeps data quantized.
### Why #3: Why does dequantization hurt performance?
**Answer:** Dequantizing 4096×4096 Q4_K (8 MB) produces 64 MB of f32, exceeding L3 cache.
### Why #4: Why not keep data quantized during GEMV?
**Answer:** trueno's current SIMD kernels only support f32 dot products, not Q4×Q8.
### Why #5: Why weren't fused Q4×Q8 kernels implemented?
**Answer:** Phase 2 focused on f32 matmul parity with NumPy. **This is the root cause.**
---
## 3. Solution Architecture
### 3.1 Q4_K Block Format
Q4_K quantization uses 256-element blocks with super-blocks:
```
Block layout (256 elements = 32 super-blocks × 8 elements):
- scales: [f16; 12] (12 bytes) - scale factors per super-block
- d: f16 (2 bytes) - block-wide scale
- dmin: f16 (2 bytes) - block-wide minimum
- qs: [u8; 128] (128 bytes) - packed 4-bit quantized values
Total: 144 bytes per 256 elements = 4.5 bits/element
```
### 3.2 Fused Q4K×Q8K Kernel Design
The fused kernel computes dot(Q4_K_weights, Q8_K_input) without intermediate dequantization:
```rust
/// Fused Q4K × Q8K dot product
///
/// Computes: sum(dequant(q4) * dequant(q8)) directly in SIMD registers
#[target_feature(enable = "avx2")]
unsafe fn fused_q4k_q8k_dot_avx2(
q4_block: &BlockQ4K, // 256 quantized weights
q8_block: &BlockQ8K, // 256 quantized inputs
) -> f32 {
// Step 1: Load scales (stays in registers)
let d = f16_to_f32(q4_block.d);
let dmin = f16_to_f32(q4_block.dmin);
// Step 2: Process 32 elements at a time (4 super-blocks)
let mut acc = _mm256_setzero_ps();
for sb in 0..8 { // 8 iterations × 32 elements = 256
let offset = sb * 32;
// Load Q4 nibbles (16 bytes = 32 values)
let q4_packed = _mm_loadu_si128(&q4_block.qs[offset/2]);
// Unpack nibbles to bytes: [n0|n1, n2|n3, ...] → [n0, n1, n2, n3, ...]
let q4_lo = _mm256_and_si256(
_mm256_cvtepu8_epi16(q4_packed),
_mm256_set1_epi16(0x0F)
);
let q4_hi = _mm256_and_si256(
_mm256_srli_epi16(_mm256_cvtepu8_epi16(q4_packed), 4),
_mm256_set1_epi16(0x0F)
);
// Load Q8 values (32 bytes = 32 int8)
let q8_vec = _mm256_loadu_si256(&q8_block.qs[offset]);
// Integer multiply-add: q4 * q8
let prod_lo = _mm256_maddubs_epi16(q4_lo, q8_vec_lo);
let prod_hi = _mm256_maddubs_epi16(q4_hi, q8_vec_hi);
// Accumulate with scale
let scale = get_scale(q4_block.scales, sb);
acc = _mm256_fmadd_ps(
_mm256_cvtepi32_ps(_mm256_add_epi32(prod_lo, prod_hi)),
_mm256_set1_ps(d * scale),
acc
);
}
// Horizontal sum
horizontal_sum_avx2(acc)
}
```
### 3.3 Cache-Blocked GEMV
For large matrices, apply L2 cache blocking:
```rust
/// Cache-blocked Q4K GEMV
///
/// y[M] = A[M×K] × x[K] where A is Q4K quantized
pub fn q4k_gemv_blocked(
output: &mut [f32], // M outputs
weights: &[BlockQ4K], // M×K/256 blocks
input: &BlockQ8K, // K/256 blocks (quantized input)
m: usize,
k: usize,
) {
const BLOCK_M: usize = 64; // Rows per L2 block
const BLOCK_K: usize = 4096; // Columns per L2 block (fits in L2)
// Process in L2-friendly blocks
for m_start in (0..m).step_by(BLOCK_M) {
let m_end = (m_start + BLOCK_M).min(m);
// Parallel over output rows
(m_start..m_end).into_par_iter()
.with_min_len(16) // Avoid Rayon overhead
.for_each(|row| {
let mut sum = 0.0f32;
// Process K dimension in blocks
for k_block in 0..(k / 256) {
sum += fused_q4k_q8k_dot_avx2(
&weights[row * (k/256) + k_block],
&input[k_block],
);
}
output[row] = sum;
});
}
}
```
---
## 4. Implementation Plan
### 4.1 File Structure
```
trueno/src/
├── quantize/
│ ├── mod.rs # Module exports
│ ├── formats.rs # Q4_K, Q5_K, Q6_K, Q8_K structs
│ ├── fused_avx2.rs # AVX2 fused kernels
│ ├── fused_avx512.rs # AVX-512 fused kernels
│ ├── fused_neon.rs # ARM NEON fused kernels
│ └── blocked_gemv.rs # Cache-blocked GEMV
```
### 4.2 API Design
```rust
// trueno/src/lib.rs
pub mod quantize;
// Public API
pub use quantize::{
BlockQ4K, BlockQ5K, BlockQ6K, BlockQ8K,
q4k_gemv, q4k_gemm,
quantize_f32_to_q4k, quantize_f32_to_q8k,
};
```
### 4.3 Integration with realizar
```rust
// realizar/src/apr_transformer.rs
use trueno::quantize::{q4k_gemv, BlockQ4K, BlockQ8K};
fn forward_ffn(&self, hidden: &mut [f32]) {
// Quantize input to Q8
let input_q8 = quantize_f32_to_q8k(hidden);
// Fused Q4K × Q8K GEMV (no intermediate f32)
q4k_gemv(
&mut self.up_out,
&self.up_weights_q4k,
&input_q8,
self.intermediate_dim,
self.hidden_dim,
);
// ... rest of FFN
}
```
---
## 5. Falsifiable Hypotheses
### H1: Fused Kernel Throughput
**Claim:** Fused Q4K×Q8K dot product achieves ≥2x throughput vs separate dequant+dot.
**Falsification:** Benchmark 10M dot products. If fused < 1.5x separate, hypothesis falsified.
**Prediction:** fused_throughput / separate_throughput ≥ 2.0
### H2: Memory Bandwidth Reduction
**Claim:** Fused approach reduces memory traffic by ≥80% for Q4K matmul.
**Falsification:** Profile with `perf stat`. If LLC-load-misses reduced <50%, hypothesis falsified.
**Prediction:** memory_traffic_fused / memory_traffic_separate ≤ 0.2
### H3: End-to-End Inference Speedup
**Claim:** APR with fused kernels achieves ≥2x Ollama throughput on TinyLlama-1.1B.
**Falsification:** Benchmark 100 tokens. If throughput < 1.5x Ollama, hypothesis falsified.
**Prediction:** apr_fused_throughput ≥ 2.0 × ollama_throughput
### H4: Numerical Accuracy
**Claim:** Fused kernel produces results within 1e-3 relative error of f32 reference.
**Falsification:** Compare 1000 random dot products. If max_rel_error > 1e-3, falsified.
**Prediction:** max_relative_error < 1e-3
---
## 6. Benchmark Targets
### 6.1 Micro-Benchmarks
| Q4K dequant (256 elem) | 180 | N/A | - |
| f32 dot (256 elem) | 45 | N/A | - |
| Separate (dequant+dot) | 225 | N/A | baseline |
| **Fused Q4K×Q8K** | N/A | **<100** | **>2.2x** |
### 6.2 End-to-End Targets
| TinyLlama-1.1B | APR Q4_K | 15 tok/s | **≥42 tok/s** | **≥2x** |
| Qwen2.5-0.5B | APR Q4_K | 21 tok/s | **≥60 tok/s** | **≥2x** |
| Phi-2 2.7B | APR Q4_K | 7 tok/s | **≥20 tok/s** | **≥2x** |
---
## 7. llamafile Reference Analysis
### 7.1 Key Techniques from llamafile sgemm
1. **Matrix Repacking**: Transpose and repack B matrix for sequential access
2. **4×1 Micro-kernel**: Process 4 output rows simultaneously
3. **L2 Cache Blocking**: 64×64 blocks fit in L2 (256KB)
4. **Fused Dequant**: Q4 nibble extraction inline with FMA
### 7.2 Adaptation for trueno
| C++ with inline asm | Rust with `std::arch` intrinsics |
| Fixed block sizes | Configurable via `BlockConfig` |
| OpenMP parallelism | Rayon with `with_min_len()` |
| Platform-specific files | Feature-gated backends |
---
## 8. Testing Strategy
### 8.1 Unit Tests
```rust
#[test]
fn test_fused_q4k_q8k_dot_correctness() {
let q4 = random_q4k_block();
let q8 = random_q8k_block();
let fused_result = fused_q4k_q8k_dot_avx2(&q4, &q8);
let reference = reference_q4k_q8k_dot(&q4, &q8);
assert!((fused_result - reference).abs() < 1e-3 * reference.abs());
}
#[test]
fn test_fused_kernel_speedup() {
let (q4_blocks, q8_blocks) = setup_benchmark_data();
let separate_time = bench(|| separate_dequant_dot(&q4_blocks, &q8_blocks));
let fused_time = bench(|| fused_q4k_q8k_dot(&q4_blocks, &q8_blocks));
assert!(separate_time / fused_time >= 1.5, "Fused must be ≥1.5x faster");
}
```
### 8.2 Integration Tests
```rust
#[test]
fn test_apr_inference_with_fused_kernels() {
let model = load_apr_model("tinyllama-1.1b-q4k.apr");
let input = "Hello, world!";
let (output, throughput) = benchmark_inference(&model, input, 50);
assert!(throughput >= 30.0, "Must achieve ≥30 tok/s");
assert!(output.contains_coherent_text());
}
```
---
## 9. Revision History
| 1.0.0 | 2025-01-15 | Initial specification |
---
## 10. References
[1] Goto, K., & Van Geijn, R. A. (2008). "Anatomy of High-Performance Matrix Multiplication." ACM TOMS.
[2] Intel Corporation. (2024). "Intel 64 and IA-32 Architectures Optimization Reference Manual."
[3] Dettmers, T., et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS.
[4] llamafile sgemm: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp
[5] Trueno Phase 2 Micro-Kernel: [phase2-microkernel.md](./phase2-microkernel.md)
---
*Specification for Trueno Phase 15 (2025-01-15)*
*Zero excuses. Zero defects. APR IS THE FORMAT.*