Expand description
Quantized weight storage and on-the-fly dequantizing dot-products.
The whole point of running huge models on small devices is to keep weights quantized in memory and dequantize one block at a time inside the dot-product, instead of expanding the whole weight matrix to F32 (which costs 8× the RAM for Q4). This module is the Phase-0 spike proving that mechanic: a Q4_0 matrix-vector product computed straight from the packed blocks matches the F32 reference, while storing only 0.5625 bytes/weight.
Block layouts follow the canonical ggml conventions (so the same code reads real GGUF files in Phase 1):
Q4_0: 32 weights per block, 18 bytes = f16 scale + 16 packed nibble bytes. Bytejholds elementj(low nibble) and elementj+16(high nibble).Q8_0: 32 weights per block, 34 bytes = f16 scale + 32 × i8.
Constants§
- Q4_
0_ BLOCK_ BYTES - Bytes per Q4_0 block: 2 (f16 scale) + 16 (packed nibbles).
- Q4_
K_ BLOCK_ BYTES - Q5_
K_ BLOCK_ BYTES - Q6_
K_ BLOCK_ BYTES - Q8_
0_ BLOCK_ BYTES - Bytes per Q8_0 block: 2 (f16 scale) + 32 (i8 quants).
- QK
- Weights per quantized block (both Q4_0 and Q8_0 use 32).
- QK_K
Functions§
- dequantize_
q4_ 0_ block - Dequantize one Q4_0 block into
out(lengthQK). - dot_
q4_ 0_ block_ f32 - Dot product of one Q4_0 block with a length-
QKf32 activation slice. - dot_
q4_ 0_ row_ f32 - Dot product of a full Q4_0-quantized weight row.
- dot_
q4_ k_ row_ f32 - Dot product of a full Q4_K-quantized weight row with an f32 activation vector.
- dot_
q5_ k_ row_ f32 - Dot product of a full Q5_K-quantized weight row with an f32 activation vector.
- dot_
q6_ k_ row_ f32 - Dot product of a full Q6_K-quantized weight row with an f32 activation vector.
- dot_
q8_ 0_ block_ f32 - Dot product of a Q8_0 block with an f32 activation slice.
- dot_
q8_ 0_ row_ f32 - Dot product of a full Q8_0-quantized weight row with an f32 activation vector.
- dot_
q8_ 0_ row_ i8_ scalar - Scalar fallback for dot_q8_0_row_sdot (non-dotprod aarch64 or other platforms). Uses i32 integer arithmetic — no widening chain, still faster than the f32 path for targets without AVX2, and correct everywhere.
- quantize_
q4_ 0_ block - Quantize a length-
QKslice of f32 into one Q4_0 block (ggml convention). - quantize_
q4_ 0_ row - Quantize a full f32 weight row (
k % QK == 0) into packed Q4_0 blocks. - quantize_
q8_ 0_ block - Quantize a length-
QK(32) slice of f32 into one Q8_0 block (ggml convention). - quantize_
row_ to_ i8 - Quantize a row of f32 activations to i8 (symmetric Q8_0 style). Returns (quantized_bytes, scale) where scale = max_abs / 127. The caller multiplies each block’s weight scale by this scale to recover f32.