Skip to main content

Module quant

Module quant 

Source
Expand description

Quantized weight storage and on-the-fly dequantizing dot-products.

The whole point of running huge models on small devices is to keep weights quantized in memory and dequantize one block at a time inside the dot-product, instead of expanding the whole weight matrix to F32 (which costs 8× the RAM for Q4). This module is the Phase-0 spike proving that mechanic: a Q4_0 matrix-vector product computed straight from the packed blocks matches the F32 reference, while storing only 0.5625 bytes/weight.

Block layouts follow the canonical ggml conventions (so the same code reads real GGUF files in Phase 1):

  • Q4_0: 32 weights per block, 18 bytes = f16 scale + 16 packed nibble bytes. Byte j holds element j (low nibble) and element j+16 (high nibble).
  • Q8_0: 32 weights per block, 34 bytes = f16 scale + 32 × i8.

Constants§

Q4_0_BLOCK_BYTES
Bytes per Q4_0 block: 2 (f16 scale) + 16 (packed nibbles).
Q4_K_BLOCK_BYTES
Q5_K_BLOCK_BYTES
Q6_K_BLOCK_BYTES
Q8_0_BLOCK_BYTES
Bytes per Q8_0 block: 2 (f16 scale) + 32 (i8 quants).
QK
Weights per quantized block (both Q4_0 and Q8_0 use 32).
QK_K

Functions§

dequantize_q4_0_block
Dequantize one Q4_0 block into out (length QK).
dot_q4_0_block_f32
Dot product of one Q4_0 block with a length-QK f32 activation slice.
dot_q4_0_row_f32
Dot product of a full Q4_0-quantized weight row.
dot_q4_k_row_f32
Dot product of a full Q4_K-quantized weight row with an f32 activation vector.
dot_q5_k_row_f32
Dot product of a full Q5_K-quantized weight row with an f32 activation vector.
dot_q6_k_row_f32
Dot product of a full Q6_K-quantized weight row with an f32 activation vector.
dot_q8_0_block_f32
Dot product of a Q8_0 block with an f32 activation slice.
dot_q8_0_row_f32
Dot product of a full Q8_0-quantized weight row with an f32 activation vector.
dot_q8_0_row_i8_scalar
Scalar fallback for dot_q8_0_row_sdot (non-dotprod aarch64 or other platforms). Uses i32 integer arithmetic — no widening chain, still faster than the f32 path for targets without AVX2, and correct everywhere.
quantize_q4_0_block
Quantize a length-QK slice of f32 into one Q4_0 block (ggml convention).
quantize_q4_0_row
Quantize a full f32 weight row (k % QK == 0) into packed Q4_0 blocks.
quantize_q8_0_block
Quantize a length-QK (32) slice of f32 into one Q8_0 block (ggml convention).
quantize_row_to_i8
Quantize a row of f32 activations to i8 (symmetric Q8_0 style). Returns (quantized_bytes, scale) where scale = max_abs / 127. The caller multiplies each block’s weight scale by this scale to recover f32.