Skip to main content

Module quant_std

Module quant_std

Expand description

Standard GGUF quantization block types: Q4_0 (4-bit) and Q8_0 (8-bit).

These are the most common quantization formats in distributed GGUF model files, accounting for roughly 80% of publicly released models.

Q4_0 (GGML type 2): 32 weights per block, 18 bytes total. Block scale d: f16 + 16 bytes of packed 4-bit nibbles (2 per byte). Dequant: w[j] = d × (nibble[j] − 8).
Q8_0 (GGML type 8): 32 weights per block, 34 bytes total. Block scale d: f16 + 32 bytes of i8 weights. Dequant: w[j] = d × qs[j].

Structs§

BlockQ4_0: Q4_0 block: 32 weights quantized to 4 bits each with a shared FP16 scale.
BlockQ8_0: Q8_0 block: 32 weights quantized to 8-bit signed integers with a shared FP16 scale.

Constants§

BLOCK_Q4_0_BYTES: Number of bytes per Q4_0 block (2-byte f16 scale + 16 bytes of 4-bit pairs).
BLOCK_Q8_0_BYTES: Number of bytes per Q8_0 block (2-byte f16 scale + 32 bytes of i8 weights).
QK_Q4_0: Number of weights per Q4_0 block.
QK_Q8_0: Number of weights per Q8_0 block.