Skip to main content

Module quantizer

Module quantizer 

Source
Expand description

Groupwise symmetric quantization with f16 scales.

For each group of group_len values:

  • scale = max(|v_i|) / qmax
  • q_i = round(v_i / scale), clamped to [-qmax, +qmax]
  • u_i = q_i + qmax (bias to unsigned for packing)

Functionsยง

compute_scales
Compute f16 group scales for a frame.
dequantize
Dequantize packed codes using f16 scales, writing f32 values.
dequantize_f32
Dequantize packed codes using f32 scales, writing f32 values.
frame_fits_scales
Check if a frame fits within existing f16 scales (within drift tolerance).
frame_fits_scales_f32
Check if a frame fits within existing scales (within drift tolerance).
quantize_and_pack
Quantize a frame using pre-computed f16 scales and pack into bitstream.
quantize_and_pack_f32
Quantize a frame using pre-computed f32 scales and pack into bitstream.
scales_to_f32
Pre-convert f16 scales to f32 for hot-path use.