Expand description
Groupwise symmetric quantization with f16 scales.
For each group of group_len values:
scale = max(|v_i|) / qmaxq_i = round(v_i / scale), clamped to[-qmax, +qmax]u_i = q_i + qmax(bias to unsigned for packing)
Functionsยง
- compute_
scales - Compute f16 group scales for a frame.
- dequantize
- Dequantize packed codes using f16 scales, writing f32 values.
- dequantize_
f32 - Dequantize packed codes using f32 scales, writing f32 values.
- frame_
fits_ scales - Check if a frame fits within existing f16 scales (within drift tolerance).
- frame_
fits_ scales_ f32 - Check if a frame fits within existing scales (within drift tolerance).
- quantize_
and_ pack - Quantize a frame using pre-computed f16 scales and pack into bitstream.
- quantize_
and_ pack_ f32 - Quantize a frame using pre-computed f32 scales and pack into bitstream.
- scales_
to_ f32 - Pre-convert f16 scales to f32 for hot-path use.