Skip to main content

quantize_4bit_double

Function quantize_4bit_double 

Source
pub fn quantize_4bit_double(values: &[f32]) -> DoubleQuantized4Bit
Expand description

Quantize values to 4-bit with double quantization of scale factors

First applies standard 4-bit quantization, then quantizes the resulting FP32 scale factors to 8-bit with a second-level block size of 256.