pub fn quantize_float(v: f32, n: i32) -> f32
Expand description
Quantizes a float into a floating point value with a limited number of significant mantissa bits.
Generates +-inf for overflow, preserves NaN, flushes denormals to zero, rounds to nearest.
Assumes n
is in a valid mantissa precision range, which is 1..23