Function quantize_float

Source
pub fn quantize_float(v: f32, n: i32) -> f32
Expand description

Quantizes a float into a floating point value with a limited number of significant mantissa bits.

Generates +-inf for overflow, preserves NaN, flushes denormals to zero, rounds to nearest.

Assumes n is in a valid mantissa precision range, which is 1..23