pub fn quantize_half(v: f32) -> u16
Expand description
Quantizes a float into half-precision floating point value.
Generates +-inf for overflow, preserves NaN, flushes denormals to zero, rounds to nearest.
Representable magnitude range: [6e-5; 65504]
Maximum relative reconstruction error: 5e-4