Function quantize_half

pub fn quantize_half(v: f32) -> u16

Expand description

Quantizes a float into half-precision floating point value.

Generates +-inf for overflow, preserves NaN, flushes denormals to zero, rounds to nearest.

Representable magnitude range: [6e-5; 65504]

Maximum relative reconstruction error: 5e-4

Function quantize_halfCopy item path