Quantize a float into a floating point value with a limited number of significant mantissa bits.
Generates +-inf for overflow, preserves NaN, flushes denormals to zero, rounds to nearest.
Assumes N is in a valid mantissa precision range, which is 1..23
Quantize a float into half-precision floating point value.
Generates +-inf for overflow, preserves NaN, flushes denormals to zero, rounds to nearest.
Representable magnitude range: [6e-5; 65504].
Maximum relative reconstruction error: 5e-4.
Quantize a float in [-1..1] range into an N-bit fixed point snorm value.
Quantize a float in [0..1] range into an N-bit fixed point unorm value.