Struct f16

Source

#[repr(C)]
pub struct f16(/* private fields */);

Expand description

A 16-bit floating point type implementing the IEEE 754-2008 standard binary16 a.k.a “half” format.

This 16-bit floating point type is intended for efficient storage where the full range and precision of a larger floating point value is not required.

Implementations§

Source §

impl f16

Source

pub const DIGITS: u32 = 3u32

Approximate number of f16 significant digits in base 10

Source

pub const EPSILON: f16

f16 machine epsilon value

This is the difference between 1.0 and the next largest representable number.

Source

pub const INFINITY: f16

f16 positive Infinity (+∞)

Source

pub const MANTISSA_DIGITS: u32 = 11u32

Number of f16 significant digits in base 2

Source

pub const MAX: f16

Largest finite f16 value

Source

pub const MAX_10_EXP: i32 = 4i32

Maximum possible f16 power of 10 exponent

Source

pub const MAX_EXP: i32 = 16i32

Maximum possible f16 power of 2 exponent

Source

pub const MIN: f16

Smallest finite f16 value

Source

pub const MIN_10_EXP: i32 = -4i32

Minimum possible normal f16 power of 10 exponent

Source

pub const MIN_EXP: i32 = -13i32

One greater than the minimum possible normal f16 power of 2 exponent

Source

pub const MIN_POSITIVE: f16

Smallest positive normal f16 value

Source

pub const NAN: f16

f16 Not a Number (NaN)

Source

pub const RADIX: u32 = 2u32

The radix or base of the internal representation of f16

Source

pub const MIN_POSITIVE_SUBNORMAL: f16

Minimum positive subnormal f16 value

Source

pub const E: f16

f16 Euler’s number (ℯ)

Source

pub const PI: f16

f16 Archimedes’ constant (π)

Source

pub const LN_10: f16

f16 𝗅𝗇 10

Source

pub const LN_2: f16

f16 𝗅𝗇 2

Source

pub const LOG10_E: f16

f16 𝗅𝗈𝗀₁₀ℯ

Source

pub const LOG10_2: f16

f16 𝗅𝗈𝗀₁₀2

Source

pub const LOG2_E: f16

f16 𝗅𝗈𝗀₂ℯ

Source

pub const LOG2_10: f16

f16 𝗅𝗈𝗀₂10

Source

pub const SQRT_2: f16

f16 √2

Source

pub const SIGN_MASK: u16 = 32_768u16

Sign bit

Source

pub const EXP_MASK: u16 = 31_744u16

Exponent mask

Source

pub const HIDDEN_BIT_MASK: u16 = 1_024u16

Mask for the hidden bit.

Source

pub const MAN_MASK: u16 = 1_023u16

Mantissa mask

Source

pub const TINY_BITS: u16 = 1u16

Minimum representable positive value (min subnormal)

Source

pub const NEG_TINY_BITS: u16 = 32_769u16

Minimum representable negative value (min negative subnormal)

Source

pub const fn from_bits(bits: u16) -> f16

Constructs a 16-bit floating point value from the raw bits.

Source

pub fn from_f32(value: f32) -> f16

Constructs a 16-bit floating point value from a 32-bit floating point value.

This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 32-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.

This will prefer correctness over speed. Currently, this always uses an intrinsic if available.

Source

pub const fn from_f32_const(value: f32) -> f16

Constructs a 16-bit floating point value from a 32-bit floating point value.

This function is identical to from_f32 except it never uses hardware intrinsics, which allows it to be const. from_f32 should be preferred in any non-const context.

This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 32-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.

Source

pub fn from_f32_instrinsic(value: f32) -> f16

Constructs a 16-bit floating point value from a 32-bit floating point value.

This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 32-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.

Source

pub const fn from_f32_lossless(value: f32) -> Option<f16>

Create a f16 loslessly from an f32.

This is only true if the f32 is non-finite (infinite or NaN), or the exponent can be represented by a normal f16 and no non-zero bits would be truncated.

“Lossless” does not mean the data is represented the same as a decimal number. For example, an f32 and f64 have the significant digits (excluding the hidden bit) for a value closest to 1e35 of:

f32: 110100001001100001100
f64: 11010000100110000110000000000000000000000000000000

However, the f64 is displayed as 1.0000000409184788e+35, while the value closest to 1e35 in f64 is 11010000100110000101110010110001110100110110000010. This makes it look like precision has been lost but this is due to the approximations used to represent binary values as a decimal.

This does not respect signalling NaNs: if the value is NaN or inf, then it will return that value.

Source

pub fn from_f64(value: f64) -> f16

Constructs a 16-bit floating point value from a 64-bit floating point value.

This operation is lossy. If the 64-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 64-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.

This will prefer correctness over speed: on x86 systems, this currently uses a software rather than an instrinsic implementation on x86.

Source

pub const fn from_f64_const(value: f64) -> f16

Constructs a 16-bit floating point value from a 64-bit floating point value.

This function is identical to from_f64 except it never uses hardware intrinsics, which allows it to be const. from_f64 should be preferred in any non-const context.

This operation is lossy. If the 64-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 64-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.

Source

pub fn from_f64_instrinsic(value: f64) -> f16

Constructs a 16-bit floating point value from a 64-bit floating point value.

This operation is lossy. If the 64-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 64-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.

This prefers to use vendor instrinsics if possible, otherwise, it goes to a fallback. On x86 and x86_64, this can be more lossy than from_f64.

Source

pub const fn from_f64_lossless(value: f64) -> Option<f16>

Create a f16 loslessly from an f64.

This is only true if the f64 is non-finite (infinite or NaN), or the exponent can be represented by a normal f16 and no non-zero bits would be truncated.

“Lossless” does not mean the data is represented the same as a decimal number. For example, an f32 and f64 have the significant digits (excluding the hidden bit) for a value closest to 1e35 of:

f32: 110100001001100001100
f64: 11010000100110000110000000000000000000000000000000

However, the f64 is displayed as 1.0000000409184788e+35, while the value closest to 1e35 in f64 is 11010000100110000101110010110001110100110110000010. This makes it look like precision has been lost but this is due to the approximations used to represent binary values as a decimal.

This does not respect signalling NaNs: if the value is NaN or inf, then it will return that value.

Source

pub const fn to_bits(self) -> u16

Converts a f16 into the underlying bit representation.

Source

pub const fn to_le_bytes(self) -> [u8; 2]

Returns the memory representation of the underlying bit representation as a byte array in little-endian byte order.

§Examples

let bytes = f16::from_f32(12.5).to_le_bytes();
assert_eq!(bytes, [0x40, 0x4A]);

Source

pub const fn to_be_bytes(self) -> [u8; 2]

Returns the memory representation of the underlying bit representation as a byte array in big-endian (network) byte order.

§Examples

let bytes = f16::from_f32(12.5).to_be_bytes();
assert_eq!(bytes, [0x4A, 0x40]);

Source

pub const fn to_ne_bytes(self) -> [u8; 2]

Returns the memory representation of the underlying bit representation as a byte array in native byte order.

As the target platform’s native endianness is used, portable code should use to_be_bytes or to_le_bytes, as appropriate, instead.

§Examples

let bytes = f16::from_f32(12.5).to_ne_bytes();
assert_eq!(bytes, if cfg!(target_endian = "big") {
    [0x4A, 0x40]
} else {
    [0x40, 0x4A]
});

Source

pub const fn from_le_bytes(bytes: [u8; 2]) -> f16

Creates a floating point value from its representation as a byte array in little endian.

§Examples

let value = f16::from_le_bytes([0x40, 0x4A]);
assert_eq!(value, f16::from_f32(12.5));

Source

pub const fn from_be_bytes(bytes: [u8; 2]) -> f16

Creates a floating point value from its representation as a byte array in big endian.

§Examples

let value = f16::from_be_bytes([0x4A, 0x40]);
assert_eq!(value, f16::from_f32(12.5));

Source

pub const fn from_ne_bytes(bytes: [u8; 2]) -> f16

Creates a floating point value from its representation as a byte array in native endian.

As the target platform’s native endianness is used, portable code likely wants to use from_be_bytes or from_le_bytes, as appropriate instead.

§Examples

let value = f16::from_ne_bytes(if cfg!(target_endian = "big") {
    [0x4A, 0x40]
} else {
    [0x40, 0x4A]
});
assert_eq!(value, f16::from_f32(12.5));

Source

pub fn to_f32(self) -> f32

Converts a f16 value into a f32 value.

This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.

This will prefer correctness over speed. Currently, this always uses an intrinsic if available.

Source

pub const fn to_f32_const(self) -> f32

Converts a f16 value into a f32 value.

This function is identical to to_f32 except it never uses hardware intrinsics, which allows it to be const. to_f32 should be preferred in any non-const context.

This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.

Source

pub fn to_f32_intrinsic(self) -> f32

Converts a f16 value into a f32 value.

This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.

Source

pub fn as_f32(self) -> f32

Convert the data to an f32 type, used for numerical operations.

Source

pub const fn as_f32_const(self) -> f32

Convert the data to an f32 type, used for numerical operations.

Source

pub fn to_f64(self) -> f64

Converts a f16 value into a f64 value.

This conversion is lossless as all 16-bit floating point values can be represented exactly in 64-bit floating point.

This will prefer correctness over speed: on x86 systems, this currently uses a software rather than an instrinsic implementation on x86.

Source

pub const fn to_f64_const(self) -> f64

Converts a f16 value into a f64 value.

This function is identical to to_f64 except it never uses hardware intrinsics, which allows it to be const. to_f64 should be preferred in any non-const context.

This conversion is lossless as all 16-bit floating point values can be represented exactly in 64-bit floating point.

Source

pub fn to_f64_intrinsic(self) -> f64

Converts a f16 value into a f32 value.

This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.

Source

pub fn as_f64(self) -> f64

Convert the data to an f64 type, used for numerical operations.

Source

pub const fn as_f64_const(self) -> f64

Convert the data to an f64 type, used for numerical operations.

Source

pub const fn is_nan(self) -> bool

Returns true if this value is NaN and false otherwise.

§Examples


let nan = f16::NAN;
let f = f16::from_f32(7.0_f32);

assert!(nan.is_nan());
assert!(!f.is_nan());

Source

pub const fn abs(self) -> Self

Computes the absolute value of self.

Source

pub const fn is_infinite(self) -> bool

Returns true if this value is ±∞ and false. otherwise.

§Examples


let f = f16::from_f32(7.0f32);
let inf = f16::INFINITY;
let neg_inf = f16::NEG_INFINITY;
let nan = f16::NAN;

assert!(!f.is_infinite());
assert!(!nan.is_infinite());

assert!(inf.is_infinite());
assert!(neg_inf.is_infinite());

Source

pub const fn is_finite(self) -> bool

Returns true if this number is neither infinite nor NaN.

§Examples


let f = f16::from_f32(7.0f32);
let inf = f16::INFINITY;
let neg_inf = f16::NEG_INFINITY;
let nan = f16::NAN;

assert!(f.is_finite());

assert!(!nan.is_finite());
assert!(!inf.is_finite());
assert!(!neg_inf.is_finite());

Source

pub const fn is_subnormal(self) -> bool

Returns true if the number is subnormal.

Source

pub const fn is_normal(self) -> bool

Returns true if the number is neither zero, infinite, subnormal, or NaN.

§Examples


let min = f16::MIN_POSITIVE;
let max = f16::MAX;
let lower_than_min = f16::from_f32(1.0e-10_f32);
let zero = f16::from_f32(0.0_f32);

assert!(min.is_normal());
assert!(max.is_normal());

assert!(!zero.is_normal());
assert!(!f16::NAN.is_normal());
assert!(!f16::INFINITY.is_normal());
// Values between `0` and `min` are Subnormal.
assert!(!lower_than_min.is_normal());

Source

pub const fn classify(self) -> FpCategory

Returns the floating point category of the number.

If only one property is going to be tested, it is generally faster to use the specific predicate instead.

§Examples

use std::num::FpCategory;

let num = f16::from_f32(12.4_f32);
let inf = f16::INFINITY;

assert_eq!(num.classify(), FpCategory::Normal);
assert_eq!(inf.classify(), FpCategory::Infinite);

Source

pub const fn signum(self) -> f16

Returns a number that represents the sign of self.

1.0 if the number is positive, +0.0 or INFINITY
-1.0 if the number is negative, -0.0 or NEG_INFINITY
NAN if the number is NaN

§Examples


let f = f16::from_f32(3.5_f32);

assert_eq!(f.signum(), f16::from_f32(1.0));
assert_eq!(f16::NEG_INFINITY.signum(), f16::from_f32(-1.0));

assert!(f16::NAN.signum().is_nan());

Source

pub const fn is_sign_positive(self) -> bool

Returns true if and only if self has a positive sign, including +0.0, NaNs with a positive sign bit and +∞.

§Examples


let nan = f16::NAN;
let f = f16::from_f32(7.0_f32);
let g = f16::from_f32(-7.0_f32);

assert!(f.is_sign_positive());
assert!(!g.is_sign_positive());
// `NaN` can be either positive or negative
assert!(nan.is_sign_positive() != nan.is_sign_negative());

Source

pub const fn is_sign_negative(self) -> bool

Returns true if and only if self has a negative sign, including -0.0, NaNs with a negative sign bit and −∞.

§Examples


let nan = f16::NAN;
let f = f16::from_f32(7.0f32);
let g = f16::from_f32(-7.0f32);

assert!(!f.is_sign_negative());
assert!(g.is_sign_negative());
// `NaN` can be either positive or negative
assert!(nan.is_sign_positive() != nan.is_sign_negative());

Source

pub const fn copysign(self, sign: f16) -> f16

Returns a number composed of the magnitude of self and the sign of sign.

Equal to self if the sign of self and sign are the same, otherwise equal to -self. If self is NaN, then NaN with the sign of sign is returned.

§Examples

let f = f16::from_f32(3.5);

assert_eq!(f.copysign(f16::from_f32(0.42)), f16::from_f32(3.5));
assert_eq!(f.copysign(f16::from_f32(-0.42)), f16::from_f32(-3.5));
assert_eq!((-f).copysign(f16::from_f32(0.42)), f16::from_f32(3.5));
assert_eq!((-f).copysign(f16::from_f32(-0.42)), f16::from_f32(-3.5));

assert!(f16::NAN.copysign(f16::from_f32(1.0)).is_nan());

Source

pub fn recip(self) -> Self

Takes the reciprocal (inverse) of a number, 1/x.

Source

pub fn to_degrees(self) -> Self

Converts radians to degrees.

Source

pub fn to_radians(self) -> Self

Converts degrees to radians.

Source

pub const fn max(self, other: f16) -> f16

Returns the maximum of the two numbers.

If one of the arguments is NaN, then the other argument is returned.

§Examples

let x = f16::from_f32(1.0);
let y = f16::from_f32(2.0);

assert_eq!(x.max(y), y);

Source

pub const fn min(self, other: f16) -> f16

Returns the minimum of the two numbers.

If one of the arguments is NaN, then the other argument is returned.

§Examples

let x = f16::from_f32(1.0);
let y = f16::from_f32(2.0);

assert_eq!(x.min(y), x);

Source

pub const fn clamp(self, min: f16, max: f16) -> f16

Restrict a value to a certain interval unless it is NaN.

Returns max if self is greater than max, and min if self is less than min. Otherwise this returns self.

Note that this function returns NaN if the initial value was NaN as well.

§Panics

Panics if min > max, min is NaN, or max is NaN.

§Examples

assert!(f16::from_f32(-3.0).clamp(f16::from_f32(-2.0), f16::from_f32(1.0)) == f16::from_f32(-2.0));
assert!(f16::from_f32(0.0).clamp(f16::from_f32(-2.0), f16::from_f32(1.0)) == f16::from_f32(0.0));
assert!(f16::from_f32(2.0).clamp(f16::from_f32(-2.0), f16::from_f32(1.0)) == f16::from_f32(1.0));
assert!(f16::NAN.clamp(f16::from_f32(-2.0), f16::from_f32(1.0)).is_nan());

Source

pub fn total_cmp(&self, other: &Self) -> Ordering

Returns the ordering between self and other.

Unlike the standard partial comparison between floating point numbers, this comparison always produces an ordering in accordance to the totalOrder predicate as defined in the IEEE 754 (2008 revision) floating point standard. The values are ordered in the following sequence:

negative quiet NaN
negative signaling NaN
negative infinity
negative numbers
negative subnormal numbers
negative zero
positive zero
positive subnormal numbers
positive numbers
positive infinity
positive signaling NaN
positive quiet NaN.

The ordering established by this function does not always agree with the PartialOrd and PartialEq implementations of f16. For example, they consider negative and positive zero equal, while total_cmp doesn’t.

The interpretation of the signaling NaN bit follows the definition in the IEEE 754 standard, which may not match the interpretation by some of the older, non-conformant (e.g. MIPS) hardware implementations.

§Examples

let mut v: Vec<f16> = vec![];
v.push(f16::ONE);
v.push(f16::INFINITY);
v.push(f16::NEG_INFINITY);
v.push(f16::NAN);
v.push(f16::MAX_SUBNORMAL);
v.push(-f16::MAX_SUBNORMAL);
v.push(f16::ZERO);
v.push(f16::NEG_ZERO);
v.push(f16::NEG_ONE);
v.push(f16::MIN_POSITIVE);

v.sort_by(|a, b| a.total_cmp(&b));

assert!(v
    .into_iter()
    .zip(
        [
            f16::NEG_INFINITY,
            f16::NEG_ONE,
            -f16::MAX_SUBNORMAL,
            f16::NEG_ZERO,
            f16::ZERO,
            f16::MAX_SUBNORMAL,
            f16::MIN_POSITIVE,
            f16::ONE,
            f16::INFINITY,
            f16::NAN
        ]
        .iter()
    )
    .all(|(a, b)| a.to_bits() == b.to_bits()));

Source §