#[repr(C)]pub struct f16(/* private fields */);
Expand description
A 16-bit floating point type implementing the IEEE 754-2008 standard
binary16
a.k.a “half” format.
This 16-bit floating point type is intended for efficient storage where the full range and precision of a larger floating point value is not required.
Implementations§
Source§impl f16
impl f16
Sourcepub const EPSILON: f16
pub const EPSILON: f16
f16
machine epsilon value
This is the difference between 1.0 and the next largest representable number.
Sourcepub const MANTISSA_DIGITS: u32 = 11u32
pub const MANTISSA_DIGITS: u32 = 11u32
Number of f16
significant digits in base 2
Sourcepub const MAX_10_EXP: i32 = 4i32
pub const MAX_10_EXP: i32 = 4i32
Maximum possible f16
power of 10 exponent
Sourcepub const MIN_10_EXP: i32 = -4i32
pub const MIN_10_EXP: i32 = -4i32
Minimum possible normal f16
power of 10 exponent
Sourcepub const MIN_EXP: i32 = -13i32
pub const MIN_EXP: i32 = -13i32
One greater than the minimum possible normal f16
power of 2
exponent
Sourcepub const MIN_POSITIVE: f16
pub const MIN_POSITIVE: f16
Smallest positive normal f16
value
Sourcepub const NEG_INFINITY: f16
pub const NEG_INFINITY: f16
f16
negative infinity (-∞)
Sourcepub const MIN_POSITIVE_SUBNORMAL: f16
pub const MIN_POSITIVE_SUBNORMAL: f16
Minimum positive subnormal f16
value
Sourcepub const MAX_SUBNORMAL: f16
pub const MAX_SUBNORMAL: f16
Maximum subnormal f16
value
Sourcepub const FRAC_1_SQRT_2: f16
pub const FRAC_1_SQRT_2: f16
f16
1/√2
Sourcepub const FRAC_2_SQRT_PI: f16
pub const FRAC_2_SQRT_PI: f16
f16
2/√π
Sourcepub const HIDDEN_BIT_MASK: u16 = 1_024u16
pub const HIDDEN_BIT_MASK: u16 = 1_024u16
Mask for the hidden bit.
Sourcepub const NEG_TINY_BITS: u16 = 32_769u16
pub const NEG_TINY_BITS: u16 = 32_769u16
Minimum representable negative value (min negative subnormal)
Sourcepub const fn from_bits(bits: u16) -> f16
pub const fn from_bits(bits: u16) -> f16
Constructs a 16-bit floating point value from the raw bits.
Sourcepub fn from_f32(value: f32) -> f16
pub fn from_f32(value: f32) -> f16
Constructs a 16-bit floating point value from a 32-bit floating point value.
This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 32-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.
This will prefer correctness over speed. Currently, this always uses an intrinsic if available.
Sourcepub const fn from_f32_const(value: f32) -> f16
pub const fn from_f32_const(value: f32) -> f16
Constructs a 16-bit floating point value from a 32-bit floating point value.
This function is identical to from_f32
except it
never uses hardware intrinsics, which allows it to be const
.
from_f32
should be preferred in any non-const
context.
This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 32-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.
Sourcepub fn from_f32_instrinsic(value: f32) -> f16
pub fn from_f32_instrinsic(value: f32) -> f16
Constructs a 16-bit floating point value from a 32-bit floating point value.
This operation is lossy. If the 32-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 32-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.
Sourcepub const fn from_f32_lossless(value: f32) -> Option<f16>
pub const fn from_f32_lossless(value: f32) -> Option<f16>
Create a f16
loslessly from an f32
.
This is only true if the f32
is non-finite
(infinite or NaN), or the exponent can be represented
by a normal f16
and no non-zero bits would
be truncated.
“Lossless” does not mean the data is represented the
same as a decimal number. For example, an f32
and f64
have the significant digits (excluding the
hidden bit) for a value closest to 1e35
of:
f32
:110100001001100001100
f64
:11010000100110000110000000000000000000000000000000
However, the f64
is displayed as 1.0000000409184788e+35
,
while the value closest to 1e35
in f64
is
11010000100110000101110010110001110100110110000010
. This
makes it look like precision has been lost but this is
due to the approximations used to represent binary values as
a decimal.
This does not respect signalling NaNs: if the value is NaN or inf, then it will return that value.
Sourcepub fn from_f64(value: f64) -> f16
pub fn from_f64(value: f64) -> f16
Constructs a 16-bit floating point value from a 64-bit floating point value.
This operation is lossy. If the 64-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 64-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.
This will prefer correctness over speed: on x86 systems, this currently uses a software rather than an instrinsic implementation on x86.
Sourcepub const fn from_f64_const(value: f64) -> f16
pub const fn from_f64_const(value: f64) -> f16
Constructs a 16-bit floating point value from a 64-bit floating point value.
This function is identical to from_f64
except it
never uses hardware intrinsics, which allows it to be const
.
from_f64
should be preferred in any non-const
context.
This operation is lossy. If the 64-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 64-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.
Sourcepub fn from_f64_instrinsic(value: f64) -> f16
pub fn from_f64_instrinsic(value: f64) -> f16
Constructs a 16-bit floating point value from a 64-bit floating point value.
This operation is lossy. If the 64-bit value is to large to fit in 16-bits, ±∞ will result. NaN values are preserved. 64-bit subnormal values are too tiny to be represented in 16-bits and result in ±0. Exponents that underflow the minimum 16-bit exponent will result in 16-bit subnormals or ±0. All other values are truncated and rounded to the nearest representable 16-bit value.
This prefers to use vendor instrinsics if possible, otherwise, it
goes to a fallback. On x86 and x86_64, this can be more lossy than
from_f64
.
Sourcepub const fn from_f64_lossless(value: f64) -> Option<f16>
pub const fn from_f64_lossless(value: f64) -> Option<f16>
Create a f16
loslessly from an f64
.
This is only true if the f64
is non-finite
(infinite or NaN), or the exponent can be represented
by a normal f16
and no non-zero bits would
be truncated.
“Lossless” does not mean the data is represented the
same as a decimal number. For example, an f32
and f64
have the significant digits (excluding the
hidden bit) for a value closest to 1e35
of:
f32
:110100001001100001100
f64
:11010000100110000110000000000000000000000000000000
However, the f64
is displayed as 1.0000000409184788e+35
,
while the value closest to 1e35
in f64
is
11010000100110000101110010110001110100110110000010
. This
makes it look like precision has been lost but this is
due to the approximations used to represent binary values as
a decimal.
This does not respect signalling NaNs: if the value is NaN or inf, then it will return that value.
Sourcepub const fn to_le_bytes(self) -> [u8; 2]
pub const fn to_le_bytes(self) -> [u8; 2]
Returns the memory representation of the underlying bit representation as a byte array in little-endian byte order.
§Examples
let bytes = f16::from_f32(12.5).to_le_bytes();
assert_eq!(bytes, [0x40, 0x4A]);
Sourcepub const fn to_be_bytes(self) -> [u8; 2]
pub const fn to_be_bytes(self) -> [u8; 2]
Returns the memory representation of the underlying bit representation as a byte array in big-endian (network) byte order.
§Examples
let bytes = f16::from_f32(12.5).to_be_bytes();
assert_eq!(bytes, [0x4A, 0x40]);
Sourcepub const fn to_ne_bytes(self) -> [u8; 2]
pub const fn to_ne_bytes(self) -> [u8; 2]
Returns the memory representation of the underlying bit representation as a byte array in native byte order.
As the target platform’s native endianness is used, portable code should
use to_be_bytes
or
to_le_bytes
, as appropriate, instead.
§Examples
let bytes = f16::from_f32(12.5).to_ne_bytes();
assert_eq!(bytes, if cfg!(target_endian = "big") {
[0x4A, 0x40]
} else {
[0x40, 0x4A]
});
Sourcepub const fn from_le_bytes(bytes: [u8; 2]) -> f16
pub const fn from_le_bytes(bytes: [u8; 2]) -> f16
Creates a floating point value from its representation as a byte array in little endian.
§Examples
let value = f16::from_le_bytes([0x40, 0x4A]);
assert_eq!(value, f16::from_f32(12.5));
Sourcepub const fn from_be_bytes(bytes: [u8; 2]) -> f16
pub const fn from_be_bytes(bytes: [u8; 2]) -> f16
Creates a floating point value from its representation as a byte array in big endian.
§Examples
let value = f16::from_be_bytes([0x4A, 0x40]);
assert_eq!(value, f16::from_f32(12.5));
Sourcepub const fn from_ne_bytes(bytes: [u8; 2]) -> f16
pub const fn from_ne_bytes(bytes: [u8; 2]) -> f16
Creates a floating point value from its representation as a byte array in native endian.
As the target platform’s native endianness is used, portable code likely
wants to use from_be_bytes
or
from_le_bytes
, as appropriate instead.
§Examples
let value = f16::from_ne_bytes(if cfg!(target_endian = "big") {
[0x4A, 0x40]
} else {
[0x40, 0x4A]
});
assert_eq!(value, f16::from_f32(12.5));
Sourcepub fn to_f32(self) -> f32
pub fn to_f32(self) -> f32
Converts a f16
value into a f32
value.
This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.
This will prefer correctness over speed. Currently, this always uses an intrinsic if available.
Sourcepub const fn to_f32_const(self) -> f32
pub const fn to_f32_const(self) -> f32
Converts a f16
value into a f32
value.
This function is identical to to_f32
except it never
uses hardware intrinsics, which allows it to be const
.
to_f32
should be preferred in any non-const
context.
This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.
Sourcepub fn to_f32_intrinsic(self) -> f32
pub fn to_f32_intrinsic(self) -> f32
Converts a f16
value into a f32
value.
This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.
Sourcepub const fn as_f32_const(self) -> f32
pub const fn as_f32_const(self) -> f32
Convert the data to an f32
type, used for numerical operations.
Sourcepub fn to_f64(self) -> f64
pub fn to_f64(self) -> f64
Converts a f16
value into a f64
value.
This conversion is lossless as all 16-bit floating point values can be represented exactly in 64-bit floating point.
This will prefer correctness over speed: on x86 systems, this currently uses a software rather than an instrinsic implementation on x86.
Sourcepub const fn to_f64_const(self) -> f64
pub const fn to_f64_const(self) -> f64
Converts a f16
value into a f64
value.
This function is identical to to_f64
except it never
uses hardware intrinsics, which allows it to be const
.
to_f64
should be preferred in any non-const
context.
This conversion is lossless as all 16-bit floating point values can be represented exactly in 64-bit floating point.
Sourcepub fn to_f64_intrinsic(self) -> f64
pub fn to_f64_intrinsic(self) -> f64
Converts a f16
value into a f32
value.
This conversion is lossless as all 16-bit floating point values can be represented exactly in 32-bit floating point.
Sourcepub const fn as_f64_const(self) -> f64
pub const fn as_f64_const(self) -> f64
Convert the data to an f64
type, used for numerical operations.
Sourcepub const fn is_nan(self) -> bool
pub const fn is_nan(self) -> bool
Returns true
if this value is NaN
and false
otherwise.
§Examples
let nan = f16::NAN;
let f = f16::from_f32(7.0_f32);
assert!(nan.is_nan());
assert!(!f.is_nan());
Sourcepub const fn is_infinite(self) -> bool
pub const fn is_infinite(self) -> bool
Returns true
if this value is ±∞ and false
.
otherwise.
§Examples
let f = f16::from_f32(7.0f32);
let inf = f16::INFINITY;
let neg_inf = f16::NEG_INFINITY;
let nan = f16::NAN;
assert!(!f.is_infinite());
assert!(!nan.is_infinite());
assert!(inf.is_infinite());
assert!(neg_inf.is_infinite());
Sourcepub const fn is_finite(self) -> bool
pub const fn is_finite(self) -> bool
Returns true
if this number is neither infinite nor NaN
.
§Examples
let f = f16::from_f32(7.0f32);
let inf = f16::INFINITY;
let neg_inf = f16::NEG_INFINITY;
let nan = f16::NAN;
assert!(f.is_finite());
assert!(!nan.is_finite());
assert!(!inf.is_finite());
assert!(!neg_inf.is_finite());
Sourcepub const fn is_subnormal(self) -> bool
pub const fn is_subnormal(self) -> bool
Returns true
if the number is subnormal.
Sourcepub const fn is_normal(self) -> bool
pub const fn is_normal(self) -> bool
Returns true
if the number is neither zero, infinite, subnormal, or
NaN
.
§Examples
let min = f16::MIN_POSITIVE;
let max = f16::MAX;
let lower_than_min = f16::from_f32(1.0e-10_f32);
let zero = f16::from_f32(0.0_f32);
assert!(min.is_normal());
assert!(max.is_normal());
assert!(!zero.is_normal());
assert!(!f16::NAN.is_normal());
assert!(!f16::INFINITY.is_normal());
// Values between `0` and `min` are Subnormal.
assert!(!lower_than_min.is_normal());
Sourcepub const fn classify(self) -> FpCategory
pub const fn classify(self) -> FpCategory
Returns the floating point category of the number.
If only one property is going to be tested, it is generally faster to use the specific predicate instead.
§Examples
use std::num::FpCategory;
let num = f16::from_f32(12.4_f32);
let inf = f16::INFINITY;
assert_eq!(num.classify(), FpCategory::Normal);
assert_eq!(inf.classify(), FpCategory::Infinite);
Sourcepub const fn signum(self) -> f16
pub const fn signum(self) -> f16
Returns a number that represents the sign of self
.
1.0
if the number is positive,+0.0
orINFINITY
-1.0
if the number is negative,-0.0
orNEG_INFINITY
NAN
if the number isNaN
§Examples
let f = f16::from_f32(3.5_f32);
assert_eq!(f.signum(), f16::from_f32(1.0));
assert_eq!(f16::NEG_INFINITY.signum(), f16::from_f32(-1.0));
assert!(f16::NAN.signum().is_nan());
Sourcepub const fn is_sign_positive(self) -> bool
pub const fn is_sign_positive(self) -> bool
Returns true
if and only if self
has a positive sign, including
+0.0
, NaNs
with a positive sign bit and +∞.
§Examples
let nan = f16::NAN;
let f = f16::from_f32(7.0_f32);
let g = f16::from_f32(-7.0_f32);
assert!(f.is_sign_positive());
assert!(!g.is_sign_positive());
// `NaN` can be either positive or negative
assert!(nan.is_sign_positive() != nan.is_sign_negative());
Sourcepub const fn is_sign_negative(self) -> bool
pub const fn is_sign_negative(self) -> bool
Returns true
if and only if self
has a negative sign, including
-0.0
, NaNs
with a negative sign bit and −∞.
§Examples
let nan = f16::NAN;
let f = f16::from_f32(7.0f32);
let g = f16::from_f32(-7.0f32);
assert!(!f.is_sign_negative());
assert!(g.is_sign_negative());
// `NaN` can be either positive or negative
assert!(nan.is_sign_positive() != nan.is_sign_negative());
Sourcepub const fn copysign(self, sign: f16) -> f16
pub const fn copysign(self, sign: f16) -> f16
Returns a number composed of the magnitude of self
and the sign of
sign
.
Equal to self
if the sign of self
and sign
are the same, otherwise
equal to -self
. If self
is NaN, then NaN with the sign of sign
is returned.
§Examples
let f = f16::from_f32(3.5);
assert_eq!(f.copysign(f16::from_f32(0.42)), f16::from_f32(3.5));
assert_eq!(f.copysign(f16::from_f32(-0.42)), f16::from_f32(-3.5));
assert_eq!((-f).copysign(f16::from_f32(0.42)), f16::from_f32(3.5));
assert_eq!((-f).copysign(f16::from_f32(-0.42)), f16::from_f32(-3.5));
assert!(f16::NAN.copysign(f16::from_f32(1.0)).is_nan());
Sourcepub fn to_degrees(self) -> Self
pub fn to_degrees(self) -> Self
Converts radians to degrees.
Sourcepub fn to_radians(self) -> Self
pub fn to_radians(self) -> Self
Converts degrees to radians.
Sourcepub const fn max(self, other: f16) -> f16
pub const fn max(self, other: f16) -> f16
Returns the maximum of the two numbers.
If one of the arguments is NaN, then the other argument is returned.
§Examples
let x = f16::from_f32(1.0);
let y = f16::from_f32(2.0);
assert_eq!(x.max(y), y);
Sourcepub const fn min(self, other: f16) -> f16
pub const fn min(self, other: f16) -> f16
Returns the minimum of the two numbers.
If one of the arguments is NaN, then the other argument is returned.
§Examples
let x = f16::from_f32(1.0);
let y = f16::from_f32(2.0);
assert_eq!(x.min(y), x);
Sourcepub const fn clamp(self, min: f16, max: f16) -> f16
pub const fn clamp(self, min: f16, max: f16) -> f16
Restrict a value to a certain interval unless it is NaN.
Returns max
if self
is greater than max
, and min
if self
is
less than min
. Otherwise this returns self
.
Note that this function returns NaN if the initial value was NaN as well.
§Panics
Panics if min > max
, min
is NaN, or max
is NaN.
§Examples
assert!(f16::from_f32(-3.0).clamp(f16::from_f32(-2.0), f16::from_f32(1.0)) == f16::from_f32(-2.0));
assert!(f16::from_f32(0.0).clamp(f16::from_f32(-2.0), f16::from_f32(1.0)) == f16::from_f32(0.0));
assert!(f16::from_f32(2.0).clamp(f16::from_f32(-2.0), f16::from_f32(1.0)) == f16::from_f32(1.0));
assert!(f16::NAN.clamp(f16::from_f32(-2.0), f16::from_f32(1.0)).is_nan());
Sourcepub fn total_cmp(&self, other: &Self) -> Ordering
pub fn total_cmp(&self, other: &Self) -> Ordering
Returns the ordering between self
and other
.
Unlike the standard partial comparison between floating point numbers,
this comparison always produces an ordering in accordance to
the totalOrder
predicate as defined in the IEEE 754 (2008 revision)
floating point standard. The values are ordered in the following
sequence:
- negative quiet NaN
- negative signaling NaN
- negative infinity
- negative numbers
- negative subnormal numbers
- negative zero
- positive zero
- positive subnormal numbers
- positive numbers
- positive infinity
- positive signaling NaN
- positive quiet NaN.
The ordering established by this function does not always agree with the
PartialOrd
and PartialEq
implementations of f16
. For example,
they consider negative and positive zero equal, while total_cmp
doesn’t.
The interpretation of the signaling NaN bit follows the definition in the IEEE 754 standard, which may not match the interpretation by some of the older, non-conformant (e.g. MIPS) hardware implementations.
§Examples
let mut v: Vec<f16> = vec![];
v.push(f16::ONE);
v.push(f16::INFINITY);
v.push(f16::NEG_INFINITY);
v.push(f16::NAN);
v.push(f16::MAX_SUBNORMAL);
v.push(-f16::MAX_SUBNORMAL);
v.push(f16::ZERO);
v.push(f16::NEG_ZERO);
v.push(f16::NEG_ONE);
v.push(f16::MIN_POSITIVE);
v.sort_by(|a, b| a.total_cmp(&b));
assert!(v
.into_iter()
.zip(
[
f16::NEG_INFINITY,
f16::NEG_ONE,
-f16::MAX_SUBNORMAL,
f16::NEG_ZERO,
f16::ZERO,
f16::MAX_SUBNORMAL,
f16::MIN_POSITIVE,
f16::ONE,
f16::INFINITY,
f16::NAN
]
.iter()
)
.all(|(a, b)| a.to_bits() == b.to_bits()));
Trait Implementations§
Source§impl AddAssign<&f16> for f16
impl AddAssign<&f16> for f16
Source§fn add_assign(&mut self, rhs: &f16)
fn add_assign(&mut self, rhs: &f16)
+=
operation. Read moreSource§impl AddAssign for f16
impl AddAssign for f16
Source§fn add_assign(&mut self, rhs: Self)
fn add_assign(&mut self, rhs: Self)
+=
operation. Read moreSource§impl DivAssign<&f16> for f16
impl DivAssign<&f16> for f16
Source§fn div_assign(&mut self, rhs: &f16)
fn div_assign(&mut self, rhs: &f16)
/=
operation. Read moreSource§impl DivAssign for f16
impl DivAssign for f16
Source§fn div_assign(&mut self, rhs: Self)
fn div_assign(&mut self, rhs: Self)
/=
operation. Read moreSource§impl MulAssign<&f16> for f16
impl MulAssign<&f16> for f16
Source§fn mul_assign(&mut self, rhs: &f16)
fn mul_assign(&mut self, rhs: &f16)
*=
operation. Read moreSource§impl MulAssign for f16
impl MulAssign for f16
Source§fn mul_assign(&mut self, rhs: Self)
fn mul_assign(&mut self, rhs: Self)
*=
operation. Read moreSource§impl PartialOrd for f16
impl PartialOrd for f16
Source§impl RemAssign<&f16> for f16
impl RemAssign<&f16> for f16
Source§fn rem_assign(&mut self, rhs: &f16)
fn rem_assign(&mut self, rhs: &f16)
%=
operation. Read moreSource§impl RemAssign for f16
impl RemAssign for f16
Source§fn rem_assign(&mut self, rhs: Self)
fn rem_assign(&mut self, rhs: Self)
%=
operation. Read moreSource§impl SubAssign<&f16> for f16
impl SubAssign<&f16> for f16
Source§fn sub_assign(&mut self, rhs: &f16)
fn sub_assign(&mut self, rhs: &f16)
-=
operation. Read moreSource§impl SubAssign for f16
impl SubAssign for f16
Source§fn sub_assign(&mut self, rhs: Self)
fn sub_assign(&mut self, rhs: Self)
-=
operation. Read more