pub struct E8M0(/* private fields */);Expand description
An 8-bit floating-point type that represents scale factors as powers of two.
§Format Details
The e8m0 format is an 8-bit representation where:
- Values 0-254: Represent powers of two from 2^-127 to 2^127
- Value 255 (0xFF): Reserved for NaN
- No mantissa bits - only represents exact powers of two
- Exponent bias: 127
§Conversion Behavior
This implementation follows NVIDIA’s CUDA specification:
- Rounding mode: round toward positive infinity.
- Saturation mode: clamp values to representable range (satfinite).
§From f64 to E8M0
- NaN → 0xFF (NaN)
- Values ≤ 0 → 0x00 (2^-127, smallest positive value)
- Values are rounded UP to the next power of two
- Values > 2^127 → 0xFE (2^127, largest finite value)
§From E8M0 to f64
- 0x00-0xFE → 2^(value - 127)
- 0xFF → NaN
§Examples
use float4::E8M0;
// Exact powers of two convert precisely
let e = E8M0::from(4.0_f64);
assert_eq!(e.to_f64(), 4.0);
// Non-powers round UP to next power of two
let e = E8M0::from(3.0_f64);
assert_eq!(e.to_f64(), 4.0); // rounds up
// Special values
assert!(E8M0::from(f64::NAN).to_f64().is_nan());
assert_eq!(E8M0::from(-1.0).to_f64(), 2f64.powi(-127)); // clamps to minimum§Reference
Based on NVIDIA’s CUDA e8m0 specification: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/struct____nv__fp8__e8m0.html
Implementations§
Source§impl E8M0
impl E8M0
Sourcepub const NAN: Self
pub const NAN: Self
The NaN (Not-a-Number) value for E8M0 format.
Represented by the bit pattern 0xFF (255).
Sourcepub fn from_f64(value: f64) -> Self
pub fn from_f64(value: f64) -> Self
Converts an f64 to E8M0 format.
This implementation follows NVIDIA’s specification with:
- Rounding:
cudaRoundPosInf- rounds toward positive infinity - Saturation:
__NV_SATFINITE- clamps to representable range
§Conversion Rules
- NaN → E8M0::NAN (0xFF)
- Values ≤ 0 → 0x00 (represents 2^-127)
- Positive values are rounded UP to the next power of two
- Values > 2^127 → 0xFE (represents 2^127)
§Examples
use float4::E8M0;
// Exact powers of two
assert_eq!(E8M0::from(1.0).to_f64(), 1.0); // 2^0
assert_eq!(E8M0::from(2.0).to_f64(), 2.0); // 2^1
// Non-powers round UP
assert_eq!(E8M0::from(1.5).to_f64(), 2.0); // rounds to 2^1
assert_eq!(E8M0::from(3.0).to_f64(), 4.0); // rounds to 2^2
// Edge cases
assert_eq!(E8M0::from(0.0).to_f64(), 2f64.powi(-127)); // minimum
assert_eq!(E8M0::from(-5.0).to_f64(), 2f64.powi(-127)); // negative → minimum
assert_eq!(E8M0::from(f64::INFINITY).to_f64(), 2f64.powi(127)); // saturatesSourcepub fn to_f64(self) -> f64
pub fn to_f64(self) -> f64
Converts this E8M0 value to an f64.
§Returns
- For bits 0x00-0xFE: Returns 2^(bits - 127)
- For bits 0xFF: Returns NaN
§Examples
use float4::E8M0;
assert_eq!(E8M0::from_bits(0x7F).to_f64(), 1.0); // 2^(127-127) = 2^0 = 1
assert_eq!(E8M0::from_bits(0x80).to_f64(), 2.0); // 2^(128-127) = 2^1 = 2
assert!(E8M0::NAN.to_f64().is_nan());Sourcepub fn from_f32_slice(values: &[f32]) -> Self
pub fn from_f32_slice(values: &[f32]) -> Self
Creates an E8M0 scale factor from a slice of f32 values.
This function computes an appropriate scale factor for quantizing the given values. It finds the maximum absolute value in the slice and converts it to a power of two scale factor following E8M0 conversion rules.
§Arguments
values- A slice of f32 values to compute the scale from
§Returns
An E8M0 scale factor that can represent the largest value in the slice when multiplied by the quantized values.
§Examples
use float4::E8M0;
// Scale for values within a small range
let values = [0.5, -0.75, 0.25];
let scale = E8M0::from_f32_slice(&values);
assert_eq!(scale.to_f64(), 1.0); // rounds 0.75 up to 1.0
// Scale for larger values
let values = [1.0, 5.0, -3.5];
let scale = E8M0::from_f32_slice(&values);
assert_eq!(scale.to_f64(), 8.0); // rounds 5.0 up to 8.0
// Empty slice returns smallest scale
let scale = E8M0::from_f32_slice(&[]);
assert_eq!(scale.to_f64(), 2f64.powi(-127));Sourcepub const fn from_bits(bits: u8) -> Self
pub const fn from_bits(bits: u8) -> Self
Creates an E8M0 from raw bits.
This performs no validation - the u8 value is directly used as the bit pattern.
§Examples
use float4::E8M0;
let e = E8M0::from_bits(0x7F);
assert_eq!(e.to_f64(), 1.0); // 2^(127-127) = 1
let e = E8M0::from_bits(0xFF);
assert!(e.to_f64().is_nan()); // 0xFF is NaNTrait Implementations§
Source§impl From<E8M0> for f64
Converts E8M0 to f64.
impl From<E8M0> for f64
Converts E8M0 to f64.
This is a convenience trait implementation that calls E8M0::to_f64.
Source§impl From<E8M0> for u8
Extracts the raw bits from an E8M0.
impl From<E8M0> for u8
Extracts the raw bits from an E8M0.
Returns the underlying 8-bit representation.
Source§impl From<u8> for E8M0
Creates an E8M0 from raw bits.
impl From<u8> for E8M0
Creates an E8M0 from raw bits.
This performs no validation - the u8 value is directly used as the bit pattern.
§Examples
use float4::E8M0;
let e = E8M0::from(0x7Fu8);
assert_eq!(e.to_f64(), 1.0); // 2^(127-127) = 1
let e = E8M0::from(0xFFu8);
assert!(e.to_f64().is_nan()); // 0xFF is NaN