E8M0

Struct E8M0 

Source
pub struct E8M0(/* private fields */);
Expand description

An 8-bit floating-point type that represents scale factors as powers of two.

§Format Details

The e8m0 format is an 8-bit representation where:

  • Values 0-254: Represent powers of two from 2^-127 to 2^127
  • Value 255 (0xFF): Reserved for NaN
  • No mantissa bits - only represents exact powers of two
  • Exponent bias: 127

§Conversion Behavior

This implementation follows NVIDIA’s CUDA specification:

  • Rounding mode: round toward positive infinity.
  • Saturation mode: clamp values to representable range (satfinite).

§From f64 to E8M0

  • NaN → 0xFF (NaN)
  • Values ≤ 0 → 0x00 (2^-127, smallest positive value)
  • Values are rounded UP to the next power of two
  • Values > 2^127 → 0xFE (2^127, largest finite value)

§From E8M0 to f64

  • 0x00-0xFE → 2^(value - 127)
  • 0xFF → NaN

§Examples

use float4::E8M0;

// Exact powers of two convert precisely
let e = E8M0::from(4.0_f64);
assert_eq!(e.to_f64(), 4.0);

// Non-powers round UP to next power of two
let e = E8M0::from(3.0_f64);
assert_eq!(e.to_f64(), 4.0);  // rounds up

// Special values
assert!(E8M0::from(f64::NAN).to_f64().is_nan());
assert_eq!(E8M0::from(-1.0).to_f64(), 2f64.powi(-127));  // clamps to minimum

§Reference

Based on NVIDIA’s CUDA e8m0 specification: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/struct____nv__fp8__e8m0.html

Implementations§

Source§

impl E8M0

Source

pub const NAN: Self

The NaN (Not-a-Number) value for E8M0 format.

Represented by the bit pattern 0xFF (255).

Source

pub fn from_f64(value: f64) -> Self

Converts an f64 to E8M0 format.

This implementation follows NVIDIA’s specification with:

  • Rounding: cudaRoundPosInf - rounds toward positive infinity
  • Saturation: __NV_SATFINITE - clamps to representable range
§Conversion Rules
  • NaN → E8M0::NAN (0xFF)
  • Values ≤ 0 → 0x00 (represents 2^-127)
  • Positive values are rounded UP to the next power of two
  • Values > 2^127 → 0xFE (represents 2^127)
§Examples
use float4::E8M0;

// Exact powers of two
assert_eq!(E8M0::from(1.0).to_f64(), 1.0);   // 2^0
assert_eq!(E8M0::from(2.0).to_f64(), 2.0);   // 2^1

// Non-powers round UP
assert_eq!(E8M0::from(1.5).to_f64(), 2.0);   // rounds to 2^1
assert_eq!(E8M0::from(3.0).to_f64(), 4.0);   // rounds to 2^2

// Edge cases
assert_eq!(E8M0::from(0.0).to_f64(), 2f64.powi(-127));   // minimum
assert_eq!(E8M0::from(-5.0).to_f64(), 2f64.powi(-127));  // negative → minimum
assert_eq!(E8M0::from(f64::INFINITY).to_f64(), 2f64.powi(127));  // saturates
Source

pub fn to_f64(self) -> f64

Converts this E8M0 value to an f64.

§Returns
  • For bits 0x00-0xFE: Returns 2^(bits - 127)
  • For bits 0xFF: Returns NaN
§Examples
use float4::E8M0;

assert_eq!(E8M0::from_bits(0x7F).to_f64(), 1.0);  // 2^(127-127) = 2^0 = 1
assert_eq!(E8M0::from_bits(0x80).to_f64(), 2.0);  // 2^(128-127) = 2^1 = 2
assert!(E8M0::NAN.to_f64().is_nan());
Source

pub fn from_f32_slice(values: &[f32]) -> Self

Creates an E8M0 scale factor from a slice of f32 values.

This function computes an appropriate scale factor for quantizing the given values. It finds the maximum absolute value in the slice and converts it to a power of two scale factor following E8M0 conversion rules.

§Arguments
  • values - A slice of f32 values to compute the scale from
§Returns

An E8M0 scale factor that can represent the largest value in the slice when multiplied by the quantized values.

§Examples
use float4::E8M0;

// Scale for values within a small range
let values = [0.5, -0.75, 0.25];
let scale = E8M0::from_f32_slice(&values);
assert_eq!(scale.to_f64(), 1.0);  // rounds 0.75 up to 1.0

// Scale for larger values
let values = [1.0, 5.0, -3.5];
let scale = E8M0::from_f32_slice(&values);
assert_eq!(scale.to_f64(), 8.0);  // rounds 5.0 up to 8.0

// Empty slice returns smallest scale
let scale = E8M0::from_f32_slice(&[]);
assert_eq!(scale.to_f64(), 2f64.powi(-127));
Source

pub const fn from_bits(bits: u8) -> Self

Creates an E8M0 from raw bits.

This performs no validation - the u8 value is directly used as the bit pattern.

§Examples
use float4::E8M0;

let e = E8M0::from_bits(0x7F);
assert_eq!(e.to_f64(), 1.0);  // 2^(127-127) = 1

let e = E8M0::from_bits(0xFF);
assert!(e.to_f64().is_nan());  // 0xFF is NaN
Source

pub const fn to_bits(&self) -> u8

Extracts the raw bits from an E8M0.

Returns the underlying 8-bit representation.

Trait Implementations§

Source§

impl Clone for E8M0

Source§

fn clone(&self) -> E8M0

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for E8M0

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl From<E8M0> for f64

Converts E8M0 to f64.

This is a convenience trait implementation that calls E8M0::to_f64.

Source§

fn from(v: E8M0) -> Self

Converts to this type from the input type.
Source§

impl From<E8M0> for u8

Extracts the raw bits from an E8M0.

Returns the underlying 8-bit representation.

Source§

fn from(v: E8M0) -> Self

Converts to this type from the input type.
Source§

impl From<f32> for E8M0

Source§

fn from(value: f32) -> Self

Converts an f32 to E8M0 format.

This is equivalent to converting via f64.

§Examples
use float4::E8M0;

let e: E8M0 = 2.5f32.into();
assert_eq!(e.to_f64(), 4.0); // rounds up to 2^2
Source§

impl From<f64> for E8M0

Source§

fn from(value: f64) -> Self

Converts to this type from the input type.
Source§

impl From<u8> for E8M0

Creates an E8M0 from raw bits.

This performs no validation - the u8 value is directly used as the bit pattern.

§Examples

use float4::E8M0;

let e = E8M0::from(0x7Fu8);
assert_eq!(e.to_f64(), 1.0);  // 2^(127-127) = 1

let e = E8M0::from(0xFFu8);
assert!(e.to_f64().is_nan());  // 0xFF is NaN
Source§

fn from(b: u8) -> Self

Converts to this type from the input type.
Source§

impl Hash for E8M0

Source§

fn hash<__H: Hasher>(&self, state: &mut __H)

Feeds this value into the given Hasher. Read more
1.3.0 · Source§

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

Feeds a slice of this type into the given Hasher. Read more
Source§

impl PartialEq for E8M0

Source§

fn eq(&self, other: &E8M0) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl Copy for E8M0

Source§

impl Eq for E8M0

Source§

impl StructuralPartialEq for E8M0

Auto Trait Implementations§

§

impl Freeze for E8M0

§

impl RefUnwindSafe for E8M0

§

impl Send for E8M0

§

impl Sync for E8M0

§

impl Unpin for E8M0

§

impl UnwindSafe for E8M0

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.