Struct E8M0

Source

pub struct E8M0(/* private fields */);

Expand description

An 8-bit floating-point type that represents scale factors as powers of two.

§Format Details

The e8m0 format is an 8-bit representation where:

Values 0-254: Represent powers of two from 2^-127 to 2^127
Value 255 (0xFF): Reserved for NaN
No mantissa bits - only represents exact powers of two
Exponent bias: 127

§Conversion Behavior

This implementation follows NVIDIA’s CUDA specification:

Rounding mode: round toward positive infinity.
Saturation mode: clamp values to representable range (satfinite).

§From f64 to E8M0

NaN → 0xFF (NaN)
Values ≤ 0 → 0x00 (2^-127, smallest positive value)
Values are rounded UP to the next power of two
Values > 2^127 → 0xFE (2^127, largest finite value)

§From E8M0 to f64

0x00-0xFE → 2^(value - 127)
0xFF → NaN

§Examples

use float4::E8M0;

// Exact powers of two convert precisely
let e = E8M0::from(4.0_f64);
assert_eq!(e.to_f64(), 4.0);

// Non-powers round UP to next power of two
let e = E8M0::from(3.0_f64);
assert_eq!(e.to_f64(), 4.0);  // rounds up

// Special values
assert!(E8M0::from(f64::NAN).to_f64().is_nan());
assert_eq!(E8M0::from(-1.0).to_f64(), 2f64.powi(-127));  // clamps to minimum

§Reference

Based on NVIDIA’s CUDA e8m0 specification: https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/struct____nv__fp8__e8m0.html

Implementations§

Source §

impl E8M0

Source

pub const NAN: Self

The NaN (Not-a-Number) value for E8M0 format.

Represented by the bit pattern 0xFF (255).

Source

pub fn from_f64(value: f64) -> Self

Converts an f64 to E8M0 format.

This implementation follows NVIDIA’s specification with:

Rounding: cudaRoundPosInf - rounds toward positive infinity
Saturation: __NV_SATFINITE - clamps to representable range

§Conversion Rules

NaN → E8M0::NAN (0xFF)
Values ≤ 0 → 0x00 (represents 2^-127)
Positive values are rounded UP to the next power of two
Values > 2^127 → 0xFE (represents 2^127)

§Examples

use float4::E8M0;

// Exact powers of two
assert_eq!(E8M0::from(1.0).to_f64(), 1.0);   // 2^0
assert_eq!(E8M0::from(2.0).to_f64(), 2.0);   // 2^1

// Non-powers round UP
assert_eq!(E8M0::from(1.5).to_f64(), 2.0);   // rounds to 2^1
assert_eq!(E8M0::from(3.0).to_f64(), 4.0);   // rounds to 2^2

// Edge cases
assert_eq!(E8M0::from(0.0).to_f64(), 2f64.powi(-127));   // minimum
assert_eq!(E8M0::from(-5.0).to_f64(), 2f64.powi(-127));  // negative → minimum
assert_eq!(E8M0::from(f64::INFINITY).to_f64(), 2f64.powi(127));  // saturates

Source

pub fn to_f64(self) -> f64

Converts this E8M0 value to an f64.

§Returns

For bits 0x00-0xFE: Returns 2^(bits - 127)
For bits 0xFF: Returns NaN

§Examples

use float4::E8M0;

assert_eq!(E8M0::from_bits(0x7F).to_f64(), 1.0);  // 2^(127-127) = 2^0 = 1
assert_eq!(E8M0::from_bits(0x80).to_f64(), 2.0);  // 2^(128-127) = 2^1 = 2
assert!(E8M0::NAN.to_f64().is_nan());

Source

pub fn from_f32_slice(values: &[f32]) -> Self

Creates an E8M0 scale factor from a slice of f32 values.

This function computes an appropriate scale factor for quantizing the given values. It finds the maximum absolute value in the slice and converts it to a power of two scale factor following E8M0 conversion rules.

§Arguments

values - A slice of f32 values to compute the scale from

§Returns

An E8M0 scale factor that can represent the largest value in the slice when multiplied by the quantized values.

§Examples

use float4::E8M0;

// Scale for values within a small range
let values = [0.5, -0.75, 0.25];
let scale = E8M0::from_f32_slice(&values);
assert_eq!(scale.to_f64(), 1.0);  // rounds 0.75 up to 1.0

// Scale for larger values
let values = [1.0, 5.0, -3.5];
let scale = E8M0::from_f32_slice(&values);
assert_eq!(scale.to_f64(), 8.0);  // rounds 5.0 up to 8.0

// Empty slice returns smallest scale
let scale = E8M0::from_f32_slice(&[]);
assert_eq!(scale.to_f64(), 2f64.powi(-127));

Source

pub const fn from_bits(bits: u8) -> Self

Creates an E8M0 from raw bits.

This performs no validation - the u8 value is directly used as the bit pattern.

§Examples

use float4::E8M0;

let e = E8M0::from_bits(0x7F);
assert_eq!(e.to_f64(), 1.0);  // 2^(127-127) = 1

let e = E8M0::from_bits(0xFF);
assert!(e.to_f64().is_nan());  // 0xFF is NaN

Source

pub const fn to_bits(&self) -> u8

Extracts the raw bits from an E8M0.

Returns the underlying 8-bit representation.

Trait Implementations§

Source §

impl Clone for E8M0

Source §

fn clone(&self) -> E8M0

Returns a duplicate of the value. Read more

1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Source §

impl Debug for E8M0

Source §

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Source §

impl From<E8M0> for f64

Converts E8M0 to f64.

This is a convenience trait implementation that calls E8M0::to_f64.

Source §

fn from(v: E8M0) -> Self

Converts to this type from the input type.

Source §

impl From<E8M0> for u8

Extracts the raw bits from an E8M0.

Returns the underlying 8-bit representation.

Source §

fn from(v: E8M0) -> Self

Converts to this type from the input type.

Source §

impl From<f32> for E8M0

Source §

fn from(value: f32) -> Self

Converts an f32 to E8M0 format.

This is equivalent to converting via f64.

§Examples

use float4::E8M0;

let e: E8M0 = 2.5f32.into();
assert_eq!(e.to_f64(), 4.0); // rounds up to 2^2

Source §

impl From<f64> for E8M0

Source §

fn from(value: f64) -> Self

Converts to this type from the input type.

Source §

impl From<u8> for E8M0

Creates an E8M0 from raw bits.

This performs no validation - the u8 value is directly used as the bit pattern.

§Examples

use float4::E8M0;

let e = E8M0::from(0x7Fu8);
assert_eq!(e.to_f64(), 1.0);  // 2^(127-127) = 1

let e = E8M0::from(0xFFu8);
assert!(e.to_f64().is_nan());  // 0xFF is NaN