Crate microfloat

Expand description

§8-bit and sub-byte floating point types for Rust

This crate implements microfloat types for Rust, including common 8-bit formats and sub-byte 4-bit and 6-bit formats. Microfloats are a subset of minifloat formats.

8-bit floating point representations:

f8e3m4 - signed E3M4, bias 3, IEEE-like NaN/Inf.
f8e4m3 - signed E4M3, bias 7, IEEE-like NaN/Inf.
f8e4m3b11fnuz - signed E4M3, bias 11, finite-only, unsigned zero.
f8e4m3fn - signed E4M3, bias 7, finite-only, signed outer NaNs.
f8e4m3fnuz - signed E4M3, bias 8, finite-only, unsigned zero.
f8e5m2 - signed E5M2, bias 15, IEEE-like NaN/Inf.
f8e5m2fnuz - signed E5M2, bias 16, finite-only, unsigned zero.
f8e8m0fnu - unsigned E8M0 scale, bias 127, no zero, single NaN.

Microscaling (MX) sub-byte floating point representations:

f4e2m1fn - signed 4-bit E2M1, bias 1, finite-only, saturating.
f6e2m3fn - signed 6-bit E2M3, bias 1, finite-only, saturating.
f6e3m2fn - signed 6-bit E3M2, bias 3, finite-only, saturating.

In type suffixes,

f means finite-only with no infinities,
n means the format has a special NaN encoding,
uz means unsigned zero with no distinct negative zero encoding, and
u means unsigned.

This crate is modeled to be compatible with the microfloat types in the ml-dtypes Python package. For broader minifloat types such as f16 and bf16, use the half crate; microfloat is heavily inspired by half.

§Usage

The float types attempt to match existing Rust floating point type functionality where possible, and provide conversion operations, classification, formatting, parsing, arithmetic operations, and common math operations. Calculations are performed through f32 and rounded back to the target format.

use microfloat::f8e4m3;

let x = f8e4m3::from_f32(1.5);
let y = f8e4m3::from_f32(2.0);
let z = x + y;

assert_eq!(z.to_f32(), 3.5);

This crate provides no_std support.

Requires Rust 1.85 or greater.

§Optional Features

serde - Implement Serialize and Deserialize traits for the float types. This adds a dependency on the serde crate.
num-traits - Enable ToPrimitive, FromPrimitive, Num, NumCast, FloatCore, Signed, Bounded, Zero, and One trait implementations from the num-traits crate.
bytemuck - Enable Zeroable and Pod trait implementations from the bytemuck crate.
rand_distr - Enable sampling from distributions like StandardUniform and StandardNormal from the rand_distr crate.
rkyv - Enable zero-copy serialization support with the rkyv crate.

§Testing

Compatibility with ml-dtypes is tested by generated fixtures in tests/fixtures/. These fixtures validate conversions, classifications, arithmetic, and math methods.

§`float8`

The float8 crate provides F8E4M3 and F8E5M2 types that are not fully OCP compliant. They use NVIDIA’s __NV_SATFINITE saturation mode (cuda_fp8.hpp). In this mode INFINITY constants are FP8_MAXNORM overflow sentinels rather than true infinities. In contrast, microfloat uses __NV_NOSAT semantics (IEEE NaN/Inf on overflow).

Structs§

f4e2m1fn: Signed 4-bit E2M1 MX finite-only type with bias 1, stored in a byte.
f6e2m3fn: Signed 6-bit E2M3 MX finite-only type with bias 1, stored in a byte.
f6e3m2fn: Signed 6-bit E3M2 MX finite-only type with bias 3, stored in a byte.
f8e3m4: Signed 8-bit E3M4 floating point type with bias 3 and IEEE-like NaN/Inf.
f8e4m3: Signed 8-bit E4M3 floating point type with bias 7 and IEEE-like NaN/Inf.
f8e4m3b11fnuz: Signed 8-bit E4M3 finite-only type with bias 11, unsigned zero, and a single NaN.
f8e4m3fn: Signed 8-bit E4M3 finite-only type with bias 7 and signed outer NaNs.
f8e4m3fnuz: Signed 8-bit E4M3 finite-only type with bias 8, unsigned zero, and a single NaN.
f8e5m2: Signed 8-bit E5M2 floating point type with bias 15 and IEEE-like NaN/Inf.
f8e5m2fnuz: Signed 8-bit E5M2 finite-only type with bias 16, unsigned zero, and a single NaN.
f8e8m0fnu: Unsigned 8-bit E8M0 MX scale format with bias 127, no zero, and a single NaN.