Crate float4

Source
Expand description

Four-bit floating point types and block formats for Rust.

This crate provides low-precision floating-point types following the OCP MX specification, designed for efficient storage and computation in machine learning applications where extreme quantization is beneficial.

§Available Types

  • F4E2M1: 4-bit floating-point with 2 exponent bits and 1 mantissa bit
  • E8M0: 8-bit scale factor representing powers of two (2^-127 to 2^127)
  • MXFP4Block: Block format storing 32 F4E2M1 values with a shared E8M0 scale

§F4E2M1 Format Details

The F4E2M1 type implements the E2M1 format with:

  • 1 sign bit
  • 2 exponent bits
  • 1 mantissa bit
  • Exponent bias of 1
  • Round-to-nearest-even (roundTiesToEven) rounding mode

This format can represent 16 distinct values ranging from -6.0 to 6.0, including:

  • Normal numbers: ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0
  • Subnormal numbers: ±0.5
  • Zero: ±0.0

§Examples

Basic usage:

use float4::F4E2M1;

// Create from f64
let a = F4E2M1::from_f64(1.5);
assert_eq!(a.to_f64(), 1.5);

// Create from raw bits
let b = F4E2M1::from_bits(0x3); // 0b0011 = 1.5
assert_eq!(b.to_f64(), 1.5);

// Values outside representable range saturate
let c = F4E2M1::from_f64(10.0);
assert_eq!(c.to_f64(), 6.0); // Saturates to maximum

§Rounding Behavior

The type uses round-to-nearest-even as specified by IEEE 754:

use float4::F4E2M1;

// Rounding to nearest
assert_eq!(F4E2M1::from_f64(1.75).to_f64(), 2.0);
assert_eq!(F4E2M1::from_f64(2.25).to_f64(), 2.0);

// Round-to-even when exactly halfway
assert_eq!(F4E2M1::from_f64(1.25).to_f64(), 1.0); // Rounds to even
assert_eq!(F4E2M1::from_f64(2.5).to_f64(), 2.0);  // Rounds to even

§Special Values

Unlike standard floating point formats, F4E2M1 has no representation for infinity or NaN. These values saturate to the maximum representable value:

use float4::F4E2M1;

assert_eq!(F4E2M1::from_f64(f64::INFINITY).to_f64(), 6.0);
assert_eq!(F4E2M1::from_f64(f64::NEG_INFINITY).to_f64(), -6.0);
assert_eq!(F4E2M1::from_f64(f64::NAN).to_f64(), 6.0);

§MXFP4 Block Format

The MXFP4Block type provides efficient storage for multiple F4E2M1 values by sharing a common scale factor:

use float4::{F4E2M1, E8M0, MXFP4Block};

// Original f32 data
let data = vec![1.5, -2.0, 0.5, 3.0];

// Compute scale (rounds up to power of 2)
let scale = E8M0::from_f32_slice(&data);
assert_eq!(scale.to_f64(), 4.0); // 3.0 rounds up to 4.0

// Quantize values
let mut quantized = [F4E2M1::from_f64(0.0); 32];
for i in 0..data.len() {
    quantized[i] = F4E2M1::from_f64(data[i] as f64 / scale.to_f64());
}

// Pack into block (17 bytes total for 32 values)
let block = MXFP4Block::from_f32_slice(quantized, scale);

// Convert back
let restored = block.to_f32_array();
// Note: Due to F4E2M1's limited precision, values may be quantized
assert_eq!(restored[0], 2.0);  // 1.5/4.0 = 0.375 -> rounds to 0.5 -> 0.5*4.0 = 2.0
assert_eq!(restored[1], -2.0); // -2.0/4.0 = -0.5 is exactly representable

This format achieves 4× compression compared to f32, making it ideal for:

  • Neural network weight storage
  • Activation caching in quantized models
  • Memory-bandwidth limited applications

Structs§

E8M0
An 8-bit floating-point type that represents scale factors as powers of two.
F4E2M1
A 4-bit floating point type with 2 exponent bits and 1 mantissa bit.
MXFP4Block
A compressed block of 32 F4E2M1 values with a shared E8M0 scale factor.