Expand description
Four-bit floating point types and block formats for Rust.
This crate provides low-precision floating-point types following the OCP MX specification, designed for efficient storage and computation in machine learning applications where extreme quantization is beneficial.
§Available Types
F4E2M1: 4-bit floating-point with 2 exponent bits and 1 mantissa bitE8M0: 8-bit scale factor representing powers of two (2^-127 to 2^127)MXFP4Block: Block format storing 32 F4E2M1 values with a shared E8M0 scale
§F4E2M1 Format Details
The F4E2M1 type implements the E2M1 format with:
- 1 sign bit
- 2 exponent bits
- 1 mantissa bit
- Exponent bias of 1
- Round-to-nearest-even (roundTiesToEven) rounding mode
This format can represent 16 distinct values ranging from -6.0 to 6.0, including:
- Normal numbers: ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0
- Subnormal numbers: ±0.5
- Zero: ±0.0
§Examples
Basic usage:
use float4::F4E2M1;
// Create from f64
let a = F4E2M1::from_f64(1.5);
assert_eq!(a.to_f64(), 1.5);
// Create from raw bits
let b = F4E2M1::from_bits(0x3); // 0b0011 = 1.5
assert_eq!(b.to_f64(), 1.5);
// Values outside representable range saturate
let c = F4E2M1::from_f64(10.0);
assert_eq!(c.to_f64(), 6.0); // Saturates to maximum§Rounding Behavior
The type uses round-to-nearest-even as specified by IEEE 754:
use float4::F4E2M1;
// Rounding to nearest
assert_eq!(F4E2M1::from_f64(1.75).to_f64(), 2.0);
assert_eq!(F4E2M1::from_f64(2.25).to_f64(), 2.0);
// Round-to-even when exactly halfway
assert_eq!(F4E2M1::from_f64(1.25).to_f64(), 1.0); // Rounds to even
assert_eq!(F4E2M1::from_f64(2.5).to_f64(), 2.0); // Rounds to even§Special Values
Unlike standard floating point formats, F4E2M1 has no representation for infinity or NaN. These values saturate to the maximum representable value:
use float4::F4E2M1;
assert_eq!(F4E2M1::from_f64(f64::INFINITY).to_f64(), 6.0);
assert_eq!(F4E2M1::from_f64(f64::NEG_INFINITY).to_f64(), -6.0);
assert_eq!(F4E2M1::from_f64(f64::NAN).to_f64(), 6.0);§MXFP4 Block Format
The MXFP4Block type provides efficient storage for multiple F4E2M1 values by sharing
a common scale factor:
use float4::{F4E2M1, E8M0, MXFP4Block};
// Original f32 data
let data = vec![1.5, -2.0, 0.5, 3.0];
// Compute scale (rounds up to power of 2)
let scale = E8M0::from_f32_slice(&data);
assert_eq!(scale.to_f64(), 4.0); // 3.0 rounds up to 4.0
// Quantize values
let mut quantized = [F4E2M1::from_f64(0.0); 32];
for i in 0..data.len() {
quantized[i] = F4E2M1::from_f64(data[i] as f64 / scale.to_f64());
}
// Pack into block (17 bytes total for 32 values)
let block = MXFP4Block::from_f32_slice(quantized, scale);
// Convert back
let restored = block.to_f32_array();
// Note: Due to F4E2M1's limited precision, values may be quantized
assert_eq!(restored[0], 2.0); // 1.5/4.0 = 0.375 -> rounds to 0.5 -> 0.5*4.0 = 2.0
assert_eq!(restored[1], -2.0); // -2.0/4.0 = -0.5 is exactly representableThis format achieves 4× compression compared to f32, making it ideal for:
- Neural network weight storage
- Activation caching in quantized models
- Memory-bandwidth limited applications
Structs§
- E8M0
- An 8-bit floating-point type that represents scale factors as powers of two.
- F4E2M1
- A 4-bit floating point type with 2 exponent bits and 1 mantissa bit.
- MXFP4
Block - A compressed block of 32 F4E2M1 values with a shared E8M0 scale factor.