#[repr(transparent)] around u8 storage — bit-compatible with
__nv_fp8_storage_t on the CUDA side and with float8::F8E4M3 on the
host side. A DeviceBuffer<u8> (byte substrate) can be reinterpreted
as DeviceBuffer<Fp8E4M3> via view_as without copying.
Numerical range: ±448 (max finite). One NaN encoding only
(S.1111.111); E4M3 has no infinities. The conversion path
matches NVIDIA’s __nv_cvt_float_to_fp8(x, __NV_SATFINITE, __NV_E4M3):
round-half-to-even, saturating-to-max-finite on overflow.
Routes through Ada Lovelace FP8 tensor cores
(mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32) with F32
accumulation and float alpha / beta scaling. First landed in
baracuda-kernels Phase 2.