Expand description
FP8 (float8_e4m3fn) dequantization support.
Models like Qwen3.5-27B-FP8 store most weight tensors in F8_E4M3 format with
per-block scale factors (weight_scale_inv). This module provides a custom
VarBuilder backend that transparently dequantizes FP8 weights at load time,
allowing cake to run FP8-quantized models on any backend (CUDA, Metal, CPU).
Dequantization formula (block size 128×128): bf16_weight[i*128..(i+1)128, j128..(j+1)*128] = cast(fp8_weight[…same…]) * scale_inv[i, j]
Functions§
- is_
fp8_ quantized - Check whether a model uses FP8 block-wise quantization by looking at its config.
- load_
fp8_ ⚠var_ builder - Create a VarBuilder that transparently dequantizes FP8 weights.