Skip to main content

Module fp8

Module fp8 

Source
Expand description

FP8 (float8_e4m3fn) dequantization support.

Models like Qwen3.5-27B-FP8 store most weight tensors in F8_E4M3 format with per-block scale factors (weight_scale_inv). This module provides a custom VarBuilder backend that transparently dequantizes FP8 weights at load time, allowing cake to run FP8-quantized models on any backend (CUDA, Metal, CPU).

Dequantization formula (block size 128×128): bf16_weight[i*128..(i+1)128, j128..(j+1)*128] = cast(fp8_weight[…same…]) * scale_inv[i, j]

Functions§

is_fp8_quantized
Check whether a model uses FP8 block-wise quantization by looking at its config.
load_fp8_var_builder
Create a VarBuilder that transparently dequantizes FP8 weights.