pub enum QuantScheme {
}Expand description
How a tensor is quantized. Mirrors the schemes RLX needs for LLM inference on Apple Silicon: blockwise int8 (GPTQ-style), blockwise int4 (Q4_K), and per-tensor fp8 (e4m3 / e5m2).
Each variant carries the parameters the dequantizer needs to read at runtime — scale, zero-point, block size. Where these live in the actual weight tensor is up to the loader (#56).
Variants§
Int8Block
Symmetric int8 with one scale per block_size elements.
Int8BlockAsym
Asymmetric int8 with scale + zero-point per block_size elements.
Int4Block
Int4 packed two-per-byte, scale per block_size elements
(Q4_K-ish; matches GGUF block layout).
Fp8E4m3
FP8 e4m3 (no scale; same domain as half).
Fp8E5m2
FP8 e5m2 (no scale; wider range than e4m3).
GgufQ4K
GGUF / llama.cpp Q4_K super-block (256 elements / 144 bytes).
Packs an f16 super-scale + f16 super-min + 8 sub-block 6-bit
scales + 8 sub-block 6-bit mins + 128 nibbles. Block layout is
fixed by the format — there’s no block_size knob.
GgufQ5K
GGUF Q5_K (256 / 176 bytes). Adds a 32-byte high-bit plane on top of Q4_K.
GgufQ6K
GGUF Q6_K (256 / 210 bytes). Per-sub-block signed scales, no min term.
GgufQ8K
GGUF Q8_K (256 / 276 bytes). Per-super-block f32 scale plus i8 quants and a 32-byte sum-of-blocks table that’s only used by Q8_K × Q8_K matmul accumulation paths.
GgufQ2K
GGUF Q2_K (256 / 84 bytes). 2-bit quants with per-sub-block scale/min.
GgufQ3K
GGUF Q3_K (256 / 110 bytes). 3-bit quants with hmask high bit plane.
GgufQ4_0
GGUF Q4_0 (32 / 18 bytes). Legacy llama.cpp block: f16 scale + nibbles.
GgufQ8_0
GGUF Q8_0 (32 / 34 bytes). Legacy block: f16 scale + 32×i8 quants.
Nvfp4Block
NVIDIA FP4 (E2M1) block — fixed 16-element groups, FP8 E4M3 block
scales, optional f32 global scale on input 3 (legacy zp slot).
Used by FLUX.2 / MLX nvfp4 checkpoints.
Implementations§
Source§impl QuantScheme
impl QuantScheme
Sourcepub const fn bits_per_element_x10(self) -> u32
pub const fn bits_per_element_x10(self) -> u32
Bits per element after packing (×10 for K-quants since they have fractional bit budgets — divide by 10 when comparing).
Sourcepub const fn bits_per_element(self) -> u32
pub const fn bits_per_element(self) -> u32
Bits per element after packing (rounded down). Use
bits_per_element_x10 for the K-quant fractional values.
Sourcepub const fn has_scale(self) -> bool
pub const fn has_scale(self) -> bool
True if this scheme requires a per-block scale tensor on the side.
Sourcepub const fn scale_is_fp8(self) -> bool
pub const fn scale_is_fp8(self) -> bool
True for NVFP4 block scales stored as FP8 E4M3 bytes (not f32).
Sourcepub const fn nvfp4_group_size(self) -> u32
pub const fn nvfp4_group_size(self) -> u32
Fixed NVFP4 group size along K (0 for other schemes).
Sourcepub const fn has_zero_point(self) -> bool
pub const fn has_zero_point(self) -> bool
True if this scheme requires a per-block zero-point.
Sourcepub const fn gguf_block_size(self) -> u32
pub const fn gguf_block_size(self) -> u32
GGUF K-quant block size (256 elements) — meaningless for the non-GGUF schemes (returns 0).
Sourcepub const fn gguf_block_bytes(self) -> u32
pub const fn gguf_block_bytes(self) -> u32
Bytes per GGUF super-block. 0 for non-GGUF schemes.
Trait Implementations§
Source§impl Clone for QuantScheme
impl Clone for QuantScheme
Source§fn clone(&self) -> QuantScheme
fn clone(&self) -> QuantScheme
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for QuantScheme
impl Debug for QuantScheme
Source§impl Display for QuantScheme
impl Display for QuantScheme
Source§impl PartialEq for QuantScheme
impl PartialEq for QuantScheme
Source§fn eq(&self, other: &QuantScheme) -> bool
fn eq(&self, other: &QuantScheme) -> bool
self and other values to be equal, and is used by ==.