#[non_exhaustive]#[repr(u16)]pub enum QuantizeKind {
Show 22 variants
PerTensor = 0,
PerTensorBackward = 1,
DequantizePerTensor = 2,
DequantizePerTensorBackward = 3,
PerChannel = 4,
PerChannelBackward = 5,
DequantizePerChannel = 6,
DequantizePerChannelBackward = 7,
FakeQuantize = 8,
FakeQuantizeBackward = 9,
PerToken = 16,
PerTokenBackward = 17,
PerGroup = 18,
PerGroupBackward = 19,
DynamicRange = 20,
DequantizePerToken = 21,
DequantizePerTokenBackward = 22,
DequantizePerGroup = 23,
DequantizePerGroupBackward = 24,
QuantizedLinear = 25,
GgufDequantize = 26,
GgufMmvq = 27,
}Expand description
Quantization op discriminant — Category P from the comprehensive plan.
Stored as u16 in crate::KernelSku::op when
category == OpCategory::Quantization. Phase 8 Milestone 8.1 wires the
trailblazer set: per-tensor + per-channel quantize / dequantize plus
fake_quantize (round-trip in FP space). All entries support FW + BW
where applicable (FW-only for kinds that have no meaningful gradient).
Trailblazer dtype scope. Input FP × output int:
- Input FP:
f32, f64, f16, bf16. - Output int:
s8, u8. Sub-byte packed types (s4,u4) are deferred. scalematches the input FP dtype;zero_pointis alwaysi32(wide enough for any int output qmin/qmax range).
Backward convention (Straight-Through Estimator). The BW of
quantize and fake_quantize uses STE — the gradient passes through
(with a 1/scale factor for quantize, no factor for fake_quantize)
where the rounded result was in-range [qmin, qmax], zero elsewhere.
The “in-range mask” is recomputed in BW from the saved input
tensor rather than saved as a separate FW output — this matches
PyTorch’s internal FakeQuantize and keeps the FW signature clean.
Callers must therefore retain the original input tensor for the BW
pass (which they would do anyway for autograd).
Future milestones extend this enum with PerToken / PerGroup /
DynamicRange variants — discriminant gaps are intentionally left
for those.
Variants (Non-exhaustive)§
This enum is marked as non-exhaustive
PerTensor = 0
quantize_per_tensor(x, scale, zero_point) —
q = clamp(round(x / scale) + zero_point, qmin, qmax).
One scalar scale (FP) and zero_point (i32) for the whole
tensor. PyTorch torch.quantize_per_tensor.
PerTensorBackward = 1
Gradient of Self::PerTensor via STE:
dx = (dy / scale) * in_range_mask, where the mask is
qmin <= round(x/scale) + zp <= qmax.
DequantizePerTensor = 2
dequantize_per_tensor(q, scale, zero_point) —
x = scale * (q - zero_point). Linear; exactly invertible up to
rounding. PyTorch torch.Tensor.dequantize.
DequantizePerTensorBackward = 3
Gradient of Self::DequantizePerTensor: dq = dy * scale
(linear identity scaled).
PerChannel = 4
quantize_per_channel(x, scale[C], zero_point[C], axis) — same
math as Self::PerTensor but with one scale[c] /
zero_point[c] pair per slice along axis. PyTorch
torch.quantize_per_channel.
PerChannelBackward = 5
Gradient of Self::PerChannel via STE:
dx = (dy / scale[c]) * in_range_mask[c].
DequantizePerChannel = 6
dequantize_per_channel(q, scale[C], zero_point[C], axis) —
x = scale[c] * (q - zero_point[c]).
DequantizePerChannelBackward = 7
Gradient of Self::DequantizePerChannel:
dq = dy * scale[c].
FakeQuantize = 8
fake_quantize_per_tensor(x, scale, zero_point) —
y = scale * (clamp(round(x/scale)+zp, qmin, qmax) - zp). The
roundtrip quantize-then-dequantize in FP space; produces a lossy
FP output. PyTorch
torch.fake_quantize_per_tensor_affine.
FakeQuantizeBackward = 9
Gradient of Self::FakeQuantize via STE:
dx = dy * in_range_mask. No 1/scale factor — the
dequant-side multiplication by scale in FW cancels the
1/scale from STE.
PerToken = 16
Reserved — quantize_per_token (per-row dynamic-range
quantization used by activation quantization).
PerTokenBackward = 17
Reserved — gradient of Self::PerToken.
PerGroup = 18
Reserved — quantize_per_group (block-wise quantization used by
GPTQ / AWQ / GGML).
PerGroupBackward = 19
Reserved — gradient of Self::PerGroup.
DynamicRange = 20
Reserved — dynamic_range_quantize (post-training dynamic
quantization).
DequantizePerToken = 21
dequantize_per_token(q, scale[N], zero_point[N]) —
y[n, d] = scale[n] * (q[n, d] - zp[n]). Per-row inverse of
Self::PerToken.
DequantizePerTokenBackward = 22
Gradient of Self::DequantizePerToken:
dq = dy * scale[n] (straight-through).
DequantizePerGroup = 23
dequantize_per_group(q, scale[outer, num_groups], zero_point[outer, num_groups]) — per-group inverse of
Self::PerGroup.
DequantizePerGroupBackward = 24
Gradient of Self::DequantizePerGroup:
dq[i, j] = dy[i, j] * scale[i, j/g] (straight-through).
QuantizedLinear = 25
quantized_linear(activation_fp, weight_q_s8, weight_scale, bias?) — W8A8 fused quantized matmul. Pipeline: dynamic-range
per-token quantize the activation → int8 GEMM with int32
accumulator → dequantize via per-row scale_a and per-channel
scale_w. The canonical inference-time LLM matmul recipe
(e.g. SmoothQuant, AWQ-runtime); FP activation in, FP output out,
int8 storage only on the GEMM. Backward isn’t shipped — this op
is inference-only by convention.
GgufDequantize = 26
gguf_dequantize(packed_bytes) -> fp_tensor — unpack a
GGUF-packed weight buffer (Q4_0 / Q4_1 / Q5_0 / Q5_1 / Q8_0 +
Q2_K / Q3_K / Q4_K / Q5_K / Q6_K / Q8_K) into a dense FP
tensor. The block format is carried out-of-band on the plan
descriptor (see GgufBlockFormat); the kernel surface
fans out across block formats but the enum value is the same.
Inference-only by convention (BW not shipped).
GgufMmvq = 27
gguf_mmvq(packed_weight, fp_activation) -> fp_output —
fused dequant + matrix-vector multiply: the inference-time
“decode-step” matmul used by llama.cpp on GGUF weights.
FP activation in (f32 today), FP output out. Inference-only
(BW not shipped).
Trait Implementations§
Source§impl Clone for QuantizeKind
impl Clone for QuantizeKind
Source§fn clone(&self) -> QuantizeKind
fn clone(&self) -> QuantizeKind
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreimpl Copy for QuantizeKind
Source§impl Debug for QuantizeKind
impl Debug for QuantizeKind
impl Eq for QuantizeKind
Source§impl Hash for QuantizeKind
impl Hash for QuantizeKind
Source§impl PartialEq for QuantizeKind
impl PartialEq for QuantizeKind
Source§fn eq(&self, other: &QuantizeKind) -> bool
fn eq(&self, other: &QuantizeKind) -> bool
self and other values to be equal, and is used by ==.