Enum QuantizeKind

Source

#[non_exhaustive]
#[repr(u16)]pub enum QuantizeKind {
Show 22 variants    PerTensor = 0,
    PerTensorBackward = 1,
    DequantizePerTensor = 2,
    DequantizePerTensorBackward = 3,
    PerChannel = 4,
    PerChannelBackward = 5,
    DequantizePerChannel = 6,
    DequantizePerChannelBackward = 7,
    FakeQuantize = 8,
    FakeQuantizeBackward = 9,
    PerToken = 16,
    PerTokenBackward = 17,
    PerGroup = 18,
    PerGroupBackward = 19,
    DynamicRange = 20,
    DequantizePerToken = 21,
    DequantizePerTokenBackward = 22,
    DequantizePerGroup = 23,
    DequantizePerGroupBackward = 24,
    QuantizedLinear = 25,
    GgufDequantize = 26,
    GgufMmvq = 27,
}

Expand description

Quantization op discriminant — Category P from the comprehensive plan.

Stored as u16 in crate::KernelSku::op when category == OpCategory::Quantization. Phase 8 Milestone 8.1 wires the trailblazer set: per-tensor + per-channel quantize / dequantize plus fake_quantize (round-trip in FP space). All entries support FW + BW where applicable (FW-only for kinds that have no meaningful gradient).

Trailblazer dtype scope. Input FP × output int:

Input FP: f32, f64, f16, bf16.
Output int: s8, u8. Sub-byte packed types (s4, u4) are deferred.
scale matches the input FP dtype; zero_point is always i32 (wide enough for any int output qmin/qmax range).

Backward convention (Straight-Through Estimator). The BW of quantize and fake_quantize uses STE — the gradient passes through (with a 1/scale factor for quantize, no factor for fake_quantize) where the rounded result was in-range [qmin, qmax], zero elsewhere. The “in-range mask” is recomputed in BW from the saved input tensor rather than saved as a separate FW output — this matches PyTorch’s internal FakeQuantize and keeps the FW signature clean. Callers must therefore retain the original input tensor for the BW pass (which they would do anyway for autograd).

Future milestones extend this enum with PerToken / PerGroup / DynamicRange variants — discriminant gaps are intentionally left for those.

Variants (Non-exhaustive)§

This enum is marked as non-exhaustive

Non-exhaustive enums could have additional variants added in future. Therefore, when matching against variants of non-exhaustive enums, an extra wildcard arm must be added to account for any future variants.

§

PerTensor = 0

quantize_per_tensor(x, scale, zero_point) — q = clamp(round(x / scale) + zero_point, qmin, qmax). One scalar scale (FP) and zero_point (i32) for the whole tensor. PyTorch torch.quantize_per_tensor.

§

PerTensorBackward = 1

Gradient of Self::PerTensor via STE: dx = (dy / scale) * in_range_mask, where the mask is qmin <= round(x/scale) + zp <= qmax.

§

DequantizePerTensor = 2

dequantize_per_tensor(q, scale, zero_point) — x = scale * (q - zero_point). Linear; exactly invertible up to rounding. PyTorch torch.Tensor.dequantize.

§

DequantizePerTensorBackward = 3

Gradient of Self::DequantizePerTensor: dq = dy * scale (linear identity scaled).

§

PerChannel = 4

quantize_per_channel(x, scale[C], zero_point[C], axis) — same math as Self::PerTensor but with one scale[c] / zero_point[c] pair per slice along axis. PyTorch torch.quantize_per_channel.

§

PerChannelBackward = 5

Gradient of Self::PerChannel via STE: dx = (dy / scale[c]) * in_range_mask[c].

§

DequantizePerChannel = 6

dequantize_per_channel(q, scale[C], zero_point[C], axis) — x = scale[c] * (q - zero_point[c]).

§

DequantizePerChannelBackward = 7

Gradient of Self::DequantizePerChannel: dq = dy * scale[c].

§

FakeQuantize = 8

fake_quantize_per_tensor(x, scale, zero_point) — y = scale * (clamp(round(x/scale)+zp, qmin, qmax) - zp). The roundtrip quantize-then-dequantize in FP space; produces a lossy FP output. PyTorch torch.fake_quantize_per_tensor_affine.

§

FakeQuantizeBackward = 9

Gradient of Self::FakeQuantize via STE: dx = dy * in_range_mask. No 1/scale factor — the dequant-side multiplication by scale in FW cancels the 1/scale from STE.

§

PerToken = 16

Reserved — quantize_per_token (per-row dynamic-range quantization used by activation quantization).

§

PerTokenBackward = 17

Reserved — gradient of Self::PerToken.

§

PerGroup = 18

Reserved — quantize_per_group (block-wise quantization used by GPTQ / AWQ / GGML).

§

PerGroupBackward = 19

Reserved — gradient of Self::PerGroup.

§

DynamicRange = 20

Reserved — dynamic_range_quantize (post-training dynamic quantization).

§

DequantizePerToken = 21

dequantize_per_token(q, scale[N], zero_point[N]) — y[n, d] = scale[n] * (q[n, d] - zp[n]). Per-row inverse of Self::PerToken.

§

DequantizePerTokenBackward = 22

Gradient of Self::DequantizePerToken: dq = dy * scale[n] (straight-through).

§

DequantizePerGroup = 23

dequantize_per_group(q, scale[outer, num_groups], zero_point[outer, num_groups]) — per-group inverse of Self::PerGroup.

§

DequantizePerGroupBackward = 24

Gradient of Self::DequantizePerGroup: dq[i, j] = dy[i, j] * scale[i, j/g] (straight-through).

§

QuantizedLinear = 25

quantized_linear(activation_fp, weight_q_s8, weight_scale, bias?) — W8A8 fused quantized matmul. Pipeline: dynamic-range per-token quantize the activation → int8 GEMM with int32 accumulator → dequantize via per-row scale_a and per-channel scale_w. The canonical inference-time LLM matmul recipe (e.g. SmoothQuant, AWQ-runtime); FP activation in, FP output out, int8 storage only on the GEMM. Backward isn’t shipped — this op is inference-only by convention.

§

GgufDequantize = 26

gguf_dequantize(packed_bytes) -> fp_tensor — unpack a GGUF-packed weight buffer (Q4_0 / Q4_1 / Q5_0 / Q5_1 / Q8_0 + Q2_K / Q3_K / Q4_K / Q5_K / Q6_K / Q8_K) into a dense FP tensor. The block format is carried out-of-band on the plan descriptor (see GgufBlockFormat); the kernel surface fans out across block formats but the enum value is the same. Inference-only by convention (BW not shipped).

§

GgufMmvq = 27

gguf_mmvq(packed_weight, fp_activation) -> fp_output — fused dequant + matrix-vector multiply: the inference-time “decode-step” matmul used by llama.cpp on GGUF weights. FP activation in (f32 today), FP output out. Inference-only (BW not shipped).

QuantizeKind

Enum QuantizeKind Copy item path

Variants (Non-exhaustive)§

PerTensor = 0

PerTensorBackward = 1

DequantizePerTensor = 2

DequantizePerTensorBackward = 3

PerChannel = 4

PerChannelBackward = 5

DequantizePerChannel = 6

DequantizePerChannelBackward = 7

FakeQuantize = 8

FakeQuantizeBackward = 9

PerToken = 16

PerTokenBackward = 17

PerGroup = 18

PerGroupBackward = 19

DynamicRange = 20

DequantizePerToken = 21

DequantizePerTokenBackward = 22

DequantizePerGroup = 23

DequantizePerGroupBackward = 24

QuantizedLinear = 25

GgufDequantize = 26

GgufMmvq = 27

Trait Implementations§

impl Clone for QuantizeKind

fn clone(&self) -> QuantizeKind

fn clone_from(&mut self, source: &Self)

impl Copy for QuantizeKind

impl Debug for QuantizeKind

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

impl Eq for QuantizeKind

impl Hash for QuantizeKind

fn hash<__H>(&self, state: &mut __H)where __H: Hasher,

fn hash_slice<H>(data: &[Self], state: &mut H)where H: Hasher, Self: Sized,

impl PartialEq for QuantizeKind

fn eq(&self, other: &QuantizeKind) -> bool

fn ne(&self, other: &Rhs) -> bool

impl StructuralPartialEq for QuantizeKind

Auto Trait Implementations§

impl Freeze for QuantizeKind

impl RefUnwindSafe for QuantizeKind

impl Send for QuantizeKind

impl Sync for QuantizeKind

impl Unpin for QuantizeKind

impl UnsafeUnpin for QuantizeKind

impl UnwindSafe for QuantizeKind

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Enum QuantizeKind

fn hash<H>(&self, state: &mut H)
where __H: Hasher,

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,