Skip to main content

QuantizeKind

Enum QuantizeKind 

Source
#[non_exhaustive]
#[repr(u16)]
pub enum QuantizeKind {
Show 22 variants PerTensor = 0, PerTensorBackward = 1, DequantizePerTensor = 2, DequantizePerTensorBackward = 3, PerChannel = 4, PerChannelBackward = 5, DequantizePerChannel = 6, DequantizePerChannelBackward = 7, FakeQuantize = 8, FakeQuantizeBackward = 9, PerToken = 16, PerTokenBackward = 17, PerGroup = 18, PerGroupBackward = 19, DynamicRange = 20, DequantizePerToken = 21, DequantizePerTokenBackward = 22, DequantizePerGroup = 23, DequantizePerGroupBackward = 24, QuantizedLinear = 25, GgufDequantize = 26, GgufMmvq = 27,
}
Expand description

Quantization op discriminant — Category P from the comprehensive plan.

Stored as u16 in crate::KernelSku::op when category == OpCategory::Quantization. Phase 8 Milestone 8.1 wires the trailblazer set: per-tensor + per-channel quantize / dequantize plus fake_quantize (round-trip in FP space). All entries support FW + BW where applicable (FW-only for kinds that have no meaningful gradient).

Trailblazer dtype scope. Input FP × output int:

  • Input FP: f32, f64, f16, bf16.
  • Output int: s8, u8. Sub-byte packed types (s4, u4) are deferred.
  • scale matches the input FP dtype; zero_point is always i32 (wide enough for any int output qmin/qmax range).

Backward convention (Straight-Through Estimator). The BW of quantize and fake_quantize uses STE — the gradient passes through (with a 1/scale factor for quantize, no factor for fake_quantize) where the rounded result was in-range [qmin, qmax], zero elsewhere. The “in-range mask” is recomputed in BW from the saved input tensor rather than saved as a separate FW output — this matches PyTorch’s internal FakeQuantize and keeps the FW signature clean. Callers must therefore retain the original input tensor for the BW pass (which they would do anyway for autograd).

Future milestones extend this enum with PerToken / PerGroup / DynamicRange variants — discriminant gaps are intentionally left for those.

Variants (Non-exhaustive)§

This enum is marked as non-exhaustive
Non-exhaustive enums could have additional variants added in future. Therefore, when matching against variants of non-exhaustive enums, an extra wildcard arm must be added to account for any future variants.
§

PerTensor = 0

quantize_per_tensor(x, scale, zero_point)q = clamp(round(x / scale) + zero_point, qmin, qmax). One scalar scale (FP) and zero_point (i32) for the whole tensor. PyTorch torch.quantize_per_tensor.

§

PerTensorBackward = 1

Gradient of Self::PerTensor via STE: dx = (dy / scale) * in_range_mask, where the mask is qmin <= round(x/scale) + zp <= qmax.

§

DequantizePerTensor = 2

dequantize_per_tensor(q, scale, zero_point)x = scale * (q - zero_point). Linear; exactly invertible up to rounding. PyTorch torch.Tensor.dequantize.

§

DequantizePerTensorBackward = 3

Gradient of Self::DequantizePerTensor: dq = dy * scale (linear identity scaled).

§

PerChannel = 4

quantize_per_channel(x, scale[C], zero_point[C], axis) — same math as Self::PerTensor but with one scale[c] / zero_point[c] pair per slice along axis. PyTorch torch.quantize_per_channel.

§

PerChannelBackward = 5

Gradient of Self::PerChannel via STE: dx = (dy / scale[c]) * in_range_mask[c].

§

DequantizePerChannel = 6

dequantize_per_channel(q, scale[C], zero_point[C], axis)x = scale[c] * (q - zero_point[c]).

§

DequantizePerChannelBackward = 7

Gradient of Self::DequantizePerChannel: dq = dy * scale[c].

§

FakeQuantize = 8

fake_quantize_per_tensor(x, scale, zero_point)y = scale * (clamp(round(x/scale)+zp, qmin, qmax) - zp). The roundtrip quantize-then-dequantize in FP space; produces a lossy FP output. PyTorch torch.fake_quantize_per_tensor_affine.

§

FakeQuantizeBackward = 9

Gradient of Self::FakeQuantize via STE: dx = dy * in_range_mask. No 1/scale factor — the dequant-side multiplication by scale in FW cancels the 1/scale from STE.

§

PerToken = 16

Reserved — quantize_per_token (per-row dynamic-range quantization used by activation quantization).

§

PerTokenBackward = 17

Reserved — gradient of Self::PerToken.

§

PerGroup = 18

Reserved — quantize_per_group (block-wise quantization used by GPTQ / AWQ / GGML).

§

PerGroupBackward = 19

Reserved — gradient of Self::PerGroup.

§

DynamicRange = 20

Reserved — dynamic_range_quantize (post-training dynamic quantization).

§

DequantizePerToken = 21

dequantize_per_token(q, scale[N], zero_point[N])y[n, d] = scale[n] * (q[n, d] - zp[n]). Per-row inverse of Self::PerToken.

§

DequantizePerTokenBackward = 22

Gradient of Self::DequantizePerToken: dq = dy * scale[n] (straight-through).

§

DequantizePerGroup = 23

dequantize_per_group(q, scale[outer, num_groups], zero_point[outer, num_groups]) — per-group inverse of Self::PerGroup.

§

DequantizePerGroupBackward = 24

Gradient of Self::DequantizePerGroup: dq[i, j] = dy[i, j] * scale[i, j/g] (straight-through).

§

QuantizedLinear = 25

quantized_linear(activation_fp, weight_q_s8, weight_scale, bias?) — W8A8 fused quantized matmul. Pipeline: dynamic-range per-token quantize the activation → int8 GEMM with int32 accumulator → dequantize via per-row scale_a and per-channel scale_w. The canonical inference-time LLM matmul recipe (e.g. SmoothQuant, AWQ-runtime); FP activation in, FP output out, int8 storage only on the GEMM. Backward isn’t shipped — this op is inference-only by convention.

§

GgufDequantize = 26

gguf_dequantize(packed_bytes) -> fp_tensor — unpack a GGUF-packed weight buffer (Q4_0 / Q4_1 / Q5_0 / Q5_1 / Q8_0 + Q2_K / Q3_K / Q4_K / Q5_K / Q6_K / Q8_K) into a dense FP tensor. The block format is carried out-of-band on the plan descriptor (see GgufBlockFormat); the kernel surface fans out across block formats but the enum value is the same. Inference-only by convention (BW not shipped).

§

GgufMmvq = 27

gguf_mmvq(packed_weight, fp_activation) -> fp_output — fused dequant + matrix-vector multiply: the inference-time “decode-step” matmul used by llama.cpp on GGUF weights. FP activation in (f32 today), FP output out. Inference-only (BW not shipped).

Trait Implementations§

Source§

impl Clone for QuantizeKind

Source§

fn clone(&self) -> QuantizeKind

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Copy for QuantizeKind

Source§

impl Debug for QuantizeKind

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more
Source§

impl Eq for QuantizeKind

Source§

impl Hash for QuantizeKind

Source§

fn hash<__H>(&self, state: &mut __H)
where __H: Hasher,

Feeds this value into the given Hasher. Read more
1.3.0 · Source§

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

Feeds a slice of this type into the given Hasher. Read more
Source§

impl PartialEq for QuantizeKind

Source§

fn eq(&self, other: &QuantizeKind) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 (const: unstable) · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl StructuralPartialEq for QuantizeKind

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.