Expand description
quantized_linear — fused W8A8 quantized matmul (Phase 8.3).
The canonical inference-time LLM matmul recipe:
- Quantize the FP activation per-token (dynamic-range, symmetric).
- Accumulate the int8 × int8 GEMM into int32.
- Dequantize the int32 acc by
scale_a[m] · scale_w[n]and store as FP.
Used by SmoothQuant, AWQ-runtime, and most production W8A8 LLM kernels. The Plan owns the orchestration; the underlying bespoke kernel fuses the int8 mma + dequant + FP store as one launch.
§Layout
activation:[M, K]FP (row-major).weight_q:[C_out, K]int8 (row-major — one row per output channel).weight_scale:[C_out]FP (per-output-channel, saved when the weight was quantized offline).output:[M, C_out]FP.
weight_q is [C_out, K] rather than [K, C_out] so the inner-K
reduction reads contiguous K spans from both the activation row and
the weight row — the natural layout for the linear-layer convention
y = x · W^T where W is the weight matrix in [C_out, C_in] form
(PyTorch nn.Linear.weight layout).
§Trailblazer scope
- Symmetric + per-token activation quantization (composes
super::DynamicRangeQuantizePlan). - Per-output-channel weight scale (caller supplies, computed offline).
TIn ∈ {f32, f64}activation + output; weight =S8.- Naive kernel (one thread per output cell, register-only int32 accumulator) — correctness scaffold, not throughput-optimized. Tiled-smem / mma.sync optimizations land in a perf milestone.
- Inference-only — no backward. The W8A8 path is forward-only
by convention; if a downstream needs gradients, they should use
super::FakeQuantizePlanfor QAT (quant-aware training) and run a normal FP matmul.
Structs§
- Quantized
Linear Args - Args bundle for a
quantized_linearlaunch. - Quantized
Linear Descriptor - Descriptor for a
quantized_linearop. - Quantized
Linear Plan quantized_linearplan (W8A8 fused).