Module quantized_linear

Expand description

quantized_linear — fused W8A8 quantized matmul (Phase 8.3).

The canonical inference-time LLM matmul recipe:

Quantize the FP activation per-token (dynamic-range, symmetric).
Accumulate the int8 × int8 GEMM into int32.
Dequantize the int32 acc by scale_a[m] · scale_w[n] and store as FP.

Used by SmoothQuant, AWQ-runtime, and most production W8A8 LLM kernels. The Plan owns the orchestration; the underlying bespoke kernel fuses the int8 mma + dequant + FP store as one launch.

§Layout

activation : [M, K] FP (row-major).
weight_q : [C_out, K] int8 (row-major — one row per output channel).
weight_scale : [C_out] FP (per-output-channel, saved when the weight was quantized offline).
output : [M, C_out] FP.

weight_q is [C_out, K] rather than [K, C_out] so the inner-K reduction reads contiguous K spans from both the activation row and the weight row — the natural layout for the linear-layer convention y = x · W^T where W is the weight matrix in [C_out, C_in] form (PyTorch nn.Linear.weight layout).

§Trailblazer scope

Symmetric + per-token activation quantization (composes super::DynamicRangeQuantizePlan).
Per-output-channel weight scale (caller supplies, computed offline).
TIn ∈ {f32, f64} activation + output; weight = S8.
Naive kernel (one thread per output cell, register-only int32 accumulator) — correctness scaffold, not throughput-optimized. Tiled-smem / mma.sync optimizations land in a perf milestone.
Inference-only — no backward. The W8A8 path is forward-only by convention; if a downstream needs gradients, they should use super::FakeQuantizePlan for QAT (quant-aware training) and run a normal FP matmul.

Structs§

QuantizedLinearArgs: Args bundle for a quantized_linear launch.
QuantizedLinearDescriptor: Descriptor for a quantized_linear op.
QuantizedLinearPlan: quantized_linear plan (W8A8 fused).