pub struct QuantizedLinearPlan<TIn: Element, TWQ: IntElement> { /* private fields */ }Expand description
quantized_linear plan (W8A8 fused).
Composes two passes internally:
- Activation quantize — per-token symmetric dynamic-range
quantization, fused max-abs reduce + scale compute + quantize.
Implemented by the same
dynamic_range_quantize_per_token_symkernel that backssuper::DynamicRangeQuantizePlan. - Quantized matmul — fused int8 GEMM + per-row/per-col
dequantize + FP store. Implemented by the bespoke
quantized_linear_w8a8kernel.
Both passes share the same stream and execute back-to-back; the Plan
does NOT own an internal DynamicRangeQuantizePlan instance — it
invokes the FFI directly to keep the launch ordering explicit.
When to use: W8A8 inference matmul (SmoothQuant / AWQ-runtime
style). Inference-only — no BW; for QAT use
FakeQuantizePlan + normal FP matmul.
Dtypes (trailblazer): TIn (act/out) ∈ {f32, f64}; TWQ = S8.
f16 / bf16 activations and u8 weight not yet wired.
Shape limits: activation [M, K]; weight_q [C_out, K];
weight_scale [C_out]; output [M, C_out]. The W4 layout
[C_out, K] matches y = x · W^T (PyTorch nn.Linear.weight).
Workspace: zero in Workspace. Caller supplies
act_q_scratch [M, K] (int8) and act_scale_scratch [M]
(FP) in QuantizedLinearArgs for the fused activation-quant
pass.
Precision guarantee: deterministic, bit-stable. Naive kernel (one thread per output cell, register-only int32 acc) for correctness; tiled-smem / mma.sync optimizations land in a perf milestone — current variant is correctness-scaffold, not throughput-optimized.
Implementations§
Source§impl<TIn: Element, TWQ: IntElement> QuantizedLinearPlan<TIn, TWQ>
impl<TIn: Element, TWQ: IntElement> QuantizedLinearPlan<TIn, TWQ>
Sourcepub fn select(
_stream: &Stream,
desc: &QuantizedLinearDescriptor,
_pref: PlanPreference,
) -> Result<Self>
pub fn select( _stream: &Stream, desc: &QuantizedLinearDescriptor, _pref: PlanPreference, ) -> Result<Self>
Pick a kernel for desc.
Sourcepub fn can_implement(
&self,
args: &QuantizedLinearArgs<'_, TIn, TWQ>,
) -> Result<()>
pub fn can_implement( &self, args: &QuantizedLinearArgs<'_, TIn, TWQ>, ) -> Result<()>
Validate args at run time.
Sourcepub fn workspace_size(&self) -> usize
pub fn workspace_size(&self) -> usize
Workspace bytes — none. Scratch buffers are caller-owned via the
args bundle (act_q_scratch + act_scale_scratch), allowing
reuse across launches.
Sourcepub fn precision_guarantee(&self) -> PrecisionGuarantee
pub fn precision_guarantee(&self) -> PrecisionGuarantee
Numerical guarantees.