pub struct SmoothQuantLinearPlan<TIn: Element, TWQ: IntElement> { /* private fields */ }Expand description
SmoothQuant linear plan — pure Rust composition over the
bespoke quantized_linear_w8a8 kernel.
When to use: SmoothQuant inference matmul. Activation has
already been smoothed (divided by per-channel s[K]) and
quantized per-tensor to int8; weight has already been smoothed
(multiplied by s[K]) and quantized per-output-channel to int8;
caller passes both, plus the static per-tensor act-scale + per-N
weight-scale, to this plan.
Dtypes (trailblazer): TIn (scales/out) ∈ {f32, f64};
TWQ = S8 weight; activation is fixed at S8. f16 / bf16 / u8
weight follow once the underlying quantized_linear_w8a8 kernel
grows those dtypes (same matrix as
super::QuantizedLinearPlan).
Shape limits: act_q [M, K]; weight_q [N, K];
weight_scale [N]; output [M, N]. [N, K] weight layout
matches y = x · W^T (PyTorch nn.Linear.weight).
Workspace: zero in Workspace. Caller supplies
act_scale_scratch [M] (FP) in SmoothQuantLinearArgs for
the act-scale broadcast.
Precision guarantee: deterministic, bit-stable on the same
hardware (inherits from the underlying quantized_linear_w8a8
kernel — register-only int32 accumulator + serial FP scale
multiply, no atomics).
Implementations§
Source§impl<TIn: Element, TWQ: IntElement> SmoothQuantLinearPlan<TIn, TWQ>
impl<TIn: Element, TWQ: IntElement> SmoothQuantLinearPlan<TIn, TWQ>
Sourcepub fn select(
_stream: &Stream,
desc: &SmoothQuantLinearDescriptor,
_pref: PlanPreference,
) -> Result<Self>
pub fn select( _stream: &Stream, desc: &SmoothQuantLinearDescriptor, _pref: PlanPreference, ) -> Result<Self>
Pick a kernel for desc.
Sourcepub fn can_implement(
&self,
args: &SmoothQuantLinearArgs<'_, TIn, TWQ>,
) -> Result<()>
pub fn can_implement( &self, args: &SmoothQuantLinearArgs<'_, TIn, TWQ>, ) -> Result<()>
Validate args at run time.
Sourcepub fn workspace_size(&self) -> usize
pub fn workspace_size(&self) -> usize
Workspace bytes — none. The act-scale [M] broadcast buffer is
caller-owned via the args bundle (act_scale_scratch),
allowing reuse across launches.
Sourcepub fn precision_guarantee(&self) -> PrecisionGuarantee
pub fn precision_guarantee(&self) -> PrecisionGuarantee
Numerical guarantees.
Sourcepub fn run(
&self,
stream: &Stream,
_workspace: Workspace<'_>,
args: SmoothQuantLinearArgs<'_, TIn, TWQ>,
) -> Result<()>
pub fn run( &self, stream: &Stream, _workspace: Workspace<'_>, args: SmoothQuantLinearArgs<'_, TIn, TWQ>, ) -> Result<()>
Launch.
Two-pass: (1) fill the [M] scratch with the descriptor’s
act_scale; (2) launch the quantized_linear_w8a8 kernel
directly via the FFI (skips the dynamic-range-quantize pass
that super::QuantizedLinearPlan does).