pub struct QuantizePerTokenPlan<TIn: Element, TOut: IntElement> { /* private fields */ }Expand description
quantize_per_token forward plan.
q[n, d] = clamp(round(x[n, d] / scale[n]) + zero_point[n], qmin, qmax).
Per-row quantization for 2-D activations (W8A8 LLM-style).
When to use: forward activation quantization at inference (one
(scale, zp) pair per token row, computed from the row’s max-abs
range upstream). For weight quantization use
QuantizePerChannelPlan; for
global scale use QuantizePerTensorPlan.
Pair with QuantizePerTokenBackwardPlan
for STE.
Dtypes: input FP {f32, f64, f16, bf16} × output int
{s8, u8}. scale[] is input dtype; zero_point[] is i32.
Shape limits: rank-2 [N, D]; scale and zero_point are
[N]. q_max ≥ q_min.
Workspace: none.
Precision guarantee: deterministic, bit-stable. One thread per output cell, no atomics. Round-ties-even.
Implementations§
Source§impl<TIn: Element, TOut: IntElement> QuantizePerTokenPlan<TIn, TOut>
impl<TIn: Element, TOut: IntElement> QuantizePerTokenPlan<TIn, TOut>
Sourcepub fn select(
_stream: &Stream,
desc: &QuantizePerTokenDescriptor,
_pref: PlanPreference,
) -> Result<Self>
pub fn select( _stream: &Stream, desc: &QuantizePerTokenDescriptor, _pref: PlanPreference, ) -> Result<Self>
Pick a kernel for desc.
Sourcepub fn can_implement(
&self,
args: &QuantizePerTokenArgs<'_, TIn, TOut>,
) -> Result<()>
pub fn can_implement( &self, args: &QuantizePerTokenArgs<'_, TIn, TOut>, ) -> Result<()>
Validate args at run time.
Sourcepub fn workspace_size(&self) -> usize
pub fn workspace_size(&self) -> usize
Workspace bytes — none.
Sourcepub fn precision_guarantee(&self) -> PrecisionGuarantee
pub fn precision_guarantee(&self) -> PrecisionGuarantee
Numerical guarantees.