pub struct QuantizedLinearArgs<'a, TIn: Element, TWQ: IntElement> {
pub activation: TensorRef<'a, TIn, 2>,
pub weight_q: TensorRef<'a, TWQ, 2>,
pub weight_scale: TensorRef<'a, TIn, 1>,
pub output: TensorMut<'a, TIn, 2>,
pub act_q_scratch: TensorMut<'a, S8, 2>,
pub act_scale_scratch: TensorMut<'a, TIn, 1>,
}Expand description
Args bundle for a quantized_linear launch.
The caller supplies the already-quantized weight + its per-channel
scale (offline-computed). The activation is FP; per-token
activation quantization happens inside QuantizedLinearPlan::run
via an internally orchestrated super::DynamicRangeQuantizePlan
pass.
act_q_scratch and act_scale_scratch are caller-owned scratch
buffers for the quantized activation + computed per-row activation
scale. They are part of the args bundle (not workspace) so callers
can reuse them across launches without re-allocation — the Plan’s
workspace_size() returns 0.
Fields§
§activation: TensorRef<'a, TIn, 2>FP activation [M, K].
weight_q: TensorRef<'a, TWQ, 2>Already-quantized int8 weight [C_out, K].
weight_scale: TensorRef<'a, TIn, 1>Per-output-channel weight scale [C_out] in FP.
output: TensorMut<'a, TIn, 2>FP output [M, C_out].
act_q_scratch: TensorMut<'a, S8, 2>Scratch for the per-token quantized activation [M, K] in int8.
Caller-owned; reused across launches.
act_scale_scratch: TensorMut<'a, TIn, 1>Scratch for the per-token activation scale [M] in FP.
Caller-owned; reused across launches. Populated by the
internally orchestrated dynamic-range pass.