Expand description
QuantizedLinear: an INT8 weight matrix with per-channel or per-tensor
scale/zero_point.
§Design
Weights are stored as Array2<i8> with shape (out_features, in_features).
The forward pass dequantizes the weight matrix on every call and then
performs a standard f64 matmul. This is the CPU-first honest cut;
integer-matmul (packed int8 GEMM) is a future follow-up.
For PerChannel granularity each row (output channel) has its own
scale[c] and zero_point[c]. For PerTensor a single pair applies to
all elements.
Structs§
- Quantized
Linear - A weight matrix quantized to i8, with per-channel or per-tensor
scale/zero_point.
Enums§
- Quantization
Error - Error type for quantization operations on linear layers.